Skip to main content

Non Metric Space (Approximate) Library in R

(This article was first published on mlampros, and kindly contributed to R-bloggers)

The nmslibR package is a wrapper of NMSLIB, which according to the authors “… is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods”.

I’ve searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I’ve already written a package for k-nearest-neighbor search (KernelKnn), however, it’s based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The nmslibR package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found here.

The NMSLIB Library,

  • is a collection of search methods for generic spaces
  • has both metric and non-metric search algorithms
  • has both exact and approximate search algorithms
  • is an evaluation toolkit that simplifies experimentation and processing of results
  • is extensible (new spaces and methods can be added)
  • It was designed to be efficient

Details can be found in the NMSLIB-manual.

The nmslibR package

The nmslibR package includes the following R6-class / functions,

class
NMSlib
Knn_Query()
knn_Query_Batch()
save_Index()
functions

KernelKnn_nmslib()

KernelKnnCV_nmslib()

dgCMatrix_2scipy_sparse()

mat_2scipy_sparse()

The package documentation includes details and examples for the R6-class and functions. I’ll start explaining how a user can work with sparse matrices as the input can also be a python sparse matrix.

Sparse matrices as input

The nmslibR package includes two functions (mat_2scipy_sparse and dgCMatrix_2scipy_sparse) which allow the user to convert from a matrix / dgCMatrix to a scipy sparse matrix,


library(nmslibR)

# conversion from a matrix object to a scipy sparse matrix
#----------------------------------------------------------

set.seed(1)

x = matrix(runif(1000), nrow = 100, ncol = 10)

x_sparse = mat_2scipy_sparse(x, format = "sparse_row_matrix")

print(dim(x))

[1] 100  10

print(x_sparse$shape)

(100, 10)
  


# conversion from a dgCMatrix object to a scipy sparse matrix
#-------------------------------------------------------------

data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)

# by default column-oriented format

dgcM = Matrix::Matrix(data = data, nrow = 3,

                      ncol = 3, byrow = TRUE,

                      sparse = TRUE)

print(dim(dgcM))

[1] 3 3

x_sparse = dgCMatrix_2scipy_sparse(dgcM)

print(x_sparse$shape)

(3, 3)
  

The NMSlib R6-class

The parameter settings for the NMSlib R6-class can be found in the Non-Metric Space Library (NMSLIB) Manual, which explains the NMSLIB Library in detail. In the following code chunk, I’ll show the functionality of the methods included using a data set from my Github repository (it appears as .ipynb notebook in the nmslib Github repository)



library(nmslibR)


# download the data from my Github repository (tested on a Linux OS)
#-------------------------------------------------------------------

system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/sift_10k.txt")


# load the data in the R session
#-------------------------------

sift_10k = read.table("~/sift_10k.txt", quote="\"", comment.char="")


# index parameters
#-----------------

M = 15
efC = 100
num_threads = 5

index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC, 
                    
                    'post' = 0, 'skip_optimized_index' = 1 )


# query-time parameters
#----------------------

efS = 100

query_time_params = list('efSearch' = efS)


# Number of neighbors 
#--------------------

K = 100


# space to use
#---------------

space_name = 'l2sqr_sift'     


# initialize NMSlib [ the data should be a matrix ]
#--------------------------------------------------

init_nms = NMSlib$new(input_data = as.matrix(sift_10k), Index_Params = index_params, 
                      
                      Time_Params = query_time_params, space = space_name, 
                      
                      space_params = NULL, method = 'hnsw', 
                      
                      data_type = 'DENSE_UINT8_VECTOR', dtype = 'INT',
                      
                      index_filepath = NULL, print_progress = FALSE)



# returns a 1-dimensional vector
#-------------------------------

init_nms$Knn_Query(query_data_row = as.matrix(sift_10k[1, ]), k = 5)



[[1]]
[1]    2    6 4585 9256  140                    # indices

[[2]]
[1] 18724 24320 68158 69067 70321               # distances
 


# returns knn's for all data
#---------------------------

all_dat = init_nms$knn_Query_Batch(as.matrix(sift_10k), k = 5, num_threads = 1)

str(all_dat)



# a list of indices and distances for all observations
#------------------------------------------------------

List of 2
 $ knn_idx : num [1:10000, 1:5] 3 4 1 2 13 14 1 2 30 31 ...
 $ knn_dist: num [1:10000, 1:5] 18724 14995 18724 14995 21038 ...


Details on the various methods and parameter settings can be found in the manual of the NMSLIB python Library.

KernelKnn using the nmslibR package

In the Vignette of the KernelKnn (Image classification of the MNIST and CIFAR-10 data using KernelKnn and HOG (histogram of oriented gradients)) package I experimented with the mnist dataset and a cross-validated kernel k-nearest-neighbors model gave 98.4 % accuracy based on HOG (histogram of oriented gradients) features. However, it took almost 30 minutes (depending on the system configuration) to complete using 6 threads. I’ve implemented a similar function using NMSLIB (KernelKnnCV_nmslib), so in the next code chunk I’ll use the same parameter setting and I’ll compare computation time and accuracy.

First load the data,


# using system('wget..') on a linux OS 

system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")             

mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T, 
                    
                    quote = "\"", sep = ",")



X = mnist[, -ncol(mnist)]
dim(X)

## [1] 70000   784

# the 'KernelKnnCV_nmslib' function requires that the labels are numeric and start from 1 : Inf

y = mnist[, ncol(mnist)] + 1          
table(y)

## y
##    1    2    3    4    5    6    7    8    9   10 
## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958


# evaluation metric

acc = function (y_true, preds) {
  
  out = table(y_true, max.col(preds, ties.method = "random"))
  
  acc = sum(diag(out))/sum(out)
  
  acc
}


then compute the HOG features,


library(OpenImageR)

hog = HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6)

## 
## time to complete : 2.101281 secs  

dim(hog)

## [1] 70000   324


then compute the approximate kernel k-nearest-neighbors using the cosine distance,


# parameters for 'KernelKnnCV_nmslib'
#------------------------------------

M = 30
efC = 100
num_threads = 6

index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC,
                    
                    'post' = 0, 'skip_optimized_index' = 1 )


efS = 100

query_time_params = list('efSearch' = efS)


# approximate kernel knn
#-----------------------

fit_hog = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, 
                             weights_function = 'biweight_tricube_MULT', 
                             Levels = sort(unique(y)), Index_Params = index_params,
                             Time_Params = query_time_params, space = "cosinesimil", 
                             space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR", 
                             dtype = "FLOAT", index_filepath = NULL, print_progress = FALSE, 
                             num_threads = 6, seed_num = 1)


# cross-validation starts .. 

# |=================================================================================| 100%

# time to complete : 32.88805 secs 


str(fit_hog)




List of 2
 $ preds:List of 4
  ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ : num [1:17500, 1:10] 0 0 0 0 1 ...
  ..$ : num [1:17500, 1:10] 0 0 0 0 0 ...
  ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
 $ folds:List of 4
  ..$ fold_1: int [1:17500] 49808 21991 42918 7967 49782 28979 64440 49809 30522 36673 ...
  ..$ fold_2: int [1:17500] 51122 9469 58021 45228 2944 58052 65074 17709 2532 31262 ...
  ..$ fold_3: int [1:17500] 33205 40078 68177 32620 52721 18981 19417 53922 19102 67206 ...
  ..$ fold_4: int [1:17500] 28267 41652 28514 34525 68534 13294 48759 47521 69395 41408 ...



acc_fit_hog = unlist(lapply(1:length(fit_hog$preds), 
                            
                            function(x) acc(y[fit_hog$folds[[x]]], 
                                            
                                            fit_hog$preds[[x]])))
acc_fit_hog

## [1] 0.9768000 0.9786857 0.9763429 0.9760000

cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')

## mean accuracy for hog-features using cross-validation : 0.9769571


It took approx. 33 seconds to return with an accuracy of 97.7 % . Almost 47 times faster than KernelKnn’s corresponding function (brute force) with a slight lower accuracy rate (the braycurtis distance metric might be better suited for this dataset).

I also run the corresponding brute-force algorithm of the NMSLIB Library by setting the method parameter to seq_search,



# brute force of NMSLIB   [ here we set 'Index_Params' and 'Time_Params' to NULL ]
#----------------------

fit_hog_seq = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, 
                                weights_function = 'biweight_tricube_MULT', 
                                Levels = sort(unique(y)), Index_Params = NULL,
                                Time_Params = NULL, space = "cosinesimil", 
                                space_params = NULL, method = "seq_search", 
                                data_type = "DENSE_VECTOR", dtype = "FLOAT", 
                                index_filepath = NULL, print_progress = FALSE, 
                                num_threads = 6, seed_num = 1)


# cross-validation starts .. 

# |=================================================================================| 100%

# time to complete : 4.506177 mins  


acc_fit_hog_seq = unlist(lapply(1:length(fit_hog_seq$preds), 
                                
                                function(x) acc(y[fit_hog_seq$folds[[x]]], 
                                                
                                                fit_hog_seq$preds[[x]])))
acc_fit_hog_seq

## [1] 0.9785143 0.9802286 0.9783429 0.9784571

cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog_seq), '\n')

## mean accuracy for hog-features using cross-validation : 0.9788857



The brute-force algorithm of the NMSLIB Library is almost 6 times faster than KernelKnn giving an accuracy of approx. 97.9 %.

The README.md file of the nmslibR package includes the SystemRequirements and installation instructions.

An updated version of the nmslibR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/nmslibR/issues.

To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


from R-bloggers http://ift.tt/2GR9pnv
via IFTTT

Comments

Popular posts from this blog

Explaining models with Triplot, part 1

[This article was first published on R in ResponsibleML on Medium , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Explaining models with triplot, part 1 tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot , part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features. Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used ...

The con behind every wedding

With her marriage on the rocks, one writer struggles to reconcile her cynicism about happily-ever-after as her own children rush to tie the knot A lavish wedding, a couple in love; romance was in the air, as it should be when two people are getting married. But on the top table, the mothers of the happy pair were bonding over their imminent plans for … divorce. That story was told to me by the mother of the bride. The wedding in question was two summers ago: she is now divorced, and the bridegroom’s parents are separated. “We couldn’t but be aware of the crushing irony of the situation,” said my friend. “There we were, celebrating our children’s marriage, while plotting our own escapes from relationships that had long ago gone sour, and had probably been held together by our children. Now they were off to start their lives together, we could be off, too – on our own, or in search of new partners.” Continue reading... The Guardian http://ift.tt/2xZTguV October 07, 2017 at 09:00AM