Skip to main content

Non Metric Space (Approximate) Library in R

(This article was first published on mlampros, and kindly contributed to R-bloggers)

The nmslibR package is a wrapper of NMSLIB, which according to the authors “… is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods”.

I’ve searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I’ve already written a package for k-nearest-neighbor search (KernelKnn), however, it’s based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The nmslibR package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found here.

The NMSLIB Library,

  • is a collection of search methods for generic spaces
  • has both metric and non-metric search algorithms
  • has both exact and approximate search algorithms
  • is an evaluation toolkit that simplifies experimentation and processing of results
  • is extensible (new spaces and methods can be added)
  • It was designed to be efficient

Details can be found in the NMSLIB-manual.

The nmslibR package

The nmslibR package includes the following R6-class / functions,

class
NMSlib
Knn_Query()
knn_Query_Batch()
save_Index()
functions

KernelKnn_nmslib()

KernelKnnCV_nmslib()

dgCMatrix_2scipy_sparse()

mat_2scipy_sparse()

The package documentation includes details and examples for the R6-class and functions. I’ll start explaining how a user can work with sparse matrices as the input can also be a python sparse matrix.

Sparse matrices as input

The nmslibR package includes two functions (mat_2scipy_sparse and dgCMatrix_2scipy_sparse) which allow the user to convert from a matrix / dgCMatrix to a scipy sparse matrix,


library(nmslibR)

# conversion from a matrix object to a scipy sparse matrix
#----------------------------------------------------------

set.seed(1)

x = matrix(runif(1000), nrow = 100, ncol = 10)

x_sparse = mat_2scipy_sparse(x, format = "sparse_row_matrix")

print(dim(x))

[1] 100  10

print(x_sparse$shape)

(100, 10)
  


# conversion from a dgCMatrix object to a scipy sparse matrix
#-------------------------------------------------------------

data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)

# by default column-oriented format

dgcM = Matrix::Matrix(data = data, nrow = 3,

                      ncol = 3, byrow = TRUE,

                      sparse = TRUE)

print(dim(dgcM))

[1] 3 3

x_sparse = dgCMatrix_2scipy_sparse(dgcM)

print(x_sparse$shape)

(3, 3)
  

The NMSlib R6-class

The parameter settings for the NMSlib R6-class can be found in the Non-Metric Space Library (NMSLIB) Manual, which explains the NMSLIB Library in detail. In the following code chunk, I’ll show the functionality of the methods included using a data set from my Github repository (it appears as .ipynb notebook in the nmslib Github repository)



library(nmslibR)


# download the data from my Github repository (tested on a Linux OS)
#-------------------------------------------------------------------

system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/sift_10k.txt")


# load the data in the R session
#-------------------------------

sift_10k = read.table("~/sift_10k.txt", quote="\"", comment.char="")


# index parameters
#-----------------

M = 15
efC = 100
num_threads = 5

index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC, 
                    
                    'post' = 0, 'skip_optimized_index' = 1 )


# query-time parameters
#----------------------

efS = 100

query_time_params = list('efSearch' = efS)


# Number of neighbors 
#--------------------

K = 100


# space to use
#---------------

space_name = 'l2sqr_sift'     


# initialize NMSlib [ the data should be a matrix ]
#--------------------------------------------------

init_nms = NMSlib$new(input_data = as.matrix(sift_10k), Index_Params = index_params, 
                      
                      Time_Params = query_time_params, space = space_name, 
                      
                      space_params = NULL, method = 'hnsw', 
                      
                      data_type = 'DENSE_UINT8_VECTOR', dtype = 'INT',
                      
                      index_filepath = NULL, print_progress = FALSE)



# returns a 1-dimensional vector
#-------------------------------

init_nms$Knn_Query(query_data_row = as.matrix(sift_10k[1, ]), k = 5)



[[1]]
[1]    2    6 4585 9256  140                    # indices

[[2]]
[1] 18724 24320 68158 69067 70321               # distances
 


# returns knn's for all data
#---------------------------

all_dat = init_nms$knn_Query_Batch(as.matrix(sift_10k), k = 5, num_threads = 1)

str(all_dat)



# a list of indices and distances for all observations
#------------------------------------------------------

List of 2
 $ knn_idx : num [1:10000, 1:5] 3 4 1 2 13 14 1 2 30 31 ...
 $ knn_dist: num [1:10000, 1:5] 18724 14995 18724 14995 21038 ...


Details on the various methods and parameter settings can be found in the manual of the NMSLIB python Library.

KernelKnn using the nmslibR package

In the Vignette of the KernelKnn (Image classification of the MNIST and CIFAR-10 data using KernelKnn and HOG (histogram of oriented gradients)) package I experimented with the mnist dataset and a cross-validated kernel k-nearest-neighbors model gave 98.4 % accuracy based on HOG (histogram of oriented gradients) features. However, it took almost 30 minutes (depending on the system configuration) to complete using 6 threads. I’ve implemented a similar function using NMSLIB (KernelKnnCV_nmslib), so in the next code chunk I’ll use the same parameter setting and I’ll compare computation time and accuracy.

First load the data,


# using system('wget..') on a linux OS 

system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")             

mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T, 
                    
                    quote = "\"", sep = ",")



X = mnist[, -ncol(mnist)]
dim(X)

## [1] 70000   784

# the 'KernelKnnCV_nmslib' function requires that the labels are numeric and start from 1 : Inf

y = mnist[, ncol(mnist)] + 1          
table(y)

## y
##    1    2    3    4    5    6    7    8    9   10 
## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958


# evaluation metric

acc = function (y_true, preds) {
  
  out = table(y_true, max.col(preds, ties.method = "random"))
  
  acc = sum(diag(out))/sum(out)
  
  acc
}


then compute the HOG features,


library(OpenImageR)

hog = HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6)

## 
## time to complete : 2.101281 secs  

dim(hog)

## [1] 70000   324


then compute the approximate kernel k-nearest-neighbors using the cosine distance,


# parameters for 'KernelKnnCV_nmslib'
#------------------------------------

M = 30
efC = 100
num_threads = 6

index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC,
                    
                    'post' = 0, 'skip_optimized_index' = 1 )


efS = 100

query_time_params = list('efSearch' = efS)


# approximate kernel knn
#-----------------------

fit_hog = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, 
                             weights_function = 'biweight_tricube_MULT', 
                             Levels = sort(unique(y)), Index_Params = index_params,
                             Time_Params = query_time_params, space = "cosinesimil", 
                             space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR", 
                             dtype = "FLOAT", index_filepath = NULL, print_progress = FALSE, 
                             num_threads = 6, seed_num = 1)


# cross-validation starts .. 

# |=================================================================================| 100%

# time to complete : 32.88805 secs 


str(fit_hog)




List of 2
 $ preds:List of 4
  ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ : num [1:17500, 1:10] 0 0 0 0 1 ...
  ..$ : num [1:17500, 1:10] 0 0 0 0 0 ...
  ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
 $ folds:List of 4
  ..$ fold_1: int [1:17500] 49808 21991 42918 7967 49782 28979 64440 49809 30522 36673 ...
  ..$ fold_2: int [1:17500] 51122 9469 58021 45228 2944 58052 65074 17709 2532 31262 ...
  ..$ fold_3: int [1:17500] 33205 40078 68177 32620 52721 18981 19417 53922 19102 67206 ...
  ..$ fold_4: int [1:17500] 28267 41652 28514 34525 68534 13294 48759 47521 69395 41408 ...



acc_fit_hog = unlist(lapply(1:length(fit_hog$preds), 
                            
                            function(x) acc(y[fit_hog$folds[[x]]], 
                                            
                                            fit_hog$preds[[x]])))
acc_fit_hog

## [1] 0.9768000 0.9786857 0.9763429 0.9760000

cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')

## mean accuracy for hog-features using cross-validation : 0.9769571


It took approx. 33 seconds to return with an accuracy of 97.7 % . Almost 47 times faster than KernelKnn’s corresponding function (brute force) with a slight lower accuracy rate (the braycurtis distance metric might be better suited for this dataset).

I also run the corresponding brute-force algorithm of the NMSLIB Library by setting the method parameter to seq_search,



# brute force of NMSLIB   [ here we set 'Index_Params' and 'Time_Params' to NULL ]
#----------------------

fit_hog_seq = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, 
                                weights_function = 'biweight_tricube_MULT', 
                                Levels = sort(unique(y)), Index_Params = NULL,
                                Time_Params = NULL, space = "cosinesimil", 
                                space_params = NULL, method = "seq_search", 
                                data_type = "DENSE_VECTOR", dtype = "FLOAT", 
                                index_filepath = NULL, print_progress = FALSE, 
                                num_threads = 6, seed_num = 1)


# cross-validation starts .. 

# |=================================================================================| 100%

# time to complete : 4.506177 mins  


acc_fit_hog_seq = unlist(lapply(1:length(fit_hog_seq$preds), 
                                
                                function(x) acc(y[fit_hog_seq$folds[[x]]], 
                                                
                                                fit_hog_seq$preds[[x]])))
acc_fit_hog_seq

## [1] 0.9785143 0.9802286 0.9783429 0.9784571

cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog_seq), '\n')

## mean accuracy for hog-features using cross-validation : 0.9788857



The brute-force algorithm of the NMSLIB Library is almost 6 times faster than KernelKnn giving an accuracy of approx. 97.9 %.

The README.md file of the nmslibR package includes the SystemRequirements and installation instructions.

An updated version of the nmslibR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/nmslibR/issues.

To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


from R-bloggers http://ift.tt/2GR9pnv
via IFTTT

Comments

Popular posts from this blog

Solving Van der Pol equation with ivp_solve

Van der Pol’s differential equation is The equation describes a system with nonlinear damping, the degree of damping given by μ. If μ = 0 the system is linear and undamped, but for positive μ the system is nonlinear and damped. We will plot the phase portrait for the solution to Van der Pol’s equation in Python using SciPy’s new ODE solver ivp_solve . The function ivp_solve does not solve second-order systems of equations directly. It solves systems of first-order equations, but a second-order differential equation can be recast as a pair of first-order equations by introducing the first derivative as a new variable. Since y is the derivative of x , the phase portrait is just the plot of ( x , y ). If μ = 0, we have a simple harmonic oscillator and the phase portrait is simply a circle. For larger values of μ the solutions enter limiting cycles, but the cycles are more complicated than just circles. Here’s the Python code that made the plot. from scipy import linspace from ...

Lawyer: 'Socialite Grifter' Anna Sorokin 'Had To Do It Her Way' (And Steal $275,000)

Opening statements were made in the "Socialite Grifter" trial on Wednesday, and both sides provided extremely different reasons why Anna Sorokin allegedly scammed a number of people and institutions out of $275,000. [ more › ] Gothamist https://ift.tt/2HXgI0E March 29, 2019 at 12:33AM

5 Massively Important AI Features In Time Tracking Applications

Artificial intelligence has transformed the future of many industries. One area that has been under- investigated is the use of AI in time tracking technology. AI is Fundamentally Changing the Future of Time Tracking Technology A time tracking software is a worthy investment irrespective of the size of your organization. It generates accurate reports based on the amount of time your team spends working on a task. These reports facilitate planning of budgets for upcoming projects. Many AI tools are changing the nature of time management. MindSync AI discussed the pivotal role of AI in time management in a Medium article . Why is time tracking software important? It helps with keeping track of the hours being invested on a given task. This sheds light on the timeline for the overall project. It also helps in determining the productivity levels of the employees. This is one of the many reasons that AI is driving workplace productivity . But how can employers utilize it effectively? ...