Search | arXiv e-print repository

Experimental Analysis of Machine Learning Techniques for Finding Search Radius in Locality Sensitive Hashing

Abstract: Finding similar data in high-dimensional spaces is one of the important tasks in multimedia applications. Approaches introduced to find exact searching techniques often use tree-based index structures which are known to suffer from the curse of the dimensionality problem that limits their performance. Approximate searching techniques prefer performance over accuracy and they return good enough res… ▽ More Finding similar data in high-dimensional spaces is one of the important tasks in multimedia applications. Approaches introduced to find exact searching techniques often use tree-based index structures which are known to suffer from the curse of the dimensionality problem that limits their performance. Approximate searching techniques prefer performance over accuracy and they return good enough results while achieving a better performance. Locality Sensitive Hashing (LSH) is one of the most popular approximate nearest neighbor search techniques for high-dimensional spaces. One of the most time-consuming processes in LSH is to find the neighboring points in the projected spaces. An improved LSH-based index structure, called radius-optimized Locality Sensitive Hashing (roLSH) has been proposed to utilize Machine Learning and efficiently find these neighboring points; thus, further improve the overall performance of LSH. In this paper, we extend roLSH by experimentally studying the effect of different types of famous Machine Learning techniques on overall performance. We compare ten regression techniques on four real-world datasets and show that Neural Network-based techniques are the best fit to be used in roLSH as their accuracy and performance trade-off are the best compared to the other techniques. △ Less

Submitted 16 November, 2022; originally announced November 2022.

ACM Class: H.2.4

arXiv:2208.04415 [pdf, ps, other]

Deep Learning Driven Natural Languages Text to SQL Query Conversion: A Survey

Authors: Ayush Kumar, Parth Nagarkar, Prabhav Nalhe, Sanjeev Vijayakumar

Abstract: With the future striving toward data-centric decision-making, seamless access to databases is of utmost importance. There is extensive research on creating an efficient text-to-sql (TEXT2SQL) model to access data from the database. Using a Natural language is one of the best interfaces that can bridge the gap between the data and results by accessing the database efficiently, especially for non-te… ▽ More With the future striving toward data-centric decision-making, seamless access to databases is of utmost importance. There is extensive research on creating an efficient text-to-sql (TEXT2SQL) model to access data from the database. Using a Natural language is one of the best interfaces that can bridge the gap between the data and results by accessing the database efficiently, especially for non-technical users. It will open the doors and create tremendous interest among users who are well versed in technical skills or not very skilled in query languages. Even if numerous deep learning-based algorithms are proposed or studied, there still is very challenging to have a generic model to solve the data query issues using natural language in a real-work scenario. The reason is the use of different datasets in different studies, which comes with its limitations and assumptions. At the same time, we do lack a thorough understanding of these proposed models and their limitations with the specific dataset it is trained on. In this paper, we try to present a holistic overview of 24 recent neural network models studied in the last couple of years, including their architectures involving convolutional neural networks, recurrent neural networks, pointer networks, reinforcement learning, generative models, etc. We also give an overview of the 11 datasets that are widely used to train the models for TEXT2SQL technologies. We also discuss the future application possibilities of TEXT2SQL technologies for seamless data queries. △ Less

Submitted 8 August, 2022; originally announced August 2022.

Comments: 19 pages, 6 figures

arXiv:2109.03784 [pdf, other]

A Survey on Machine Learning Techniques for Auto Labeling of Video, Audio, and Text Data

Authors: Shikun Zhang, Omid Jafari, Parth Nagarkar

Abstract: Machine learning has been utilized to perform tasks in many different domains such as classification, object detection, image segmentation and natural language analysis. Data labeling has always been one of the most important tasks in machine learning. However, labeling large amounts of data increases the monetary cost in machine learning. As a result, researchers started to focus on reducing data… ▽ More Machine learning has been utilized to perform tasks in many different domains such as classification, object detection, image segmentation and natural language analysis. Data labeling has always been one of the most important tasks in machine learning. However, labeling large amounts of data increases the monetary cost in machine learning. As a result, researchers started to focus on reducing data annotation and labeling costs. Transfer learning was designed and widely used as an efficient approach that can reasonably reduce the negative impact of limited data, which in turn, reduces the data preparation cost. Even transferring previous knowledge from a source domain reduces the amount of data needed in a target domain. However, large amounts of annotated data are still demanded to build robust models and improve the prediction accuracy of the model. Therefore, researchers started to pay more attention on auto annotation and labeling. In this survey paper, we provide a review of previous techniques that focuses on optimized data annotation and labeling for video, audio, and text data. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2105.14195 [pdf, other]

A Survey of Performance Optimization in Neural Network-Based Video Analytics Systems

Authors: Nada Ibrahim, Preeti Maurya, Omid Jafari, Parth Nagarkar

Abstract: Video analytics systems perform automatic events, movements, and actions recognition in a video and make it possible to execute queries on the video. As a result of a large number of video data that need to be processed, optimizing the performance of video analytics systems has become an important research topic. Neural networks are the state-of-the-art for performing video analytics tasks such as… ▽ More Video analytics systems perform automatic events, movements, and actions recognition in a video and make it possible to execute queries on the video. As a result of a large number of video data that need to be processed, optimizing the performance of video analytics systems has become an important research topic. Neural networks are the state-of-the-art for performing video analytics tasks such as video annotation and object detection. Prior survey papers consider application-specific video analytics techniques that improve accuracy of the results; however, in this survey paper, we provide a review of the techniques that focus on optimizing the performance of Neural Network-Based Video Analytics Systems. △ Less

Submitted 10 May, 2021; originally announced May 2021.

arXiv:2102.08942 [pdf, ps, other]

A Survey on Locality Sensitive Hashing Algorithms and their Applications

Authors: Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, Chidambaram Crushev

Abstract: Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper,… ▽ More Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains. △ Less

Submitted 17 February, 2021; originally announced February 2021.

MSC Class: H.2.4 ACM Class: H.2.4

arXiv:2006.11285 [pdf, other]

doi 10.1007/978-3-030-69377-0_6

Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

Authors: Omid Jafari, Parth Nagarkar

Abstract: Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based indexing approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH)… ▽ More Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based indexing approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH) is a very popular technique for finding approximate nearest neighbors in high-dimensional spaces. Apart from providing theoretical guarantees on the query results, one of the main benefits of LSH techniques is their good scalability to large datasets because they are external memory based. The most dominant costs for existing LSH techniques are the algorithm time and the index I/Os required to find candidate points. Existing works do not compare both of these dominant costs in their evaluation. In this experimental survey paper, we show the impact of both these costs on the overall performance of the LSH technique. We compare three state-of-the-art techniques on four real-world datasets, and show that, in contrast to recent works, C2LSH is still the state-of-the-art algorithm in terms of performance while achieving similar accuracy as its recent competitors. △ Less

Submitted 19 June, 2020; originally announced June 2020.

Comments: arXiv admin note: text overlap with arXiv:2003.06415

MSC Class: H.2.4 ACM Class: H.2.4

Journal ref: ADC 2021. Lecture Notes in Computer Science, vol. 12610. Springer, Cham, pp. 62-73

arXiv:2006.11284 [pdf, other]

doi 10.1007/978-3-030-60936-8_25

Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors

Authors: Omid Jafari, Parth Nagarkar, Jonathan Montaño

Abstract: Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate ne… ▽ More Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate nearest neighbors. Existing state-of-the-art Locality Sensitive Hashing techniques that focus on improving performance of the overall process, mainly focus on minimizing the total number of IOs while sacrificing the overall processing time. The main time-consuming process in LSH techniques is the process of finding neighboring points in projected spaces. We present a novel index structure called radius-optimized Locality Sensitive Hashing (roLSH). With the help of sampling techniques and Neural Networks, we present two techniques to find neighboring points in projected spaces efficiently, without sacrificing the accuracy of the results. Our extensive experimental analysis on real datasets shows the performance benefit of roLSH over existing state-of-the-art LSH techniques. △ Less

Submitted 19 June, 2020; originally announced June 2020.

Comments: arXiv admin note: text overlap with arXiv:2003.06415

MSC Class: H.2.4 ACM Class: H.2.4

Journal ref: SISAP 2020. Lecture Notes in Computer Science, vol 12440. Springer, Cham

arXiv:2003.06415 [pdf, other]

doi 10.1007/978-3-030-60936-8_4

mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data

Authors: Omid Jafari, Parth Nagarkar, Jonathan Montaño

Abstract: Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important high-dimensional feature vectors. Existing Locality Sensitive Hashing (LSH) techniques require users to find top-k similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant… ▽ More Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important high-dimensional feature vectors. Existing Locality Sensitive Hashing (LSH) techniques require users to find top-k similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant work due to two main reasons: 1) not all feature vectors may contribute equally in finding the top-k similar multimedia objects, and 2) feature vectors are treated independently during query processing. Additionally, there is no theoretical guarantee on the returned multimedia results. In this work, we propose a practical and efficient indexing approach for finding top-k approximate nearest neighbors for multimedia data using LSH called mmLSH, which can provide theoretical guarantees on the returned multimedia results. Additionally, we present a buffer-conscious strategy to speed up the query processing. Experimental evaluation shows significant gains in performance time and accuracy for different real multimedia datasets when compared against state-of-the-art LSH techniques. △ Less

Submitted 22 June, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

Comments: Submitted to SISAP 2020

MSC Class: H.2.4 ACM Class: H.2.4

Journal ref: SISAP 2020. Lecture Notes in Computer Science, vol 12440. Springer, Cham

arXiv:1912.07101 [pdf, other]

doi 10.1109/SSIAI49293.2020.9094616

Efficient Bitmap-based Indexing and Retrieval of Similarity Search Image Queries

Authors: Omid Jafari, Parth Nagarkar, Jonathan Montaño

Abstract: Finding similar images is a necessary operation in many multimedia applications. Images are often represented and stored as a set of high-dimensional features, which are extracted using localized feature extraction algorithms. Locality Sensitive Hashing is one of the most popular approximate processing techniques for finding similar points in high-dimensional spaces. Locality Sensitive Hashing (LS… ▽ More Finding similar images is a necessary operation in many multimedia applications. Images are often represented and stored as a set of high-dimensional features, which are extracted using localized feature extraction algorithms. Locality Sensitive Hashing is one of the most popular approximate processing techniques for finding similar points in high-dimensional spaces. Locality Sensitive Hashing (LSH) and its variants are designed to find similar points, but they are not designed to find objects (such as images, which are made up of a collection of points) efficiently. In this paper, we propose an index structure, Bitmap-Image LSH (bImageLSH), for efficient processing of high-dimensional images. Using a real dataset, we experimentally show the performance benefit of our novel design while kee** the accuracy of the image results high. △ Less

Submitted 15 December, 2019; originally announced December 2019.

Comments: Submitted to 2020 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2020)

MSC Class: H.2.4 ACM Class: H.2.4

Journal ref: 2020 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), Albuquerque, NM, USA, 2020, pp. 58-61

arXiv:1912.07091 [pdf]

doi 10.5121/csit.2019.91410

Drawbacks and Proposed Solutions for Real-time Processing on Existing State-of-the-art Locality Sensitive Hashing Techniques

Authors: Omid Jafari, Khandker Mushfiqul Islam, Parth Nagarkar

Abstract: Nearest-neighbor query processing is a fundamental operation for many image retrieval applications. Often, images are stored and represented by high-dimensional vectors that are generated by feature-extraction algorithms. Since tree-based index structures are shown to be ineffective for high dimensional processing due to the well-known "Curse of Dimensionality", approximate nearest neighbor techni… ▽ More Nearest-neighbor query processing is a fundamental operation for many image retrieval applications. Often, images are stored and represented by high-dimensional vectors that are generated by feature-extraction algorithms. Since tree-based index structures are shown to be ineffective for high dimensional processing due to the well-known "Curse of Dimensionality", approximate nearest neighbor techniques are used for faster query processing. Locality Sensitive Hashing (LSH) is a very popular and efficient approximate nearest neighbor technique that is known for its sublinear query processing complexity and theoretical guarantees. Nowadays, with the emergence of technology, several diverse application domains require real-time high-dimensional data storing and processing capacity. Existing LSH techniques are not suitable to handle real-time data and queries. In this paper, we discuss the challenges and drawbacks of existing LSH techniques for processing real-time high-dimensional image data. Additionally, through experimental analysis, we propose improvements for existing state-of-the-art LSH techniques for efficient processing of high-dimensional image data. △ Less

Submitted 15 December, 2019; originally announced December 2019.

Comments: Accepted and Presented at the 5th International Conference on Signal and Image Processing (SIGI-2019), Dubai, UAE

MSC Class: H.2.4 ACM Class: H.2.4

arXiv:1907.11803 [pdf, other]

doi 10.1145/3323873.3325048

qwLSH: Cache-conscious Indexing for Processing Similarity Search Query Workloads in High-Dimensional Spaces

Authors: Omid Jafari, John Ossorgin, Parth Nagarkar

Abstract: Similarity search queries in high-dimensional spaces are an important type of queries in many domains such as image processing, machine learning, etc. Since exact similarity search indexing techniques suffer from the well-known curse of dimensionality in high-dimensional spaces, approximate search techniques are often utilized instead. Locality Sensitive Hashing (LSH) has been shown to be an effec… ▽ More Similarity search queries in high-dimensional spaces are an important type of queries in many domains such as image processing, machine learning, etc. Since exact similarity search indexing techniques suffer from the well-known curse of dimensionality in high-dimensional spaces, approximate search techniques are often utilized instead. Locality Sensitive Hashing (LSH) has been shown to be an effective approximate search method for solving similarity search queries in high-dimensional spaces. Often times, queries in real-world settings arrive as part of a query workload. LSH and its variants are particularly designed to solve single queries effectively. They suffer from one major drawback while executing query workloads: they do not take into consideration important data characteristics for effective cache utilization while designing the index structures. In this paper, we present qwLSH, an index structure for efficiently processing similarity search query workloads in high-dimensional spaces. We intelligently divide a given cache during processing of a query workload by using novel cost models. Experimental results show that, given a query workload, qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: Extended version of the published work

ACM Class: H.2.4

Journal ref: In Proceedings of ICMR '19, pp. 329-333. ACM, 2019

Showing 1–11 of 11 results for author: Nagarkar, P