Search | arXiv e-print repository

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

Authors: Kesavaraj V, Anuprabha M, Anil Kumar Vuppala

Abstract: Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pair… ▽ More Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.04654 [pdf]

Music Recommendation Based on Facial Emotion Recognition

Authors: Rajesh B, Keerthana V, Narayana Darapaneni, Anwesh Reddy P

Abstract: Introduction: Music provides an incredible avenue for individuals to express their thoughts and emotions, while also serving as a delightful mode of entertainment for enthusiasts and music lovers. Objectives: This paper presents a comprehensive approach to enhancing the user experience through the integration of emotion recognition, music recommendation, and explainable AI using GRAD-CAM. Methods:… ▽ More Introduction: Music provides an incredible avenue for individuals to express their thoughts and emotions, while also serving as a delightful mode of entertainment for enthusiasts and music lovers. Objectives: This paper presents a comprehensive approach to enhancing the user experience through the integration of emotion recognition, music recommendation, and explainable AI using GRAD-CAM. Methods: The proposed methodology utilizes a ResNet50 model trained on the Facial Expression Recognition (FER) dataset, consisting of real images of individuals expressing various emotions. Results: The system achieves an accuracy of 82% in emotion classification. By leveraging GRAD-CAM, the model provides explanations for its predictions, allowing users to understand the reasoning behind the system's recommendations. The model is trained on both FER and real user datasets, which include labelled facial expressions, and real images of individuals expressing various emotions. The training process involves pre-processing the input images, extracting features through convolutional layers, reasoning with dense layers, and generating emotion predictions through the output layer. Conclusion: The proposed methodology, leveraging the Resnet50 model with ROI-based analysis and explainable AI techniques, offers a robust and interpretable solution for facial emotion detection paper. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.03914 [pdf, other]

Open vocabulary keyword spotting through transfer learning from speech synthesis

Authors: Kesavaraj V, Anil Kumar Vuppala

Abstract: Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages… ▽ More Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER). △ Less

Submitted 18 April, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

arXiv:2401.11324 [pdf, other]

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Authors: Karthik V., Saim Khan, Somesh Singh, Harsha Vardhan Simhadri, Jyothi Vedurada

Abstract: Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature, on large datasets. The growing volume and dimensionality of data necessi… ▽ More Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature, on large datasets. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its high computational power and energy efficiency. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the index-graph and the data to reside entirely in the GPU memory, or (ii) they partition the data into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. In this paper, we introduce BANG, a first-of-its-kind GPU-based ANNS method which works efficiently on billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing compressed data on the GPU to perform distance computations while maintaining the graph on the CPU. BANG incorporates high-optimized GPU kernels and proceeds in stages that run concurrently on the GPU and CPU, taking advantage of their architectural specificities. We evaluate BANG using a single NVIDIA Ampere A100 GPU on ten popular ANN benchmark datasets. BANG outperforms the state-of-the-art in the majority of the cases. Notably, on the billion-size datasets, we are significantly faster than our competitors, achieving throughputs 40x-200x more than the competing methods for a high recall of 0.9. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2304.06861 [pdf, other]

Evaluation of Social Biases in Recent Large Pre-Trained Models

Authors: Swapnil Sharma, Nikita Anand, Kranthi Kiran G. V., Alind Jain

Abstract: Large pre-trained language models are widely used in the community. These models are usually trained on unmoderated and unfiltered data from open sources like the Internet. Due to this, biases that we see in platforms online which are a reflection of those in society are in turn captured and learned by these models. These models are deployed in applications that affect millions of people and their… ▽ More Large pre-trained language models are widely used in the community. These models are usually trained on unmoderated and unfiltered data from open sources like the Internet. Due to this, biases that we see in platforms online which are a reflection of those in society are in turn captured and learned by these models. These models are deployed in applications that affect millions of people and their inherent biases are harmful to the targeted social groups. In this work, we study the general trend in bias reduction as newer pre-trained models are released. Three recent models ( ELECTRA, DeBERTa, and DistilBERT) are chosen and evaluated against two bias benchmarks, StereoSet and CrowS-Pairs. They are compared to the baseline of BERT using the associated metrics. We explore whether as advancements are made and newer, faster, lighter models are released: are they being developed responsibly such that their inherent social biases have been reduced compared to their older counterparts? The results are compiled and we find that all the models under study do exhibit biases but have generally improved as compared to BERT. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 7 pages, 4 Tables

arXiv:2211.06153 [pdf, other]

SUNDEW: An Ensemble of Predictors for Case-Sensitive Detection of Malware

Authors: Sareena Karapoola, Nikhilesh Singh, Chester Rebeiro, Kamakoti V

Abstract: Malware programs are diverse, with varying objectives, functionalities, and threat levels ranging from mere pop-ups to financial losses. Consequently, their run-time footprints across the system differ, impacting the optimal data source (Network, Operating system (OS), Hardware) and features that are instrumental to malware detection. Further, the variations in threat levels of malware classes aff… ▽ More Malware programs are diverse, with varying objectives, functionalities, and threat levels ranging from mere pop-ups to financial losses. Consequently, their run-time footprints across the system differ, impacting the optimal data source (Network, Operating system (OS), Hardware) and features that are instrumental to malware detection. Further, the variations in threat levels of malware classes affect the user requirements for detection. Thus, the optimal tuple of <data-source, features, user-requirements> is different for each malware class, impacting the state-of-the-art detection solutions that are agnostic to these subtle differences. This paper presents SUNDEW, a framework to detect malware classes using their optimal tuple of <data-source, features, user-requirements>. SUNDEW uses an ensemble of specialized predictors, each trained with a particular data source (network, OS, and hardware) and tuned for features and requirements of a specific class. While the specialized ensemble with a holistic view across the system improves detection, aggregating the independent conflicting inferences from the different predictors is challenging. SUNDEW resolves such conflicts with a hierarchical aggregation considering the threat-level, noise in the data sources, and prior domain knowledge. We evaluate SUNDEW on a real-world dataset of over 10,000 malware samples from 8 classes. It achieves an F1-Score of one for most classes, with an average of 0.93 and a limited performance overhead of 1.5%. △ Less

Submitted 14 November, 2022; v1 submitted 11 November, 2022; originally announced November 2022.

arXiv:2201.08574 [pdf, other]

Classroom Slide Narration System

Authors: Jobin K. V., Ajoy Mondal, C. V. Jawahar

Abstract: Slide presentations are an effective and efficient tool used by the teaching community for classroom communication. However, this teaching model can be challenging for blind and visually impaired (VI) students. The VI student required personal human assistance for understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio… ▽ More Slide presentations are an effective and efficient tool used by the teaching community for classroom communication. However, this teaching model can be challenging for blind and visually impaired (VI) students. The VI student required personal human assistance for understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio descriptions corresponding to the slide content. This problem poses as an image-to-markup language generation task. The initial step is to extract logical regions such as title, text, equation, figure, and table from the slide image. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the architecture, Classroom Slide Segmentation Network (CSSN). The unique attributes of this architecture differs from most other semantic segmentation networks. Publicly available benchmark datasets such as WiSe and SPaSe are used to validate the performance of our segmentation architecture. We obtained 9.54 segmentation accuracy improvement in WiSe dataset. We extract content (information) from the slide using four well-established modules such as optical character recognition (OCR), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (CSNS) to help VI students understand the slide content. The users have given better feedback on the quality output of the proposed CSNS in comparison to existing systems like Facebooks Automatic Alt-Text (AAT) and Tesseract. △ Less

Submitted 21 January, 2022; originally announced January 2022.

Journal ref: CVIP 2021

arXiv:1903.06475 [pdf, other]

doi 10.1145/3342280.3342330

White Mirror: Leaking Sensitive Information from Interactive Netflix Movies using Encrypted Traffic Analysis

Authors: Gargi Mitra, Prasanna Karthik Vairam, Patanjali SLPSK, Nitin Chandrachoodan, Kamakoti V

Abstract: Privacy leaks from Netflix videos/movies is well researched. Current state-of-the-art works have been able to obtain coarse-grained information such as the genre and the title of videos by passive observation of encrypted traffic. However, leakage of fine-grained information from encrypted traffic has not been studied so far. Such information can be used to build behavioural profiles of viewers.… ▽ More Privacy leaks from Netflix videos/movies is well researched. Current state-of-the-art works have been able to obtain coarse-grained information such as the genre and the title of videos by passive observation of encrypted traffic. However, leakage of fine-grained information from encrypted traffic has not been studied so far. Such information can be used to build behavioural profiles of viewers. On 28th December 2018, Netflix released the first mainstream interactive movie called 'Black Mirror: Bandersnatch'. In this work, we use this movie as a case-study to show for the first time that fine-grained information (i.e., choices made by users) can be revealed from encrypted traffic. We use the state information exchanged between the viewer's browser and Netflix as the side-channel. To evaluate our proposed technique, we built the first interactive video traffic dataset of 100 viewers; which we will be releasing. Preliminary results indicate that the choices made by a user can be revealed 96% of the time in the worst case. △ Less

Submitted 15 March, 2019; originally announced March 2019.

Comments: 2 pages, 2 figures, 1 table

arXiv:1903.04143 [pdf, other]

The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

Authors: Žiga Emeršič, Aruna Kumar S. V., B. S. Harish, Weronika Gutfeter, Jalil Nourmohammadi Khiarak, Andrzej Pacut, Earnest Hansley, Mauricio Pamplona Segundo, Sudeep Sarkar, Hyeonjung Park, Gi Pyo Nam, Ig-Jae Kim, Sagar G. Sangodkar, Ümit Kaçar, Murvet Kirci, Li Yuan, Jishou Yuan, Haonan Zhao, Fei Lu, Junying Mao, Xiaoshuang Zhang, Dogucan Yaman, Fevziye Irem Eyiokur, Kadir Bulut Özler, Hazım Kemal Ekenel , et al. (6 additional authors not shown)

Abstract: This paper presents a summary of the 2019 Unconstrained Ear Recognition Challenge (UERC), the second in a series of group benchmarking efforts centered around the problem of person recognition from ear images captured in uncontrolled settings. The goal of the challenge is to assess the performance of existing ear recognition techniques on a challenging large-scale ear dataset and to analyze perfor… ▽ More This paper presents a summary of the 2019 Unconstrained Ear Recognition Challenge (UERC), the second in a series of group benchmarking efforts centered around the problem of person recognition from ear images captured in uncontrolled settings. The goal of the challenge is to assess the performance of existing ear recognition techniques on a challenging large-scale ear dataset and to analyze performance of the technology from various viewpoints, such as generalization abilities to unseen data characteristics, sensitivity to rotations, occlusions and image resolution and performance bias on sub-groups of subjects, selected based on demographic criteria, i.e. gender and ethnicity. Research groups from 12 institutions entered the competition and submitted a total of 13 recognition approaches ranging from descriptor-based methods to deep-learning models. The majority of submissions focused on ensemble based methods combining either representations from multiple deep models or hand-crafted with learned image descriptors. Our analysis shows that methods incorporating deep learning models clearly outperform techniques relying solely on hand-crafted descriptors, even though both groups of techniques exhibit similar behaviour when it comes to robustness to various covariates, such presence of occlusions, changes in (head) pose, or variability in image resolution. The results of the challenge also show that there has been considerable progress since the first UERC in 2017, but that there is still ample room for further research in this area. △ Less

Submitted 14 March, 2019; v1 submitted 11 March, 2019; originally announced March 2019.

Comments: The content of this paper was published in ICB, 2019. This ArXiv version is from before the peer review

arXiv:1806.00683 [pdf, other]

Deep Pepper: Expert Iteration based Chess agent in the Reinforcement Learning Setting

Authors: Sai Krishna G. V., Kyle Goyette, Ahmad Chamseddine, Breandan Considine

Abstract: An almost-perfect chess playing agent has been a long standing challenge in the field of Artificial Intelligence. Some of the recent advances demonstrate we are approaching that goal. In this project, we provide methods for faster training of self-play style algorithms, mathematical details of the algorithm used, various potential future directions, and discuss most of the relevant work in the are… ▽ More An almost-perfect chess playing agent has been a long standing challenge in the field of Artificial Intelligence. Some of the recent advances demonstrate we are approaching that goal. In this project, we provide methods for faster training of self-play style algorithms, mathematical details of the algorithm used, various potential future directions, and discuss most of the relevant work in the area of computer chess. Deep Pepper uses embedded knowledge to accelerate the training of the chess engine over a "tabula rasa" system such as Alpha Zero. We also release our code to promote further research. △ Less

Submitted 17 October, 2018; v1 submitted 2 June, 2018; originally announced June 2018.

Comments: Tabula Rasa, Chess engine, Learning Fast and Slow, Reinforcement Learning, Alpha Zero

arXiv:1402.4009 [pdf]

doi 10.14445/22315381/IJETT-V8P213

Privacy Shielding against Mass Surveillance

Authors: Kashyap. V, Boominathan. P

Abstract: Privacy Shielding against Mass Surveillance provides a step by step tactical approach to protecting the privacy of all the users of the internet from mass surveillance programs by the governments and other state agencies. Protection of privacy is of prime importance and Privacy Shielding provides the right means against mass surveillance programs and from malicious users trying to gain access to y… ▽ More Privacy Shielding against Mass Surveillance provides a step by step tactical approach to protecting the privacy of all the users of the internet from mass surveillance programs by the governments and other state agencies. Protection of privacy is of prime importance and Privacy Shielding provides the right means against mass surveillance programs and from malicious users trying to gain access to your systems. Although protection is difficult when massive government agencies like the National Security Agency and The Government Communications Headquarters target internet users for surveillance, it is possible because the target is not you as an individual but the entire mass as a whole. With the right approach and a broad perspective of the term Privacy, it is possible for one to freely access and share information over the internet without being victims of surveillance. △ Less

Submitted 17 February, 2014; originally announced February 2014.

Comments: 8 pages,8 figures,Published with International Journal of Engineering Trends and Technology (IJETT)

arXiv:1303.0908 [pdf]

KRAB Algorithm - A Revised Algorithm for Incremental Call Graph Generation

Authors: Rajasekhara Babu, Krishnakumar V., George Abraham, Kiransinh Borasia

Abstract: This paper is aimed to present the importance and implementation of an incremental call graph plugin. An algorithm is proposed for the call graph implementation which has better overall performance than the algorithm that has been proposed previously. In addition to this, the algorithm has been empirically proved to have excellent performance on recursive codes. The algorithm also readily checks f… ▽ More This paper is aimed to present the importance and implementation of an incremental call graph plugin. An algorithm is proposed for the call graph implementation which has better overall performance than the algorithm that has been proposed previously. In addition to this, the algorithm has been empirically proved to have excellent performance on recursive codes. The algorithm also readily checks for function skip and returns exceptions. △ Less

Submitted 4 March, 2013; originally announced March 2013.

Comments: 6 pages, 7 figures, 1 Table, Conference

Journal ref: World Applied Programming, Vol (2), Issue (5), May 2012. 294-299, Special section for proceeding of International E-Conference on Information Technology and Applications (IECITA), ISSN: 2222-2510

Showing 1–12 of 12 results for author: V., K