Search | arXiv e-print repository

Semantic SQL -- Combining and optimizing semantic predicates in SQL

Authors: Akash Mittal, Anshul Bheemreddy, Huili Tao

Abstract: In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vec… ▽ More In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vector databases have emerged, embedding unstructured data for efficient top-k queries based on textual queries. This paper introduces a novel framework SSQL - Semantic SQL that utilizes these two approaches, enabling the incorporation of semantic queries within SQL statements. Our approach extends SQL queries with dedicated keywords for specifying semantic queries alongside predicates related to ML model results and metadata. Our experimental results show that using just semantic queries fails catastrophically to answer count and spatial queries in more than 60% of the cases. Our proposed method jointly optimizes the queries containing both semantic predicates and predicates on structured tables, such as those generated by ML models or other metadata. Further, to improve the query results, we incorporated human-in-the-loop feedback to determine the optimal similarity score threshold for returning results. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2403.18060 [pdf, ps, other]

The Cordiality Game and the Game Cordiality Number

Authors: Elliot Krop, Aryan Mittal, Michael C. Wigal

Abstract: The cordiality game is played on a graph $G$ by two players, Admirable (A) and Impish (I), who take turns selecting \track{unlabeled} vertices of $G$. Admirable labels the selected vertices by $0$ and Impish by $1$, and the resulting label on any edge is the sum modulo $2$ of the labels of the vertices incident to that edge. The two players have opposite goals: Admirable attempts to minimize the n… ▽ More The cordiality game is played on a graph $G$ by two players, Admirable (A) and Impish (I), who take turns selecting \track{unlabeled} vertices of $G$. Admirable labels the selected vertices by $0$ and Impish by $1$, and the resulting label on any edge is the sum modulo $2$ of the labels of the vertices incident to that edge. The two players have opposite goals: Admirable attempts to minimize the number of edges with different labels as much as possible while Impish attempts to maximize this number. When both Admirable and Impish play their optimal games, we define the \emph{game cordiality number}, $c_g(G)$, as the absolute difference between the number of edges labeled zero and one. Let $P_n$ be the path on $n$ vertices. We show $c_g(P_n)\le \frac{n-3}{3}$ when $n \equiv 0 \pmod 3$, $c_g(P_n)\le \frac{n-1}{3}$ when $n \equiv 1 \pmod 3$, and $c_g(P_n)\le \frac{n+1}{3}$ when $n \equiv 2\pmod 3$. Furthermore, we show a similar bound, $c_g(T) \leq \frac{|T|}{2}$ holds for any tree $T$. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 12 pages

arXiv:2402.18434 [pdf, other]

Graph Regularized Encoder Training for Extreme Classification

Authors: Anshul Mittal, Shikhar Mohan, Deepak Saini, Suchith C. Prabhu, Jain jiao, Sumeet Agarwal, Soumen Chakrabarti, Purushottam Kar, Manik Varma

Abstract: Deep extreme classification (XC) aims to train an encoder architecture and an accompanying classifier architecture to tag a data point with the most relevant subset of labels from a very large universe of labels. XC applications in ranking, recommendation and tagging routinely encounter tail labels for which the amount of training data is exceedingly small. Graph convolutional networks (GCN) prese… ▽ More Deep extreme classification (XC) aims to train an encoder architecture and an accompanying classifier architecture to tag a data point with the most relevant subset of labels from a very large universe of labels. XC applications in ranking, recommendation and tagging routinely encounter tail labels for which the amount of training data is exceedingly small. Graph convolutional networks (GCN) present a convenient but computationally expensive way to leverage task metadata and enhance model accuracies in these settings. This paper formally establishes that in several use cases, the steep computational cost of GCNs is entirely avoidable by replacing GCNs with non-GCN architectures. The paper notices that in these settings, it is much more effective to use graph data to regularize encoder training than to implement a GCN. Based on these insights, an alternative paradigm RAMEN is presented to utilize graph metadata in XC settings that offers significant performance boosts with zero increase in inference computational costs. RAMEN scales to datasets with up to 1M labels and offers prediction accuracy up to 15% higher on benchmark datasets than state of the art methods, including those that use graph metadata to train GCNs. RAMEN also offers 10% higher accuracy over the best baseline on a proprietary recommendation dataset sourced from click logs of a popular search engine. Code for RAMEN will be released publicly. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2312.06665 [pdf]

Predicting Neural Stem Cell Differentiation Using Deep Learning Models

Authors: Chandra Suda, Nidhi Parthasarathy, Anika Mittal, Ian Young Chen, Ananya Jalihal

Abstract: Neural stem cells have immense therapeutic potential for treating various neurological disorders. However, lengthy differentiation protocols hinder the translation of neural stem cells into clinical applications. In this study, we present a deep learning approach using convolutional neural networks (CNNs) to predict the fate of neural stem cell differentiation at an early stage. We trained a CNN m… ▽ More Neural stem cells have immense therapeutic potential for treating various neurological disorders. However, lengthy differentiation protocols hinder the translation of neural stem cells into clinical applications. In this study, we present a deep learning approach using convolutional neural networks (CNNs) to predict the fate of neural stem cell differentiation at an early stage. We trained a CNN model on a dataset of cellular images from neural stem cell cultures, and the model achieved impressive results with a 99.7% testing accuracy for the binary Resnet50 model and a 93.3% testing accuracy for the multiclass Resnet50 model in predicting neuron and glial cell differentiation. This demonstrates the feasibility of using CNNs for rapid, early differentiation outcome prediction from simple microscopy images, which could greatly accelerate neural stem cell research and therapies. Additionally, the model provides biological insights into morphological features associated with specific neural cell lineages. △ Less

Submitted 19 November, 2023; originally announced December 2023.

arXiv:2311.12727 [pdf, other]

Soft Random Sampling: A Theoretical and Empirical Analysis

Authors: Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei Zhang, George Saon, Brian Kingsbury

Abstract: Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. N… ▽ More Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost. △ Less

Submitted 23 November, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2310.18924 [pdf, other]

Remaining useful life prediction of Lithium-ion batteries using spatio-temporal multimodal attention networks

Authors: Sungho Suh, Dhruv Aditya Mittal, Hymalai Bello, Bo Zhou, Mayank Shekhar Jha, Paul Lukowicz

Abstract: Lithium-ion batteries are widely used in various applications, including electric vehicles and renewable energy storage. The prediction of the remaining useful life (RUL) of batteries is crucial for ensuring reliable and efficient operation, as well as reducing maintenance costs. However, determining the life cycle of batteries in real-world scenarios is challenging, and existing methods have limi… ▽ More Lithium-ion batteries are widely used in various applications, including electric vehicles and renewable energy storage. The prediction of the remaining useful life (RUL) of batteries is crucial for ensuring reliable and efficient operation, as well as reducing maintenance costs. However, determining the life cycle of batteries in real-world scenarios is challenging, and existing methods have limitations in predicting the number of cycles iteratively. In addition, existing works often oversimplify the datasets, neglecting important features of the batteries such as temperature, internal resistance, and material type. To address these limitations, this paper proposes a two-stage RUL prediction scheme for Lithium-ion batteries using a spatio-temporal multimodal attention network (ST-MAN). The proposed ST-MAN is to capture the complex spatio-temporal dependencies in the battery data, including the features that are often neglected in existing works. Despite operating without prior knowledge of end-of-life (EOL) events, our method consistently achieves lower error rates, boasting mean absolute error (MAE) and mean square error (MSE) of 0.0275 and 0.0014, respectively, compared to existing convolutional neural networks (CNN) and long short-term memory (LSTM)-based methods. The proposed method has the potential to improve the reliability and efficiency of battery operations and is applicable in various industries. △ Less

Submitted 6 June, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.14095 [pdf, other]

Neutral Hydrogen (HI) 21 cm as a probe: Investigating Spatial Variations in Interstellar Turbulent Properties

Authors: Amit K. Mittal, Brian L Babler, Snezana Stanimirovic, Nickolas **el

Abstract: Interstellar turbulence shapes the HI distribution in the Milky Way (MW). How this affects large-scale statistical properties of HI column density across the MW remains largely unconstrained. We use approx 13,000 square-degree GALFA-HI survey to map statistical fluctuations of HI over the 40 km s-1 velocity range. We calculate the spatial power spectrum (SPS) of HI column density image by running… ▽ More Interstellar turbulence shapes the HI distribution in the Milky Way (MW). How this affects large-scale statistical properties of HI column density across the MW remains largely unconstrained. We use approx 13,000 square-degree GALFA-HI survey to map statistical fluctuations of HI over the 40 km s-1 velocity range. We calculate the spatial power spectrum (SPS) of HI column density image by running a 3-degree kernel and measuring SPS slope over a range of angular scales from 16 arcmin to 20 degree. Due to GALFA complex observing and calibration strategy, we construct detailed estimates of the noise contribution and account for GALFA beam effects on SPS. This allows us to systematically analyze HI images that trace a wide range of interstellar environments. We find that SPS slope varies between -2.6 at high Galactic latitudes, and -3.2 close to Galactic plane. The range of SPS slope values becomes tighter when we consider HI optical depth and line-of-sight length caused by the plane-parallel geometry of HI disk. This relatively uniform, large-scale distribution of SPS slope is suggestive of large-scale turbulent driving being a dominant mechanism for sha** HI structures in the MW and/or the stellar feedback turbulence being efficiently dissipated within dense molecular clouds. Only at latitudes above 60 degrees we find evidence for HI SPS slope being consistently more shallow. Those directions are largely within the Local Bubble, suggesting the recent history of this cavity, shaped by multiple supernovae explosions, has modified the turbulent state of HI and/or fractions of HI phases. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: Accepted in ApJ

arXiv:2310.08891 [pdf, other]

EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

Authors: Ramnath Kumar, Anshul Mittal, Nilesh Gupta, Aditya Kusupati, Inderjit Dhillon, Prateek Jain

Abstract: Dense embedding-based retrieval is now the industry standard for semantic search and ranking problems, like obtaining relevant web documents for a given query. Such techniques use a two-stage process: (a) contrastive learning to train a dual encoder to embed both the query and documents and (b) approximate nearest neighbor search (ANNS) for finding similar documents for a given query. These two st… ▽ More Dense embedding-based retrieval is now the industry standard for semantic search and ranking problems, like obtaining relevant web documents for a given query. Such techniques use a two-stage process: (a) contrastive learning to train a dual encoder to embed both the query and documents and (b) approximate nearest neighbor search (ANNS) for finding similar documents for a given query. These two stages are disjoint; the learned embeddings might be ill-suited for the ANNS method and vice-versa, leading to suboptimal performance. In this work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns both the embeddings and the ANNS structure to optimize retrieval performance. EHI uses a standard dual encoder model for embedding queries and documents while learning an inverted file index (IVF) style tree structure for efficient ANNS. To ensure stable and efficient learning of discrete tree-based ANNS structure, EHI introduces the notion of dense path embedding that captures the position of a query/document in the tree. We demonstrate the effectiveness of EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev set and TREC DL19) datasets. For example, with the same compute budget, EHI outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and by 4.2% (nDCG@10) on TREC DL19 benchmarks. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.08304 [pdf]

CHIP: Contrastive Hierarchical Image Pretraining

Authors: Arpit Mittal, Harshil Jhaveri, Swapnil Mallick, Abhishek Ajmera

Abstract: Few-shot object classification is the task of classifying objects in an image with limited number of examples as supervision. We propose a one-shot/few-shot classification model that can classify an object of any unseen class into a relatively general category in an hierarchically based classification. Our model uses a three-level hierarchical contrastive loss based ResNet152 classifier for classi… ▽ More Few-shot object classification is the task of classifying objects in an image with limited number of examples as supervision. We propose a one-shot/few-shot classification model that can classify an object of any unseen class into a relatively general category in an hierarchically based classification. Our model uses a three-level hierarchical contrastive loss based ResNet152 classifier for classifying an object based on its features extracted from Image embedding, not used during the training phase. For our experimentation, we have used a subset of the ImageNet (ILSVRC-12) dataset that contains only the animal classes for training our model and created our own dataset of unseen classes for evaluating our trained model. Our model provides satisfactory results in classifying the unknown objects into a generic category which has been later discussed in greater detail. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.14382 [pdf, other]

Agree To Disagree

Authors: Abhinav Raghuvanshi, Siddhesh Pawar, Anirudh Mittal

Abstract: How frequently do individuals thoroughly review terms and conditions before proceeding to register for a service, install software, or access a website? The majority of internet users do not engage in this practice. This trend is not surprising, given that terms and conditions typically consist of lengthy documents replete with intricate legal terminology and convoluted sentences. In this paper, w… ▽ More How frequently do individuals thoroughly review terms and conditions before proceeding to register for a service, install software, or access a website? The majority of internet users do not engage in this practice. This trend is not surprising, given that terms and conditions typically consist of lengthy documents replete with intricate legal terminology and convoluted sentences. In this paper, we introduce a Machine Learning-powered approach designed to automatically parse and summarize critical information in a user-friendly manner. This technology focuses on distilling the pertinent details that users should contemplate before committing to an agreement. △ Less

Submitted 24 September, 2023; originally announced September 2023.

arXiv:2309.05475 [pdf, other]

Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model

Authors: Neel Bhate, Ansh Mittal, Zhe He, Xiao Luo

Abstract: Demographics, Social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clin… ▽ More Demographics, Social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research. △ Less

Submitted 13 September, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures

arXiv:2309.04961 [pdf, other]

doi 10.1109/CVPR52688.2022.01207

Multi-modal Extreme Classification

Authors: Anshul Mittal, Kunal Dahiya, Shreya Malani, Janani Ramaswamy, Seba Kuruvilla, Jitendra Ajmera, Keng-hao Chang, Sumeet Agarwal, Purushottam Kar, Manik Varma

Abstract: This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where datapoints and labels are endowed with visual and textual descriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several millions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based methods. On t… ▽ More This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where datapoints and labels are endowed with visual and textual descriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several millions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based methods. On the other hand, XC methods utilize classifier architectures to offer superior accuracies than embedding only methods but mostly focus on text-based categorization tasks. MUFIN bridges this gap by reformulating multi-modal categorization as an XC problem with several millions of labels. This presents the twin challenges of develo** multi-modal architectures that can offer embeddings sufficiently expressive to allow accurate categorization over millions of labels; and training and inference routines that scale logarithmically in the number of labels. MUFIN develops an architecture based on cross-modal attention and trains it in a modular fashion using pre-training and positive and negative mining. A novel product-to-product recommendation dataset MM-AmazonTitles-300K containing over 300K products was curated from publicly available amazon.com listings with each product endowed with a title and multiple images. On the all datasets MUFIN offered at least 3% higher accuracy than leading text-based, image-based and multi-modal techniques. Code for MUFIN is available at https://github.com/Extreme-classification/MUFIN △ Less

Submitted 10 September, 2023; originally announced September 2023.

ACM Class: H.3.3

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022

arXiv:2308.13969 [pdf, other]

Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers

Authors: Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda

Abstract: Modern transformer-based models designed for computer vision have outperformed humans across a spectrum of visual tasks. However, critical tasks, such as medical image interpretation or autonomous driving, still require reliance on human judgments. This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models… ▽ More Modern transformer-based models designed for computer vision have outperformed humans across a spectrum of visual tasks. However, critical tasks, such as medical image interpretation or autonomous driving, still require reliance on human judgments. This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models to improve accuracy across multiple driving situations and datasets. First, we establish the significance of fixation regions in left-right driving decisions, as observed in both human subjects and a Vision Transformer (ViT). By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap is exploited for model pruning without compromising accuracy. Thereafter, we incorporate information from the driving scene with fixation data, employing a "joint space-fixation" (JSF) attention setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to train the ViT model to attend to the same regions that humans fixated on. We find that the ViT performance is improved in accuracy and number of training epochs when using JSF and FAX. These results hold significant implications for human-guided artificial intelligence. △ Less

Submitted 26 August, 2023; originally announced August 2023.

Comments: 25 pages, 9 figures, 3 tables

arXiv:2308.12157 [pdf, other]

Evaluation of Faithfulness Using the Longest Supported Subsequence

Authors: Anirudh Mittal, Timo Schick, Mikel Artetxe, Jane Dwivedi-Yu

Abstract: As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-gener… ▽ More As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous substring of the claim that is supported by the context, which we refer to as the Longest Supported Subsequence (LSS). Using a new human-annotated dataset, we finetune a model to generate LSS. We introduce a new method of evaluation and demonstrate that these metrics correlate better with human ratings when LSS is employed, as opposed to when it is not. Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset. Our metric consistently outperforms other metrics on a summarization dataset across six different models. Finally, we compare several popular Large Language Models (LLMs) for faithfulness using this metric. We release the human-annotated dataset built for predicting LSS and our fine-tuned model for evaluating faithfulness. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2307.12549 [pdf, other]

Estimating Time to Clear Pendency of Cases in High Courts in India using Linear Regression

Authors: Kshitiz Verma, Anshu Musaddi, Ansh Mittal, Anshul Jain

Abstract: Indian Judiciary is suffering from burden of millions of cases that are lying pending in its courts at all the levels. The High Court National Judicial Data Grid (HC-NJDG) indexes all the cases pending in the high courts and publishes the data publicly. In this paper, we analyze the data that we have collected from the HC-NJDG portal on 229 randomly chosen days between August 31, 2017 to March 22,… ▽ More Indian Judiciary is suffering from burden of millions of cases that are lying pending in its courts at all the levels. The High Court National Judicial Data Grid (HC-NJDG) indexes all the cases pending in the high courts and publishes the data publicly. In this paper, we analyze the data that we have collected from the HC-NJDG portal on 229 randomly chosen days between August 31, 2017 to March 22, 2020, including these dates. Thus, the data analyzed in the paper spans a period of more than two and a half years. We show that: 1) the pending cases in most of the high courts is increasing linearly with time. 2) the case load on judges in various high courts is very unevenly distributed, making judges of some high courts hundred times more loaded than others. 3) for some high courts it may take even a hundred years to clear the pendency cases if proper measures are not taken. We also suggest some policy changes that may help clear the pendency within a fixed time of either five or fifteen years. Finally, we find that the rate of institution of cases in high courts can be easily handled by the current sanctioned strength. However, extra judges are needed only to clear earlier backlogs. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 12 pages, 9 figures, JURISIN 2022. arXiv admin note: text overlap with arXiv:2307.10615

arXiv:2307.05006 [pdf, ps, other]

Improving RNN-Transducers with Acoustic LookAhead

Authors: Vinit S. Unni, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi

Abstract: RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to s… ▽ More RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: 5 pages, 1 fig, 7 tables, Proceedings of Interspeech 2023

arXiv:2306.16048 [pdf, other]

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Authors: Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

Abstract: This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Finding… ▽ More This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for develo** more robust models. △ Less

Submitted 18 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: CVPR2024 MMFM workshop

arXiv:2306.14904 [pdf, other]

Determining Smallest Path Size of Multiplication Transducers Without a Restricted Digit Set

Authors: Aditya Mittal, Karthik Mittal

Abstract: Directed multiplication transducers are a tool for performing non-decimal base multiplication without an additional conversion to base 10. This allows for faster computation and provides easier visualization depending on the problem at hand. By building these multiplication transducers computationally, new patterns can be identified as these transducers can be built with much larger bases and mult… ▽ More Directed multiplication transducers are a tool for performing non-decimal base multiplication without an additional conversion to base 10. This allows for faster computation and provides easier visualization depending on the problem at hand. By building these multiplication transducers computationally, new patterns can be identified as these transducers can be built with much larger bases and multipliers. Through a recursive approach, we created artificial multiplication transducers, allowing for the formation of several unique conjectures specifically focused on the smallest closed loop around a multiplication transducer starting and ending at zero. We show a general recursive pattern for this loop; through this recurrence relation, the length of the smallest closed loop for a particular transducer base b along with the range of multipliers having this particular length for multiplier m was also identified. This research is expected to be explored further by testing reductions of the digit set and determining whether similar properties will hold. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: 15 pages, 4 figures, submitted at SoCal-Nevada MAA Session 2022 and Cal State East Bay Student Research Symposium 2022

arXiv:2306.14812 [pdf, other]

MOVES: Movable and Moving LiDAR Scene Segmentation in Label-Free settings using Static Reconstruction

Authors: Prashant Kumar, Dhruv Makwana, Onkar Susladkar, Anurag Mittal, Prem Kumar Kalra

Abstract: Accurate static structure reconstruction and segmentation of non-stationary objects is of vital importance for autonomous navigation applications. These applications assume a LiDAR scan to consist of only static structures. In the real world however, LiDAR scans consist of non-stationary dynamic structures - moving and movable objects. Current solutions use segmentation information to isolate and… ▽ More Accurate static structure reconstruction and segmentation of non-stationary objects is of vital importance for autonomous navigation applications. These applications assume a LiDAR scan to consist of only static structures. In the real world however, LiDAR scans consist of non-stationary dynamic structures - moving and movable objects. Current solutions use segmentation information to isolate and remove moving structures from LiDAR scan. This strategy fails in several important use-cases where segmentation information is not available. In such scenarios, moving objects and objects with high uncertainty in their motion i.e. movable objects, may escape detection. This violates the above assumption. We present MOVES, a novel GAN based adversarial model that segments out moving as well as movable objects in the absence of segmentation information. We achieve this by accurately transforming a dynamic LiDAR scan to its corresponding static scan. This is obtained by replacing dynamic objects and corresponding occlusions with static structures which were occluded by dynamic objects. We leverage corresponding static-dynamic LiDAR pairs. △ Less

Submitted 15 October, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: 35 pages, 8 figures, 6 tables

arXiv:2306.09988 [pdf, other]

A Numerically Robust and Stable Time-Space Pseudospectral Approach for Generalized Burgers-Fisher Equation

Authors: Harvindra Singh, Lokendra Balyan, A. K. Mittal, Parul Saini

Abstract: In this article, we present the time-space Chebyshev pseudospectral method (TS-CPsM) to approximate a solution to the generalised Burgers-Fisher (gBF) equation. The Chebyshev-Gauss-Lobatto (CGL) points serve as the foundation for the recommended method, which makes use of collocations in both the time and space directions. Further, using a map**, the non-homogeneous initial-boundary value proble… ▽ More In this article, we present the time-space Chebyshev pseudospectral method (TS-CPsM) to approximate a solution to the generalised Burgers-Fisher (gBF) equation. The Chebyshev-Gauss-Lobatto (CGL) points serve as the foundation for the recommended method, which makes use of collocations in both the time and space directions. Further, using a map**, the non-homogeneous initial-boundary value problem is transformed into a homogeneous problem, and a system of algebraic equations is obtained. The numerical approach known as Newton-Raphson is implemented in order to get the desired results for the system. The proposed method's stability analysis has been performed. Different researchers' considerations on test problems have been explored to illustrate the robustness and practicality of the approach presented. The approximate solutions we found using the proposed method are highly accurate and significantly better than the existing results. △ Less

Submitted 16 June, 2023; originally announced June 2023.

arXiv:2306.04849 [pdf, other]

ScaleDet: A Scalable Multi-Dataset Object Detector

Authors: Yanbei Chen, Manchen Wang, Abhay Mittal, Zhenlin Xu, Paolo Favaro, Joseph Tighe, Davide Modolo

Abstract: Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisti… ▽ More Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisticated optimizations to unify labels across datasets, we introduce a simple yet scalable formulation to derive a unified semantic label space for multi-dataset training. ScaleDet is trained by visual-textual alignment to learn the label assignment with label semantic similarities across datasets. Once trained, ScaleDet can generalize well on any given upstream and downstream datasets with seen and unseen classes. We conduct extensive experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and 13 datasets from Object Detection in the Wild (ODinW) as downstream datasets. Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the same backbone. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: CVPR 2023

arXiv:2304.10050 [pdf, other]

Neural Radiance Fields: Past, Present, and Future

Authors: Ansh Mittal

Abstract: The various aspects like modeling and interpreting 3D environments and surroundings have enticed humans to progress their research in 3D Computer Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in Computer Graphics, Robotics, Computer Vision, and the possible scope of High-Resolution Low Storage… ▽ More The various aspects like modeling and interpreting 3D environments and surroundings have enticed humans to progress their research in 3D Computer Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in Computer Graphics, Robotics, Computer Vision, and the possible scope of High-Resolution Low Storage Augmented Reality and Virtual Reality-based 3D models have gained traction from res with more than 1000 preprints related to NeRFs published. This paper serves as a bridge for people starting to study these fields by building on the basics of Mathematics, Geometry, Computer Vision, and Computer Graphics to the difficulties encountered in Implicit Representations at the intersection of all these disciplines. This survey provides the history of rendering, Implicit Learning, and NeRFs, the progression of research on NeRFs, and the potential applications and implications of NeRFs in today's world. In doing so, this survey categorizes all the NeRF-related research in terms of the datasets used, objective functions, applications solved, and evaluation criteria for these applications. △ Less

Submitted 14 January, 2024; v1 submitted 19 April, 2023; originally announced April 2023.

Comments: 413 pages, 9 figures, 277 citations

arXiv:2303.08230 [pdf, other]

Bayesian Beta-Bernoulli Process Sparse Coding with Deep Neural Networks

Authors: Arunesh Mittal, Kai Yang, Paul Sajda, John Paisley

Abstract: Several approximate inference methods have been proposed for deep discrete latent variable models. However, non-parametric methods which have previously been successfully employed for classical sparse coding models have largely been unexplored in the context of deep models. We propose a non-parametric iterative algorithm for learning discrete latent representations in such deep models. Additionall… ▽ More Several approximate inference methods have been proposed for deep discrete latent variable models. However, non-parametric methods which have previously been successfully employed for classical sparse coding models have largely been unexplored in the context of deep models. We propose a non-parametric iterative algorithm for learning discrete latent representations in such deep models. Additionally, to learn scale invariant discrete features, we propose local data scaling variables. Lastly, to encourage sparsity in our representations, we propose a Beta-Bernoulli process prior on the latent factors. We evaluate our spare coding model coupled with different likelihood models. We evaluate our method across datasets with varying characteristics and compare our results to current amortized approximate inference methods. △ Less

Submitted 14 March, 2023; originally announced March 2023.

arXiv:2301.09420 [pdf, other]

On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment

Authors: Ansh Mittal, Aditya Malte

Abstract: Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this… ▽ More Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: 6 pages, 5 figures

arXiv:2212.14125 [pdf, other]

MuTable (Music Table): Turn any surface into musical instrument

Authors: Akash Mittal, Ragini Gupta

Abstract: With the rise in pervasive computing solutions, interactive surfaces have gained a large popularity across multi-application domains including smart boards for education, touch-enabled kiosks for smart retail and smart mirrors for smart homes. Despite the increased popularity of such interactive surfaces, existing platforms are mostly limited to custom built surfaces with attached sensors and hard… ▽ More With the rise in pervasive computing solutions, interactive surfaces have gained a large popularity across multi-application domains including smart boards for education, touch-enabled kiosks for smart retail and smart mirrors for smart homes. Despite the increased popularity of such interactive surfaces, existing platforms are mostly limited to custom built surfaces with attached sensors and hardware, that are expensive and require complicated design considerations. To address this, we design a low-cost, intuitive system called MuTable that repurposes any flat surface (such as table tops) into a live musical instrument. This provides a unique, close to real-time instrument playing experience to the user to play any type of musical instrument. This is achieved by projecting the instrument's shape on any tangible surface, sensor calibration, user taps detection, tap position identification, and associated sound generation. We demonstrate the performance of our working system by reporting an accuracy of 83% for detecting softer taps, 100% accuracy for detecting the regular taps, and a precision of 95.7% for estimating hand location. △ Less

Submitted 28 December, 2022; originally announced December 2022.

arXiv:2212.05933 [pdf, other]

Nostradamus: Weathering Worth

Authors: Alapan Chaudhuri, Zeeshan Ahmed, Ashwin Rao, Shivansh Subramanian, Shreyas Pradhan, Abhishek Mittal

Abstract: Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate da… ▽ More Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate data, and various climate indicators such as carbon dioxide emissions. We have conducted our study based on the US financial market, global climate trends, and daily weather records to demonstrate a significant relationship between climate and stock price fluctuation. Our analysis covers both short-term and long-term rises and dips in company stock performances. Lastly, we take four natural disasters as a case study to observe the effect they have on people's emotional state and their influence on the stock market. △ Less

Submitted 17 January, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: 13 pages, 13 figures; updated abstract; updated format to Springer LNCS

arXiv:2212.04122 [pdf, other]

Reducing Collision Risk in Multi-Agent Path Planning: Application to Air traffic Management

Authors: Sarah H. Q. Li, Avi Mittal, Pierre-Loïc Garoche, Açıkmeşe, Behçet

Abstract: To minimize collision risks in the multi-agent path planning problem with stochastic transition dynamics, we formulate a Markov decision process congestion game with a multi-linear congestion cost. Players within the game complete individual tasks while minimizing their own collision risks. We show that the set of Nash equilibria coincides with the first-order KKT points of a non-convex optimizati… ▽ More To minimize collision risks in the multi-agent path planning problem with stochastic transition dynamics, we formulate a Markov decision process congestion game with a multi-linear congestion cost. Players within the game complete individual tasks while minimizing their own collision risks. We show that the set of Nash equilibria coincides with the first-order KKT points of a non-convex optimization problem. Our game is applied to a historical flight plan over France to reduce collision risks between commercial aircraft. △ Less

Submitted 10 December, 2022; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: 6 pages, 2 figures

arXiv:2210.16892 [pdf, other]

Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Authors: Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, Ganesh Ramakrishnan

Abstract: Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the… ▽ More Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise. △ Less

Submitted 30 October, 2022; originally announced October 2022.

arXiv:2210.05162 [pdf, other]

doi 10.1109/TAES.2023.3254527

Estimation methods for elementary chirp model parameters

Authors: Anjali Mittal, Rhythm Grover, Debasis Kundu, Amit Mitra

Abstract: In this paper, we propose some estimation techniques to estimate the elementary chirp model parameters, which are encountered in sonar, radar, acoustics, and other areas. We derive asymptotic theoretical properties of least squares estimators and approximate least squares estimators for the one-component elementary chirp model. It is proved that the proposed estimators are strongly consistent and… ▽ More In this paper, we propose some estimation techniques to estimate the elementary chirp model parameters, which are encountered in sonar, radar, acoustics, and other areas. We derive asymptotic theoretical properties of least squares estimators and approximate least squares estimators for the one-component elementary chirp model. It is proved that the proposed estimators are strongly consistent and follow the normal distribution asymptotically. We also suggest how to obtain proper initial values for these methods. The problem of finding initial values is a difficult problem when the number of components in the model is large, or when the signal-to-noise ratio is low, or when two frequency rates are close to each other. We propose sequential procedures to estimate the multiple-component elementary chirp model parameters. We prove that the theoretical properties of sequential least squares estimators and sequential approximate least squares estimators coincide with those of least squares estimators and approximate least squares estimators, respectively. To evaluate the performance of the proposed estimators, numerical experiments are performed. It is observed that the proposed sequential estimators perform well even in situations where least squares estimators do not perform well. We illustrate the performance of the proposed sequential algorithm on a bat data. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2209.11472 [pdf]

doi 10.1016/j.solener.2022.08.056

Alternate stabilization methods for CZTSSe photovoltaic devices by thermal treatment, dark electric bias and illumination

Authors: W. Ananda, M. Rennhofer, A. Mittal, N. Zechner, W. Lang

Abstract: Reliable measurement routines are crucial for power rating and yield prediction of photovoltaic emerging thinfilm technologies. Copper-Zinc-Tin-Sulfur-Selenium (CZTSSe) thin-film photovoltaic devices are an emerging technology made of abundant elements. Still, sufficient stabilization methods prior to electric power measurement are missing in the international standardization, while existing stand… ▽ More Reliable measurement routines are crucial for power rating and yield prediction of photovoltaic emerging thinfilm technologies. Copper-Zinc-Tin-Sulfur-Selenium (CZTSSe) thin-film photovoltaic devices are an emerging technology made of abundant elements. Still, sufficient stabilization methods prior to electric power measurement are missing in the international standardization, while existing standards for other thin-film technologies do not work properly for CZTSSe. This study investigated methods for achieving power stabilization of the CZTSSe solar devices. Three complementary stabilization routines for the kesterite-based solar devices were investigated as an alternative to the existing international device testing standards: rapid annealing, dark electric biasing and different operating points under illumination. The typical number of stabilization cycles for power stabilization was between 3 and 6 cycles of rapid annealing, dark electric bias and illumination with a power loss of -19.5%, -11.4%, and -1.9%, for the respective methods. The dark electric bias method was found to provide the most reliable average result for power stabilization. All stabilization methods proved to have the potential to work sufficiently in stabilizing the CZTSSe devices for standardized power measurement. △ Less

Submitted 23 September, 2022; originally announced September 2022.

arXiv:2207.11838 [pdf, other]

SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions

Authors: Ansh Mittal, Shuvam Ghosal, Rishibha Bansal

Abstract: Detecting suspicious activities in surveillance videos is a longstanding problem in real-time surveillance that leads to difficulties in detecting crimes. Hence, we propose a novel approach for detecting and summarizing suspicious activities in surveillance videos. We have also created ground truth summaries for the UCF-Crime video dataset. We modify a pre-existing approach for this task by levera… ▽ More Detecting suspicious activities in surveillance videos is a longstanding problem in real-time surveillance that leads to difficulties in detecting crimes. Hence, we propose a novel approach for detecting and summarizing suspicious activities in surveillance videos. We have also created ground truth summaries for the UCF-Crime video dataset. We modify a pre-existing approach for this task by leveraging the Human-Object Interaction (HOI) model for the Visual features in the Bi-Modal Transformer. Further, we validate our approach against the existing state-of-the-art algorithms for the Dense Video Captioning task for the ActivityNet Captions dataset. We observe that this formulation for Dense Captioning performs significantly better than other discussed BMT-based approaches for BLEU@1, BLEU@2, BLEU@3, BLEU@4, and METEOR. We further perform a comparative analysis of the dataset and the model to report the findings based on different NMS thresholds (searched using Genetic Algorithms). Here, our formulation outperforms all the models for BLEU@1, BLEU@2, BLEU@3, and most models for BLEU@4 and METEOR falling short of only ADV-INF Global by 25% and 0.5%, respectively. △ Less

Submitted 22 October, 2022; v1 submitted 24 July, 2022; originally announced July 2022.

Comments: 14 pages, 6 figures, 6 tables

arXiv:2207.04452 [pdf, other]

NGAME: Negative Mining-aware Mini-batching for Extreme Classification

Authors: Kunal Dahiya, Nilesh Gupta, Deepak Saini, Akshay Soni, Yajun Wang, Kushal Dave, Jian Jiao, Gururaj K, Prasenjit Dey, Amit Singh, Deepesh Hada, Vidit Jain, Bhawna Paliwal, Anshul Mittal, Sonu Mehta, Ramachandran Ramjee, Sumeet Agarwal, Purushottam Kar, Manik Varma

Abstract: Extreme Classification (XC) seeks to tag data points with the most relevant subset of labels from an extremely large label set. Performing deep XC with dense, learnt representations for data points and labels has attracted much attention due to its superiority over earlier XC methods that used sparse, hand-crafted features. Negative mining techniques have emerged as a critical component of all dee… ▽ More Extreme Classification (XC) seeks to tag data points with the most relevant subset of labels from an extremely large label set. Performing deep XC with dense, learnt representations for data points and labels has attracted much attention due to its superiority over earlier XC methods that used sparse, hand-crafted features. Negative mining techniques have emerged as a critical component of all deep XC methods that allow them to scale to millions of labels. However, despite recent advances, training deep XC models with large encoder architectures such as transformers remains challenging. This paper identifies that memory overheads of popular negative mining techniques often force mini-batch sizes to remain small and slow training down. In response, this paper introduces NGAME, a light-weight mini-batch creation technique that offers provably accurate in-batch negative samples. This allows training with larger mini-batches offering significantly faster convergence and higher accuracies than existing negative sampling techniques. NGAME was found to be up to 16% more accurate than state-of-the-art methods on a wide array of benchmark datasets for extreme classification, as well as 3% more accurate at retrieving search engine queries in response to a user webpage visit to show personalized ads. In live A/B tests on a popular search engine, NGAME yielded up to 23% gains in click-through-rates. △ Less

Submitted 10 July, 2022; originally announced July 2022.

arXiv:2205.01825 [pdf, other]

AmbiPun: Generating Humorous Puns with Ambiguous Context

Authors: Anirudh Mittal, Yufei Tian, Nanyun Peng

Abstract: In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3… ▽ More In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words and then generate puns incorporating context words from both concepts. Human evaluation shows that our method successfully generates pun 52\% of the time, outperforming well-crafted baselines and the state-of-the-art models by a large margin. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: To appear in NAACL 2022

arXiv:2203.13628 [pdf, other]

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Authors: Sreyan Ghosh, Ashish Seth, and Deepak Mittal, Maneesh Singh, S. Umesh

Abstract: Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins… ▽ More Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins objective function, we propose to learn embeddings that are invariant to distortions of an input audio sample, while making sure that they contain non-redundant information about the sample. To achieve this, we measure the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of an audio segment sampled from an audio file and make it as close to the identity matrix as possible. We use a combination of a small subset of the large-scale AudioSet dataset and FSD50K for self-supervised learning and are able to learn with less than half the parameters compared to state-of-the-art algorithms. For evaluation, we transfer these learned representations to 9 downstream classification tasks, including speech, music, and animal sounds, and show competitive results under different evaluation setups. In addition to being simple and intuitive, our pre-training algorithm is amenable to compute through its inherent nature of construction and does not require careful implementation details to avoid trivial or degenerate solutions. Furthermore, we conduct ablation studies on our results and make all our code and pre-trained models publicly available https://github.com/Speech-Lab-IITM/DeLoRes. △ Less

Submitted 26 June, 2022; v1 submitted 25 March, 2022; originally announced March 2022.

Comments: Accepted to AAAI 2022 workshop on Self-supervised Learning for Audio and Speech Processing

arXiv:2203.02317 [pdf, other]

Adaptive Discounting of Implicit Language Models in RNN-Transducers

Authors: Vinit Unni, Shreya Khare, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi, Samarth Bharadwaj

Abstract: RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that a… ▽ More RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM discounting technique AdaptLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. AdaptLMD uses a two-pronged approach: 1) Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task. △ Less

Submitted 21 February, 2022; originally announced March 2022.

Comments: Proceedings for ICASSP 2022

arXiv:2201.11479 [pdf, other]

doi 10.21428/594757db.d2f8342b

Eye-focused Detection of Bell's Palsy in Videos

Authors: Sharik Ali Ansari, Koteswar Rao Jerripothula, Pragya Nagpal, Ankush Mittal

Abstract: In this paper, we present how Bell's Palsy, a neurological disorder, can be detected just from a subject's eyes in a video. We notice that Bell's Palsy patients often struggle to blink their eyes on the affected side. As a result, we can observe a clear contrast between the blinking patterns of the two eyes. Although previous works did utilize images/videos to detect this disorder, none have expli… ▽ More In this paper, we present how Bell's Palsy, a neurological disorder, can be detected just from a subject's eyes in a video. We notice that Bell's Palsy patients often struggle to blink their eyes on the affected side. As a result, we can observe a clear contrast between the blinking patterns of the two eyes. Although previous works did utilize images/videos to detect this disorder, none have explicitly focused on the eyes. Most of them require the entire face. One obvious advantage of having an eye-focused detection system is that subjects' anonymity is not at risk. Also, our AI decisions based on simple blinking patterns make them explainable and straightforward. Specifically, we develop a novel feature called blink similarity, which measures the similarity between the two blinking patterns. Our extensive experiments demonstrate that the proposed feature is quite robust, for it helps in Bell's Palsy detection even with very few labels. Our proposed eye-focused detection system is not only cheaper but also more convenient than several existing methods. △ Less

Submitted 27 January, 2022; originally announced January 2022.

Comments: Published in the Proceedings of the 34th Canadian Conference on Artificial Intelligence. Please cite this paper in the following manner: S. A. Ansari, K. R. Jerripothula, P. Nagpal, and A. Mittal. "Eye-focused Detection of Bell's Palsy in Videos". In: Proceedings of the 34th Canadian Conference on Artificial Intelligence (June 8, 2021). doi: 10.21428/594757db.d2f8342b

arXiv:2201.11407 [pdf, other]

Non-linear Motion Estimation for Video Frame Interpolation using Space-time Convolutions

Authors: Saikat Dutta, Arulkumar Subramaniam, Anurag Mittal

Abstract: Video frame interpolation aims to synthesize one or multiple frames between two consecutive frames in a video. It has a wide range of applications including slow-motion video generation, frame-rate up-scaling and develo** video codecs. Some older works tackled this problem by assuming per-pixel linear motion between video frames. However, objects often follow a non-linear motion pattern in the r… ▽ More Video frame interpolation aims to synthesize one or multiple frames between two consecutive frames in a video. It has a wide range of applications including slow-motion video generation, frame-rate up-scaling and develo** video codecs. Some older works tackled this problem by assuming per-pixel linear motion between video frames. However, objects often follow a non-linear motion pattern in the real domain and some recent methods attempt to model per-pixel motion by non-linear models (e.g., quadratic). A quadratic model can also be inaccurate, especially in the case of motion discontinuities over time (i.e. sudden jerks) and occlusions, where some of the flow information may be invalid or inaccurate. In our paper, we propose to approximate the per-pixel motion using a space-time convolution network that is able to adaptively select the motion model to be used. Specifically, we are able to softly switch between a linear and a quadratic model. Towards this end, we use an end-to-end 3D CNN encoder-decoder architecture over bidirectional optical flows and occlusion maps to estimate the non-linear motion model of each pixel. Further, a motion refinement module is employed to refine the non-linear motion and the interpolated frames are estimated by a simple war** of the neighboring frames with the estimated per-pixel motion. Through a set of comprehensive experiments, we validate the effectiveness of our model and show that our method outperforms state-of-the-art algorithms on four datasets (Vimeo, DAVIS, HD and GoPro). △ Less

Submitted 12 April, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Accepted at CLIC workshop, CVPR 2022. Code: https://github.com/saikatdutta/NME-VFI

arXiv:2112.14314 [pdf, other]

Improving Prediction of Cognitive Performance using Deep Neural Networks in Sparse Data

Authors: Sharath Koorathota, Arunesh Mittal, Richard P. Sloan, Paul Sajda

Abstract: Cognition in midlife is an important predictor of age-related mental decline and statistical models that predict cognitive performance can be useful for predicting decline. However, existing models struggle to capture complex relationships between physical, sociodemographic, psychological and mental health factors that effect cognition. Using data from an observational, cohort study, Midlife in th… ▽ More Cognition in midlife is an important predictor of age-related mental decline and statistical models that predict cognitive performance can be useful for predicting decline. However, existing models struggle to capture complex relationships between physical, sociodemographic, psychological and mental health factors that effect cognition. Using data from an observational, cohort study, Midlife in the United States (MIDUS), we modeled a large number of variables to predict executive function and episodic memory measures. We used cross-sectional and longitudinal outcomes with varying sparsity, or amount of missing data. Deep neural network (DNN) models consistently ranked highest in all of the cognitive performance prediction tasks, as assessed with root mean squared error (RMSE) on out-of-sample data. RMSE differences between DNN and other model types were statistically significant (T(8) = -3.70; p < 0.05). The interaction effect between model type and sparsity was significant (F(9)=59.20; p < 0.01), indicating the success of DNNs can partly be attributed to their robustness and ability to model hierarchical relationships between health-related factors. Our findings underscore the potential of neural networks to model clinical datasets and allow better understanding of factors that lead to cognitive decline. △ Less

Submitted 28 December, 2021; originally announced December 2021.

arXiv:2112.03984 [pdf, other]

Emotion-Cause Pair Extraction in Customer Reviews

Authors: Arpit Mittal, Jeel Tejaskumar Vaishnav, Aishwarya Kaliki, Nathan Johns, Wyatt Pease

Abstract: Emotion-Cause Pair Extraction (ECPE) is a complex yet popular area in Natural Language Processing due to its importance and potential applications in various domains. In this report , we aim to present our work in ECPE in the domain of online reviews. With a manually annotated dataset, we explore an algorithm to extract emotion cause pairs using a neural network. In addition, we propose a model us… ▽ More Emotion-Cause Pair Extraction (ECPE) is a complex yet popular area in Natural Language Processing due to its importance and potential applications in various domains. In this report , we aim to present our work in ECPE in the domain of online reviews. With a manually annotated dataset, we explore an algorithm to extract emotion cause pairs using a neural network. In addition, we propose a model using previous reference materials and combining emotion-cause pair extraction with research in the domain of emotion-aware word embeddings, where we send these embeddings into a Bi-LSTM layer which gives us the emotionally relevant clauses. With the constraint of a limited dataset, we achieved . The overall scope of our report comprises of a comprehensive literature review, implementation of referenced methods for dataset construction and initial model training, and modifying previous work in ECPE by proposing an improvement to the pipeline, as well as algorithm development and implementation for the specific domain of reviews. △ Less

Submitted 7 December, 2021; originally announced December 2021.

Comments: 7 Pages, 8 Figures

arXiv:2111.13972 [pdf, other]

Tap** BERT for Preposition Sense Disambiguation

Authors: Siddhesh Pawar, Shyam Thombre, Anirudh Mittal, Girishkumar Ponkiya, Pushpak Bhattacharyya

Abstract: Prepositions are frequently occurring polysemous words. Disambiguation of prepositions is crucial in tasks like semantic role labelling, question answering, text entailment, and noun compound paraphrasing. In this paper, we propose a novel methodology for preposition sense disambiguation (PSD), which does not use any linguistic tools. In a supervised setting, the machine learning model is presente… ▽ More Prepositions are frequently occurring polysemous words. Disambiguation of prepositions is crucial in tasks like semantic role labelling, question answering, text entailment, and noun compound paraphrasing. In this paper, we propose a novel methodology for preposition sense disambiguation (PSD), which does not use any linguistic tools. In a supervised setting, the machine learning model is presented with sentences wherein prepositions have been annotated with senses. These senses are IDs in what is called The Preposition Project (TPP). We use the hidden layer representations from pre-trained BERT and BERT variants. The latent representations are then classified into the correct sense ID using a Multi Layer Perceptron. The dataset used for this task is from SemEval-2007 Task-6. Our methodology gives an accuracy of 86.85% which is better than the state-of-the-art. △ Less

Submitted 27 November, 2021; originally announced November 2021.

ACM Class: I.2.7

arXiv:2111.07370 [pdf, other]

Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks

Authors: Arulkumar Subramaniam, Jayesh Vaidya, Muhammed Abdul Majeed Ameen, Athira Nambiar, Anurag Mittal

Abstract: Video-based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Although using pre-trained models is a viable approach, it has several limitation… ▽ More Video-based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Although using pre-trained models is a viable approach, it has several limitations in the need for an exhaustive annotation of object categories, a possible domain gap between datasets, and a bias that is typically present in pre-trained models. In this work, we propose to utilize the common rationale that a sequence of video frames capture a set of common objects and interactions between them, thus a notion of co-segmentation between the video frame features may equip the model with the ability to automatically focus on task-specific salient regions and improve the underlying task's performance in an end-to-end manner. In this regard, we propose a generic module called ``Co-Segmentation inspired Attention Module'' (COSAM) that can be plugged in to any CNN model to promote the notion of co-segmentation based attention among a sequence of video frame features. We show the application of COSAM in three video-based tasks namely: 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification and demonstrate that COSAM is able to capture the task-specific salient regions in video frames, thus leading to notable performance improvements along with interpretable attention maps for a variety of video-based vision tasks, with possible application to other video-based vision tasks as well. △ Less

Submitted 1 August, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

Comments: 26 pages, 14 figures, Preprint submitted to CVIU journal

arXiv:2111.06685 [pdf, other]

doi 10.1145/3437963.3441810

DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents

Authors: Kunal Dahiya, Deepak Saini, Anshul Mittal, Ankush Shaw, Kushal Dave, Akshay Soni, Himanshu Jain, Sumeet Agarwal, Manik Varma

Abstract: Scalability and accuracy are well recognized challenges in deep extreme multi-label learning where the objective is to train architectures for automatically annotating a data point with the most relevant subset of labels from an extremely large label set. This paper develops the DeepXML framework that addresses these challenges by decomposing the deep extreme multi-label task into four simpler sub… ▽ More Scalability and accuracy are well recognized challenges in deep extreme multi-label learning where the objective is to train architectures for automatically annotating a data point with the most relevant subset of labels from an extremely large label set. This paper develops the DeepXML framework that addresses these challenges by decomposing the deep extreme multi-label task into four simpler sub-tasks each of which can be trained accurately and efficiently. Choosing different components for the four sub-tasks allows DeepXML to generate a family of algorithms with varying trade-offs between accuracy and scalability. In particular, DeepXML yields the Astec algorithm that could be 2-12% more accurate and 5-30x faster to train than leading deep extreme classifiers on publically available short text datasets. Astec could also efficiently train on Bing short text datasets containing up to 62 million labels while making predictions for billions of users and data points per day on commodity hardware. This allowed Astec to be deployed on the Bing search engine for a number of short text applications ranging from matching user queries to advertiser bid phrases to showing personalized ads where it yielded significant gains in click-through-rates, coverage, revenue and other online metrics over state-of-the-art techniques currently in production. DeepXML's code is available at https://github.com/Extreme-classification/deepxml △ Less

Submitted 12 November, 2021; originally announced November 2021.

ACM Class: F.2.2; I.2.7

Journal ref: Web Search and Data Mining 2021

arXiv:2111.00490 [pdf, other]

DSC-IITISM at FinCausal 2021: Combining POS tagging with Attention-based Contextual Representations for Identifying Causal Relationships in Financial Documents

Authors: Gunjan Haldar, Aman Mittal, Pradyumna Gupta

Abstract: Causality detection draws plenty of attention in the field of Natural Language Processing and linguistics research. It has essential applications in information retrieval, event prediction, question answering, financial analysis, and market research. In this study, we explore several methods to identify and extract cause-effect pairs in financial documents using transformers. For this purpose, we… ▽ More Causality detection draws plenty of attention in the field of Natural Language Processing and linguistics research. It has essential applications in information retrieval, event prediction, question answering, financial analysis, and market research. In this study, we explore several methods to identify and extract cause-effect pairs in financial documents using transformers. For this purpose, we propose an approach that combines POS tagging with the BIO scheme, which can be integrated with modern transformer models to address this challenge of identifying causality in a given text. Our best methodology achieves an F1-Score of 0.9551, and an Exact Match Score of 0.8777 on the blind test in the FinCausal-2021 Shared Task at the FinCausal 2021 Workshop. △ Less

Submitted 31 October, 2021; originally announced November 2021.

Comments: 5 pages, 5 tables

MSC Class: 68T50 (Primary); 91F20 (Secondary) ACM Class: I.2.7

arXiv:2110.12765 [pdf, other]

"So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

Authors: Anirudh Mittal, Pranav Jeevan, Prerak Gandhi, Diptesh Kanojia, Pushpak Bhattacharyya

Abstract: Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset ($\sim$40 hours) using stand-up comedy clips. We devise a novel scoring mecha… ▽ More Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset ($\sim$40 hours) using stand-up comedy clips. We devise a novel scoring mechanism to annotate the training data with a humour quotient score using the audience's laughter. The normalized duration (laughter duration divided by the clip duration) of laughter in each clip is used to compute this humour coefficient score on a five-point scale (0-4). This method of scoring is validated by comparing with manually annotated scores, wherein a quadratic weighted kappa of 0.6 is obtained. We use this dataset to train a model that provides a "funniness" score, on a five-point scale, given the audio and its corresponding text. We compare various neural language models for the task of humour-rating and achieve an accuracy of $0.813$ in terms of Quadratic Weighted Kappa (QWK). Our "Open Mic" dataset is released for further research along with the code. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: Accepted at EMNLP 2021 Main Conference (short papers); 4 pages, 1 figure, 3 tables

arXiv:2109.06488 [pdf]

Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers

Authors: Dinesh Kumar Vishwakarma, Mayank **dal, Ayush Mittal, Aditya Sharma

Abstract: Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situati… ▽ More Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework that takes both cognition and affect-based features into consideration. A pre-features fusion-based framework that takes into account: situation-based features from a regular snapshot of a trailer that includes nouns and verbs providing the useful affect-based map** with the corresponding genres, dialogue (speech) based feature from audio, metadata which together provides the relevant information for cognitive and affect based video analysis. We also develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres: Action, Romance, Comedy, Horror, and Science Fiction, and perform cross-validation on the standard LMTD-9 dataset for validating the proposed framework. The results demonstrate that the proposed methodology for movie genre classification has performed excellently as depicted by the F1 scores, precision, recall, and area under the precision-recall curves. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: 21 pages, 7 figures

arXiv:2108.12585 [pdf, other]

On the Significance of Question Encoder Sequence Model in the Out-of-Distribution Performance in Visual Question Answering

Authors: Gouthaman KV, Anurag Mittal

Abstract: Generalizing beyond the experiences has a significant role in develo** practical AI systems. It has been shown that current Visual Question Answering (VQA) models are over-dependent on the language-priors (spurious correlations between question-types and their most frequent answers) from the train set and pose poor performance on Out-of-Distribution (OOD) test sets. This conduct limits their gen… ▽ More Generalizing beyond the experiences has a significant role in develo** practical AI systems. It has been shown that current Visual Question Answering (VQA) models are over-dependent on the language-priors (spurious correlations between question-types and their most frequent answers) from the train set and pose poor performance on Out-of-Distribution (OOD) test sets. This conduct limits their generalizability and restricts them from being utilized in real-world situations. This paper shows that the sequence model architecture used in the question-encoder has a significant role in the generalizability of VQA models. To demonstrate this, we performed a detailed analysis of various existing RNN-based and Transformer-based question-encoders, and along, we proposed a novel Graph attention network (GAT)-based question-encoder. Our study found that a better choice of sequence model in the question-encoder improves the generalizability of VQA models even without using any additional relatively complex bias-mitigation approaches. △ Less

Submitted 21 December, 2021; v1 submitted 28 August, 2021; originally announced August 2021.

arXiv:2108.00368 [pdf, other]

doi 10.1145/3437963.3441807

DECAF: Deep Extreme Classification with Label Features

Authors: Anshul Mittal, Kunal Dahiya, Sheshansh Agrawal, Deepak Saini, Sumeet Agarwal, Purushottam Kar, Manik Varma

Abstract: Extreme multi-label classification (XML) involves tagging a data point with its most relevant subset of labels from an extremely large label set, with several applications such as product-to-product recommendation with millions of products. Although leading XML algorithms scale to millions of labels, they largely ignore label meta-data such as textual descriptions of the labels. On the other hand,… ▽ More Extreme multi-label classification (XML) involves tagging a data point with its most relevant subset of labels from an extremely large label set, with several applications such as product-to-product recommendation with millions of products. Although leading XML algorithms scale to millions of labels, they largely ignore label meta-data such as textual descriptions of the labels. On the other hand, classical techniques that can utilize label metadata via representation learning using deep networks struggle in extreme settings. This paper develops the DECAF algorithm that addresses these challenges by learning models enriched by label metadata that jointly learn model parameters and feature representations using deep networks and offer accurate classification at the scale of millions of labels. DECAF makes specific contributions to model architecture design, initialization, and training, enabling it to offer up to 2-6% more accurate prediction than leading extreme classifiers on publicly available benchmark product-to-product recommendation datasets, such as LF-AmazonTitles-1.3M. At the same time, DECAF was found to be up to 22x faster at inference than leading deep extreme classifiers, which makes it suitable for real-time applications that require predictions within a few milliseconds. The code for DECAF is available at the following URL https://github.com/Extreme-classification/DECAF. △ Less

Submitted 1 August, 2021; originally announced August 2021.

ACM Class: F.2.2; I.2.7

Journal ref: Web Search and Data Mining 2021

arXiv:2108.00261 [pdf, other]

doi 10.1145/3442381.3449815

ECLARE: Extreme Classification with Label Graph Correlations

Authors: Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, Manik Varma

Abstract: Deep extreme classification (XC) seeks to train deep architectures that can tag a data point with its most relevant subset of labels from an extremely large label set. The core utility of XC comes from predicting labels that are rarely seen during training. Such rare labels hold the key to personalized recommendations that can delight and surprise a user. However, the large number of rare labels a… ▽ More Deep extreme classification (XC) seeks to train deep architectures that can tag a data point with its most relevant subset of labels from an extremely large label set. The core utility of XC comes from predicting labels that are rarely seen during training. Such rare labels hold the key to personalized recommendations that can delight and surprise a user. However, the large number of rare labels and small amount of training data per rare label offer significant statistical and computational challenges. State-of-the-art deep XC methods attempt to remedy this by incorporating textual descriptions of labels but do not adequately address the problem. This paper presents ECLARE, a scalable deep learning architecture that incorporates not only label text, but also label correlations, to offer accurate real-time predictions within a few milliseconds. Core contributions of ECLARE include a frugal architecture and scalable techniques to train deep models along with label correlation graphs at the scale of millions of labels. In particular, ECLARE offers predictions that are 2 to 14% more accurate on both publicly available benchmark datasets as well as proprietary datasets for a related products recommendation task sourced from the Bing search engine. Code for ECLARE is available at https://github.com/Extreme-classification/ECLARE. △ Less

Submitted 31 July, 2021; originally announced August 2021.

ACM Class: F.2.2; I.2.7

Journal ref: The Web Conference 2021

arXiv:2107.07842 [pdf, other]

A Survey of Knowledge Graph Embedding and Their Applications

Authors: Shivani Choudhary, Tarun Luthra, Ashima Mittal, Rajat Singh

Abstract: Knowledge Graph embedding provides a versatile technique for representing knowledge. These techniques can be used in a variety of applications such as completion of knowledge graph to predict missing information, recommender systems, question answering, query expansion, etc. The information embedded in Knowledge graph though being structured is challenging to consume in a real-world application. K… ▽ More Knowledge Graph embedding provides a versatile technique for representing knowledge. These techniques can be used in a variety of applications such as completion of knowledge graph to predict missing information, recommender systems, question answering, query expansion, etc. The information embedded in Knowledge graph though being structured is challenging to consume in a real-world application. Knowledge graph embedding enables the real-world application to consume information to improve performance. Knowledge graph embedding is an active research area. Most of the embedding methods focus on structure-based information. Recent research has extended the boundary to include text-based information and image-based information in entity embedding. Efforts have been made to enhance the representation with context information. This paper introduces growth in the field of KG embedding from simple translation-based models to enrichment-based models. This paper includes the utility of the Knowledge graph in real-world applications. △ Less

Submitted 16 July, 2021; originally announced July 2021.

Comments: 11 pages, 9 figures

arXiv:2106.15238 [pdf, other]

doi 10.21437/Interspeech.2020-3208

Representation based meta-learning for few-shot spoken intent recognition

Authors: Ashish Mittal, Samarth Bharadwaj, Shreya Khare, Saneem Chemmengath, Karthik Sankaranarayanan, Brian Kingsbury

Abstract: Spoken intent detection has become a popular approach to interface with various smart devices with ease. However, such systems are limited to the preset list of intents-terms or commands, which restricts the quick customization of personal devices to new intents. This paper presents a few-shot spoken intent classification approach with task-agnostic representations via meta-learning paradigm. Spec… ▽ More Spoken intent detection has become a popular approach to interface with various smart devices with ease. However, such systems are limited to the preset list of intents-terms or commands, which restricts the quick customization of personal devices to new intents. This paper presents a few-shot spoken intent classification approach with task-agnostic representations via meta-learning paradigm. Specifically, we leverage the popular representation-based meta-learning learning to build a task-agnostic representation of utterances, that then use a linear classifier for prediction. We evaluate three such approaches on our novel experimental protocol developed on two popular spoken intent classification datasets: Google Commands and the Fluent Speech Commands dataset. For a 5-shot (1-shot) classification of novel classes, the proposed framework provides an average classification accuracy of 88.6% (76.3%) on the Google Commands dataset, and 78.5% (64.2%) on the Fluent Speech Commands dataset. The performance is comparable to traditionally supervised classification models with abundant training samples. △ Less

Submitted 29 June, 2021; originally announced June 2021.

Comments: Accepted paper at Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October, 2020

Showing 1–50 of 133 results for author: Mittal, A