-
Graph-Augmented LLMs for Personalized Health Insights: A Case Study in Sleep Analysis
Authors:
Ajan Subramanian,
Zhongqi Yang,
Iman Azimi,
Amir M. Rahmani
Abstract:
Health monitoring systems have revolutionized modern healthcare by enabling the continuous capture of physiological and behavioral data, essential for preventive measures and early health intervention. While integrating this data with Large Language Models (LLMs) has shown promise in delivering interactive health advice, traditional methods like Retrieval-Augmented Generation (RAG) and fine-tuning…
▽ More
Health monitoring systems have revolutionized modern healthcare by enabling the continuous capture of physiological and behavioral data, essential for preventive measures and early health intervention. While integrating this data with Large Language Models (LLMs) has shown promise in delivering interactive health advice, traditional methods like Retrieval-Augmented Generation (RAG) and fine-tuning often fail to fully utilize the complex, multi-dimensional, and temporally relevant data from wearable devices. These conventional approaches typically provide limited actionable and personalized health insights due to their inadequate capacity to dynamically integrate and interpret diverse health data streams. In response, this paper introduces a graph-augmented LLM framework designed to significantly enhance the personalization and clarity of health insights. Utilizing a hierarchical graph structure, the framework captures inter and intra-patient relationships, enriching LLM prompts with dynamic feature importance scores derived from a Random Forest Model. The effectiveness of this approach is demonstrated through a sleep analysis case study involving 20 college students during the COVID-19 lockdown, highlighting the potential of our model to generate actionable and personalized health insights efficiently. We leverage another LLM to evaluate the insights for relevance, comprehensiveness, actionability, and personalization, addressing the critical need for models that process and interpret complex health data effectively. Our findings show that augmenting prompts with our framework yields significant improvements in all 4 criteria. Through our framework, we can elicit well-crafted, more thoughtful responses tailored to a specific patient.
△ Less
Submitted 24 June, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation
Authors:
Peidong Wang,
Jian Xue,
**yu Li,
Junkun Chen,
Aswin Shanmugam Subramanian
Abstract:
Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In some cases, the input language can be given or estimated. Our goal is to use this additional language information while preserving the quality of the o…
▽ More
Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In some cases, the input language can be given or estimated. Our goal is to use this additional language information while preserving the quality of the other languages. We accomplish this by introducing a simple and effective linear input network. The linear input network is initialized as an identity matrix, which ensures that the model can perform as well as, or better than, the original model. Experimental results show that the proposed method can successfully enhance the specified language, while kee** the language-agnostic ability of the many-to-one ST models.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering
Authors:
Anand Subramanian,
Viktor Schlegel,
Abhinav Ramesh Kashyap,
Thanh-Tung Nguyen,
Vijay Prakash Dwivedi,
Stefan Winkler
Abstract:
There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for succes…
▽ More
There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
MERP: Metaverse Extended Realtiy Portal
Authors:
Anisha Ghosh,
Aditya Mitra,
Anik Saha,
Sibi Chakkaravarthy Sethuraman,
Anitha Subramanian
Abstract:
A standardized control system called Metaverse Extended Reality Portal (MERP) is presented as a solution to the issues with conventional VR eyewear. The MERP system improves user awareness of the physical world while offering an immersive 3D view of the metaverse by using a shouldermounted projector to display a Heads-Up Display (HUD) in a designated Metaverse Experience Room. To provide natural a…
▽ More
A standardized control system called Metaverse Extended Reality Portal (MERP) is presented as a solution to the issues with conventional VR eyewear. The MERP system improves user awareness of the physical world while offering an immersive 3D view of the metaverse by using a shouldermounted projector to display a Heads-Up Display (HUD) in a designated Metaverse Experience Room. To provide natural and secure interaction inside the metaverse, a compass module and gyroscope integration enable accurate map** of real-world motions to avatar actions. Through user tests and research, the MERP system shows that it may reduce mishaps brought on by poor spatial awareness, offering an improved metaverse experience and laying the groundwork for future developments in virtual reality technology. MERP, which is compared with existing Virtual Reality (VR) glasses used to traverse the metaverse, is projected to become a seamless, novel and better alternative. Existing VR headsets and AR glasses have well-known drawbacks that making them ineffective for prolonged usage as it causes harm to the eyes.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
On Approximation Schemes for Stabbing Rectilinear Polygons
Authors:
Arindam Khan,
Aditya Subramanian,
Tobias Widmann,
Andreas Wiese
Abstract:
We study the problem of stabbing rectilinear polygons, where we are given $n$ rectilinear polygons in the plane that we want to stab, i.e., we want to select horizontal line segments such that for each given rectilinear polygon there is a line segment that intersects two opposite (parallel) edges of it. Our goal is to find a set of line segments of minimum total length such that all polygons are s…
▽ More
We study the problem of stabbing rectilinear polygons, where we are given $n$ rectilinear polygons in the plane that we want to stab, i.e., we want to select horizontal line segments such that for each given rectilinear polygon there is a line segment that intersects two opposite (parallel) edges of it. Our goal is to find a set of line segments of minimum total length such that all polygons are stabbed. For the special case of rectangles, there is a $O(1)$-approximation algorithm and the problem is $\mathsf{NP}$-hard [Chan et al.]. Also, the problem admits a QPTAS [Eisenbrand et al.] and even a PTAS [Khan et al.]. However, the approximability for the setting of more general polygons, e.g., L-shapes or T-shapes, is completely open.
In this paper, we characterize the conditions under which the problem admits a $(1+\varepsilon)$-approximation algorithm. We assume that each input polygon is composed of rectangles that are placed on top of each other such that, for each pair of adjacent edges between rectangles, one edge contains the other. We show that if all input polygons satisfy the hourglass condition, then the problem admits a QPTAS. In particular, it is thus unlikely that this case is $\mathsf{APX}$-hard. Furthermore, we show that there exists a PTAS if each input polygon is composed out of rectangles with a bounded range of widths. On the other hand, if the input polygons do not satisfy these conditions, we prove that the problem is $\mathsf{APX}$-hard, already if all input polygons have only eight edges. We remark that all polygons with fewer edges automatically satisfy the hourglass condition. On the other hand, for arbitrary rectilinear polygons we even show a lower bound of $Ω(\log n)$ for the possible approximation ratio, which implies that the best possible ratio is in $Θ(\log n)$ since the problem is a special case of Set Cover.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Reducing Uncertainty in Sea-level Rise Prediction: A Spatial-variability-aware Approach
Authors:
Subhankar Ghosh,
Shuai An,
Arun Sharma,
Jayant Gupta,
Shashi Shekhar,
Aneesh Subramanian
Abstract:
Given multi-model ensemble climate projections, the goal is to accurately and reliably predict future sea-level rise while lowering the uncertainty. This problem is important because sea-level rise affects millions of people in coastal communities and beyond due to climate change's impacts on polar ice sheets and the ocean. This problem is challenging due to spatial variability and unknowns such a…
▽ More
Given multi-model ensemble climate projections, the goal is to accurately and reliably predict future sea-level rise while lowering the uncertainty. This problem is important because sea-level rise affects millions of people in coastal communities and beyond due to climate change's impacts on polar ice sheets and the ocean. This problem is challenging due to spatial variability and unknowns such as possible tip** points (e.g., collapse of Greenland or West Antarctic ice-shelf), climate feedback loops (e.g., clouds, permafrost thawing), future policy decisions, and human actions. Most existing climate modeling approaches use the same set of weights globally, during either regression or deep learning to combine different climate projections. Such approaches are inadequate when different regions require different weighting schemes for accurate and reliable sea-level rise predictions. This paper proposes a zonal regression model which addresses spatial variability and model inter-dependency. Experimental results show more reliable predictions using the weights learned via this approach on a regional scale.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Spatial-frequency channels, shape bias, and adversarial robustness
Authors:
Ajay Subramanian,
Elena Sizikova,
Najib J. Majaj,
Denis G. Pelli
Abstract:
What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that human…
▽ More
What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel") that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. Unlike humans, the neural network channel is very broad, 2-4 times wider than the human channel. Thus, noise at certain high and low frequencies will impair network performance and spare human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (51% variance explained) and robustness of adversarially-trained networks (66% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further beyond the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only makes it worse. Networks with narrower channels might be more robust.
△ Less
Submitted 5 November, 2023; v1 submitted 22 September, 2023;
originally announced September 2023.
-
STint: Self-supervised Temporal Interpolation for Geospatial Data
Authors:
Nidhin Harilal,
Bri-Mathias Hodge,
Aneesh Subramanian,
Claire Monteleoni
Abstract:
Supervised and unsupervised techniques have demonstrated the potential for temporal interpolation of video data. Nevertheless, most prevailing temporal interpolation techniques hinge on optical flow, which encodes the motion of pixels between video frames. On the other hand, geospatial data exhibits lower temporal resolution while encompassing a spectrum of movements and deformations that challeng…
▽ More
Supervised and unsupervised techniques have demonstrated the potential for temporal interpolation of video data. Nevertheless, most prevailing temporal interpolation techniques hinge on optical flow, which encodes the motion of pixels between video frames. On the other hand, geospatial data exhibits lower temporal resolution while encompassing a spectrum of movements and deformations that challenge several assumptions inherent to optical flow. In this work, we propose an unsupervised temporal interpolation technique, which does not rely on ground truth data or require any motion information like optical flow, thus offering a promising alternative for better generalization across geospatial domains. Specifically, we introduce a self-supervised technique of dual cycle consistency. Our proposed technique incorporates multiple cycle consistency losses, which result from interpolating two frames between consecutive input frames through a series of stages. This dual cycle consistent constraint causes the model to produce intermediate frames in a self-supervised manner. To the best of our knowledge, this is the first attempt at unsupervised temporal interpolation without the explicit use of optical flow. Our experimental evaluations across diverse geospatial datasets show that STint significantly outperforms existing state-of-the-art methods for unsupervised temporal interpolation.
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
Njobvu-AI: An open-source tool for collaborative image labeling and implementation of computer vision models
Authors:
Jonathan S. Koning,
Ashwin Subramanian,
Mazen Alotaibi,
Cara L. Appel,
Christopher M. Sullivan,
Thon Chao,
Lisa Truong,
Robyn L. Tanguay,
Pankaj Jaiswal,
Taal Levi,
Damon B. Lesmeister
Abstract:
Practitioners interested in using computer vision models lack user-friendly and open-source software that combines features to label training data, allow multiple users, train new algorithms, review output, and implement new models. Labeling training data, such as images, is a key step to develo** accurate object detection algorithms using computer vision. This step is often not compatible with…
▽ More
Practitioners interested in using computer vision models lack user-friendly and open-source software that combines features to label training data, allow multiple users, train new algorithms, review output, and implement new models. Labeling training data, such as images, is a key step to develo** accurate object detection algorithms using computer vision. This step is often not compatible with many cloud-based services for marking or labeling image and video data due to limited internet bandwidth in many regions of the world. Desktop tools are useful for groups working in remote locations, but users often do not have the capability to combine projects developed locally by multiple collaborators. Furthermore, many tools offer features for labeling data or using pre-trained models for classification, but few allow researchers to combine these steps to create and apply custom models. Free, open-source, and user-friendly software that offers a full suite of features (e.g., ability to work locally and online, and train custom models) is desirable to field researchers and conservationists that may have limited coding skills. We developed Njobvu-AI, a free, open-source tool that can be run on both desktop and server hardware using Node.js, allowing users to label data, combine projects for collaboration and review, train custom algorithms, and implement new computer vision models. The name Njobvu-AI (pronounced N-joh-voo AI), incorporating the Chichewa word for elephant, is inspired by a wildlife monitoring program in Malawi that was a primary impetus for the development of this tool and references similarities between the powerful memory of elephants and properties of computer vision models.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
Fair Rank Aggregation
Authors:
Diptarka Chakraborty,
Syamantak Das,
Arindam Khan,
Aditya Subramanian
Abstract:
Ranking algorithms find extensive usage in diverse areas such as web search, employment, college admission, voting, etc. The related rank aggregation problem deals with combining multiple rankings into a single aggregate ranking. However, algorithms for both these problems might be biased against some individuals or groups due to implicit prejudice or marginalization in the historical data. We stu…
▽ More
Ranking algorithms find extensive usage in diverse areas such as web search, employment, college admission, voting, etc. The related rank aggregation problem deals with combining multiple rankings into a single aggregate ranking. However, algorithms for both these problems might be biased against some individuals or groups due to implicit prejudice or marginalization in the historical data. We study ranking and rank aggregation problems from a fairness or diversity perspective, where the candidates (to be ranked) may belong to different groups and each group should have a fair representation in the final ranking. We allow the designer to set the parameters that define fair representation. These parameters specify the allowed range of the number of candidates from a particular group in the top-$k$ positions of the ranking. Given any ranking, we provide a fast and exact algorithm for finding the closest fair ranking for the Kendall tau metric under block-fairness. We also provide an exact algorithm for finding the closest fair ranking for the Ulam metric under strict-fairness, when there are only $O(1)$ number of groups. Our algorithms are simple, fast, and might be extendable to other relevant metrics. We also give a novel meta-algorithm for the general rank aggregation problem under the fairness framework. Surprisingly, this meta-algorithm works for any generalized mean objective (including center and median problems) and any fairness criteria. As a byproduct, we obtain 3-approximation algorithms for both center and median problems, under both Kendall tau and Ulam metrics. Furthermore, using sophisticated techniques we obtain a $(3-\varepsilon)$-approximation algorithm, for a constant $\varepsilon>0$, for the Ulam metric under strong fairness.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records
Authors:
Viktor Schlegel,
Hao Li,
Yu** Wu,
Anand Subramanian,
Thanh-Tung Nguyen,
Abhinav Ramesh Kashyap,
Daniel Beck,
Xiaojun Zeng,
Riza Theresa Batista-Navarro,
Stefan Winkler,
Goran Nenadic
Abstract:
This paper describes PULSAR, our system submission at the ImageClef 2023 MediQA-Sum task on summarising patient-doctor dialogues into clinical records. The proposed framework relies on domain-specific pre-training, to produce a specialised language model which is trained on task-specific natural data augmented by synthetic data generated by a black-box LLM. We find limited evidence towards the eff…
▽ More
This paper describes PULSAR, our system submission at the ImageClef 2023 MediQA-Sum task on summarising patient-doctor dialogues into clinical records. The proposed framework relies on domain-specific pre-training, to produce a specialised language model which is trained on task-specific natural data augmented by synthetic data generated by a black-box LLM. We find limited evidence towards the efficacy of domain-specific pre-training and data augmentation, while scaling up the language model yields the best performance gains. Our approach was ranked second and third among 13 submissions on task B of the challenge. Our code is available at https://github.com/yu**-wu/PULSAR.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
Human activity recognition using deep learning approaches and single frame cnn and convolutional lstm
Authors:
Sheryl Mathew,
Annapoorani Subramanian,
Pooja,
Balamurugan MS,
Manoj Kumar Rajagopal
Abstract:
Human activity recognition is one of the most important tasks in computer vision and has proved useful in different fields such as healthcare, sports training and security. There are a number of approaches that have been explored to solve this task, some of them involving sensor data, and some involving video data. In this paper, we aim to explore two deep learning-based approaches, namely single…
▽ More
Human activity recognition is one of the most important tasks in computer vision and has proved useful in different fields such as healthcare, sports training and security. There are a number of approaches that have been explored to solve this task, some of them involving sensor data, and some involving video data. In this paper, we aim to explore two deep learning-based approaches, namely single frame Convolutional Neural Networks (CNNs) and convolutional Long Short-Term Memory to recognise human actions from videos. Using a convolutional neural networks-based method is advantageous as CNNs can extract features automatically and Long Short-Term Memory networks are great when it comes to working on sequence data such as video. The two models were trained and evaluated on a benchmark action recognition dataset, UCF50, and another dataset that was created for the experimentation. Though both models exhibit good accuracies, the single frame CNN model outperforms the Convolutional LSTM model by having an accuracy of 99.8% with the UCF50 dataset.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
SimplyMime: A Control at Our Fingertips
Authors:
Sibi Chakkaravarthy Sethuraman,
Gaurav Reddy Tadkapally,
Athresh Kiran,
Saraju P. Mohanty,
Anitha Subramanian
Abstract:
The utilization of consumer electronics, such as televisions, set-top boxes, home theaters, and air conditioners, has become increasingly prevalent in modern society as technology continues to evolve. As new devices enter our homes each year, the accumulation of multiple infrared remote controls to operate them not only results in a waste of energy and resources, but also creates a cumbersome and…
▽ More
The utilization of consumer electronics, such as televisions, set-top boxes, home theaters, and air conditioners, has become increasingly prevalent in modern society as technology continues to evolve. As new devices enter our homes each year, the accumulation of multiple infrared remote controls to operate them not only results in a waste of energy and resources, but also creates a cumbersome and cluttered environment for the user. This paper presents a novel system, named SimplyMime, which aims to eliminate the need for multiple remote controls for consumer electronics and provide the user with intuitive control without the need for additional devices. SimplyMime leverages a dynamic hand gesture recognition architecture, incorporating Artificial Intelligence and Human-Computer Interaction, to create a sophisticated system that enables users to interact with a vast majority of consumer electronics with ease. Additionally, SimplyMime has a security aspect where it can verify and authenticate the user utilising the palmprint, which ensures that only authorized users can control the devices. The performance of the proposed method for detecting and recognizing gestures in a stream of motion was thoroughly tested and validated using multiple benchmark datasets, resulting in commendable accuracy levels. One of the distinct advantages of the proposed method is its minimal computational power requirements, making it highly adaptable and reliable in a wide range of circumstances. The paper proposes incorporating this technology into all consumer electronic devices that currently require a secondary remote for operation, thus promoting a more efficient and sustainable living environment.
△ Less
Submitted 22 April, 2023;
originally announced April 2023.
-
MagicEye: An Intelligent Wearable Towards Independent Living of Visually Impaired
Authors:
Sibi C. Sethuraman,
Gaurav R. Tadkapally,
Saraju P. Mohanty,
Gautam Galada,
Anitha Subramanian
Abstract:
Individuals with visual impairments often face a multitude of challenging obstacles in their daily lives. Vision impairment can severely impair a person's ability to work, navigate, and retain independence. This can result in educational limits, a higher risk of accidents, and a plethora of other issues. To address these challenges, we present MagicEye, a state-of-the-art intelligent wearable devi…
▽ More
Individuals with visual impairments often face a multitude of challenging obstacles in their daily lives. Vision impairment can severely impair a person's ability to work, navigate, and retain independence. This can result in educational limits, a higher risk of accidents, and a plethora of other issues. To address these challenges, we present MagicEye, a state-of-the-art intelligent wearable device designed to assist visually impaired individuals. MagicEye employs a custom-trained CNN-based object detection model, capable of recognizing a wide range of indoor and outdoor objects frequently encountered in daily life. With a total of 35 classes, the neural network employed by MagicEye has been specifically designed to achieve high levels of efficiency and precision in object detection. The device is also equipped with facial recognition and currency identification modules, providing invaluable assistance to the visually impaired. In addition, MagicEye features a GPS sensor for navigation, allowing users to move about with ease, as well as a proximity sensor for detecting nearby objects without physical contact. In summary, MagicEye is an innovative and highly advanced wearable device that has been designed to address the many challenges faced by individuals with visual impairments. It is equipped with state-of-the-art object detection and navigation capabilities that are tailored to the needs of the visually impaired, making it one of the most promising solutions to assist those who are struggling with visual impairments.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
Online and Dynamic Algorithms for Geometric Set Cover and Hitting Set
Authors:
Arindam Khan,
Aditya Lonkar,
Saladi Rahul,
Aditya Subramanian,
Andreas Wiese
Abstract:
Set cover and hitting set are fundamental problems in combinatorial optimization which are well-studied in the offline, online, and dynamic settings. We study the geometric versions of these problems and present new online and dynamic algorithms for them. In the online version of set cover (resp. hitting set), $m$ sets (resp.~$n$ points) are give $n$ points (resp.~$m$ sets) arrive online, one-by-o…
▽ More
Set cover and hitting set are fundamental problems in combinatorial optimization which are well-studied in the offline, online, and dynamic settings. We study the geometric versions of these problems and present new online and dynamic algorithms for them. In the online version of set cover (resp. hitting set), $m$ sets (resp.~$n$ points) are give $n$ points (resp.~$m$ sets) arrive online, one-by-one. In the dynamic versions, points (resp. sets) can arrive as well as depart. Our goal is to maintain a set cover (resp. hitting set), minimizing the size of the computed solution.
For online set cover for (axis-parallel) squares of arbitrary sizes, we present a tight $O(\log n)$-competitive algorithm. In the same setting for hitting set, we provide a tight $O(\log N)$-competitive algorithm, assuming that all points have integral coordinates in $[0,N)^{2}$. No online algorithm had been known for either of these settings, not even for unit squares (apart from the known online algorithms for arbitrary set systems).
For both dynamic set cover and hitting set with $d$-dimensional hyperrectangles, we obtain $(\log m)^{O(d)}$-approximation algorithms with $(\log m)^{O(d)}$ worst-case update time. This partially answers an open question posed by Chan et al. [SODA'22]. Previously, no dynamic algorithms with polylogarithmic update time were known even in the setting of squares (for either of these problems). Our main technical contributions are an \emph{extended quad-tree }approach and a \emph{frequency reduction} technique that reduces geometric set cover instances to instances of general set cover with bounded frequency.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Automated patent extraction powers generative modeling in focused chemical spaces
Authors:
Akshay Subramanian,
Kevin P. Greenman,
Alexis Gervaix,
Tzuhsiung Yang,
Rafael Gómez-Bombarelli
Abstract:
Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of…
▽ More
Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by develo** an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.
△ Less
Submitted 24 July, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Quantifying Causes of Arctic Amplification via Deep Learning based Time-series Causal Inference
Authors:
Sahara Ali,
Omar Faruque,
Yiyi Huang,
Md. Osman Gani,
Aneesh Subramanian,
Nicole-Jienne Shchlegel,
Jianwu Wang
Abstract:
The warming of the Arctic, also known as Arctic amplification, is led by several atmospheric and oceanic drivers. However, the details of its underlying thermodynamic causes are still unknown. Inferring the causal effects of atmospheric processes on sea ice melt using fixed treatment effect strategies leads to unrealistic counterfactual estimations. Such models are also prone to bias due to time-v…
▽ More
The warming of the Arctic, also known as Arctic amplification, is led by several atmospheric and oceanic drivers. However, the details of its underlying thermodynamic causes are still unknown. Inferring the causal effects of atmospheric processes on sea ice melt using fixed treatment effect strategies leads to unrealistic counterfactual estimations. Such models are also prone to bias due to time-varying confoundedness. Further, the complex non-linearity in Earth science data makes it infeasible to perform causal inference using existing marginal structural techniques. In order to tackle these challenges, we propose TCINet - time-series causal inference model to infer causation under continuous treatment using recurrent neural networks and a novel probabilistic balancing technique. Through experiments on synthetic and observational data, we show how our research can substantially improve the ability to quantify leading causes of Arctic sea ice melt, further paving paths for causal inference in observational Earth science.
△ Less
Submitted 25 September, 2023; v1 submitted 22 February, 2023;
originally announced March 2023.
-
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Authors:
Christoph Boeddeker,
Aswin Shanmugam Subramanian,
Gordon Wichern,
Reinhold Haeb-Umbach,
Jonathan Le Roux
Abstract:
Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that pro…
▽ More
Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.
△ Less
Submitted 1 January, 2024; v1 submitted 7 March, 2023;
originally announced March 2023.
-
CueCAn: Cue Driven Contextual Attention For Identifying Missing Traffic Signs on Unconstrained Roads
Authors:
Varun Gupta,
Anbumani Subramanian,
C. V. Jawahar,
Rohit Saluja
Abstract:
Unconstrained Asian roads often involve poor infrastructure, affecting overall road safety. Missing traffic signs are a regular part of such roads. Missing or non-existing object detection has been studied for locating missing curbs and estimating reasonable regions for pedestrians on road scene images. Such methods involve analyzing task-specific single object cues. In this paper, we present the…
▽ More
Unconstrained Asian roads often involve poor infrastructure, affecting overall road safety. Missing traffic signs are a regular part of such roads. Missing or non-existing object detection has been studied for locating missing curbs and estimating reasonable regions for pedestrians on road scene images. Such methods involve analyzing task-specific single object cues. In this paper, we present the first and most challenging video dataset for missing objects, with multiple types of traffic signs for which the cues are visible without the signs in the scenes. We refer to it as the Missing Traffic Signs Video Dataset (MTSVD). MTSVD is challenging compared to the previous works in two aspects i) The traffic signs are generally not present in the vicinity of their cues, ii) The traffic signs cues are diverse and unique. Also, MTSVD is the first publicly available missing object dataset. To train the models for identifying missing signs, we complement our dataset with 10K traffic sign tracks, with 40 percent of the traffic signs having cues visible in the scenes. For identifying missing signs, we propose the Cue-driven Contextual Attention units (CueCAn), which we incorporate in our model encoder. We first train the encoder to classify the presence of traffic sign cues and then train the entire segmentation model end-to-end to localize missing traffic signs. Quantitative and qualitative analysis shows that CueCAn significantly improves the performance of base models.
△ Less
Submitted 5 March, 2023;
originally announced March 2023.
-
MetaSecure: A Passwordless Authentication for the Metaverse
Authors:
Sibi Chakkaravarthy Sethuraman,
Aditya Mitra,
Anisha Ghosh,
Gautam Galada,
Anitha Subramanian
Abstract:
Metaverse in general holds a potential future for cyberspace. At the beginning of Web 2.0, it was witnessed that people were signing in with various pseudonyms or 'nyms', risking their online identities by increasing presence of fake accounts leading to difficulty in unique identification for different roles. However, in Web 3.0, the metaverse, a user's identity is tied to their original identity,…
▽ More
Metaverse in general holds a potential future for cyberspace. At the beginning of Web 2.0, it was witnessed that people were signing in with various pseudonyms or 'nyms', risking their online identities by increasing presence of fake accounts leading to difficulty in unique identification for different roles. However, in Web 3.0, the metaverse, a user's identity is tied to their original identity, where risking one poses a significant risk to the other. Therefore, this paper proposes a novel authentication system for securing digital assets, online identity, avatars, and accounts called Metasecure where a unique id for every entity or user to develop a human establishment is essential on a digital platform. The proposed passwordless system provides three layers of security using device attestation, facial recognition and use of physical security keys, security keys, or smartcards in accordance to Fast IDentity Online (FIDO2) specifications. It provides SDKs for authentication on any system including VR/XR glasses, thus ensuring seamlessness in accessing services in the Metaverse.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
Authors:
Darius Petermann,
Gordon Wichern,
Aswin Shanmugam Subramanian,
Zhong-Qiu Wang,
Jonathan Le Roux
Abstract:
Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem,…
▽ More
Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We benchmark the performance of several deep learning-based source separation models on this task and evaluate them with respect to simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that better correlate with human perception. Furthermore, we thoroughly evaluate how source separation can influence downstream transcription tasks. First, we investigate the task of activity detection on the three sources as a way to both further improve source separation and perform transcription. We formulate the transcription tasks as speech recognition for speech and audio tagging for music and SFX. We observe that, while the use of source separation estimates improves transcription performance in comparison to the original soundtrack, performance is still sub-optimal due to artifacts introduced by the separation process. Therefore, we thoroughly investigate how remixing of the three separated source stems at various relative levels can reduce artifacts and consequently improve the transcription performance. We find that remixing music and SFX interferences at a target SNR of 17.5 dB reduces speech recognition word error rate, and similar impact from remixing is observed for tagging music and SFX content.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Hyperbolic Audio Source Separation
Authors:
Darius Petermann,
Gordon Wichern,
Aswin Subramanian,
Jonathan Le Roux
Abstract:
We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Inspired by recent successes modeling hierarchical relationships in text and images with hyperbolic embeddings, our algorithm obtains a hyperbolic embedding for each time-frequency bin of a mixture s…
▽ More
We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Inspired by recent successes modeling hierarchical relationships in text and images with hyperbolic embeddings, our algorithm obtains a hyperbolic embedding for each time-frequency bin of a mixture signal and estimates masks using hyperbolic softmax layers. On a synthetic dataset containing mixtures of multiple people talking and musical instruments playing, our hyperbolic model performed comparably to a Euclidean baseline in terms of source to distortion ratio, with stronger performance at low embedding dimensions. Furthermore, we find that time-frequency regions containing multiple overlap** sources are embedded towards the center (i.e., the most uncertain region) of the hyperbolic space, and we can use this certainty estimate to efficiently trade-off between artifact introduction and interference reduction when isolating individual sounds.
△ Less
Submitted 9 December, 2022;
originally announced December 2022.
-
Energetic electron precipitation driven by electromagnetic ion cyclotron waves from ELFIN's low altitude perspective
Authors:
V. Angelopoulos,
X. -J. Zhang,
A. V. Artemyev,
D. Mourenas,
E. Tsai,
C. Wilkins,
A. Runov,
J. Liu,
D. L. Turner,
W. Li,
K. Khurana,
R. E. Wirz,
V. A. Sergeev,
X. Meng,
J. Wu,
M. D. Hartinger,
T. Raita,
Y. Shen,
X. An,
X. Shi,
M. F. Bashir,
X. Shen,
L. Gan,
M. Qin,
L. Capannolo
, et al. (61 additional authors not shown)
Abstract:
We review comprehensive observations of electromagnetic ion cyclotron (EMIC) wave-driven energetic electron precipitation using data from the energetic electron detector on the Electron Losses and Fields InvestigatioN (ELFIN) mission, two polar-orbiting low-altitude spinning CubeSats, measuring 50-5000 keV electrons with good pitch-angle and energy resolution. EMIC wave-driven precipitation exhibi…
▽ More
We review comprehensive observations of electromagnetic ion cyclotron (EMIC) wave-driven energetic electron precipitation using data from the energetic electron detector on the Electron Losses and Fields InvestigatioN (ELFIN) mission, two polar-orbiting low-altitude spinning CubeSats, measuring 50-5000 keV electrons with good pitch-angle and energy resolution. EMIC wave-driven precipitation exhibits a distinct signature in energy-spectrograms of the precipitating-to-trapped flux ratio: peaks at 0.5 MeV which are abrupt (bursty) with significant substructure (occasionally down to sub-second timescale). Multiple ELFIN passes over the same MLT sector allow us to study the spatial and temporal evolution of the EMIC wave - electron interaction region. Using two years of ELFIN data, we assemble a statistical database of 50 events of strong EMIC wave-driven precipitation. Most reside at L=5-7 at dusk, while a smaller subset exists at L=8-12 at post-midnight. The energies of the peak-precipitation ratio and of the half-peak precipitation ratio (our proxy for the minimum resonance energy) exhibit an L-shell dependence in good agreement with theoretical estimates based on prior statistical observations of EMIC wave power spectra. The precipitation ratio's spectral shape for the most intense events has an exponential falloff away from the peak (i.e., on either side of 1.45 MeV). It too agrees well with quasi-linear diffusion theory based on prior statistics of wave spectra. Sub-MeV electron precipitation observed concurrently with strong EMIC wave-driven 1MeV precipitation has a spectral shape that is consistent with efficient pitch-angle scattering down to 200-300 keV by much less intense higher frequency EMIC waves. These results confirm the critical role of EMIC waves in driving relativistic electron losses. Nonlinear effects may abound and require further investigation.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Reverberation as Supervision for Speech Separation
Authors:
Rohith Aralikatti,
Christoph Boeddeker,
Gordon Wichern,
Aswin Shanmugam Subramanian,
Jonathan Le Roux
Abstract:
This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal's audito…
▽ More
This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal's auditory system. We assume the availability of two-channel mixtures at training time, and train a neural network to separate the sources given one of the channels as input such that the other channel may be predicted from the separated sources. As the relationship between the room impulse responses (RIRs) of each channel depends on the locations of the sources, which are unknown to the network, the network cannot rely on learning that relationship. Instead, our proposed loss function fits each of the separated sources to the mixture in the target channel via Wiener filtering, and compares the resulting mixture to the ground-truth one. We show that minimizing the scale-invariant signal-to-distortion ratio (SI-SDR) of the predicted right-channel mixture with respect to the ground truth implicitly guides the network towards separating the left-channel sources. On a semi-supervised reverberant speech separation task based on the WHAMR! dataset, using training data where just 5% (resp., 10%) of the mixtures are labeled with associated isolated sources, we achieve 70% (resp., 78%) of the SI-SDR improvement obtained when training with supervision on the full training set, while a model trained only on the labeled data obtains 43% (resp., 45%).
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Late Audio-Visual Fusion for In-The-Wild Speaker Diarization
Authors:
Zexu Pan,
Gordon Wichern,
François G. Germain,
Aswin Subramanian,
Jonathan Le Roux
Abstract:
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system…
▽ More
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.
△ Less
Submitted 27 September, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes
Authors:
Shubham Dokania,
A. H. Abdul Hafez,
Anbumani Subramanian,
Manmohan Chandraker,
C. V. Jawahar
Abstract:
Autonomous driving and assistance systems rely on annotated data from traffic and road scenarios to model and learn the various object relations in complex real-world scenarios. Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios and adapt to different situations. Currently, existing datasets, while large-scale, lack su…
▽ More
Autonomous driving and assistance systems rely on annotated data from traffic and road scenarios to model and learn the various object relations in complex real-world scenarios. Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios and adapt to different situations. Currently, existing datasets, while large-scale, lack such diversities and are geographically biased towards mainly developed cities. An unstructured and complex driving layout found in several develo** countries such as India poses a challenge to these models due to the sheer degree of variations in the object types, densities, and locations. To facilitate better research toward accommodating such scenarios, we build a new dataset, IDD-3D, which consists of multi-modal data from multiple cameras and LiDAR sensors with 12k annotated driving LiDAR frames across various traffic scenarios. We discuss the need for this dataset through statistical comparisons with existing datasets and highlight benchmarks on standard 3D object detection and tracking tasks in complex layouts. Code and data available at https://github.com/shubham1810/idd3d_kit.git
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments
Authors:
Shubham Dokania,
Anbumani Subramanian,
Manmohan Chandraker,
C. V. Jawahar
Abstract:
High-quality structured data with rich annotations are critical components in intelligent vehicle systems dealing with road scenes. However, data curation and annotation require intensive investments and yield low-diversity scenarios. The recently growing interest in synthetic data raises questions about the scope of improvement in such systems and the amount of manual work still required to produ…
▽ More
High-quality structured data with rich annotations are critical components in intelligent vehicle systems dealing with road scenes. However, data curation and annotation require intensive investments and yield low-diversity scenarios. The recently growing interest in synthetic data raises questions about the scope of improvement in such systems and the amount of manual work still required to produce high volumes and variations of simulated data. This work proposes a synthetic data generation pipeline that utilizes existing datasets, like nuScenes, to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation, mimicking real scene properties with high-fidelity, along with mechanisms to diversify samples in a physically meaningful way. We demonstrate improvements in mIoU metrics by presenting qualitative and quantitative experiments with real and synthetic data for semantic segmentation on the Cityscapes and KITTI-STEP datasets. All relevant code and data is released on github (https://github.com/shubham1810/trove_toolkit).
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
ABB-BERT: A BERT model for disambiguating abbreviations and contractions
Authors:
Prateek Kacker,
Andi Cupallari,
Aswin Gridhar Subramanian,
Nimit Jain
Abstract:
Abbreviations and contractions are commonly found in text across different domains. For example, doctors' notes contain many contractions that can be personalized based on their choices. Existing spelling correction models are not suitable to handle expansions because of many reductions of characters in words. In this work, we propose ABB-BERT, a BERT-based model, which deals with an ambiguous lan…
▽ More
Abbreviations and contractions are commonly found in text across different domains. For example, doctors' notes contain many contractions that can be personalized based on their choices. Existing spelling correction models are not suitable to handle expansions because of many reductions of characters in words. In this work, we propose ABB-BERT, a BERT-based model, which deals with an ambiguous language containing abbreviations and contractions. ABB-BERT can rank them from thousands of options and is designed for scale. It is trained on Wikipedia text, and the algorithm allows it to be fine-tuned with little compute to get better performance for a domain or person. We are publicly releasing the training dataset for abbreviations and contractions derived from Wikipedia.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
How is model-related uncertainty quantified and reported in different disciplines?
Authors:
Emily G. Simmonds,
Kwaku Peprah Adjei,
Christoffer Wold Andersen,
Janne Cathrin Hetle Aspheim,
Claudia Battistin,
Nicola Bulso,
Hannah Christensen,
Benjamin Cretois,
Ryan Cubero,
Ivan A. Davidovich,
Lisa Dickel,
Benjamin Dunn,
Etienne Dunn-Sigouin,
Karin Dyrstad,
Sigurd Einum,
Donata Giglio,
Haakon Gjerlow,
Amelie Godefroidt,
Ricardo Gonzalez-Gil,
Soledad Gonzalo Cogno,
Fabian Grosse,
Paul Halloran,
Mari F. Jensen,
John James Kennedy,
Peter Egge Langsaether
, et al. (18 additional authors not shown)
Abstract:
How do we know how much we know? Quantifying uncertainty associated with our modelling work is the only way we can answer how much we know about any phenomenon. With quantitative science now highly influential in the public sphere and the results from models translating into action, we must support our conclusions with sufficient rigour to produce useful, reproducible results. Incomplete considera…
▽ More
How do we know how much we know? Quantifying uncertainty associated with our modelling work is the only way we can answer how much we know about any phenomenon. With quantitative science now highly influential in the public sphere and the results from models translating into action, we must support our conclusions with sufficient rigour to produce useful, reproducible results. Incomplete consideration of model-based uncertainties can lead to false conclusions with real world impacts. Despite these potentially damaging consequences, uncertainty consideration is incomplete both within and across scientific fields. We take a unique interdisciplinary approach and conduct a systematic audit of model-related uncertainty quantification from seven scientific fields, spanning the biological, physical, and social sciences. Our results show no single field is achieving complete consideration of model uncertainties, but together we can fill the gaps. We propose opportunities to improve the quantification of uncertainty through use of a source framework for uncertainty consideration, model type specific guidelines, improved presentation, and shared best practice. We also identify shared outstanding challenges (uncertainty in input data, balancing trade-offs, error propagation, and defining how much uncertainty is required). Finally, we make nine concrete recommendations for current practice (following good practice guidelines and an uncertainty checklist, presenting uncertainty numerically, and propagating model-related uncertainty into conclusions), future research priorities (uncertainty in input data, quantifying uncertainty in complex models, and the importance of missing uncertainty in different contexts), and general research standards across the sciences (transparency about study limitations and dedicated uncertainty sections of manuscripts).
△ Less
Submitted 1 July, 2022; v1 submitted 24 June, 2022;
originally announced June 2022.
-
SATBench: Benchmarking the speed-accuracy tradeoff in object recognition by humans and dynamic neural networks
Authors:
Ajay Subramanian,
Sara Price,
Omkar Kumbhar,
Elena Sizikova,
Najib J. Majaj,
Denis G. Pelli
Abstract:
The core of everyday tasks like reading and driving is active object recognition. Attempts to model such tasks are currently stymied by the inability to incorporate time. People show a flexible tradeoff between speed and accuracy and this tradeoff is a crucial human skill. Deep neural networks have emerged as promising candidates for predicting peak human object recognition performance and neural…
▽ More
The core of everyday tasks like reading and driving is active object recognition. Attempts to model such tasks are currently stymied by the inability to incorporate time. People show a flexible tradeoff between speed and accuracy and this tradeoff is a crucial human skill. Deep neural networks have emerged as promising candidates for predicting peak human object recognition performance and neural activity. However, modeling the temporal dimension i.e., the speed-accuracy tradeoff (SAT), is essential for them to serve as useful computational models for how humans recognize objects. To this end, we here present the first large-scale (148 observers, 4 neural networks, 8 tasks) dataset of the speed-accuracy tradeoff (SAT) in recognizing ImageNet images. In each human trial, a beep, indicating the desired reaction time, sounds at a fixed delay after the image is presented, and observer's response counts only if it occurs near the time of the beep. In a series of blocks, we test many beep latencies, i.e., reaction times. We observe that human accuracy increases with reaction time and proceed to compare its characteristics with the behavior of several dynamic neural networks that are capable of inference-time adaptive computation. Using FLOPs as an analog for reaction time, we compare networks with humans on curve-fit error, category-wise correlation, and curve steepness, and conclude that cascaded dynamic neural networks are a promising model of human reaction time in object recognition tasks.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads
Authors:
Aman Goyal,
Dev Agarwal,
Anbumani Subramanian,
C. V. Jawahar,
Ravi Kiran Sarvadevabhatla,
Rohit Saluja
Abstract:
In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles. Identifying and penalizing such riders is vital in curbing road accidents and improving citizens' safety. With this motivation, we propose an approach for detecting, tracking, and counting motorcycle ridin…
▽ More
In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles. Identifying and penalizing such riders is vital in curbing road accidents and improving citizens' safety. With this motivation, we propose an approach for detecting, tracking, and counting motorcycle riding violations in videos taken from a vehicle-mounted dashboard camera. We employ a curriculum learning-based object detector to better tackle challenging scenarios such as occlusions. We introduce a novel trapezium-shaped object boundary representation to increase robustness and tackle the rider-motorcycle association. We also introduce an amodal regressor that generates bounding boxes for the occluded riders. Experimental results on a large-scale unconstrained driving dataset demonstrate the superiority of our approach compared to existing approaches and other ablative variants.
△ Less
Submitted 18 April, 2022;
originally announced April 2022.
-
Heterogeneous Target Speech Separation
Authors:
Efthymios Tzinis,
Gordon Wichern,
Aswin Subramanian,
Paris Smaragdis,
Jonathan Le Roux
Abstract:
We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts u…
▽ More
We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts used as conditioning. Our experiments show that training separation models with heterogeneous conditions facilitates the generalization to new concepts with unseen out-of-domain data while also performing substantially higher than single-domain specialist models. Notably, such training leads to more robust learning of new harder source separation discriminative concepts and can yield improvements over permutation invariant training with oracle source selection. We analyze the intrinsic behavior of source separation training with heterogeneous metadata and propose ways to alleviate emerging problems with challenging separation conditions. We release the collection of preparation recipes for all datasets used to further promote research towards this challenging task.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
Magnetization Reversal Across Multiple Serial Barriers in a Single Fe$_3$O$_4$ Nanoparticle
Authors:
Sagar Paul,
Ganesh Kotagiri,
Rini Ganguly,
Annapoorni Subramanian,
Hervé Courtois,
Clemens B. Winkelmann,
Anjan K. Gupta
Abstract:
Depinning of nanoscale magnetic textures, such as domain walls, vortices and skyrmions, is of paramount importance for magnetic storage and information processing. We measure time-resolved magnetic switching statistics of an individual, non-single-domain Fe$_3$O$_4$ nanoparticle using a micrometer-scale superconducting quantum interference device. Surprisingly, a strong narrowing of the waiting-ti…
▽ More
Depinning of nanoscale magnetic textures, such as domain walls, vortices and skyrmions, is of paramount importance for magnetic storage and information processing. We measure time-resolved magnetic switching statistics of an individual, non-single-domain Fe$_3$O$_4$ nanoparticle using a micrometer-scale superconducting quantum interference device. Surprisingly, a strong narrowing of the waiting-time distributions before reaching the final state is observed as compared to the exponential distribution expected for a single barrier. The magnetization reversal across the nanostructure is thus shown to result from multiple serial barriers in the minimum energy pathway.
△ Less
Submitted 22 January, 2022;
originally announced January 2022.
-
Automatic Quantification and Visualization of Street Trees
Authors:
Arpit Bahety,
Rohit Saluja,
Ravi Kiran Sarvadevabhatla,
Anbumani Subramanian,
C. V. Jawahar
Abstract:
Assessing the number of street trees is essential for evaluating urban greenery and can help municipalities employ solutions to identify tree-starved streets. It can also help identify roads with different levels of deforestation and afforestation over time. Yet, there has been little work in the area of street trees quantification. This work first explains a data collection setup carefully design…
▽ More
Assessing the number of street trees is essential for evaluating urban greenery and can help municipalities employ solutions to identify tree-starved streets. It can also help identify roads with different levels of deforestation and afforestation over time. Yet, there has been little work in the area of street trees quantification. This work first explains a data collection setup carefully designed for counting roadside trees. We then describe a unique annotation procedure aimed at robustly detecting and quantifying trees. We work on a dataset of around 1300 Indian road scenes annotated with over 2500 street trees. We additionally use the five held-out videos covering 25 km of roads for counting trees. We finally propose a street tree detection, counting, and visualization framework using current object detectors and a novel yet simple counting algorithm owing to the thoughtful collection setup. We find that the high-level visualizations based on the density of trees on the routes and Kernel Density Ranking (KDR) provide a quick, accurate, and inexpensive way to recognize tree-starved streets. We obtain a tree detection mAP of 83.74% on the test images, which is a 2.73% improvement over our baseline. We propose Tree Count Density Classification Accuracy (TCDCA) as an evaluation metric to measure tree density. We obtain TCDCA of 96.77% on the test videos, with a remarkable improvement of 22.58% over baseline, and demonstrate that our counting module's performance is close to human level. Source code: https://github.com/iHubData-Mobility/public-tree-counting.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Dataset of gold nanoparticle sizes and morphologies extracted from literature-mined microscopy images
Authors:
Akshay Subramanian,
Kevin Cruse,
Amalie Trewartha,
Xingzhi Wang,
A. Paul Alivisatos,
Gerbrand Ceder
Abstract:
The factors controlling the size and morphology of nanoparticles have so far been poorly understood. Data-driven techniques are an exciting avenue to explore this field through the identification of trends and correlations in data. However, for these techniques to be utilized, large datasets annotated with the structural attributes of nanoparticles are required. While experimental SEM/TEM images c…
▽ More
The factors controlling the size and morphology of nanoparticles have so far been poorly understood. Data-driven techniques are an exciting avenue to explore this field through the identification of trends and correlations in data. However, for these techniques to be utilized, large datasets annotated with the structural attributes of nanoparticles are required. While experimental SEM/TEM images collected from controlled experiments are reliable sources of this information, large-scale collection of these images across a variety of experimental conditions is expensive and infeasible. Published scientific literature, which provides a vast source of high-quality figures including SEM/TEM images, can provide a large amount of data at a lower cost if effectively mined. In this work, we develop an automated pipeline to retrieve and analyse microscopy images from gold nanoparticle literature and provide a dataset of 4361 SEM/TEM images of gold nanoparticles along with automatically extracted size and morphology information. The dataset can be queried to obtain information about the physical attributes of gold nanoparticles and their statistical distributions.
△ Less
Submitted 6 January, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
Attention Guided Cosine Margin For Overcoming Class-Imbalance in Few-Shot Road Object Detection
Authors:
Ashutosh Agarwal,
Anay Majee,
Anbumani Subramanian,
Chetan Arora
Abstract:
Few-shot object detection (FSOD) localizes and classifies objects in an image given only a few data samples. Recent trends in FSOD research show the adoption of metric and meta-learning techniques, which are prone to catastrophic forgetting and class confusion. To overcome these pitfalls in metric learning based FSOD techniques, we introduce Attention Guided Cosine Margin (AGCM) that facilitates t…
▽ More
Few-shot object detection (FSOD) localizes and classifies objects in an image given only a few data samples. Recent trends in FSOD research show the adoption of metric and meta-learning techniques, which are prone to catastrophic forgetting and class confusion. To overcome these pitfalls in metric learning based FSOD techniques, we introduce Attention Guided Cosine Margin (AGCM) that facilitates the creation of tighter and well separated class-specific feature clusters in the classification head of the object detector. Our novel Attentive Proposal Fusion (APF) module minimizes catastrophic forgetting by reducing the intra-class variance among co-occurring classes. At the same time, the proposed Cosine Margin Cross-Entropy loss increases the angular margin between confusing classes to overcome the challenge of class confusion between already learned (base) and newly added (novel) classes. We conduct our experiments on the challenging India Driving Dataset (IDD), which presents a real-world class-imbalanced setting alongside popular FSOD benchmark PASCAL-VOC. Our method outperforms State-of-the-Art (SoTA) approaches by up to 6.4 mAP points on the IDD-OS and up to 2.0 mAP points on the IDD-10 splits for the 10-shot setting. On the PASCAL-VOC dataset, we outperform existing SoTA approaches by up to 4.9 mAP points.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
Authors:
Can Karakus,
Rahul Huilgol,
Fei Wu,
Anirudh Subramanian,
Cade Daniel,
Derya Cavdar,
Teng Xu,
Haohan Chen,
Arash Rahnama,
Luis Quintela
Abstract:
With deep learning models rapidly growing in size, systems-level solutions for large-model training are required. We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch, and enables easy training of large models using model parallelism and other memory-saving features. In contrast to existing solutions, the implementation of the SageMaker library is much mor…
▽ More
With deep learning models rapidly growing in size, systems-level solutions for large-model training are required. We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch, and enables easy training of large models using model parallelism and other memory-saving features. In contrast to existing solutions, the implementation of the SageMaker library is much more generic and flexible, in that it can automatically partition and run pipeline parallelism over arbitrary model architectures with minimal code change, and also offers a general and extensible framework for tensor parallelism, which supports a wider range of use cases, and is modular enough to be easily applied to new training scripts. The library also preserves the native PyTorch user experience to a much larger degree, supporting module re-use and dynamic graphs, while giving the user full control over the details of the training step. We evaluate performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering, and demonstrate competitive performance over existing solutions.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
A PTAS for the horizontal rectangle stabbing problem
Authors:
Arindam Khan,
Aditya Subramanian,
Andreas Wiese
Abstract:
We study rectangle stabbing problems in which we are given $n$ axis-aligned rectangles in the plane that we want to stab, i.e., we want to select line segments such that for each given rectangle there is a line segment that intersects two opposite edges of it.
In the horizontal rectangle stabbing problem (STABBING), the goal is to find a set of horizontal line segments of minimum total length su…
▽ More
We study rectangle stabbing problems in which we are given $n$ axis-aligned rectangles in the plane that we want to stab, i.e., we want to select line segments such that for each given rectangle there is a line segment that intersects two opposite edges of it.
In the horizontal rectangle stabbing problem (STABBING), the goal is to find a set of horizontal line segments of minimum total length such that all rectangles are stabbed. In general rectangle stabbing problem, also known as horizontal-vertical stabbing problem (HV-Stabbing), the goal is to find a set of rectilinear (i.e., either vertical or horizontal) line segments of minimum total length such that all rectangles are stabbed. Both variants are NP-hard. Chan, van Dijk, Fleszar, Spoerhase, and Wolff [2018]initiated the study of these problems by providing constant approximation algorithms. Recently, Eisenbrand, Gallato, Svensson, and Venzin [2021] have presented a QPTAS and a polynomial-time 8-approximation algorithm for STABBING but it is was open whether the problem admits a PTAS.
In this paper, we obtain a PTAS for STABBING, settling this question. For HV-Stabbing, we obtain a $(2+\varepsilon)$-approximation. We also obtain PTASes for special cases of HV-Stabbing: (i) when all rectangles are squares, (ii) when each rectangle's width is at most its height, and (iii) when all rectangles are $δ$-large, i.e., have at least one edge whose length is at least $δ$, while all edge lengths are at most 1. Our result also implies improved approximations for other problems such as generalized minimum Manhattan network.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
An efficient bundle-based approach for the share-a-ride problem
Authors:
Ana Beatriz Herthel,
Richard Hartl,
Anand Subramanian,
Thibaut Vidal
Abstract:
Some of today's most significant challenges in urban environments concern individual mobility and rapid parcel delivery. With the surge of e-commerce and the ever-increasing volume of goods to be handled, new logistic solutions are in high demand. The share-a-ride problem (SARP) was proposed as one such solution, combining people and parcel transportation in taxis. This is an NP-hard problem and t…
▽ More
Some of today's most significant challenges in urban environments concern individual mobility and rapid parcel delivery. With the surge of e-commerce and the ever-increasing volume of goods to be handled, new logistic solutions are in high demand. The share-a-ride problem (SARP) was proposed as one such solution, combining people and parcel transportation in taxis. This is an NP-hard problem and thus obtaining optimal solutions can be computationally costly. In this paper, we work with a variation of SARP for ride-hailing systems, which can be formulated as a multi-depot open generalised vehicle routing problem with time windows. We present and solve a mixed-integer linear programming (MILP) formulation for this problem that bundles requests together, and we compare its results to a previously proposed two-stage method. The latter solves the so-called freight insertion problem (FIP) in the second stage, for which we consider two versions, and the problem consists of inserting parcels into predefined passenger routes obtained in the first stage. We tested the methods in three sets of instances. The developed bundle-based approach outperformed both FIP versions in solution quality and in the service of parcels. Our method also compares favourably when it comes to reducing the amount of deadheading distance.
△ Less
Submitted 21 February, 2023; v1 submitted 28 October, 2021;
originally announced October 2021.
-
Meta Guided Metric Learner for Overcoming Class Confusion in Few-Shot Road Object Detection
Authors:
Anay Majee,
Anbumani Subramanian,
Kshitij Agrawal
Abstract:
Localization and recognition of less-occurring road objects have been a challenge in autonomous driving applications due to the scarcity of data samples. Few-Shot Object Detection techniques extend the knowledge from existing base object classes to learn novel road objects given few training examples. Popular techniques in FSOD adopt either meta or metric learning techniques which are prone to cla…
▽ More
Localization and recognition of less-occurring road objects have been a challenge in autonomous driving applications due to the scarcity of data samples. Few-Shot Object Detection techniques extend the knowledge from existing base object classes to learn novel road objects given few training examples. Popular techniques in FSOD adopt either meta or metric learning techniques which are prone to class confusion and base class forgetting. In this work, we introduce a novel Meta Guided Metric Learner (MGML) to overcome class confusion in FSOD. We re-weight the features of the novel classes higher than the base classes through a novel Squeeze and Excite module and encourage the learning of truly discriminative class-specific features by applying an Orthogonality Constraint to the meta learner. Our method outperforms State-of-the-Art (SoTA) approaches in FSOD on the India Driving Dataset (IDD) by upto 11 mAP points while suffering from the least class confusion of 20% given only 10 examples of each novel road object. We further show similar improvements on the few-shot splits of PASCAL VOC dataset where we outperform SoTA approaches by upto 5.8 mAP accross all splits.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
Multi-Domain Incremental Learning for Semantic Segmentation
Authors:
Prachi Garg,
Rohit Saluja,
Vineeth N Balasubramanian,
Chetan Arora,
Anbumani Subramanian,
C. V. Jawahar
Abstract:
Recent efforts in multi-domain learning for semantic segmentation attempt to learn multiple geographical datasets in a universal, joint model. A simple fine-tuning experiment performed sequentially on three popular road scene segmentation datasets demonstrates that existing segmentation frameworks fail at incrementally learning on a series of visually disparate geographical domains. When learning…
▽ More
Recent efforts in multi-domain learning for semantic segmentation attempt to learn multiple geographical datasets in a universal, joint model. A simple fine-tuning experiment performed sequentially on three popular road scene segmentation datasets demonstrates that existing segmentation frameworks fail at incrementally learning on a series of visually disparate geographical domains. When learning a new domain, the model catastrophically forgets previously learned knowledge. In this work, we pose the problem of multi-domain incremental learning for semantic segmentation. Given a model trained on a particular geographical domain, the goal is to (i) incrementally learn a new geographical domain, (ii) while retaining performance on the old domain, (iii) given that the previous domain's dataset is not accessible. We propose a dynamic architecture that assigns universally shared, domain-invariant parameters to capture homogeneous semantic features present in all domains, while dedicated domain-specific parameters learn the statistics of each domain. Our novel optimization strategy helps achieve a good balance between retention of old knowledge (stability) and acquiring new knowledge (plasticity). We demonstrate the effectiveness of our proposed solution on domain incremental settings pertaining to real-world driving scenes from roads of Germany (Cityscapes), the United States (BDD100k), and India (IDD).
△ Less
Submitted 23 October, 2021;
originally announced October 2021.
-
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition
Authors:
Xuankai Chang,
Takashi Maekaku,
Pengcheng Guo,
**g Shi,
Yen-Ju Lu,
Aswin Shanmugam Subramanian,
Tianzi Wang,
Shu-wen Yang,
Yu Tsao,
Hung-yi Lee,
Shinji Watanabe
Abstract:
Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations…
▽ More
Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.
△ Less
Submitted 9 October, 2021;
originally announced October 2021.
-
Few-Shot Batch Incremental Road Object Detection via Detector Fusion
Authors:
Anuj Tambwekar,
Kshitij Agrawal,
Anay Majee,
Anbumani Subramanian
Abstract:
Incremental few-shot learning has emerged as a new and challenging area in deep learning, whose objective is to train deep learning models using very few samples of new class data, and none of the old class data. In this work we tackle the problem of batch incremental few-shot road object detection using data from the India Driving Dataset (IDD). Our approach, DualFusion, combines object detectors…
▽ More
Incremental few-shot learning has emerged as a new and challenging area in deep learning, whose objective is to train deep learning models using very few samples of new class data, and none of the old class data. In this work we tackle the problem of batch incremental few-shot road object detection using data from the India Driving Dataset (IDD). Our approach, DualFusion, combines object detectors in a manner that allows us to learn to detect rare objects with very limited data, all without severely degrading the performance of the detector on the abundant classes. In the IDD OpenSet incremental few-shot detection task, we achieve a mAP50 score of 40.0 on the base classes and an overall mAP50 score of 38.8, both of which are the highest to date. In the COCO batch incremental few-shot detection task, we achieve a novel AP score of 9.9, surpassing the state-of-the-art novel class performance on the same by over 6.6 times.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
Exponential-Size Neighborhoods for the Pickup-and-Delivery Traveling Salesman Problem
Authors:
Toni Pacheco,
Rafael Martinelli,
Anand Subramanian,
Túlio A. M. Toffolo,
Thibaut Vidal
Abstract:
Neighborhood search is a cornerstone of state-of-the-art traveling salesman and vehicle routing metaheuristics. While neighborhood exploration procedures are well developed for problems with individual services, their counterparts for one-to-one pickup-and-delivery problems have been more scarcely studied. A direct extension of classic neighborhoods is often inefficient or complex due to the neces…
▽ More
Neighborhood search is a cornerstone of state-of-the-art traveling salesman and vehicle routing metaheuristics. While neighborhood exploration procedures are well developed for problems with individual services, their counterparts for one-to-one pickup-and-delivery problems have been more scarcely studied. A direct extension of classic neighborhoods is often inefficient or complex due to the necessity of jointly considering service pairs. To circumvent these issues, we introduce major improvements to existing neighborhood searches for the pickup-and-delivery traveling salesman problem and new large neighborhoods. We show that the classical Relocate-Pair neighborhood can be fully explored in $O(n^2)$ instead of $O(n^3)$ time. We adapt the 4-Opt and Balas-Simonetti neighborhoods to consider precedence constraints. Moreover, we introduce an exponential-size neighborhood called 2k-Opt, which includes all solutions generated by multiple nested 2-Opts and can be searched in $O(n^2)$ time using dynamic programming. We conduct extensive computational experiments, highlighting the significant contribution of these new neighborhoods and speed-up strategies within two classical metaheuristics. Notably, our approach permits to repeatedly solve small pickup-and-delivery problem instances to optimality or near-optimality within milliseconds, and therefore it represents a valuable tool for time-critical applications such as meal delivery or mobility on demand.
△ Less
Submitted 20 August, 2022; v1 submitted 12 July, 2021;
originally announced July 2021.
-
Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition
Authors:
Aswin Shanmugam Subramanian,
Chao Weng,
Shinji Watanabe,
Meng Yu,
Dong Yu
Abstract:
Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate repre…
▽ More
Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate representations inside the network. This allows our model to give source-specific posteriors as the output unlike the traditional multi-label classification approach. Existing deep learning methods perform a frame level prediction, whereas our approach performs an utterance level prediction by incorporating temporal selection and averaging inside the network to avoid post-processing. We also experiment with various loss functions and show that a variant of earth mover distance (EMD) is very effective in classifying DOA at a very high resolution by modeling inter-class relationships. In addition to using the prediction error as a metric for evaluating our localization model, we also establish its potency as a frontend with automatic speech recognition (ASR) as the downstream task. We convert the estimated DOAs into a feature suitable for ASR and pass it as an additional input feature to a strong multi-channel and multi-talker speech recognition baseline. This added input feature drastically improves the ASR performance and gives a word error rate (WER) of 6.3% on the evaluation data of our simulated noisy two speaker mixtures, while the baseline which doesn't use explicit localization input has a WER of 11.5%. We also perform ASR evaluation on real recordings with the overlapped set of the MC-WSJ-AV corpus in addition to simulated mixtures.
△ Less
Submitted 28 November, 2021; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Few-Shot Learning for Road Object Detection
Authors:
Anay Majee,
Kshitij Agrawal,
Anbumani Subramanian
Abstract:
Few-shot learning is a problem of high interest in the evolution of deep learning. In this work, we consider the problem of few-shot object detection (FSOD) in a real-world, class-imbalanced scenario. For our experiments, we utilize the India Driving Dataset (IDD), as it includes a class of less-occurring road objects in the image dataset and hence provides a setup suitable for few-shot learning.…
▽ More
Few-shot learning is a problem of high interest in the evolution of deep learning. In this work, we consider the problem of few-shot object detection (FSOD) in a real-world, class-imbalanced scenario. For our experiments, we utilize the India Driving Dataset (IDD), as it includes a class of less-occurring road objects in the image dataset and hence provides a setup suitable for few-shot learning. We evaluate both metric-learning and meta-learning based FSOD methods, in two experimental settings: (i) representative (same-domain) splits from IDD, that evaluates the ability of a model to learn in the context of road images, and (ii) object classes with less-occurring object samples, similar to the open-set setting in real-world. From our experiments, we demonstrate that the metric-learning method outperforms meta-learning on the novel classes by (i) 11.2 mAP points on the same domain, and (ii) 1.0 mAP point on the open-set. We also show that our extension of object classes in a real-world open dataset offers a rich ground for few-shot learning studies.
△ Less
Submitted 17 March, 2021; v1 submitted 29 January, 2021;
originally announced January 2021.
-
The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans
Authors:
Shinji Watanabe,
Florian Boyer,
Xuankai Chang,
Pengcheng Guo,
Tomoki Hayashi,
Yosuke Higuchi,
Takaaki Hori,
Wen-Chin Huang,
Hirofumi Inaguma,
Naoyuki Kamo,
Shigeki Karita,
Chenda Li,
**g Shi,
Aswin Shanmugam Subramanian,
Wangyou Zhang
Abstract:
This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text…
▽ More
This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
COVIDScholar: An automated COVID-19 research aggregation and analysis platform
Authors:
Amalie Trewartha,
John Dagdelen,
Haoyan Huo,
Kevin Cruse,
Zheren Wang,
Tan** He,
Akshay Subramanian,
Yuxing Fei,
Benjamin Justus,
Kristin Persson,
Gerbrand Ceder
Abstract:
The ongoing COVID-19 pandemic has had far-reaching effects throughout society, and science is no exception. The scale, speed, and breadth of the scientific community's COVID-19 response has lead to the emergence of new research literature on a remarkable scale -- as of October 2020, over 81,000 COVID-19 related scientific papers have been released, at a rate of over 250 per day. This has created a…
▽ More
The ongoing COVID-19 pandemic has had far-reaching effects throughout society, and science is no exception. The scale, speed, and breadth of the scientific community's COVID-19 response has lead to the emergence of new research literature on a remarkable scale -- as of October 2020, over 81,000 COVID-19 related scientific papers have been released, at a rate of over 250 per day. This has created a challenge to traditional methods of engagement with the research literature; the volume of new research is far beyond the ability of any human to read, and the urgency of response has lead to an increasingly prominent role for pre-print servers and a diffusion of relevant research across sources. These factors have created a need for new tools to change the way scientific literature is disseminated. COVIDScholar is a knowledge portal designed with the unique needs of the COVID-19 research community in mind, utilizing NLP to aid researchers in synthesizing the information spread across thousands of emergent research articles, patents, and clinical trials into actionable insights and new knowledge. The search interface for this corpus, https://covidscholar.org, now serves over 2000 unique users weekly. We present also an analysis of trends in COVID-19 research over the course of 2020.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration
Authors:
Chenda Li,
**g Shi,
Wangyou Zhang,
Aswin Shanmugam Subramanian,
Xuankai Chang,
Naoyuki Kamo,
Moto Hira,
Tomoki Hayashi,
Christoph Boeddeker,
Zhuo Chen,
Shinji Watanabe
Abstract:
We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhanc…
▽ More
We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation).It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets.
△ Less
Submitted 7 November, 2020;
originally announced November 2020.
-
Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization
Authors:
Aswin Shanmugam Subramanian,
Chao Weng,
Shinji Watanabe,
Meng Yu,
Yong Xu,
Shi-Xiong Zhang,
Dong Yu
Abstract:
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR), which explicitly models source speaker locations. In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn de…
▽ More
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR), which explicitly models source speaker locations. In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance. All three functionalities of D-ASR: localization, separation, and recognition are connected as a single differentiable neural network and trained solely based on ASR error minimization objectives. The advantages of D-ASR over existing methods are threefold: (1) it provides explicit speaker locations, (2) it improves the explainability factor, and (3) it achieves better ASR performance as the process is more streamlined. In addition, D-ASR does not require explicit direction of arrival (DOA) supervision like existing data-driven localization models, which makes it more appropriate for realistic data. For the case of two source mixtures, D-ASR achieves an average DOA prediction error of less than three degrees. It also outperforms a strong far-field multi-speaker end-to-end system in both separation quality and ASR performance.
△ Less
Submitted 30 October, 2020;
originally announced November 2020.