Search | arXiv e-print repository

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

Authors: Arvind Krishna Sridhar, Yinyi Guo, Erik Visser, Rehana Mahfuz

Abstract: There has been significant research on develo** pretrained transformer architectures for multimodal-to-text generation tasks. Albeit performance improvements, such models are frequently overparameterized, hence suffer from hallucination and large memory footprint making them challenging to deploy on edge devices. In this paper, we address both these issues for the application of automated audio… ▽ More There has been significant research on develo** pretrained transformer architectures for multimodal-to-text generation tasks. Albeit performance improvements, such models are frequently overparameterized, hence suffer from hallucination and large memory footprint making them challenging to deploy on edge devices. In this paper, we address both these issues for the application of automated audio captioning. First, we propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination. Then, we propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data. During the beam decoding step, the smaller model utilizes an audio-text shared latent representation to semantically align the generated text with corresponding input audio. Faithful guidance is introduced into the beam probability by incorporating the cosine similarity between latent representation projections of greedy rolled out intermediate beams and audio clip. We show the efficacy of our algorithm on benchmark datasets and evaluate the proposed scheme against baselines using conventional audio captioning and semantic similarity metrics while illustrating tradeoffs between performance and complexity. △ Less

Submitted 6 September, 2023; originally announced September 2023.

Comments: 5 pages, 5 tables, 1 figure

arXiv:2309.03326 [pdf, other]

Detecting False Alarms and Misses in Audio Captions

Authors: Rehana Mahfuz, Yinyi Guo, Arvind Krishna Sridhar, Erik Visser

Abstract: Metrics to evaluate audio captions simply provide a score without much explanation regarding what may be wrong in case the score is low. Manual human intervention is needed to find any shortcomings of the caption. In this work, we introduce a metric which automatically identifies the shortcomings of an audio caption by detecting the misses and false alarms in a candidate caption with respect to a… ▽ More Metrics to evaluate audio captions simply provide a score without much explanation regarding what may be wrong in case the score is low. Manual human intervention is needed to find any shortcomings of the caption. In this work, we introduce a metric which automatically identifies the shortcomings of an audio caption by detecting the misses and false alarms in a candidate caption with respect to a reference caption, and reports the recall, precision and F-score. Such a metric is very useful in profiling the deficiencies of an audio captioning model, which is a milestone towards improving the quality of audio captions. △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2212.02712 [pdf, other]

Improved Beam Search for Hallucination Mitigation in Abstractive Summarization

Authors: Arvind Krishna Sridhar, Erik Visser

Abstract: Advancement in large pretrained language models has significantly improved their performance for conditional language generation tasks including summarization albeit with hallucinations. To reduce hallucinations, conventional methods proposed improving beam search or using a fact checker as a postprocessing step. In this paper, we investigate the use of the Natural Language Inference (NLI) entailm… ▽ More Advancement in large pretrained language models has significantly improved their performance for conditional language generation tasks including summarization albeit with hallucinations. To reduce hallucinations, conventional methods proposed improving beam search or using a fact checker as a postprocessing step. In this paper, we investigate the use of the Natural Language Inference (NLI) entailment metric to detect and prevent hallucinations in summary generation. We propose an NLI-assisted beam re-ranking mechanism by computing entailment probability scores between the input context and summarization model-generated beams during saliency-enhanced greedy decoding. Moreover, a diversity metric is introduced to compare its effectiveness against vanilla beam search. Our proposed algorithm significantly outperforms vanilla beam decoding on XSum and CNN/DM datasets. △ Less

Submitted 14 November, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

Comments: 8 pages, 2 figures

arXiv:2209.09316 [pdf, other]

Activity report analysis with automatic single or multispan answer extraction

Authors: Ravi Choudhary, Arvind Krishna Sridhar, Erik Visser

Abstract: In the era of loT (Internet of Things) we are surrounded by a plethora of Al enabled devices that can transcribe images, video, audio, and sensors signals into text descriptions. When such transcriptions are captured in activity reports for monitoring, life logging and anomaly detection applications, a user would typically request a summary or ask targeted questions about certain sections of the r… ▽ More In the era of loT (Internet of Things) we are surrounded by a plethora of Al enabled devices that can transcribe images, video, audio, and sensors signals into text descriptions. When such transcriptions are captured in activity reports for monitoring, life logging and anomaly detection applications, a user would typically request a summary or ask targeted questions about certain sections of the report they are interested in. Depending on the context and the type of question asked, a question answering (QA) system would need to automatically determine whether the answer covers single-span or multi-span text components. Currently available QA datasets primarily focus on single span responses only (such as SQuAD[4]) or contain a low proportion of examples with multiple span answers (such as DROP[3]). To investigate automatic selection of single/multi-span answers in the use case described, we created a new smart home environment dataset comprised of questions paired with single-span or multi-span answers depending on the question and context queried. In addition, we propose a RoBERTa[6]-based multiple span extraction question answering (MSEQA) model returning the appropriate answer span for a given question. Our experiments show that the proposed model outperforms state-of-the-art QA models on our dataset while providing comparable performance on published individual single/multi-span task datasets. △ Less

Submitted 9 September, 2022; originally announced September 2022.

arXiv:2109.05097 [pdf, other]

HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge

Authors: Yufei Tian, Arvind krishna Sridhar, Nanyun Peng

Abstract: A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense a… ▽ More A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. Next, we leverage the COMeT and reverse COMeT models to do commonsense and counterfactual inference. We then generate multiple hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles creatively with high success rate and intensity scores. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Comments: Accepted at Findings of EMNLP21

arXiv:1707.06391 [pdf, other]

Deterministic Dispersion of Mobile Robots in Dynamic Rings

Authors: Ankush Agarwalla, John Augustine, William K. Moses Jr., Madhav Sankar K., Arvind Krishna Sridhar

Abstract: In this work, we study the problem of dispersion of mobile robots on dynamic rings. The problem of dispersion of $n$ robots on an $n$ node graph, introduced by Augustine and Moses Jr. [1], requires robots to coordinate with each other and reach a configuration where exactly one robot is present on each node. This problem has real world applications and applies whenever we want to minimize the tota… ▽ More In this work, we study the problem of dispersion of mobile robots on dynamic rings. The problem of dispersion of $n$ robots on an $n$ node graph, introduced by Augustine and Moses Jr. [1], requires robots to coordinate with each other and reach a configuration where exactly one robot is present on each node. This problem has real world applications and applies whenever we want to minimize the total cost of $n$ agents sharing $n$ resources, located at various places, subject to the constraint that the cost of an agent moving to a different resource is comparatively much smaller than the cost of multiple agents sharing a resource (e.g. smart electric cars sharing recharge stations). The study of this problem also provides indirect benefits to the study of scattering on graphs, the study of exploration by mobile robots, and the study of load balancing on graphs. We solve the problem of dispersion in the presence of two types of dynamism in the underlying graph: (i) vertex permutation and (ii) 1-interval connectivity. We introduce the notion of vertex permutation dynamism and have it mean that for a given set of nodes, in every round, the adversary ensures a ring structure is maintained, but the connections between the nodes may change. We use the idea of 1-interval connectivity from Di Luna et al. [10], where for a given ring, in each round, the adversary chooses at most one edge to remove. We assume robots have full visibility and present asymptotically time optimal algorithms to achieve dispersion in the presence of both types of dynamism when robots have chirality. When robots do not have chirality, we present asymptotically time optimal algorithms to achieve dispersion subject to certain constraints. Finally, we provide impossibility results for dispersion when robots have no visibility. △ Less

Submitted 16 October, 2017; v1 submitted 20 July, 2017; originally announced July 2017.

Comments: 21 pages, 10 figures, concise version of paper to appear in ICDCN 2018

ACM Class: F.2.2; G.2.2

Showing 1–6 of 6 results for author: Sridhar, A K