Search | arXiv e-print repository

What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation

Authors: Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, Carsten Eickhoff

Abstract: Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeli… ▽ More Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE incorporates a Semantic Minimal Pairs (SMP) framework for image corruption and Symmetric Token Replacement (STR) for text. This approach enables semantically meaningful causal mediation analysis for both modalities, providing a robust method for analyzing multimodal integration within models like BLIP. Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making, identifying the significant role of middle-layer cross-attention heads. Further, we uncover a set of ``universal cross-attention heads'' that consistently contribute across tasks and modalities, each performing distinct functions such as implicit image segmentation, object inhibition, and outlier inhibition. This work paves the way for more transparent and interpretable multimodal systems. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.15241 [pdf]

doi 10.1145/3664190.3672514

Retrieval Augmented Zero-Shot Text Classification

Authors: Tassallah Abdullahi, Ritambhara Singh, Carsten Eickhoff

Abstract: Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has b… ▽ More Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has been addressed by improving the embedding model with expensive training. We introduce QZero, a novel training-free knowledge augmentation approach that reformulates queries by retrieving supporting categories from Wikipedia to improve zero-shot text classification performance. Our experiments across six diverse datasets demonstrate that QZero enhances performance for state-of-the-art static and contextual embedding models without the need for retraining. Notably, in News and medical topic classification tasks, QZero improves the performance of even the largest OpenAI embedding model by at least 5% and 3%, respectively. Acting as a knowledge amplifier, QZero enables small word embedding models to achieve performance levels comparable to those of larger contextual models, offering the potential for significant computational savings. Additionally, QZero offers meaningful insights that illuminate query context and verify topic relevance, aiding in understanding model predictions. Overall, QZero improves embedding-based zero-shot classifiers while maintaining their simplicity. This makes it particularly valuable for resource-constrained environments and domains with constantly evolving information. △ Less

Submitted 26 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

Comments: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC, DC, USA

arXiv:2406.11769 [pdf, other]

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

Authors: Andrei Atanov, Jiawei Fu, Rishubh Singh, Isabella Yu, Andrew Spielberg, Amir Zamir

Abstract: A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent (i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughout nature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise t… ▽ More A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent (i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughout nature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise the following questions: 1. How effective simple visual sensors are in solving vision tasks? 2. What role does their design play in their effectiveness? We explore simple sensors with resolutions as low as one-by-one pixel, representing a single photoreceptor First, we demonstrate that just a few photoreceptors can be enough to solve many tasks, such as visual navigation and continuous control, reasonably well, with performance comparable to that of a high-resolution camera. Second, we show that the design of these simple visual sensors plays a crucial role in their ability to provide useful information and successfully solve these tasks. To find a well-performing design, we present a computational design optimization algorithm and evaluate its effectiveness across different tasks and domains, showing promising results. Finally, we perform a human survey to evaluate the effectiveness of intuitive designs devised manually by humans, showing that the computationally found design is among the best designs in most cases. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.09750 [pdf, other]

ControlVAR: Exploring Controllable Visual Autoregressive Modeling

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, Bhiksha Raj

Abstract: Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces ControlVAR, a novel… ▽ More Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces ControlVAR, a novel framework that explores pixel-level controls in visual autoregressive (VAR) modeling for flexible and efficient conditional generation. In contrast to traditional conditional models that learn the conditional distribution, ControlVAR jointly models the distribution of image and pixel-level conditions during training and imposes conditional controls during testing. To enhance the joint modeling, we adopt the next-scale AR prediction paradigm and unify control and image representations. A teacher-forcing guidance strategy is proposed to further facilitate controllable generation with joint modeling. Extensive experiments demonstrate the superior efficacy and flexibility of ControlVAR across various conditional generation tasks against popular conditional DMs, \eg, ControlNet and T2I-Adaptor. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 24 pages, 19 figures, 4 tables

arXiv:2406.04346 [pdf, other]

doi 10.1145/3644815.3644981

Automating Patch Set Generation from Code Review Comments Using Large Language Models

Authors: Tajmilur Rahman, Rahul Singh, Mir Yousuf Sultan

Abstract: The advent of Large Language Models (LLMs) has revolutionized various domains of artificial intelligence, including the realm of software engineering. In this research, we evaluate the efficacy of pre-trained LLMs in replicating the tasks traditionally performed by developers in response to code review comments. We provide code contexts to five popular LLMs and obtain the suggested code-changes (p… ▽ More The advent of Large Language Models (LLMs) has revolutionized various domains of artificial intelligence, including the realm of software engineering. In this research, we evaluate the efficacy of pre-trained LLMs in replicating the tasks traditionally performed by developers in response to code review comments. We provide code contexts to five popular LLMs and obtain the suggested code-changes (patch sets) derived from real-world code-review comments. The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets from the same repositories. This comparative analysis aims to determine the accuracy, relevance, and depth of the LLMs' feedback, thereby evaluating their readiness to support developers in responding to code-review comments. Novelty: This particular research area is still immature requiring a substantial amount of studies yet to be done. No prior research has compared the performance of existing Large Language Models (LLMs) in code-review comments. This in-progress study assesses current LLMs in code review and paves the way for future advancements in automated code quality assurance, reducing context-switching overhead due to interruptions from code change requests. △ Less

Submitted 9 April, 2024; originally announced June 2024.

Comments: 2 pages

arXiv:2405.18793 [pdf, other]

Adaptive Discretization-based Non-Episodic Reinforcement Learning in Metric Spaces

Authors: Avik Kar, Rahul Singh

Abstract: We study non-episodic Reinforcement Learning for Lipschitz MDPs in which state-action space is a metric space, and the transition kernel and rewards are Lipschitz functions. We develop computationally efficient UCB-based algorithm, $\textit{ZoRL-}ε$ that adaptively discretizes the state-action space and show that their regret as compared with $ε$-optimal policy is bounded as… ▽ More We study non-episodic Reinforcement Learning for Lipschitz MDPs in which state-action space is a metric space, and the transition kernel and rewards are Lipschitz functions. We develop computationally efficient UCB-based algorithm, $\textit{ZoRL-}ε$ that adaptively discretizes the state-action space and show that their regret as compared with $ε$-optimal policy is bounded as $\mathcal{O}(ε^{-(2 d_\mathcal{S} + d^ε_z + 1)}\log{(T)})$, where $d^ε_z$ is the $ε$-zooming dimension. In contrast, if one uses the vanilla $\textit{UCRL-}2$ on a fixed discretization of the MDP, the regret w.r.t. a $ε$-optimal policy scales as $\mathcal{O}(ε^{-(2 d_\mathcal{S} + d + 1)}\log{(T)})$ so that the adaptivity gains are huge when $d^ε_z \ll d$. Note that the absolute regret of any 'uniformly good' algorithm for a large family of continuous MDPs asymptotically scales as at least $Ω(\log{(T)})$. Though adaptive discretization has been shown to yield $\mathcal{\tilde{O}}(H^{2.5}K^\frac{d_z + 1}{d_z + 2})$ regret in episodic RL, an attempt to extend this to the non-episodic case by employing constant duration episodes whose duration increases with $T$, is futile since $d_z \to d$ as $T \to \infty$. The current work shows how to obtain adaptivity gains for non-episodic RL. The theoretical results are supported by simulations on two systems where the performance of $\textit{ZoRL-}ε$ is compared with that of '$\textit{UCRL-C}$,' the fixed discretization-based extension of $\textit{UCRL-}2$ for systems with continuous state-action spaces. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 38 pages, 2 figures

arXiv:2405.15468 [pdf, other]

Semantic Aware Diffusion Inverse Tone Map**

Authors: Abhishek Goswami, Aru Ranjan Singh, Francesco Banterle, Kurt Debattista, Thomas Bashford-Rogers

Abstract: The range of real-world scene luminance is larger than the capture capability of many digital camera sensors which leads to details being lost in captured images, most typically in bright regions. Inverse tone map** attempts to boost these captured Standard Dynamic Range (SDR) images back to High Dynamic Range (HDR) by creating a map** that linearizes the well exposed values from the SDR image… ▽ More The range of real-world scene luminance is larger than the capture capability of many digital camera sensors which leads to details being lost in captured images, most typically in bright regions. Inverse tone map** attempts to boost these captured Standard Dynamic Range (SDR) images back to High Dynamic Range (HDR) by creating a map** that linearizes the well exposed values from the SDR image, and provides a luminance boost to the clipped content. However, in most cases, the details in the clipped regions cannot be recovered or estimated. In this paper, we present a novel inverse tone map** approach for map** SDR images to HDR that generates lost details in clipped regions through a semantic-aware diffusion based inpainting approach. Our method proposes two major contributions - first, we propose to use a semantic graph to guide SDR diffusion based inpainting in masked regions in a saturated image. Second, drawing inspiration from traditional HDR imaging and bracketing methods, we propose a principled formulation to lift the SDR inpainted regions to HDR that is compatible with generative inpainting methods. Results show that our method demonstrates superior performance across different datasets on objective metrics, and subjective experiments show that the proposed method matches (and in most cases outperforms) state-of-art inverse tone map** operators in terms of objective metrics and outperforms them for visual fidelity. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.13370 [pdf, other]

Low-Resolution Chest X-ray Classification via Knowledge Distillation and Multi-task Learning

Authors: Yasmeena Akhter, Rishabh Ranjan, Richa Singh, Mayank Vatsa

Abstract: This research addresses the challenges of diagnosing chest X-rays (CXRs) at low resolutions, a common limitation in resource-constrained healthcare settings. High-resolution CXR imaging is crucial for identifying small but critical anomalies, such as nodules or opacities. However, when images are downsized for processing in Computer-Aided Diagnosis (CAD) systems, vital spatial details and receptiv… ▽ More This research addresses the challenges of diagnosing chest X-rays (CXRs) at low resolutions, a common limitation in resource-constrained healthcare settings. High-resolution CXR imaging is crucial for identifying small but critical anomalies, such as nodules or opacities. However, when images are downsized for processing in Computer-Aided Diagnosis (CAD) systems, vital spatial details and receptive fields are lost, hampering diagnosis accuracy. To address this, this paper presents the Multilevel Collaborative Attention Knowledge (MLCAK) method. This approach leverages the self-attention mechanism of Vision Transformers (ViT) to transfer critical diagnostic knowledge from high-resolution images to enhance the diagnostic efficacy of low-resolution CXRs. MLCAK incorporates local pathological findings to boost model explainability, enabling more accurate global predictions in a multi-task framework tailored for low-resolution CXR analysis. Our research, utilizing the Vindr CXR dataset, shows a considerable enhancement in the ability to diagnose diseases from low-resolution images (e.g. 28 x 28), suggesting a critical transition from the traditional reliance on high-resolution imaging (e.g. 224 x 224). △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: IEEE ISBI 2024

arXiv:2405.09101 [pdf, other]

Adaptive Koopman Embedding for Robust Control of Complex Nonlinear Dynamical Systems

Authors: Rajpal Singh, Chandan Kumar Sah, Jishnu Keshavan

Abstract: The discovery of linear embedding is the key to the synthesis of linear control techniques for nonlinear systems. In recent years, while Koopman operator theory has become a prominent approach for learning these linear embeddings through data-driven methods, these algorithms often exhibit limitations in generalizability beyond the distribution captured by training data and are not robust to change… ▽ More The discovery of linear embedding is the key to the synthesis of linear control techniques for nonlinear systems. In recent years, while Koopman operator theory has become a prominent approach for learning these linear embeddings through data-driven methods, these algorithms often exhibit limitations in generalizability beyond the distribution captured by training data and are not robust to changes in the nominal system dynamics induced by intrinsic or environmental factors. To overcome these limitations, this study presents an adaptive Koopman architecture capable of responding to the changes in system dynamics online. The proposed framework initially employs an autoencoder-based neural network that utilizes input-output information from the nominal system to learn the corresponding Koopman embedding offline. Subsequently, we augment this nominal Koopman architecture with a feed-forward neural network that learns to modify the nominal dynamics in response to any deviation between the predicted and observed lifted states, leading to improved generalization and robustness to a wide range of uncertainties and disturbances compared to contemporary methods. Extensive tracking control simulations, which are undertaken by integrating the proposed scheme within a Model Predictive Control framework, are used to highlight its robustness against measurement noise, disturbances, and parametric variations in system dynamics. △ Less

Submitted 20 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: Corrected the title

arXiv:2404.09530 [pdf, other]

doi 10.1145/3595916.3626448

RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization

Authors: Avinash Anand, Raj Jaiswal, Mohit Gupta, Siddhesh S Bangar, Pijush Bhuyan, Naman Lal, Rajeev Singh, Ritika Jha, Rajiv Ratn Shah, Shin'ichi Satoh

Abstract: Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these… ▽ More Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class. △ Less

Submitted 19 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: 8 pages, 6 figures, MMAsia 2023 Proceedings of the 5th ACM International Conference on Multimedia in Asia

Journal ref: In Proceedings of the 5th ACM International Conference on Multimedia in Asia 2023. Association for Computing Machinery, NY, USA, Article 74, pp. 1-6

arXiv:2404.03587 [pdf, other]

Anticipate & Collab: Data-driven Task Anticipation and Knowledge-driven Planning for Human-robot Collaboration

Authors: Shivam Singh, Karthik Swaminathan, Raghav Arora, Ramandeep Singh, Ahana Datta, Dipanjan Das, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna

Abstract: An agent assisting humans in daily living activities can collaborate more effectively by anticipating upcoming tasks. Data-driven methods represent the state of the art in task anticipation, planning, and related problems, but these methods are resource-hungry and opaque. Our prior work introduced a proof of concept framework that used an LLM to anticipate 3 high-level tasks that served as goals f… ▽ More An agent assisting humans in daily living activities can collaborate more effectively by anticipating upcoming tasks. Data-driven methods represent the state of the art in task anticipation, planning, and related problems, but these methods are resource-hungry and opaque. Our prior work introduced a proof of concept framework that used an LLM to anticipate 3 high-level tasks that served as goals for a classical planning system that computed a sequence of low-level actions for the agent to achieve these goals. This paper describes DaTAPlan, our framework that significantly extends our prior work toward human-robot collaboration. Specifically, DaTAPlan planner computes actions for an agent and a human to collaboratively and jointly achieve the tasks anticipated by the LLM, and the agent automatically adapts to unexpected changes in human action outcomes and preferences. We evaluate DaTAPlan capabilities in a realistic simulation environment, demonstrating accurate task anticipation, effective human-robot collaboration, and the ability to adapt to unexpected changes. Project website: https://dataplan-hrc.github.io △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2403.20312 [pdf, other]

Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

Authors: Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati

Abstract: Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given… ▽ More Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 14 pages + 6 figures in main manuscript (excluding references)

arXiv:2403.15248 [pdf, other]

Self-Supervised Backbone Framework for Diverse Agricultural Vision Tasks

Authors: Sudhir Sornapudi, Rajhans Singh

Abstract: Computer vision in agriculture is game-changing with its ability to transform farming into a data-driven, precise, and sustainable industry. Deep learning has empowered agriculture vision to analyze vast, complex visual data, but heavily rely on the availability of large annotated datasets. This remains a bottleneck as manual labeling is error-prone, time-consuming, and expensive. The lack of effi… ▽ More Computer vision in agriculture is game-changing with its ability to transform farming into a data-driven, precise, and sustainable industry. Deep learning has empowered agriculture vision to analyze vast, complex visual data, but heavily rely on the availability of large annotated datasets. This remains a bottleneck as manual labeling is error-prone, time-consuming, and expensive. The lack of efficient labeling approaches inspired us to consider self-supervised learning as a paradigm shift, learning meaningful feature representations from raw agricultural image data. In this work, we explore how self-supervised representation learning unlocks the potential applicability to diverse agriculture vision tasks by eliminating the need for large-scale annotated datasets. We propose a lightweight framework utilizing SimCLR, a contrastive learning approach, to pre-train a ResNet-50 backbone on a large, unannotated dataset of real-world agriculture field images. Our experimental analysis and results indicate that the model learns robust features applicable to a broad range of downstream agriculture tasks discussed in the paper. Additionally, the reduced reliance on annotated data makes our approach more cost-effective and accessible, paving the way for broader adoption of computer vision in agriculture. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.13989 [pdf, other]

FastFlip: Compositional Error Injection Analysis

Authors: Keyur Joshi, Rahul Singh, Tommaso Bassetto, Sarita Adve, Darko Marinov, Sasa Misailovic

Abstract: Instruction-level error injection analyses aim to find instructions where errors often lead to unacceptable outcomes like Silent Data Corruptions (SDCs). These analyses require significant time, which is especially problematic if developers wish to regularly analyze software that evolves over time. We present FastFlip, a combination of empirical error injection and symbolic SDC propagation analy… ▽ More Instruction-level error injection analyses aim to find instructions where errors often lead to unacceptable outcomes like Silent Data Corruptions (SDCs). These analyses require significant time, which is especially problematic if developers wish to regularly analyze software that evolves over time. We present FastFlip, a combination of empirical error injection and symbolic SDC propagation analyses that enables fast, compositional error injection analysis of evolving programs. FastFlip calculates how SDCs propagate across program sections and correctly accounts for unexpected side effects that can occur due to errors. Using FastFlip, we analyze five benchmarks, plus two modified versions of each benchmark. FastFlip speeds up the analysis of incrementally modified programs by $3.2\times$ (geomean). FastFlip selects a set of instructions to protect against SDCs that minimizes the runtime cost of protection while protecting against a developer-specified target fraction of all SDC-causing errors. △ Less

Submitted 26 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.12321 [pdf, other]

Explainable agency: human preferences for simple or complex explanations

Authors: Michelle Blom, Ronal Singh, Tim Miller, Liz Sonenberg, Kerry Trentelman, Adam Saulwick

Abstract: Research in cognitive psychology has established that whether people prefer simpler explanations to complex ones is context dependent, but the question of `simple vs. complex' becomes critical when an artificial agent seeks to explain its decisions or predictions to humans. We present a model for abstracting causal reasoning chains for the purpose of explanation. This model uses a set of rules to… ▽ More Research in cognitive psychology has established that whether people prefer simpler explanations to complex ones is context dependent, but the question of `simple vs. complex' becomes critical when an artificial agent seeks to explain its decisions or predictions to humans. We present a model for abstracting causal reasoning chains for the purpose of explanation. This model uses a set of rules to progressively abstract different types of causal information in causal proof traces. We perform online studies using 123 Amazon MTurk participants and with five industry experts over two domains: maritime patrol and weather prediction. We found participants' satisfaction with generated explanations was based on the consistency of relationships among the causes (coherence) that explain an event; and that the important question is not whether people prefer simple or complex explanations, but what types of causal information are relevant to individuals in specific contexts. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.09806 [pdf, other]

xLP: Explainable Link Prediction for Master Data Management

Authors: Balaji Ganesan, Matheen Ahmed Pasha, Srinivasa Parkala, Neeraj R Singh, Gayatri Mishra, Sumit Bhatia, Hima Patel, Somashekar Naganna, Sameep Mehta

Abstract: Explaining neural model predictions to users requires creativity. Especially in enterprise applications, where there are costs associated with users' time, and their trust in the model predictions is critical for adoption. For link prediction in master data management, we have built a number of explainability solutions drawing from research in interpretability, fact verification, path ranking, neu… ▽ More Explaining neural model predictions to users requires creativity. Especially in enterprise applications, where there are costs associated with users' time, and their trust in the model predictions is critical for adoption. For link prediction in master data management, we have built a number of explainability solutions drawing from research in interpretability, fact verification, path ranking, neuro-symbolic reasoning and self-explaining AI. In this demo, we present explanations for link prediction in a creative way, to allow users to choose explanations they are more comfortable with. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 8 pages, 4 figures, NeurIPS 2020 Competition and Demonstration Track. arXiv admin note: text overlap with arXiv:2012.05516

arXiv:2403.07043 [pdf, other]

A Collision Cone Approach for Control Barrier Functions

Authors: Manan Tayal, Bhavya Giri Goswami, Karthik Rajgopal, Rajpal Singh, Tejas Rao, Jishnu Keshavan, Pushpak Jagtap, Shishir Kolathaya

Abstract: This work presents a unified approach for collision avoidance using Collision-Cone Control Barrier Functions (CBFs) in both ground (UGV) and aerial (UAV) unmanned vehicles. We propose a novel CBF formulation inspired by collision cones, to ensure safety by constraining the relative velocity between the vehicle and the obstacle to always point away from each other. The efficacy of this approach is… ▽ More This work presents a unified approach for collision avoidance using Collision-Cone Control Barrier Functions (CBFs) in both ground (UGV) and aerial (UAV) unmanned vehicles. We propose a novel CBF formulation inspired by collision cones, to ensure safety by constraining the relative velocity between the vehicle and the obstacle to always point away from each other. The efficacy of this approach is demonstrated through simulations and hardware implementations on the TurtleBot, Stoch-Jeep, and Crazyflie 2.1 quadrotor robot, showcasing its effectiveness in avoiding collisions with dynamic obstacles in both ground and aerial settings. The real-time controller is developed using CBF Quadratic Programs (CBF-QPs). Comparative analysis with the state-of-the-art CBFs highlights the less conservative nature of the proposed approach. Overall, this research contributes to a novel control formation that can give a guarantee for collision avoidance in unmanned vehicles by modifying the control inputs from existing path-planning controllers. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 13 pages, 16 pages. arXiv admin note: substantial text overlap with arXiv:2209.11524, arXiv:2303.15871, arXiv:2310.10839

arXiv:2403.04924 [pdf, other]

$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Authors: Xiang Li, Kai Qiu, **glu Wang, Xiaohao Xu, Rita Singh, Kashu Yamazak, Hao Chen, Xiaonan Huang, Bhiksha Raj

Abstract: Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses t… ▽ More Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses the resilience of RPMs against various perturbations in both general and specific contexts. Recognizing the complex nature of referring perception tasks, we present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of composite disturbances. Employing this toolbox, we construct $\text{R}^2$-Bench, a benchmark for assessing the Robustness of Referring perception models under noisy conditions across five key tasks. Moreover, we propose the $\text{R}^2$-Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions. Our investigation uncovers the vulnerabilities of current RPMs to various perturbations and provides tools for assessing model robustness, potentially promoting the safe and resilient integration of intelligent systems into complex real-world scenarios. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: Code and dataset will be released at https://github.com/lxa9867/r2bench

arXiv:2403.04379 [pdf, other]

Performance evaluation of conditional handover in 5G systems under fading scenario

Authors: Souvik Deb, Megh Rathod, Rishi Balamurugan, Shankar K. Ghosh, Rajeev K. Singh, Samriddha Sanyal

Abstract: To enhance the handover performance in fifth generation (5G) cellular systems, conditional handover (CHO) has been evolved as a promising solution. Unlike A3 based handover where handover execution is certain after receiving handover command from the serving access network, in CHO, handover execution is conditional on the RSRP measurements from both current and target access networks, as well as o… ▽ More To enhance the handover performance in fifth generation (5G) cellular systems, conditional handover (CHO) has been evolved as a promising solution. Unlike A3 based handover where handover execution is certain after receiving handover command from the serving access network, in CHO, handover execution is conditional on the RSRP measurements from both current and target access networks, as well as on mobility parameters such as preparation and execution offsets. Analytic evaluation of conditional handover performance is unprecedented in literature. In this work, handover performance of CHO has been carried out in terms of handover latency, handover packet loss and handover failure probability. A Markov model accounting the effect of different mobility parameters (e.g., execution offset, preparation offset, time-to-preparation and time-to-execution), UE velocity and channel fading characteristics; has been proposed to characterize handover failure. Results obtained from the analytic model has been validated against extensive simulation results. Our study reveal that optimal configuration of $O_{exec}$, $O_{prep}$, $T_{exec}$ and $T_{prep}$ is actually conditional on underlying UE velocity and fading characteristics. This study will be helpful for the mobile operators to choose appropriate thresholds of the mobility parameters under different channel condition and UE velocities. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.10781 [pdf, other]

Towards 6G Evolution: Three Enhancements, Three Innovations, and Three Major Challenges

Authors: Rohit Singh, Aryan Kaushik, Wonjae Shin, Marco Di Renzo, Vincenzo Sciancalepore, Doohwan Lee, Hirofumi Sasaki, Arman Shojaeifard, Octavia A. Dobre

Abstract: Over the past few decades, wireless communication has witnessed remarkable growth, experiencing several transformative changes. This article aims to provide a comprehensive overview of wireless communication technologies, from the foundations to the recent wireless advances. Specifically, we take a neutral look at the state-of-the-art technologies for 5G and the ongoing evolutions towards 6G, revi… ▽ More Over the past few decades, wireless communication has witnessed remarkable growth, experiencing several transformative changes. This article aims to provide a comprehensive overview of wireless communication technologies, from the foundations to the recent wireless advances. Specifically, we take a neutral look at the state-of-the-art technologies for 5G and the ongoing evolutions towards 6G, reviewing the recommendations of the International Mobile Communication vision for 2030 (IMT-2030). We first highlight specific features of IMT 2030, including three IMT-2020 extensions (URLLC+, eMBB+, and mMTC+) and three new innovations (Ubiquitous connectivity and integrating the new capabilities of sensing & AI with communication functionality). Then, we delve into three major challenges in implementing 6G, along with global standardization efforts. Besides, a proof of concept is provided by demonstrating terahertz (THz) signal transmission using Orbital Angular Momentum (OAM) multiplexing, which is one of the potential candidates for 6G and beyond. To inspire further potential research, we conclude by identifying research opportunities and future visions on IMT-2030 recommendations. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: 8 pages, 4 figures, 1 table

arXiv:2402.10454 [pdf, other]

Optimizing Skin Lesion Classification via Multimodal Data and Auxiliary Task Integration

Authors: Mahapara Khurshid, Mayank Vatsa, Richa Singh

Abstract: The rising global prevalence of skin conditions, some of which can escalate to life-threatening stages if not timely diagnosed and treated, presents a significant healthcare challenge. This issue is particularly acute in remote areas where limited access to healthcare often results in delayed treatment, allowing skin diseases to advance to more critical stages. One of the primary challenges in dia… ▽ More The rising global prevalence of skin conditions, some of which can escalate to life-threatening stages if not timely diagnosed and treated, presents a significant healthcare challenge. This issue is particularly acute in remote areas where limited access to healthcare often results in delayed treatment, allowing skin diseases to advance to more critical stages. One of the primary challenges in diagnosing skin diseases is their low inter-class variations, as many exhibit similar visual characteristics, making accurate classification challenging. This research introduces a novel multimodal method for classifying skin lesions, integrating smartphone-captured images with essential clinical and demographic information. This approach mimics the diagnostic process employed by medical professionals. A distinctive aspect of this method is the integration of an auxiliary task focused on super-resolution image prediction. This component plays a crucial role in refining visual details and enhancing feature extraction, leading to improved differentiation between classes and, consequently, elevating the overall effectiveness of the model. The experimental evaluations have been conducted using the PAD-UFES20 dataset, applying various deep-learning architectures. The results of these experiments not only demonstrate the effectiveness of the proposed method but also its potential applicability under-resourced healthcare environments. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.09585 [pdf, other]

Domain Adaptation for Contrastive Audio-Language Models

Authors: Soham Deshmukh, Rita Singh, Bhiksha Raj

Abstract: Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performan… ▽ More Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performance, like few-shot learning or fine-tuning, require access to annotated data and iterations of training. Therefore, we propose a test-time domain adaptation method for ALMs that does not require access to annotations. Our method learns a domain vector by enforcing consistency across augmented views of the testing audio. We extensively evaluate our approach on 12 downstream tasks across domains. With just one example, our domain adaptation method leads to 3.2% (max 8.4%) average zero-shot performance improvement. After adaptation, the model still retains the generalization property of ALMs. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.05398 [pdf, other]

On the Effect of Image Resolution on Semantic Segmentation

Authors: Ritambhara Singh, Abhishek Jain, Pietro Perona, Shivani Agarwal, Junfeng Yang

Abstract: High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model ca… ▽ More High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model capable of directly producing high-resolution segmentations can match the performance of more complex systems that generate lower-resolution results. By simplifying the network architecture, we enable the processing of images at their native resolution. Our approach leverages a bottom-up information propagation technique across various scales, which we have empirically shown to enhance segmentation accuracy. We have rigorously tested our method using leading-edge semantic segmentation datasets. Specifically, for the Cityscapes dataset, we further boost accuracy by applying the Noisy Student Training technique. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: arXiv admin note: text overlap with arXiv:2209.08667 by other authors

arXiv:2402.01922 [pdf, other]

A General Framework for Learning from Weak Supervision

Authors: Hao Chen, **dong Wang, Lei Feng, Xiang Li, Yidong Wang, Xing Xie, Masashi Sugiyama, Rita Singh, Bhiksha Raj

Abstract: Weakly supervised learning generally faces challenges in applicability to various scenarios with diverse weak supervision and in scalability due to the complexity of existing algorithms, thereby hindering the practical deployment. This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm. Central to GLWS is an Expectation-Maximization (EM) formulati… ▽ More Weakly supervised learning generally faces challenges in applicability to various scenarios with diverse weak supervision and in scalability due to the complexity of existing algorithms, thereby hindering the practical deployment. This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm. Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources, including instance partial labels, aggregate statistics, pairwise observations, and unlabeled data. We further present an advanced algorithm that significantly simplifies the EM computational demands using a Non-deterministic Finite Automaton (NFA) along with a forward-backward algorithm, which effectively reduces time complexity from quadratic or factorial often required in existing solutions to linear scale. The problem of learning from arbitrary weak supervision is therefore converted to the NFA modeling of them. GLWS not only enhances the scalability of machine learning models but also demonstrates superior performance and versatility across 11 weak supervision scenarios. We hope our work paves the way for further advancements and practical deployment in this field. △ Less

Submitted 5 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: 24 pages, 20 tables, 9 figures

arXiv:2402.01292 [pdf, other]

Towards the New XAI: A Hypothesis-Driven Approach to Decision Support Using Evidence

Authors: Thao Le, Tim Miller, Liz Sonenberg, Ronal Singh

Abstract: Prior research on AI-assisted human decision-making has explored several different explainable AI (XAI) approaches. A recent paper has proposed a paradigm shift calling for hypothesis-driven XAI through a conceptual framework called evaluative AI that gives people evidence that supports or refutes hypotheses without necessarily giving a decision-aid recommendation. In this paper, we describe and e… ▽ More Prior research on AI-assisted human decision-making has explored several different explainable AI (XAI) approaches. A recent paper has proposed a paradigm shift calling for hypothesis-driven XAI through a conceptual framework called evaluative AI that gives people evidence that supports or refutes hypotheses without necessarily giving a decision-aid recommendation. In this paper, we describe and evaluate an approach for hypothesis-driven XAI based on the Weight of Evidence (WoE) framework, which generates both positive and negative evidence for a given hypothesis. Through human behavioural experiments, we show that our hypothesis-driven approach increases decision accuracy and reduces reliance compared to a recommendation-driven approach and an AI-explanation-only baseline, but with a small increase in under-reliance compared to the recommendation-driven approach. Further, we show that participants used our hypothesis-driven approach in a materially different way to the two baselines. △ Less

Submitted 27 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: 26 pages

arXiv:2402.00282 [pdf, other]

PAM: Prompting Audio-Language Models for Audio Quality Assessment

Authors: Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calcu… ▽ More While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.15589 [pdf, other]

OpineBot: Class Feedback Reimagined Using a Conversational LLM

Authors: Henansh Tanwar, Kunal Shrivastva, Rahul Singh, Dhruv Kumar

Abstract: Conventional class feedback systems often fall short, relying on static, unengaging surveys offering little incentive for student participation. To address this, we present OpineBot, a novel system employing large language models (LLMs) to conduct personalized, conversational class feedback via chatbot interface. We assessed OpineBot's effectiveness in a user study with 20 students from an Indian… ▽ More Conventional class feedback systems often fall short, relying on static, unengaging surveys offering little incentive for student participation. To address this, we present OpineBot, a novel system employing large language models (LLMs) to conduct personalized, conversational class feedback via chatbot interface. We assessed OpineBot's effectiveness in a user study with 20 students from an Indian university's Operating-Systems class, utilizing surveys and interviews to analyze their experiences. Findings revealed a resounding preference for OpineBot compared to conventional methods, highlighting its ability to engage students, produce deeper feedback, offering a dynamic survey experience. This research represents a work in progress, providing early results, marking a significant step towards revolutionizing class feedback through LLM-based technology, promoting student engagement, and leading to richer data for instructors. This ongoing research presents preliminary findings and marks a notable advancement in transforming classroom feedback using LLM-based technology to enhance student engagement and generate comprehensive data for educators. △ Less

Submitted 28 January, 2024; originally announced January 2024.

Comments: Under review

arXiv:2401.12863 [pdf, other]

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Authors: Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao

Abstract: Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates C… ▽ More Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: AAAI 2024

arXiv:2401.12803 [pdf, other]

Enhancements for 5G NR PRACH Reception: An AI/ML Approach

Authors: Rohit Singh, Anil Kumar Yerrapragada, Jeeva Keshav S, Radha Krishna Ganti

Abstract: Random Access is an important step in enabling the initial attachment of a User Equipment (UE) to a Base Station (gNB). The UE identifies itself by embedding a Preamble Index (RAPID) in the phase rotation of a known base sequence, which it transmits on the Physical Random Access Channel (PRACH). The signal on the PRACH also enables the estimation of propagation delay, often known as Timing Advance… ▽ More Random Access is an important step in enabling the initial attachment of a User Equipment (UE) to a Base Station (gNB). The UE identifies itself by embedding a Preamble Index (RAPID) in the phase rotation of a known base sequence, which it transmits on the Physical Random Access Channel (PRACH). The signal on the PRACH also enables the estimation of propagation delay, often known as Timing Advance (TA), which is induced by virtue of the UE's position. Traditional receivers estimate the RAPID and TA using correlation-based techniques. This paper presents an alternative receiver approach that uses AI/ML models, wherein two neural networks are proposed, one for the RAPID and one for the TA. Different from other works, these two models can run in parallel as opposed to sequentially. Experiments with both simulated data and over-the-air hardware captures highlight the improved performance of the proposed AI/ML-based techniques compared to conventional correlation methods. △ Less

Submitted 12 January, 2024; originally announced January 2024.

arXiv:2401.09620 [pdf, other]

Cost-effective and performant virtual WANs with CORNIFER

Authors: Anjali, Rachee Singh, Michael M. Swift

Abstract: Virtual wide-area networks (WANs) are WAN-as-a-service cloud offerings that aim to bring the performance benefits of dedicated wide-area interconnects to enterprise customers. In this work, we show that the topology of a virtual WAN can render it both performance and cost inefficient. We develop Cornifer, a tool that designs virtual WAN topologies by deciding the number of virtual WAN nodes and th… ▽ More Virtual wide-area networks (WANs) are WAN-as-a-service cloud offerings that aim to bring the performance benefits of dedicated wide-area interconnects to enterprise customers. In this work, we show that the topology of a virtual WAN can render it both performance and cost inefficient. We develop Cornifer, a tool that designs virtual WAN topologies by deciding the number of virtual WAN nodes and their location in the cloud to minimize connection latency at low cost to enterprises. By leveraging millions of latency measurements from vantage points across the world to cloud points of presence, Cornifer designs virtual WAN topologies that improve weighted client latency by 26% and lower cost by 28% compared to the state-of-the-art. Cornifer identifies virtual WAN topologies at the Pareto frontier of the deployment cost vs. connection latency trade-off and proposes a heuristic for automatic selection of Pareto-optimal virtual WAN topologies for enterprises. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: 21 pages

ACM Class: C.2.1; C.4

arXiv:2401.00547

On Learning for Ambiguous Chance Constrained Problems

Authors: A Ch Madhusudanarao, Rahul Singh

Abstract: We study chance constrained optimization problems $\min_x f(x)$ s.t. $P(\left\{ θ: g(x,θ)\le 0 \right\})\ge 1-ε$ where $ε\in (0,1)$ is the violation probability, when the distribution $P$ is not known to the decision maker (DM). When the DM has access to a set of distributions $\mathcal{U}$ such that $P$ is contained in $\mathcal{U}$, then the problem is known as the ambiguous chance-constrained p… ▽ More We study chance constrained optimization problems $\min_x f(x)$ s.t. $P(\left\{ θ: g(x,θ)\le 0 \right\})\ge 1-ε$ where $ε\in (0,1)$ is the violation probability, when the distribution $P$ is not known to the decision maker (DM). When the DM has access to a set of distributions $\mathcal{U}$ such that $P$ is contained in $\mathcal{U}$, then the problem is known as the ambiguous chance-constrained problem \cite{erdougan2006ambiguous}. We study ambiguous chance-constrained problem for the case when $\mathcal{U}$ is of the form $\left\{μ:\frac{μ(y)}{ν(y)}\leq C, \forall y\inΘ, μ(y)\ge 0\right\}$, where $ν$ is a ``reference distribution.'' We show that in this case the original problem can be ``well-approximated'' by a sampled problem in which $N$ i.i.d. samples of $θ$ are drawn from $ν$, and the original constraint is replaced with $g(x,θ_i)\le 0,~i=1,2,\ldots,N$. We also derive the sample complexity associated with this approximation, i.e., for $ε,δ>0$ the number of samples which must be drawn from $ν$ so that with a probability greater than $1-δ$ (over the randomness of $ν$), the solution obtained by solving the sampled program yields an $ε$-feasible solution for the original chance constrained problem. △ Less

Submitted 11 February, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: We have "not considered the uniform bound" for violation probabilities corresponding to the set of distributions in the ambiguity set

arXiv:2312.11561 [pdf, other]

COPD-FlowNet: Elevating Non-invasive COPD Diagnosis with CFD Simulations

Authors: Aryan Tyagi, Aryaman Rao, Shubhanshu Rao, Raj Kumar Singh

Abstract: Chronic Obstructive Pulmonary Disorder (COPD) is a prevalent respiratory disease that significantly impacts the quality of life of affected individuals. This paper presents COPDFlowNet, a novel deep-learning framework that leverages a custom Generative Adversarial Network (GAN) to generate synthetic Computational Fluid Dynamics (CFD) velocity flow field images specific to the trachea of COPD patie… ▽ More Chronic Obstructive Pulmonary Disorder (COPD) is a prevalent respiratory disease that significantly impacts the quality of life of affected individuals. This paper presents COPDFlowNet, a novel deep-learning framework that leverages a custom Generative Adversarial Network (GAN) to generate synthetic Computational Fluid Dynamics (CFD) velocity flow field images specific to the trachea of COPD patients. These synthetic images serve as a valuable resource for data augmentation and model training. Additionally, COPDFlowNet incorporates a custom Convolutional Neural Network (CNN) architecture to predict the location of the obstruction site. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: 2 pages 2 tables 3 figures

arXiv:2312.10127 [pdf, other]

doi 10.1145/3620678.3624783

How Does It Function? Characterizing Long-term Trends in Production Serverless Workloads

Authors: Artjom Joosen, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Luke Darlow, Jianfeng Wang, Adam Barker

Abstract: This paper releases and analyzes two new Huawei cloud serverless traces. The traces span a period of over 7 months with over 1.4 trillion function invocations combined. The first trace is derived from Huawei's internal workloads and contains detailed per-second statistics for 200 functions running across multiple Huawei cloud data centers. The second trace is a representative workload from Huawei'… ▽ More This paper releases and analyzes two new Huawei cloud serverless traces. The traces span a period of over 7 months with over 1.4 trillion function invocations combined. The first trace is derived from Huawei's internal workloads and contains detailed per-second statistics for 200 functions running across multiple Huawei cloud data centers. The second trace is a representative workload from Huawei's public FaaS platform. This trace contains per-minute arrival rates for over 5000 functions running in a single Huawei data center. We present the internals of a production FaaS platform by characterizing resource consumption, cold-start times, programming languages used, periodicity, per-second versus per-minute burstiness, correlations, and popularity. Our findings show that there is considerable diversity in how serverless functions behave: requests vary by up to 9 orders of magnitude across functions, with some functions executed over 1 billion times per day; scheduling time, execution time and cold-start distributions vary across 2 to 4 orders of magnitude and have very long tails; and function invocation counts demonstrate strong periodicity for many individual functions and on an aggregate level. Our analysis also highlights the need for further research in estimating resource reservations and time-series prediction to account for the huge diversity in how serverless functions behave. Datasets and code available at https://github.com/sir-lab/data-release △ Less

Submitted 15 December, 2023; originally announced December 2023.

ACM Class: D.4.7; I.5.1; C.4

Journal ref: SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing, October 2023, Pages 443-458

arXiv:2312.06699 [pdf, other]

Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning

Authors: Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Rahul Pratap Singh, Bishmoy Paul, Ali Dabouei, Min Xu

Abstract: A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence… ▽ More A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably enhances video-text retrieval by a relative improvement of 8.3\% in video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms of R@1. Additionally, in video moment retrieval, average mAP shows a relative improvement ranging from 2.0\% to 13.7 \% across different baselines. △ Less

Submitted 9 December, 2023; originally announced December 2023.

arXiv:2312.04231 [pdf, other]

Adventures of Trustworthy Vision-Language Models: A Survey

Authors: Mayank Vatsa, Anubhooti Jain, Richa Singh

Abstract: Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as in… ▽ More Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: Accepted in AAAI 2024

arXiv:2311.08723 [pdf, other]

doi 10.18653/v1/2023.emnlp-main.810

Token Prediction as Implicit Classification to Identify LLM-Generated Text

Authors: Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, Bhiksha Raj

Abstract: This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task and directly fine-tune the base LM to perform it. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experi… ▽ More This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task and directly fine-tune the base LM to perform it. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments. We compared our approach to the more direct approach of utilizing hidden states for classification. Evaluation shows the exceptional performance of our method in the text classification task, highlighting its simplicity and efficiency. Furthermore, interpretability studies on the features extracted by our model reveal its ability to differentiate distinctive writing styles among various LLMs even in the absence of an explicit classifier. We also collected a dataset named OpenLLMText, containing approximately 340k text samples from human and LLMs, including GPT3.5, PaLM, LLaMA, and GPT2. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: EMNLP 2023, Main Conference

arXiv:2310.15848 [pdf, other]

On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

Authors: Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, Tal Hassner

Abstract: Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popula… ▽ More Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI. △ Less

Submitted 24 November, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: corrected typos

arXiv:2310.09449 [pdf, other]

Pairwise Similarity Learning is SimPLE

Authors: Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf

Abstract: In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples w… ▽ More In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples with the same label) than to negative pairs (i.e., a pair of samples with different label). We start by identifying a key desideratum for PSL, and then discuss how existing methods can achieve this desideratum. We then propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin and yet is able to generalize well in open-set recognition. We apply the proposed method to three challenging PSL tasks: open-set face recognition, image retrieval and speaker verification. Comprehensive experimental results on large-scale benchmarks show that our method performs significantly better than current state-of-the-art methods. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Published in ICCV 2023 (Project page: https://simple.is.tue.mpg.de/)

arXiv:2310.07209 [pdf, other]

Multi-task Explainable Skin Lesion Classification

Authors: Mahapara Khurshid, Mayank Vatsa, Richa Singh

Abstract: Skin cancer is one of the deadliest diseases and has a high mortality rate if left untreated. The diagnosis generally starts with visual screening and is followed by a biopsy or histopathological examination. Early detection can aid in lowering mortality rates. Visual screening can be limited by the experience of the doctor. Due to the long tail distribution of dermatological datasets and signific… ▽ More Skin cancer is one of the deadliest diseases and has a high mortality rate if left untreated. The diagnosis generally starts with visual screening and is followed by a biopsy or histopathological examination. Early detection can aid in lowering mortality rates. Visual screening can be limited by the experience of the doctor. Due to the long tail distribution of dermatological datasets and significant intra-variability between classes, automatic classification utilizing computer-aided methods becomes challenging. In this work, we propose a multitask few-shot-based approach for skin lesions that generalizes well with few labelled data to address the small sample space challenge. The proposed approach comprises a fusion of a segmentation network that acts as an attention module and classification network. The output of the segmentation network helps to focus on the most discriminatory features while making a decision by the classification network. To further enhance the classification performance, we have combined segmentation and classification loss in a weighted manner. We have also included the visualization results that explain the decisions made by the algorithm. Three dermatological datasets are used to evaluate the proposed method thoroughly. We also conducted cross-database experiments to ensure that the proposed approach is generalizable across similar datasets. Experimental results demonstrate the efficacy of the proposed work. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.04445 [pdf, other]

LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model

Authors: Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh

Abstract: It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private targe… ▽ More It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private target models. The success rate of attack depends on how closely the proxy model approximates the private model. We hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query. Therefore, in this paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning proxy models on similar queries that lie in the lexico-semantic neighborhood of harmful queries to decrease the divergence between the proxy and target models. First, we demonstrate three approaches to prompt private target models to obtain similar queries given harmful queries. Next, we obtain data for local fine-tuning by eliciting responses from target models for the generated similar queries. Then, we optimize attack suffixes to generate attack prompts and evaluate the impact of our local fine-tuning on the attack's success rate. Experiments show that local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively. △ Less

Submitted 21 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.02298 [pdf, other]

Prompting Audios Using Acoustic Properties For Emotion Representation

Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

Abstract: Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emoti… ▽ More Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset. △ Less

Submitted 6 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.07737

arXiv:2310.00808 [pdf, other]

Completing Visual Objects via Bridging Generation and Segmentation

Authors: Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu

Abstract: This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the gene… ▽ More This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the generated images can lead to a more accurate mask by fusing the segmentation of images. We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser. Through alternation between the generation and segmentation stages, the partial object mask is progressively refined, providing precise shape guidance and yielding superior object completion results. Our experiments demonstrate the superiority of MaskComp over existing approaches, e.g., ControlNet and Stable Diffusion, establishing it as an effective solution for object completion. △ Less

Submitted 2 February, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

arXiv:2310.00706 [pdf, other]

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Authors: Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh

Abstract: Modern speech synthesis systems have improved significantly, with synthetic speech being indistinguishable from real speech. However, efficient and holistic evaluation of synthetic speech still remains a significant challenge. Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due to high costs. Therefore, researchers have developed auxiliary automatic metrics like Word Erro… ▽ More Modern speech synthesis systems have improved significantly, with synthetic speech being indistinguishable from real speech. However, efficient and holistic evaluation of synthetic speech still remains a significant challenge. Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due to high costs. Therefore, researchers have developed auxiliary automatic metrics like Word Error Rate (WER) to measure intelligibility. Prior works focus on evaluating synthetic speech based on pre-trained speech recognition models, however, this can be limiting since this approach primarily measures speech intelligibility. In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech. Our main assumption is that by training the ASR model on the synthetic speech, the WER on real speech reflects the similarity between distributions, a broader assessment of synthetic speech quality beyond intelligibility. Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet on three recent Text-to-Speech (TTS) systems: MQTTS, StyleTTS, and YourTTS. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2310.00132 [pdf, other]

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

Authors: Xiang Li, **glu Wang, Xiaohao Xu, Xiulian Peng, Rita Singh, Yan Lu, Bhiksha Raj

Abstract: Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved, establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence… ▽ More Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved, establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence of distinct sound events. Assuming sound events occur independently, the multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces. We are motivated to decompose the multi-source audio semantics into single-source semantics for more effective interactions with visual content. We propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several disentangled and noise-suppressed single-source semantics. Furthermore, we introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones, to handle frequent changes in audio semantics. Extensive experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance, e.g., +21.2% mIoU on the challenging AVS-Semantic benchmark with ResNet50 backbone. https://github.com/lxa9867/QSD. △ Less

Submitted 19 April, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

arXiv:2309.13544 [pdf]

Related Rhythms: Recommendation System To Discover Music You May Like

Authors: Rahul Singh, Pranav Kanuparthi

Abstract: Machine Learning models are being utilized extensively to drive recommender systems, which is a widely explored topic today. This is especially true of the music industry, where we are witnessing a surge in growth. Besides a large chunk of active users, these systems are fueled by massive amounts of data. These large-scale systems yield applications that aim to provide a better user experience and… ▽ More Machine Learning models are being utilized extensively to drive recommender systems, which is a widely explored topic today. This is especially true of the music industry, where we are witnessing a surge in growth. Besides a large chunk of active users, these systems are fueled by massive amounts of data. These large-scale systems yield applications that aim to provide a better user experience and to keep customers actively engaged. In this paper, a distributed Machine Learning (ML) pipeline is delineated, which is capable of taking a subset of songs as input and producing a new subset of songs identified as being similar to the inputted subset. The publicly accessible Million Songs Dataset (MSD) enables researchers to develop and explore reasonably efficient systems for audio track analysis and recommendations, without having to access a commercialized music platform. The objective of the proposed application is to leverage an ML system trained to optimally recommend songs that a user might like. △ Less

Submitted 24 September, 2023; originally announced September 2023.

ACM Class: I.2.6; H.3.3

arXiv:2309.13542 [pdf, other]

Integrated Sensing and Communications for IoT: Synergies with Key 6G Technology Enablers

Authors: Aryan Kaushik, Rohit Singh, Ming Li, Honghao Luo, Shalanika Dayarathna, Rajitha Senanayake, Xueli An, Richard A. Stirling-Gallacher, Wonjae Shin, Marco Di Renzo

Abstract: The Internet of Things (IoT) and wireless generations have been evolving simultaneously for the past few decades. Built upon wireless communication and sensing technologies, IoT networks are usually evaluated based on metrics that measure the device ability to sense information and effectively share it with the network, which makes Integrated Sensing and Communication (ISAC) a pivotal candidate fo… ▽ More The Internet of Things (IoT) and wireless generations have been evolving simultaneously for the past few decades. Built upon wireless communication and sensing technologies, IoT networks are usually evaluated based on metrics that measure the device ability to sense information and effectively share it with the network, which makes Integrated Sensing and Communication (ISAC) a pivotal candidate for the sixth-generation (6G) IoT standards. This paper reveals several innovative aspects of ISAC from an IoT perspective in 6G, empowering various modern IoT use cases and key technology enablers. Moreover, we address the challenges and future potential of ISAC-enabled IoT, including synergies with Reconfigurable Intelligent Surfaces (RIS), Artificial Intelligence (AI), and key updates of ISAC-IoT in 6G standardization. Furthermore, several evolutionary concepts are introduced to open future research in 6G ISAC-IoT, including the interplay with Non-Terrestrial Networks (NTN) and Orthogonal Time-Frequency Space (OTFS) modulation. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: 7 pages, 6 figures

arXiv:2309.13227 [pdf, other]

Importance of negative sampling in weak label learning

Authors: Ankit Shah, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj

Abstract: Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open prob… ▽ More Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open problem that has not been well studied for weak-label learning. In this paper, we study several sampling strategies that can measure the usefulness of negative instances for weak-label learning and select them accordingly. We test our method on CIFAR-10 and AudioSet datasets and show that it improves the weak-label classification performance and reduces the computational cost compared to random sampling methods. Our work reveals that negative instances are not all equally irrelevant, and selecting them wisely can benefit weak-label learning. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.07372 [pdf, other]

Training Audio Captioning Models without Audio

Authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

Abstract: Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a… ▽ More Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions. △ Less

Submitted 13 September, 2023; originally announced September 2023.

arXiv:2308.14190 [pdf, other]

doi 10.59275/j.melba.2024-5d51

Score-Based Generative Models for PET Image Reconstruction

Authors: Imraj RD Singh, Alexander Denker, Riccardo Barbano, Željko Kereta, Bangti **, Kris Thielemans, Peter Maass, Simon Arridge

Abstract: Score-based generative models have demonstrated highly promising results for medical image reconstruction tasks in magnetic resonance imaging or computed tomography. However, their application to Positron Emission Tomography (PET) is still largely unexplored. PET image reconstruction involves a variety of challenges, including Poisson noise with high variance and a wide dynamic range. To address t… ▽ More Score-based generative models have demonstrated highly promising results for medical image reconstruction tasks in magnetic resonance imaging or computed tomography. However, their application to Positron Emission Tomography (PET) is still largely unexplored. PET image reconstruction involves a variety of challenges, including Poisson noise with high variance and a wide dynamic range. To address these challenges, we propose several PET-specific adaptations of score-based generative models. The proposed framework is developed for both 2D and 3D PET. In addition, we provide an extension to guided reconstruction using magnetic resonance images. We validate the approach through extensive 2D and 3D $\textit{in-silico}$ experiments with a model trained on patient-realistic data without lesions, and evaluate on data without lesions as well as out-of-distribution data with lesions. This demonstrates the proposed method's robustness and significant potential for improved PET reconstruction. △ Less

Submitted 23 January, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:001

MSC Class: 15A29; 45Q05 ACM Class: I.4.9; J.2; I.2.1

Journal ref: Machine.Learning.for.Biomedical.Imaging. 2 (2024)

arXiv:2308.08979 [pdf]

doi 10.1007/s44206-023-00084-w

Modeling Digital Penetration of the Industrialized Society and its Ensuing Transfiguration

Authors: Johannes Vrana, Ripudaman Singh

Abstract: The Fourth Industrial Revolution, ushered by the deeper integration of digital technologies into professional and social spaces, provides an opportunity to meaningfully serve society. Humans have tremendous capability to innovatively improve social well-being when the situation is clear. Which was not the case during the first three revolutions. Thus, society has been accepting lifestyle changes w… ▽ More The Fourth Industrial Revolution, ushered by the deeper integration of digital technologies into professional and social spaces, provides an opportunity to meaningfully serve society. Humans have tremendous capability to innovatively improve social well-being when the situation is clear. Which was not the case during the first three revolutions. Thus, society has been accepting lifestyle changes willingly and several negative consequences unwillingly. Since the fourth one is still in its infancy, we can control it better. This paper presents a unified model of the industrialized ecosystem covering value creation, value consumption, enabling infrastructure, required skills, and additional governance. This design thinking viewpoint, which includes the consumer side of digital transformation, sets the stage for the next major lifestyle change, termed Digital Transfiguration. For validation and ease of comprehension, the model draws upon the well-understood automobile industry. This model unifies the digital penetration of both industrial creation and social consumption, in a manner that aligns several stakeholders on their transformation journey. △ Less

Submitted 21 August, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: 22 pages, 6 figures, 2 tables

Journal ref: DISO 2, 54 (2023)

Showing 1–50 of 468 results for author: Singh, R