Search | arXiv e-print repository

Multi-objective Binary Differential Approach with Parameter Tuning for Discovering Business Process Models: MoD-ProM

Authors: Sonia Deshmukh, Shikha Gupta, Naveen Kumar

Abstract: Process discovery approaches analyze the business data to automatically uncover structured information, known as a process model. The quality of a process model is measured using quality dimensions -- completeness (replay fitness), preciseness, simplicity, and generalization. Traditional process discovery algorithms usually output a single process model. A single model may not accurately capture t… ▽ More Process discovery approaches analyze the business data to automatically uncover structured information, known as a process model. The quality of a process model is measured using quality dimensions -- completeness (replay fitness), preciseness, simplicity, and generalization. Traditional process discovery algorithms usually output a single process model. A single model may not accurately capture the observed behavior and overfit the training data. We have formed the process discovery problem in a multi-objective framework that yields several candidate solutions for the end user who can pick a suitable model based on the local environmental constraints (possibly varying). We consider the Binary Differential Evolution approach in a multi-objective framework for the task of process discovery. The proposed method employs dichotomous crossover/mutation operators. The parameters are tuned using Grey relational analysis combined with the Taguchi approach. {We have compared the proposed approach with the well-known single-objective algorithms and state-of-the-art multi-objective evolutionary algorithm -- Non-dominated Sorting Genetic Algorithm (NSGA-II).} Additional comparison via computing a weighted average of the quality dimensions is also undertaken. Results show that the proposed algorithm is computationally efficient and produces diversified candidate solutions that score high on the fitness functions. It is shown that the process models generated by the proposed approach are superior to or at least as good as those generated by the state-of-the-art algorithms. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.05398 [pdf, other]

Evaluation of Posits for Spectral Analysis Using a Software-Defined Dataflow Architecture

Authors: Sameer Deshmukh, Daniel Khankin, William Killian, John Gustafson, Elad Raz

Abstract: Spectral analysis plays an important role in detection of damage in structures and deep learning. The choice of a floating-point format plays a crucial role in determining the accuracy and performance of spectral analysis. The IEEE Std 754\textsuperscript{TM} floating-point format (IEEE~754 for short) is supported by most major hardware vendors for ``normal'' floats. However, it has several limita… ▽ More Spectral analysis plays an important role in detection of damage in structures and deep learning. The choice of a floating-point format plays a crucial role in determining the accuracy and performance of spectral analysis. The IEEE Std 754\textsuperscript{TM} floating-point format (IEEE~754 for short) is supported by most major hardware vendors for ``normal'' floats. However, it has several limitations. Previous work has attempted to evaluate posit format with respect to accuracy and performance. The accuracy of the posit has been established over IEEE~754 for a variety of applications. For example, our analysis of the Fast Fourier Transform shows 2x better accuracy when using a 32-bit posit vs. a 32-bit IEEE754 format. For spectral analysis, 32-bit posits are substantially more accurate than 32-bit IEEE~754 floats. Although posit has shown better accuracy than IEEE~754, a fair evaluation of posit with IEEE~754 format using a real hardware implementation has been lacking so far. A software simulation of posit format on an x86 CPU is about $\mathbf{69.3\times}$ slower than native IEEE~754 hardware for normal floats for a Fast Fourier Transform (FFT) of $\mathbf{2^{28}}$ points. We propose the use of a software-defined dataflow architecture to evaluate performance and accuracy of posits in spectral analysis. Our dataflow architecture uses reconfigurable logical elements that express algorithms using only integer operations. Our architecture does not have an FPU, and we express both IEEE~754 and posit arithmetic using the same integer operations within the hardware. On our dataflow architecture, the posit format is only $\mathbf{1.8\times}$ slower than IEEE~754 for a Fast Fourier Transform (FFT) of $\mathbf{2^{28}\approx 268}$ million points. With this implementation, we empirically propose a new lower bound for the performance of posit compared to IEEE~754 format. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2402.09585 [pdf, other]

Domain Adaptation for Contrastive Audio-Language Models

Authors: Soham Deshmukh, Rita Singh, Bhiksha Raj

Abstract: Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performan… ▽ More Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performance, like few-shot learning or fine-tuning, require access to annotated data and iterations of training. Therefore, we propose a test-time domain adaptation method for ALMs that does not require access to annotations. Our method learns a domain vector by enforcing consistency across augmented views of the testing audio. We extensively evaluate our approach on 12 downstream tasks across domains. With just one example, our domain adaptation method leads to 3.2% (max 8.4%) average zero-shot performance improvement. After adaptation, the model still retains the generalization property of ALMs. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.00282 [pdf, other]

PAM: Prompting Audio-Language Models for Audio Quality Assessment

Authors: Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calcu… ▽ More While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.08264 [pdf, ps, other]

Towards a Transpiler for C/C++ to Safer Rust

Authors: Dhiren Tripuramallu, Swapnil Singh, Shrirang Deshmukh, Srinivas Pinisetty, Shinde Arjun Shivaji, Raja Balusamy, Ajaganna Bandeppa

Abstract: Rust is a multi-paradigm programming language developed by Mozilla that focuses on performance and safety. Rust code is arguably known best for its speed and memory safety, a property essential while develo** embedded systems. Thus, it becomes one of the alternatives when develo** operating systems for embedded devices. How to convert an existing C++ code base to Rust is also gaining greater a… ▽ More Rust is a multi-paradigm programming language developed by Mozilla that focuses on performance and safety. Rust code is arguably known best for its speed and memory safety, a property essential while develo** embedded systems. Thus, it becomes one of the alternatives when develo** operating systems for embedded devices. How to convert an existing C++ code base to Rust is also gaining greater attention. In this work, we focus on the process of transpiling C++ code to a Rust codebase in a robust and safe manner. The manual transpilation process is carried out to understand the different constructs of the Rust language and how they correspond to C++ constructs. Based on the learning from the manual transpilation, a transpilation table is created to aid in future transpilation efforts and to develop an automated transpiler. We also studied the existing automated transpilers and identified the problems and inefficiencies they involved. The results of the transpilation process were closely monitored and evaluated, showing improved memory safety without compromising performance and reliability of the resulting codebase. The study concludes with a comprehensive analysis of the findings, an evaluation of the implications for future research, and recommendations for the same in this area. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2311.07602 [pdf, other]

Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors

Authors: Sameer Deshmukh, Rio Yokota, George Bosilca

Abstract: Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architec… ▽ More Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on 3 CPUs using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes. △ Less

Submitted 10 November, 2023; originally announced November 2023.

arXiv:2311.00921 [pdf, other]

$O(N)$ distributed direct factorization of structured dense matrices using runtime systems

Authors: Sameer Deshmukh, Qinxiang Ma, Rio Yokota, George Bosilca

Abstract: Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce the time for dense direct factorization from $O(N^3)$ to $O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank matrix format that can be fa… ▽ More Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce the time for dense direct factorization from $O(N^3)$ to $O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank matrix format that can be factorized using a Cholesky-like algorithm called ULV factorization. The HSS-ULV algorithm is highly parallel because it removes the dependency on trailing sub-matrices at each HSS level. However, a key merge step that links two successive HSS levels remains a challenge for efficient parallelization. In this paper, we use an asynchronous runtime system PaRSEC with the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both state-of-the-art implementations of dense direct low rank factorization, and achieve up to 2x better factorization time for matrices arising from a diverse set of applications on up to 128 nodes of Fugaku for similar or better accuracy for all the problems that we survey. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.04445 [pdf, other]

LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model

Authors: Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh

Abstract: It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private targe… ▽ More It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private target models. The success rate of attack depends on how closely the proxy model approximates the private model. We hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query. Therefore, in this paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning proxy models on similar queries that lie in the lexico-semantic neighborhood of harmful queries to decrease the divergence between the proxy and target models. First, we demonstrate three approaches to prompt private target models to obtain similar queries given harmful queries. Next, we obtain data for local fine-tuning by eliciting responses from target models for the generated similar queries. Then, we optimize attack suffixes to generate attack prompts and evaluate the impact of our local fine-tuning on the attack's success rate. Experiments show that local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively. △ Less

Submitted 21 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.02298 [pdf, other]

Prompting Audios Using Acoustic Properties For Emotion Representation

Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

Abstract: Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emoti… ▽ More Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset. △ Less

Submitted 6 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.07737

arXiv:2310.01995 [pdf, other]

Development of Machine Vision Approach for Mechanical Component Identification based on its Dimension and Pitch

Authors: Toshit Jain, Faisel Mushtaq, K Ramesh, Sandip Deshmukh, Tathagata Ray, Chandu Parimi, Praveen Tandon, Pramod Kumar Jha

Abstract: In this work, a highly customizable and scalable vision based system for automation of mechanical assembly lines is described. The proposed system calculates the features that are required to classify and identify the different kinds of bolts that are used in the assembly line. The system describes a novel method of calculating the pitch of the bolt in addition to bolt identification and calculati… ▽ More In this work, a highly customizable and scalable vision based system for automation of mechanical assembly lines is described. The proposed system calculates the features that are required to classify and identify the different kinds of bolts that are used in the assembly line. The system describes a novel method of calculating the pitch of the bolt in addition to bolt identification and calculating the dimensions of the bolts. This identification and classification system is extremely lightweight and can be run on bare minimum hardware. The system is very fast in the order of milliseconds, hence the system can be used successfully even if the components are steadily moving on a conveyor. The results show that our system can correctly identify the parts in our dataset with 98% accuracy using the calculated features. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: 8 pages

ACM Class: I.4.7

arXiv:2309.07372 [pdf, other]

Training Audio Captioning Models without Audio

Authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

Abstract: Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a… ▽ More Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions. △ Less

Submitted 13 September, 2023; originally announced September 2023.

arXiv:2309.05767 [pdf, other]

Natural Language Supervision for General-Purpose Audio Representations

Authors: Benjamin Elizalde, Soham Deshmukh, Huaming Wang

Abstract: Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose… ▽ More Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are brought into a joint multimodal space using Contrastive Learning. We used our encoders to improve the downstream performance by a margin. We extensively evaluated the generalization of our representations on 26 downstream tasks, the largest in the literature. Our model achieves state of the art results in several tasks leading the way towards general-purpose audio representations. △ Less

Submitted 6 February, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2308.11239 [pdf, other]

LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training

Authors: Silky Singh, Shripad Deshmukh, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to p… ▽ More Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks. Specifically, we redesign the traditional graph cut on images to include motion information in a linear combination with appearance information to produce edge weights. Remarkably, this step produces object segmentation masks comparable to the current state-of-the-art on multiple benchmarks. To further improve performance, we bootstrap a segmentation network trained on these preliminary masks as pseudo-ground truths to learn from its own outputs via self-training. We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks, achieving results on par with and, in many cases surpassing state-of-the-art methods. We also demonstrate the transferability of our approach to novel domains through a qualitative study on in-the-wild images. Additionally, we present extensive ablation analysis to support our design choices and highlight the contribution of each component of our proposed method. △ Less

Submitted 2 December, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted to British Machine Vision Conference (BMVC) 2023

arXiv:2308.01385 [pdf, other]

doi 10.1145/3570361.3592498

BEAVIS: Balloon Enabled Aerial Vehicle for IoT and Sensing

Authors: Suryansh Sharma, Ashutosh Simha, R. Venkatesha Prasad, Shubham Deshmukh, Kavin B. Saravanan, Ravi Ramesh, Luca Mottola

Abstract: UAVs are becoming versatile and valuable platforms for various applications. However, the main limitation is their flying time. We present BEAVIS, a novel aerial robotic platform striking an unparalleled trade-off between the manoeuvrability of drones and the long lasting capacity of blimps. BEAVIS scores highly in applications where drones enjoy unconstrained mobility yet suffer from limited life… ▽ More UAVs are becoming versatile and valuable platforms for various applications. However, the main limitation is their flying time. We present BEAVIS, a novel aerial robotic platform striking an unparalleled trade-off between the manoeuvrability of drones and the long lasting capacity of blimps. BEAVIS scores highly in applications where drones enjoy unconstrained mobility yet suffer from limited lifetime. A nonlinear flight controller exploiting novel, unexplored, aerodynamic phenomena to regulate the ambient pressure and enable all translational and yaw degrees of freedom is proposed without direct actuation in the vertical direction. BEAVIS has built-in rotor fault detection and tolerance. We explain the design and the necessary background in detail. We verify the dynamics of BEAVIS and demonstrate its distinct advantages, such as agility, over existing platforms including the degrees of freedom akin to a drone with 11.36x increased lifetime. We exemplify the potential of BEAVIS to become an invaluable platform for many applications. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: To be published in the 29th Annual International Conference on Mobile Computing and Networking (ACM MobiCom 23), October 2-6, 2023, Madrid, Spain. ACM, New York, NY, USA, 15 pages

arXiv:2307.13192 [pdf, other]

Counterfactual Explanation Policies in RL

Authors: Shripad V. Deshmukh, Srivatsan R, Supriti Vijay, Jayakumar Subramanian, Chirag Agarwal

Abstract: As Reinforcement Learning (RL) agents are increasingly employed in diverse decision-making problems using reward preferences, it becomes important to ensure that policies learned by these frameworks in map** observations to a probability distribution of the possible actions are explainable. However, there is little to no work in the systematic understanding of these complex policies in a contras… ▽ More As Reinforcement Learning (RL) agents are increasingly employed in diverse decision-making problems using reward preferences, it becomes important to ensure that policies learned by these frameworks in map** observations to a probability distribution of the possible actions are explainable. However, there is little to no work in the systematic understanding of these complex policies in a contrastive manner, i.e., what minimal changes to the policy would improve/worsen its performance to a desired level. In this work, we present COUNTERPOL, the first framework to analyze RL policies using counterfactual explanations in the form of minimal changes to the policy that lead to the desired outcome. We do so by incorporating counterfactuals in supervised learning in RL with the target outcome regulated using desired return. We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL. Extensive empirical analysis shows the efficacy of COUNTERPOL in generating explanations for (un)learning skills while kee** close to the original policy. Our results on five different RL environments with diverse state and action spaces demonstrate the utility of counterfactual explanations, paving the way for new frontiers in designing and develo** counterfactual policies. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: ICML Workshop on Counterfactuals in Minds and Machines, 2023

arXiv:2307.04392 [pdf, other]

FODVid: Flow-guided Object Discovery in Videos

Authors: Silky Singh, Shripad Deshmukh, Mausoom Sarkar, Rishabh Jain, Mayur Hemani, Balaji Krishnamurthy

Abstract: Segmentation of objects in a video is challenging due to the nuances such as motion blurring, parallax, occlusions, changes in illumination, etc. Instead of addressing these nuances separately, we focus on building a generalizable solution that avoids overfitting to the individual intricacies. Such a solution would also help us save enormous resources involved in human annotation of video corpora.… ▽ More Segmentation of objects in a video is challenging due to the nuances such as motion blurring, parallax, occlusions, changes in illumination, etc. Instead of addressing these nuances separately, we focus on building a generalizable solution that avoids overfitting to the individual intricacies. Such a solution would also help us save enormous resources involved in human annotation of video corpora. To solve Video Object Segmentation (VOS) in an unsupervised setting, we propose a new pipeline (FODVid) based on the idea of guiding segmentation outputs using flow-guided graph-cut and temporal consistency. Basically, we design a segmentation model incorporating intra-frame appearance and flow similarities, and inter-frame temporal continuation of the objects under consideration. We perform an extensive experimental analysis of our straightforward methodology on the standard DAVIS16 video benchmark. Though simple, our approach produces results comparable (within a range of ~2 mIoU) to the existing top approaches in unsupervised VOS. The simplicity and effectiveness of our technique opens up new avenues for research in the video domain. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: CVPR 2023 (L3D-IVU workshop)

arXiv:2305.11834 [pdf, other]

Pengi: An Audio Language Model for Audio Tasks

Authors: Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

Abstract: In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended ta… ▽ More In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding △ Less

Submitted 18 January, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted at NeurIPS 2023. The manuscript is updated with additional experiments suggested by reviewers

arXiv:2305.04073 [pdf, other]

Explaining RL Decisions with Trajectories

Authors: Shripad Vilasrao Deshmukh, Arpan Dasgupta, Balaji Krishnamurthy, Nan Jiang, Chirag Agarwal, Georgios Theocharous, Jayakumar Subramanian

Abstract: Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL… ▽ More Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy. Keywords -- Explainable AI, Verifiability of AI Decisions, Explainable RL. △ Less

Submitted 22 January, 2024; v1 submitted 6 May, 2023; originally announced May 2023.

Comments: Published at International Conference on Learning Representations (ICLR), 2023

arXiv:2302.09719 [pdf, ps, other]

Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session

Authors: Laurie M. Heller, Benjamin Elizalde, Bhiksha Raj, Soham Deshmukh

Abstract: Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans. Current automated models of Machine Listening vary from purely data-driven approaches to approaches imitating human systems. In recent years, the most promising approaches have been hybrid in that they have used data-driven approaches informe… ▽ More Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans. Current automated models of Machine Listening vary from purely data-driven approaches to approaches imitating human systems. In recent years, the most promising approaches have been hybrid in that they have used data-driven approaches informed by models of the perceptual, cognitive, and semantic processes of the human system. Not only does the guidance provided by models of human perception and domain knowledge enable better, and more generalizable Machine Listening, in the converse, the lessons learned from these models may be used to verify or improve our models of human perception themselves. This paper summarizes advances in the development of such hybrid approaches, ranging from Machine Listening models that are informed by models of peripheral (human) auditory processes, to those that employ or derive semantic information encoded in relations between sounds. The research described herein was presented in a special session on "Synergy between human and machine approaches to sound/scene recognition and processing" at the 2023 ICASSP meeting. △ Less

Submitted 23 February, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

Comments: 4 pages. Summary of Special Session planned for 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://2023.ieeeicassp.org/ Second version has corrected spelling of an author's name

arXiv:2211.07737 [pdf, other]

Describing emotions with acoustic property prompts for speech emotion recognition

Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

Abstract: Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properti… ▽ More Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate. We pair a prompt with its corresponding audio using 5 different emotion datasets. We trained a neural network model using these audio-text pairs. Then, we evaluate the model using one more dataset. We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval. We expect our findings to motivate research describing the broad continuum of emotion △ Less

Submitted 14 November, 2022; originally announced November 2022.

arXiv:2211.05100 [pdf, other]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. △ Less

Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2209.14275 [pdf, other]

Audio Retrieval with WavText5K and CLAP Training

Authors: Soham Deshmukh, Benjamin Elizalde, Huaming Wang

Abstract: Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant natural language descriptions. Most of the literature train retrieval systems with one audio captioning dataset, but evaluating the benefit of training with multiple datasets is underexplored. Moreover, retrieval sy… ▽ More Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant natural language descriptions. Most of the literature train retrieval systems with one audio captioning dataset, but evaluating the benefit of training with multiple datasets is underexplored. Moreover, retrieval systems have to learn the alignment between elaborated sentences describing audio content of variable length ranging from a few seconds to several minutes. In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval. First, we provide a new collection of about five thousand web audio-text pairs that we refer to as WavText5K. When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets. Second, our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective. Combining both audio encoders helps to process variable length audio. The two contributions beat state of the art performance for AudioCaps and Clotho on Text-Audio retrieval by a relative 2% and 16%, and Audio-Text retrieval by 6% and 23%. △ Less

Submitted 28 September, 2022; originally announced September 2022.

arXiv:2209.06584 [pdf, other]

One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text

Authors: Abhinav Java, Shripad Deshmukh, Milan Aggarwal, Surgan Jandial, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: "Can we search for other similar snippets present in a ta… ▽ More Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: "Can we search for other similar snippets present in a target document page given a single query instance of a document snippet?". We propose MONOMER to solve this as a one-shot snippet detection task. MONOMER fuses context from visual, textual, and spatial modalities of snippets and documents to find query snippet in target documents. We conduct extensive ablations and experiments showing MONOMER outperforms several baselines from one-shot object detection (BHRL), template matching, and document understanding (LayoutLMv3). Due to the scarcity of relevant data for the task at hand, we train MONOMER on programmatically generated data having many visually similar query snippets and target document pairs from two datasets - Flamingo Forms and PubLayNet. We also do a human study to validate the generated data. △ Less

Submitted 12 September, 2022; originally announced September 2022.

arXiv:2209.03578 [pdf]

Sign Language Detection

Authors: Shubham Deshmukh, Favin Fernandes, Amey Chavan

Abstract: With the advancements in Computer vision techniques the need to classify images based on its features have become a huge task and necessity. In this project we proposed 2 models i.e. feature extraction and classification using ORB and SVM and the second is using CNN architecture. The end result of the project is to understand the concept behind feature extraction and image classification. The trai… ▽ More With the advancements in Computer vision techniques the need to classify images based on its features have become a huge task and necessity. In this project we proposed 2 models i.e. feature extraction and classification using ORB and SVM and the second is using CNN architecture. The end result of the project is to understand the concept behind feature extraction and image classification. The trained CNN model will also be used to convert it to tflite format for Android Development. △ Less

Submitted 8 September, 2022; originally announced September 2022.

Comments: 8 pages, 10 figures

arXiv:2209.03576 [pdf]

Suspicious and Anomaly Detection

Authors: Shubham Deshmukh, Favin Fernandes, Monali Ahire, Devarshi Borse, Amey Chavan

Abstract: In this project we propose a CNN architecture to detect anomaly and suspicious activities; the activities chosen for the project are running, jum** and kicking in public places and carrying gun, bat and knife in public places. With the trained model we compare it with the pre-existing models like Yolo, vgg16, vgg19. The trained Model is then implemented for real time detection and also used the.… ▽ More In this project we propose a CNN architecture to detect anomaly and suspicious activities; the activities chosen for the project are running, jum** and kicking in public places and carrying gun, bat and knife in public places. With the trained model we compare it with the pre-existing models like Yolo, vgg16, vgg19. The trained Model is then implemented for real time detection and also used the. tflite format of the trained .h5 model to build an android classification. △ Less

Submitted 8 September, 2022; originally announced September 2022.

Comments: 7 pages, 10 figures

arXiv:2209.03570 [pdf]

SANIP: Shop** Assistant and Navigation for the visually impaired

Authors: Shubham Deshmukh, Favin Fernandes, Amey Chavan, Monali Ahire, Devashri Borse, Jyoti Madake

Abstract: The proposed shop** assistant model SANIP is going to help blind persons to detect hand held objects and also to get a video feedback of the information retrieved from the detected and recognized objects. The proposed model consists of three python models i.e. Custom Object Detection, Text Detection and Barcode detection. For object detection of the hand held object, we have created our own cust… ▽ More The proposed shop** assistant model SANIP is going to help blind persons to detect hand held objects and also to get a video feedback of the information retrieved from the detected and recognized objects. The proposed model consists of three python models i.e. Custom Object Detection, Text Detection and Barcode detection. For object detection of the hand held object, we have created our own custom dataset that comprises daily goods such as Parle-G, Tide, and Lays. Other than that we have also collected images of Cart and Exit signs as it is essential for any person to use a cart and also notice the exit sign in case of emergency. For the other 2 models proposed the text and barcode information retrieved is converted from text to speech and relayed to the Blind person. The model was used to detect objects that were trained on and was successful in detecting and recognizing the desired output with a good accuracy and precision. △ Less

Submitted 8 September, 2022; originally announced September 2022.

Comments: 6 pages, 8 figures. arXiv admin note: text overlap with arXiv:2011.04244 by other authors

arXiv:2208.09439 [pdf, other]

Adapting Task-Oriented Dialogue Models for Email Conversations

Authors: Soham Deshmukh, Charles Lee

Abstract: Intent detection is a key part of any Natural Language Understanding (NLU) system of a conversational assistant. Detecting the correct intent is essential yet difficult for email conversations where multiple directives and intents are present. In such settings, conversation context can become a key disambiguating factor for detecting the user's request from the assistant. One prominent way of inco… ▽ More Intent detection is a key part of any Natural Language Understanding (NLU) system of a conversational assistant. Detecting the correct intent is essential yet difficult for email conversations where multiple directives and intents are present. In such settings, conversation context can become a key disambiguating factor for detecting the user's request from the assistant. One prominent way of incorporating context is modeling past conversation history like task-oriented dialogue models. However, the nature of email conversations (long form) restricts direct usage of the latest advances in task-oriented dialogue models. So in this paper, we provide an effective transfer learning framework (EMToD) that allows the latest development in dialogue models to be adapted for long-form conversations. We show that the proposed EMToD framework improves intent detection performance over pre-trained language models by 45% and over pre-trained dialogue models by 30% for task-oriented email conversations. Additionally, the modular nature of the proposed framework allows plug-and-play for any future developments in both pre-trained language and task-oriented dialogue models. △ Less

Submitted 19 August, 2022; originally announced August 2022.

arXiv:2206.15076 [pdf, other]

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Authors: Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel Cahyawijaya, Fabio Barth, Simon Ott, Matthias Samwald, Stephen Bach, Stella Biderman, Mario Sänger, Bo Wang, Alison Callahan, Daniel León Periñán, Théo Gigant, Patrick Haller, Jenny Chim, Jose David Posada, John Michael Giorgi, Karthik Rangasai Sivaraman , et al. (18 additional authors not shown)

Abstract: Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful i… ▽ More Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: Submitted to NeurIPS 2022 Datasets and Benchmarks Track

arXiv:2206.04769 [pdf, other]

CLAP: Learning Audio Concepts From Natural Language Supervision

Authors: Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang

Abstract: Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our app… ▽ More Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance. Additionally, we evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP's Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks. △ Less

Submitted 9 June, 2022; originally announced June 2022.

arXiv:2205.03513 [pdf, other]

Digital Twin Framework for Time to Failure Forecasting of Wind Turbine Gearbox: A Concept

Authors: Mili Wadhwani, Sakshi Deshmukh, Harsh S. Dhiman

Abstract: Wind turbine is a complex machine with its rotating and non-rotating equipment being sensitive to faults. Due to increased wear and tear, the maintenance aspect of a wind turbine is of critical importance. Unexpected failure of wind turbine components can lead to increased O\&M costs which ultimately reduces effective power capture of a wind farm. Fault detection in wind turbines is often suppleme… ▽ More Wind turbine is a complex machine with its rotating and non-rotating equipment being sensitive to faults. Due to increased wear and tear, the maintenance aspect of a wind turbine is of critical importance. Unexpected failure of wind turbine components can lead to increased O\&M costs which ultimately reduces effective power capture of a wind farm. Fault detection in wind turbines is often supplemented with SCADA data available from wind farm operators in the form of time-series format with a 10-minute sample interval. Moreover, time-series analysis and data representation has become a powerful tool to get a deeper understating of the dynamic processes in complex machinery like wind turbine. Wind turbine SCADA data is usually available in form of a multivariate time-series with variables like gearbox oil temperature, gearbox bearing temperature, nacelle temperature, rotor speed and active power produced. In this preprint, we discuss the concept of a digital twin for time to failure forecasting of the wind turbine gearbox where a predictive module continuously gets updated with real-time SCADA data and generates meaningful insights for the wind farm operator. △ Less

Submitted 28 April, 2022; originally announced May 2022.

arXiv:2110.02148 [pdf, other]

NaRLE: Natural Language Models using Reinforcement Learning with Emotion Feedback

Authors: Ruijie Zhou, Soham Deshmukh, Jeremiah Greer, Charles Lee

Abstract: Current research in dialogue systems is focused on conversational assistants working on short conversations in either task-oriented or open domain settings. In this paper, we focus on improving task-based conversational assistants online, primarily those working on document-type conversations (e.g., emails) whose contents may or may not be completely related to the assistant's task. We propose "NA… ▽ More Current research in dialogue systems is focused on conversational assistants working on short conversations in either task-oriented or open domain settings. In this paper, we focus on improving task-based conversational assistants online, primarily those working on document-type conversations (e.g., emails) whose contents may or may not be completely related to the assistant's task. We propose "NARLE" a deep reinforcement learning (RL) framework for improving the natural language understanding (NLU) component of dialogue systems online without the need to collect human labels for customer data. The proposed solution associates user emotion with the assistant's action and uses that to improve NLU models using policy gradients. For two intent classification problems, we empirically show that using reinforcement learning to fine tune the pre-trained supervised learning models improves performance up to 43%. Furthermore, we demonstrate the robustness of the method to partial and noisy implicit feedback. △ Less

Submitted 5 October, 2021; originally announced October 2021.

arXiv:2108.13307

Security For System-On-Chip (SoC) Using Neural Networks

Authors: Vedant Ghodke, Shubham Deshmukh, Atharva Deshpande, Ninad Ekbote, Swati Shilaskar

Abstract: With the growth of embedded systems, VLSI design phases complexity and cost factors across the globe and has become outsourced. Modern computing ICs are now using system-on-chip for better on-chip processing and communication. In the era of Internet-of-Things (IoT), security has become one of the most crucial parts of a System-on-Chip (SoC). Malicious activities generate abnormal traffic patterns… ▽ More With the growth of embedded systems, VLSI design phases complexity and cost factors across the globe and has become outsourced. Modern computing ICs are now using system-on-chip for better on-chip processing and communication. In the era of Internet-of-Things (IoT), security has become one of the most crucial parts of a System-on-Chip (SoC). Malicious activities generate abnormal traffic patterns which affect the operation of the system and its performance which cannot be afforded in a computation hungry world. SoCs have a chance of functionality failure, leakage of information, even a denial of services (DoS), Hardware Trojan Horses and many more factors which are categorized as security threats. In this paper, we aim to compare and describe different types of malicious security threats and how neural networks can be used to prevent those attacks. Spiking Neural Networks (SNN), Runtime Neural Architecture (RTNA) are some of the neural networks which prevent SoCs from attacks. Finally, the development trends in SoC security are also highlighted. △ Less

Submitted 28 September, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: Challenges with content validity

arXiv:2106.06858 [pdf, other]

Improving weakly supervised sound event detection with self-supervised auxiliary tasks

Authors: Soham Deshmukh, Bhiksha Raj, Rita Singh

Abstract: While multitask and transfer learning has shown to improve the performance of neural networks in limited data settings, they require pretraining of the model on large datasets beforehand. In this paper, we focus on improving the performance of weakly supervised sound event detection in low data and noisy settings simultaneously without requiring any pretraining task. To that extent, we propose a s… ▽ More While multitask and transfer learning has shown to improve the performance of neural networks in limited data settings, they require pretraining of the model on large datasets beforehand. In this paper, we focus on improving the performance of weakly supervised sound event detection in low data and noisy settings simultaneously without requiring any pretraining task. To that extent, we propose a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task. We empirically evaluate the proposed framework for weakly supervised sound event detection on a remix dataset of the DCASE 2019 task 1 acoustic scene data with DCASE 2018 Task 2 sounds event data under 0, 10 and 20 dB SNR. To ensure we retain the localisation information of multiple sound events, we propose a two-step attention pooling mechanism that provides a time-frequency localisation of multiple audio events in the clip. The proposed framework with two-step attention outperforms existing benchmark models by 22.3%, 12.8%, 5.9% on 0, 10 and 20 dB SNR respectively. We carry out an ablation study to determine the contribution of the auxiliary task and two-step attention pooling to the SED performance improvement. △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: Accepted at INTERSPEECH 21

arXiv:2010.16318 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414530

Interpreting glottal flow dynamics for detecting COVID-19 from voice

Authors: Soham Deshmukh, Mahmoud Al Ismail, Rita Singh

Abstract: In the pathogenesis of COVID-19, impairment of respiratory functions is often one of the key symptoms. Studies show that in these cases, voice production is also adversely affected -- vocal fold oscillations are asynchronous, asymmetrical and more restricted during phonation. This paper proposes a method that analyzes the differential dynamics of the glottal flow waveform (GFW) during voice produc… ▽ More In the pathogenesis of COVID-19, impairment of respiratory functions is often one of the key symptoms. Studies show that in these cases, voice production is also adversely affected -- vocal fold oscillations are asynchronous, asymmetrical and more restricted during phonation. This paper proposes a method that analyzes the differential dynamics of the glottal flow waveform (GFW) during voice production to identify features in them that are most significant for the detection of COVID-19 from voice. Since it is hard to measure this directly in COVID-19 patients, we infer it from recorded speech signals and compare it to the GFW computed from physical model of phonation. For normal voices, the difference between the two should be minimal, since physical models are constructed to explain phonation under assumptions of normalcy. Greater differences implicate anomalies in the bio-physical factors that contribute to the correctness of the physical model, revealing their significance indirectly. Our proposed method uses a CNN-based 2-step attention model that locates anomalies in time-feature space in the difference of the two GFWs, allowing us to infer their potential as discriminative features for classification. The viability of this method is demonstrated using a clinically curated dataset of COVID-19 positive and negative subjects. △ Less

Submitted 29 October, 2020; originally announced October 2020.

arXiv:2010.10707 [pdf, other]

Detection of COVID-19 through the analysis of vocal fold oscillations

Authors: Mahmoud Al Ismail, Soham Deshmukh, Rita Singh

Abstract: Phonation, or the vibration of the vocal folds, is the primary source of vocalization in the production of voiced sounds by humans. It is a complex bio-mechanical process that is highly sensitive to changes in the speaker's respiratory parameters. Since most symptomatic cases of COVID-19 present with moderate to severe impairment of respiratory functions, we hypothesize that signatures of COVID-19… ▽ More Phonation, or the vibration of the vocal folds, is the primary source of vocalization in the production of voiced sounds by humans. It is a complex bio-mechanical process that is highly sensitive to changes in the speaker's respiratory parameters. Since most symptomatic cases of COVID-19 present with moderate to severe impairment of respiratory functions, we hypothesize that signatures of COVID-19 may be observable by examining the vibrations of the vocal folds. Our goal is to validate this hypothesis, and to quantitatively characterize the changes observed to enable the detection of COVID-19 from voice. For this, we use a dynamical system model for the oscillation of the vocal folds, and solve it using our recently developed ADLES algorithm to yield vocal fold oscillation patterns directly from recorded speech. Experimental results on a clinically curated dataset of COVID-19 positive and negative subjects reveal characteristic patterns of vocal fold oscillations that are correlated with COVID-19. We show that these are prominent and discriminative enough that even simple classifiers such as logistic regression yields high detection accuracies using just the recordings of isolated extended vowels. △ Less

Submitted 20 October, 2020; originally announced October 2020.

Comments: 5 pages, 6 figures

arXiv:2008.07085 [pdf, other]

Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection

Authors: Soham Deshmukh, Bhiksha Raj, Rita Singh

Abstract: Weakly Labelled learning has garnered lot of attention in recent years due to its potential to scale Sound Event Detection (SED) and is formulated as Multiple Instance Learning (MIL) problem. This paper proposes a Multi-Task Learning (MTL) framework for learning from Weakly Labelled Audio data which encompasses the traditional MIL setup. To show the utility of proposed framework, we use the input… ▽ More Weakly Labelled learning has garnered lot of attention in recent years due to its potential to scale Sound Event Detection (SED) and is formulated as Multiple Instance Learning (MIL) problem. This paper proposes a Multi-Task Learning (MTL) framework for learning from Weakly Labelled Audio data which encompasses the traditional MIL setup. To show the utility of proposed framework, we use the input TimeFrequency representation (T-F) reconstruction as the auxiliary task. We show that the chosen auxiliary task de-noises internal T-F representation and improves SED performance under noisy recordings. Our second contribution is introducing two step Attention Pooling mechanism. By having 2-steps in attention mechanism, the network retains better T-F level information without compromising SED performance. The visualisation of first step and second step attention weights helps in localising the audio-event in T-F domain. For evaluating the proposed framework, we remix the DCASE 2019 task 1 acoustic scene data with DCASE 2018 Task 2 sounds event data under 0, 10 and 20 db SNR resulting in a multi-class Weakly labelled SED problem. The proposed total framework outperforms existing benchmark models over all SNRs, specifically 22.3 %, 12.8 %, 5.9 % improvement over benchmark model on 0, 10 and 20 dB SNR respectively. We carry out ablation study to determine the contribution of each auxiliary task and 2-step Attention Pooling to the SED performance improvement. The code is publicly released △ Less

Submitted 29 October, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

arXiv:2002.11500 [pdf, other]

Robust Underlay Device-to-Device Communications on Multiple Channels

Authors: Mohamed Elnourani, Siddharth Deshmukh, Baltasar Beferull-Lozano, Daniel Romero

Abstract: Most recent works in device-to-device (D2D) underlay communications focus on the optimization of either power or channel allocation to improve the spectral efficiency, and typically consider uplink and downlink separately. Further, several of them also assume perfect knowledge of channel-stateinformation (CSI). In this paper, we formulate a joint uplink and downlink resource allocation scheme, whi… ▽ More Most recent works in device-to-device (D2D) underlay communications focus on the optimization of either power or channel allocation to improve the spectral efficiency, and typically consider uplink and downlink separately. Further, several of them also assume perfect knowledge of channel-stateinformation (CSI). In this paper, we formulate a joint uplink and downlink resource allocation scheme, which assigns both power and channel resources to D2D pairs and cellular users in an underlay network scenario. The objective is to maximize the overall network rate while maintaining fairness among the D2D pairs. In addition, we also consider imperfect CSI, where we guarantee a certain outage probability to maintain the desired quality-of-service (QoS). The resulting problem is a mixed integer non-convex optimization problem and we propose both centralized and decentralized algorithms to solve it, using convex relaxation, fractional programming, and alternating optimization. In the decentralized setting, the computational load is distributed among the D2D pairs and the base station, kee** also a low communication overhead. Moreover, we also provide a theoretical convergence analysis, including also the rate of convergence to stationary points. The proposed algorithms have been experimentally tested in a simulation environment, showing their favorable performance, as compared with the state-of-the-art alternatives. △ Less

Submitted 26 February, 2020; originally announced February 2020.

Comments: 30 pages, 7 figures, 2 table. Submitted to IEEE Transactions on Wireless Communications

arXiv:1912.12191 [pdf, other]

Explain Your Move: Understanding Agent Actions Using Specific and Relevant Feature Attribution

Authors: Nikaash Puri, Sukriti Verma, Piyush Gupta, Dhruv Kayastha, Shripad Deshmukh, Balaji Krishnamurthy, Sameer Singh

Abstract: As deep reinforcement learning (RL) is applied to more tasks, there is a need to visualize and understand the behavior of learned agents. Saliency maps explain agent behavior by highlighting the features of the input state that are most relevant for the agent in taking an action. Existing perturbation-based approaches to compute saliency often highlight regions of the input that are not relevant t… ▽ More As deep reinforcement learning (RL) is applied to more tasks, there is a need to visualize and understand the behavior of learned agents. Saliency maps explain agent behavior by highlighting the features of the input state that are most relevant for the agent in taking an action. Existing perturbation-based approaches to compute saliency often highlight regions of the input that are not relevant to the action taken by the agent. Our proposed approach, SARFA (Specific and Relevant Feature Attribution), generates more focused saliency maps by balancing two aspects (specificity and relevance) that capture different desiderata of saliency. The first captures the impact of perturbation on the relative expected reward of the action to be explained. The second downweighs irrelevant features that alter the relative expected rewards of actions other than the action to be explained. We compare SARFA with existing approaches on agents trained to play board games (Chess and Go) and Atari games (Breakout, Pong and Space Invaders). We show through illustrative examples (Chess, Atari, Go), human studies (Chess), and automated evaluation methods (Chess) that SARFA generates saliency maps that are more interpretable for humans than existing approaches. For the code release and demo videos, see https://nikaashpuri.github.io/sarfa-saliency/. △ Less

Submitted 3 April, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

Comments: Accepted at the International Conference on Learning Representations (ICLR) 2020

arXiv:1912.03718 [pdf, other]

Improved Covariance Matrix Estimator using Shrinkage Transformation and Random Matrix Theory

Authors: Samruddhi Deshmukh, Amartansh Dubey

Abstract: One of the major challenges in multivariate analysis is the estimation of population covariance matrix from sample covariance matrix (SCM). Most recent covariance matrix estimators use either shrinkage transformations or asymptotic results from Random Matrix Theory (RMT). Shrinkage techniques help in pulling extreme correlation values towards certain target values whereas tools from RMT help in re… ▽ More One of the major challenges in multivariate analysis is the estimation of population covariance matrix from sample covariance matrix (SCM). Most recent covariance matrix estimators use either shrinkage transformations or asymptotic results from Random Matrix Theory (RMT). Shrinkage techniques help in pulling extreme correlation values towards certain target values whereas tools from RMT help in removing noisy eigenvalues of SCM. Both of these techniques use different approaches to achieve a similar goal which is to remove noisy correlations and add structure to SCM to overcome the bias-variance trade-off. In this paper, we first critically evaluate the pros and cons of these two techniques and then propose an improved estimator which exploits the advantages of both by taking an optimally weighted convex combination of covariance matrices estimated by an improved shrinkage transformation and a RMT based filter. It is a generalized estimator which can adapt to changing sampling noise conditions in various datasets by performing hyperparameter optimization. We show the effectiveness of this estimator on the problem of designing a financial portfolio with minimum risk. We have chosen this problem because the complex properties of stock market data provide extreme conditions to test the robustness of a covariance estimator. Using data from four of the world's largest stock exchanges, we show that our proposed estimator outperforms existing estimators in minimizing the out-of-sample risk of the portfolio and hence predicts population statistics more precisely. Since covariance analysis is a crucial statistical tool, this estimator can be used in a wide range of machine learning, signal processing and high dimensional pattern recognition applications. △ Less

Submitted 8 December, 2019; originally announced December 2019.

arXiv:1905.11824 [pdf, other]

Attacker Behaviour Profiling using Stochastic Ensemble of Hidden Markov Models

Authors: Soham Deshmukh, Rahul Rade, Dr. Faruk Kazi

Abstract: Cyber threat intelligence is one of the emerging areas of focus in information security. Much of the recent work has focused on rule-based methods and detection of network attacks using Intrusion Detection algorithms. In this paper we propose a framework for inspecting and modelling the behavioural aspect of an attacker to obtain better insight predictive power on his future actions. For modelling… ▽ More Cyber threat intelligence is one of the emerging areas of focus in information security. Much of the recent work has focused on rule-based methods and detection of network attacks using Intrusion Detection algorithms. In this paper we propose a framework for inspecting and modelling the behavioural aspect of an attacker to obtain better insight predictive power on his future actions. For modelling we propose a novel semi-supervised algorithm called Fusion Hidden Markov Model (FHMM) which is more robust to noise, requires comparatively less training time, and utilizes the benefits of ensemble learning to better model temporal relationships in data. This paper evaluates the performances of FHMM and compares it with both traditional algorithms like Markov Chain, Hidden Markov Model (HMM) and recently developed Deep Recurrent Neural Network (Deep RNN) architectures. We conduct the experiments on dataset consisting of real data attacks on a Cowrie honeypot system. FHMM provides accuracy comparable to deep RNN architectures at significant lower training time. Given these experimental results, we recommend using FHMM for modelling discrete temporal data for significantly faster training and better performance than existing methods. △ Less

Submitted 6 June, 2021; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1608.05513

Data Centroid Based Multi-Level Fuzzy Min-Max Neural Network

Authors: Shraddha Deshmukh, Sagar Gandhi, Pratap Sanap, Vivek Kulkarni

Abstract: Recently, a multi-level fuzzy min max neural network (MLF) was proposed, which improves the classification accuracy by handling an overlapped region (area of confusion) with the help of a tree structure. In this brief, an extension of MLF is proposed which defines a new boundary region, where the previously proposed methods mark decisions with less confidence and hence misclassification is more fr… ▽ More Recently, a multi-level fuzzy min max neural network (MLF) was proposed, which improves the classification accuracy by handling an overlapped region (area of confusion) with the help of a tree structure. In this brief, an extension of MLF is proposed which defines a new boundary region, where the previously proposed methods mark decisions with less confidence and hence misclassification is more frequent. A methodology to classify patterns more accurately is presented. Our work enhances the testing procedure by means of data centroids. We exhibit an illustrative example, clearly highlighting the advantage of our approach. Results on standard datasets are also presented to evidentially prove a consistent improvement in the classification rate. △ Less

Submitted 20 December, 2016; v1 submitted 19 August, 2016; originally announced August 2016.

Comments: This paper has been withdrawn by the author due to crucial evidence that the similar work has already been published

arXiv:1103.0633 [pdf]

RDBNorma: - A semi-automated tool for relational database schema normalization up to third normal form

Authors: Y. V. Dongare, P. S. Dhabe, S. V. Deshmukh

Abstract: In this paper a tool called RDBNorma is proposed, that uses a novel approach to represent a relational database schema and its functional dependencies in computer memory using only one linked list and used for semi-automating the process of relational database schema normalization up to third normal form. This paper addresses all the issues of representing a relational schema along with its functi… ▽ More In this paper a tool called RDBNorma is proposed, that uses a novel approach to represent a relational database schema and its functional dependencies in computer memory using only one linked list and used for semi-automating the process of relational database schema normalization up to third normal form. This paper addresses all the issues of representing a relational schema along with its functional dependencies using one linked list along with the algorithms to convert a relation into second and third normal form by using above representation. We have compared performance of RDBNorma with existing tool called Micro using standard relational schemas collected from various resources. It is observed that proposed tool is at least 2.89 times faster than the Micro and requires around half of the space than Micro to represent a relation. Comparison is done by entering all the attributes and functional dependencies holds on a relation in the same order and implementing both the tools in same language and on same machine. △ Less

Submitted 3 March, 2011; originally announced March 2011.

Comments: 22 pages and international journal

Journal ref: International Journal of Database Management Systems ( IJDMS ), Vol.3, No.1, February 2011

Showing 1–42 of 42 results for author: Deshmukh, S