Search | arXiv e-print repository

The Reasonable Effectiveness of Speaker Embeddings for Violence Detection

Authors: Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL)… ▽ More In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL) pre-trained models (PTMs). However, as these SSL models are very large models with million of parameters and this can hinder real-world deployment especially in compute-constraint environment. To resolve this, we propose the usage of speaker recognition models which are much smaller compared to the SSL models. Experimentation with speaker recognition model embeddings with SVM & Random Forest as classifiers, we show that speaker recognition model embeddings perform the best in comparison to state-of-the-art (SOTA) SSL models and achieve SOTA results. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 24 Show & Tell Demonstrations

arXiv:2406.06781 [pdf, other]

PERSONA: An Application for Emotion Recognition, Gender Recognition and Age Estimation

Authors: Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma

Abstract: Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in develo** models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite th… ▽ More Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in develo** models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite their inherent interconnectedness. As such in this demonstration, we present PERSONA, an application for predicting ER, GR, and AE with a single model in the backend. One notable point is we show that representations from speaker recognition pre-trained model (PTM) is better suited for such a multi-task learning format than the state-of-the-art (SOTA) self-supervised (SSL) PTM by carrying out a comparative study. Our methodology obviates the need for deploying separate models for each task and can potentially conserve resources and time during the training and deployment phases. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.06774 [pdf, other]

ComFeAT: Combination of Neural and Spectral Features for Improved Depression Detection

Authors: Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce… ▽ More In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce ComFeAT, an application that employs a CNN model trained on a combination of features extracted from PTMs, a.k.a. neural features and spectral features to enhance depression detection. Spectral features are robust to domain variations, but, they are not as good as neural features in performance, suprisingly, combining them shows complementary behavior and improves over both neural and spectral features individually. The proposed method also improves over previous state-of-the-art (SOTA) works on E-DAIC benchmark. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2403.15966 [pdf, other]

Fisher Information Approach for Masking the Sensing Plan: Applications in Multifunction Radars

Authors: Shashwat Jain, Vikram Krishnamurthy, Muralidhar Rangaswamy, Bosung Kang, Sandeep Gogineni

Abstract: How to design a Markov Decision Process (MDP) based radar controller that makes small sacrifices in performance to mask its sensing plan from an adversary? The radar controller purposefully minimizes the Fisher information of its emissions so that an adversary cannot identify the controller's model parameters accurately. Unlike classical open loop statistical inference, where the Fisher informatio… ▽ More How to design a Markov Decision Process (MDP) based radar controller that makes small sacrifices in performance to mask its sensing plan from an adversary? The radar controller purposefully minimizes the Fisher information of its emissions so that an adversary cannot identify the controller's model parameters accurately. Unlike classical open loop statistical inference, where the Fisher information serves as a lower bound for the achievable covariance, this paper employs the Fisher information as a design constraint for a closed loop radar controller to mask its sensing plan. We analytically derive a closed-form expression for the determinant of the Fisher Information Matrix (FIM) pertaining to the parameters of the MDP-based controller. Subsequently, we constrain the MDP with respect to the determinant of the FIM. Numerical results show that the introduction of minor perturbations to the MDP's transition kernel and the total operation cost can reduce the Fisher Information of the emissions. Consequently, this reduction amplifies the variability in policy and transition kernel estimation errors, thwarting the adversary's accuracy in estimating the controller's sensing plan. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2312.05187 [pdf, other]

Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2311.12564 [pdf]

Summary of the DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments

Authors: Shikha Baghel, Shreyas Ramoji, Somil Jain, Pratik Roy Chowdhuri, Prachi Singh, Deepu Vijayasenan, Sriram Ganapathy

Abstract: In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational E… ▽ More In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. To facilitate this evaluation, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of $42$ world-wide registrations and received a total of $19$ combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations. △ Less

Submitted 3 January, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2305.16333 [pdf, ps, other]

Text Generation with Speech Synthesis for ASR Data Augmentation

Authors: Zhuangqun Huang, Gil Keren, Ziran Jiang, Shashank Jain, David Goss-Grubbs, Nelson Cheng, Farnaz Abtahi, Duc Le, David Zhang, Antony D'Avirro, Ethan Campbell-Taylor, Jessie Salas, Irina-Elena Veliche, Xi Chen

Abstract: Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work mainly focuses on synthetic speech generation for ASR data augmentation, its combination with text generation methods is considerably less explored. In this work, we explore text augmentation for ASR using large-scale pre-tr… ▽ More Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work mainly focuses on synthetic speech generation for ASR data augmentation, its combination with text generation methods is considerably less explored. In this work, we explore text augmentation for ASR using large-scale pre-trained neural networks, and systematically compare those to traditional text augmentation methods. The generated synthetic texts are then converted to synthetic speech using a text-to-speech (TTS) system and added to the ASR training data. In experiments conducted on three datasets, we find that neural models achieve 9%-15% relative WER improvement and outperform traditional methods. We conclude that text augmentation, particularly through modern neural approaches, is a viable tool for improving the accuracy of ASR systems. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2303.00830 [pdf, other]

DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments

Authors: Shikha Baghel, Shreyas Ramoji, Sidharth, Ranjana H, Prachi Singh, Somil Jain, Pratik Roy Chowdhuri, Kaustubh Kulkarni, Swapnil Padhi, Deepu Vijayasenan, Sriram Ganapathy

Abstract: In multilingual societies, social conversations often involve code-mixed speech. The current speech technology may not be well equipped to extract information from multi-lingual multi-speaker conversations. The DISPLACE challenge entails a first-of-kind task to benchmark speaker and language diarization on the same data, as the data contains multi-speaker conversations in multilingual code-mixed s… ▽ More In multilingual societies, social conversations often involve code-mixed speech. The current speech technology may not be well equipped to extract information from multi-lingual multi-speaker conversations. The DISPLACE challenge entails a first-of-kind task to benchmark speaker and language diarization on the same data, as the data contains multi-speaker conversations in multilingual code-mixed speech. The challenge attempts to highlight outstanding issues in speaker diarization (SD) in multilingual settings with code-mixing. Further, language diarization (LD) in multi-speaker settings also introduces new challenges, where the system has to disambiguate speaker switches with code switches. For this challenge, a natural multilingual, multi-speaker conversational dataset is distributed for development and evaluation purposes. The systems are evaluated on single-channel far-field recordings. We also release a baseline system and report the highlights of the system submissions. △ Less

Submitted 5 June, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2302.12520 [pdf, other]

A Novel Demand Response Model and Method for Peak Reduction in Smart Grids -- PowerTAC

Authors: Sanjay Chandlekar, Arthik Boroju, Shweta Jain, Sujit Gujar

Abstract: One of the widely used peak reduction methods in smart grids is demand response, where one analyzes the shift in customers' (agents') usage patterns in response to the signal from the distribution company. Often, these signals are in the form of incentives offered to agents. This work studies the effect of incentives on the probabilities of accepting such offers in a real-world smart grid simulato… ▽ More One of the widely used peak reduction methods in smart grids is demand response, where one analyzes the shift in customers' (agents') usage patterns in response to the signal from the distribution company. Often, these signals are in the form of incentives offered to agents. This work studies the effect of incentives on the probabilities of accepting such offers in a real-world smart grid simulator, PowerTAC. We first show that there exists a function that depicts the probability of an agent reducing its load as a function of the discounts offered to them. We call it reduction probability (RP). RP function is further parametrized by the rate of reduction (RR), which can differ for each agent. We provide an optimal algorithm, MJS--ExpResponse, that outputs the discounts to each agent by maximizing the expected reduction under a budget constraint. When RRs are unknown, we propose a Multi-Armed Bandit (MAB) based online algorithm, namely MJSUCB--ExpResponse, to learn RRs. Experimentally we show that it exhibits sublinear regret. Finally, we showcase the efficacy of the proposed algorithm in mitigating demand peaks in a real-world smart grid system using the PowerTAC simulator as a test bed. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: 11 pages, 5 figures, 2 tables, Accepted as an Extended Abstract in AAMAS'23

arXiv:2302.02045 [pdf, ps, other]

Radar Clutter Covariance Estimation: A Nonlinear Spectral Shrinkage Approach

Authors: Shashwat Jain, Vikram Krishnamurthy, Muralidhar Rangaswamy, Bosung Kang, Sandeep Gogineni

Abstract: In this paper, we exploit the spiked covariance structure of the clutter plus noise covariance matrix for radar signal processing. Using state-of-the-art techniques high dimensional statistics, we propose a nonlinear shrinkage-based rotation invariant spiked covariance matrix estimator. We state the convergence of the estimated spiked eigenvalues. We use a dataset generated from the high-fidelity,… ▽ More In this paper, we exploit the spiked covariance structure of the clutter plus noise covariance matrix for radar signal processing. Using state-of-the-art techniques high dimensional statistics, we propose a nonlinear shrinkage-based rotation invariant spiked covariance matrix estimator. We state the convergence of the estimated spiked eigenvalues. We use a dataset generated from the high-fidelity, site-specific physics-based radar simulation software RFView to compare the proposed algorithm against the existing Rank Constrained Maximum Likelihood (RCML)-Expected Likelihood (EL) covariance estimation algorithm. We demonstrate that the computation time for the estimation by the proposed algorithm is less than the RCML-EL algorithm with identical Signal to Clutter plus Noise (SCNR) performance. We show that the proposed algorithm and the RCML-EL-based algorithm share the same optimization problem in high dimensions. We use Low-Rank Adaptive Normalized Matched Filter (LR-ANMF) detector to compute the detection probabilities for different false alarm probabilities over a range of target SNR. We present preliminary results which demonstrate the robustness of the detector against contaminating clutter discretes using the Challenge Dataset from RFView. Finally, we empirically show that the minimum variance distortionless beamformer (MVDR) error variance for the proposed algorithm is identical to the error variance resulting from the true covariance matrix. △ Less

Submitted 3 February, 2023; originally announced February 2023.

arXiv:2212.02002 [pdf, other]

Adaptive ECCM for Mitigating Smart Jammers

Authors: Kunal Pattanayak, Shashwat Jain, Vikram Krishnamurthy, Chris Berry

Abstract: This paper considers adaptive radar electronic counter-counter measures (ECCM) to mitigate ECM by an adversarial jammer. Our ECCM approach models the jammer-radar interaction as a Principal Agent Problem (PAP), a popular economics framework for interaction between two entities with an information imbalance. In our setup, the radar does not know the jammer's utility. Instead, the radar learns the j… ▽ More This paper considers adaptive radar electronic counter-counter measures (ECCM) to mitigate ECM by an adversarial jammer. Our ECCM approach models the jammer-radar interaction as a Principal Agent Problem (PAP), a popular economics framework for interaction between two entities with an information imbalance. In our setup, the radar does not know the jammer's utility. Instead, the radar learns the jammer's utility adaptively over time using inverse reinforcement learning. The radar's adaptive ECCM objective is two-fold (1) maximize its utility by solving the PAP, and (2) estimate the jammer's utility by observing its response. Our adaptive ECCM scheme uses deep ideas from revealed preference in micro-economics and principal agent problem in contract theory. Our numerical results show that, over time, our adaptive ECCM both identifies and mitigates the jammer's utility. △ Less

Submitted 4 December, 2022; originally announced December 2022.

arXiv:2210.11302 [pdf, other]

Fleet-Level Environmental Assessments for Feasibility of Aviation Emission Reduction Goals

Authors: Kolawole Ogunsina, Hsun Chao, Nithin Jojo Kolencherry, Samarth Jain, Kushal Moolchandani, Daniel DeLaurentis, William Crossley

Abstract: The International Air Transport Association (IATA) is one of several organizations that have presented goals for future CO2 emissions from commercial aviation with the intent of alleviating the associated environmental impacts. These goals include attaining carbon-neutral growth in the year 2020 and total aviation CO2 emissions in 2050 equal to 50% of 2005 aviation CO2 emissions. This paper presen… ▽ More The International Air Transport Association (IATA) is one of several organizations that have presented goals for future CO2 emissions from commercial aviation with the intent of alleviating the associated environmental impacts. These goals include attaining carbon-neutral growth in the year 2020 and total aviation CO2 emissions in 2050 equal to 50% of 2005 aviation CO2 emissions. This paper presents the use of a simulation-based approach to predict future CO2 emissions from commercial aviation based upon a set of scenarios developed as part of the Aircraft Technology Modeling and Assessment project within ASCENT, the FAA Center of Excellence for Alternative Jet Fuels and the Environment. Results indicate that, in future scenarios with increasing demand for air travel, it is difficult to reduce CO2 emissions in 2050 to levels equal to or below 2005 levels, although neutral CO2 growth after 2020 may be possible. △ Less

Submitted 16 September, 2022; originally announced October 2022.

Comments: Presented at the Council of Engineering Systems Universities (CESUN) conference in 2018

arXiv:2209.06573 [pdf, other]

Using Spectral Submanifolds for Nonlinear Periodic Control

Authors: Florian Mahlknecht, John Irvin Alora, Shobhit Jain, Edward Schmerling, Riccardo Bonalli, George Haller, Marco Pavone

Abstract: Very high dimensional nonlinear systems arise in many engineering problems due to semi-discretization of the governing partial differential equations, e.g. through finite element methods. The complexity of these systems present computational challenges for direct application to automatic control. While model reduction has seen ubiquitous applications in control, the use of nonlinear model reductio… ▽ More Very high dimensional nonlinear systems arise in many engineering problems due to semi-discretization of the governing partial differential equations, e.g. through finite element methods. The complexity of these systems present computational challenges for direct application to automatic control. While model reduction has seen ubiquitous applications in control, the use of nonlinear model reduction methods in this setting remains difficult. The problem lies in preserving the structure of the nonlinear dynamics in the reduced order model for high-fidelity control. In this work, we leverage recent advances in Spectral Submanifold (SSM) theory to enable model reduction under well-defined assumptions for the purpose of efficiently synthesizing feedback controllers. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: 8 pages, 6 figures, conference on decision and control 2022

arXiv:2209.04235 [pdf, other]

IEEE 802.11ad Based Joint Radar Communication Transceiver: Design, Prototype and Performance Analysis

Authors: Akanksha Sneh, Soumya Jain, V Sri Sindhu, Shobha Sundar Ram, Sumit Darak

Abstract: Rapid beam alignment is required to support high gain millimeter wave (mmW) communication links between a base station (BS) and mobile users (MU). The standard IEEE 802.11ad protocol enables beam alignment at the BS and MU through a lengthy beam training procedure accomplished through additional packet overhead. However, this results in reduced latency and throughput. Auxiliary radar functionality… ▽ More Rapid beam alignment is required to support high gain millimeter wave (mmW) communication links between a base station (BS) and mobile users (MU). The standard IEEE 802.11ad protocol enables beam alignment at the BS and MU through a lengthy beam training procedure accomplished through additional packet overhead. However, this results in reduced latency and throughput. Auxiliary radar functionality embedded within the communication protocol has been proposed in prior literature to enable rapid beam alignment of communication beams without the requirement of channel overheads. In this work, we propose a complete architectural framework of a joint radar-communication wireless transceiver wherein radar based detection of MU is realized to enable subsequent narrow beam communication. We provide a software prototype implementation with transceiver design details, signal models and signal processing algorithms. The prototype is experimentally evaluated with realistic simulations in free space and Rician propagation conditions and demonstrated to accelerate the beam alignment by a factor of four while reducing the overall bit error rate (BER) resulting in significant improvement in throughput with respect to standard 802.11ad. Likewise, the radar performance is found to be comparable to commonly used mmW radars. △ Less

Submitted 9 September, 2022; originally announced September 2022.

Comments: 14 pages, 13 figures

arXiv:2205.12378 [pdf, ps, other]

Lyapunov based Stochastic Stability of a Quantum Decision System for Human-Machine Interaction

Authors: Luke Snow, Shashwat Jain, Vikram Krishnamurthy

Abstract: In mathematical psychology, decision makers are modeled using the Lindbladian equations from quantum mechanics to capture important human-centric features such as order effects and violation of the sure thing principle. We consider human-machine interaction involving a quantum decision maker (human) and a controller (machine). Given a sequence of human decisions over time, how can the controller d… ▽ More In mathematical psychology, decision makers are modeled using the Lindbladian equations from quantum mechanics to capture important human-centric features such as order effects and violation of the sure thing principle. We consider human-machine interaction involving a quantum decision maker (human) and a controller (machine). Given a sequence of human decisions over time, how can the controller dynamically provide input messages to adapt these decisions so as to converge to a specific decision? We show via novel stochastic Lyapunov arguments how the Lindbladian dynamics of the quantum decision maker can be controlled to converge to a specific decision asymptotically. Our methodology yields a useful mathematical framework for human-sensor decision making. The stochastic Lyapunov results are also of independent interest as they generalize recent results in the literature. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2204.00059

arXiv:2204.00059 [pdf, ps, other]

Lyapunov based Stochastic Stability of Human-Machine Interaction: A Quantum Decision System Approach

Authors: Luke Snow, Shashwat Jain, Vikram Krishnamurthy

Abstract: In mathematical psychology, decision makers are modeled using the Lindbladian equations from quantum mechanics to capture important human-centric features such as order effects and violation of the sure thing principle. We consider human-machine interaction involving a quantum decision maker (human) and a controller (machine). Given a sequence of human decisions over time, how can the controller d… ▽ More In mathematical psychology, decision makers are modeled using the Lindbladian equations from quantum mechanics to capture important human-centric features such as order effects and violation of the sure thing principle. We consider human-machine interaction involving a quantum decision maker (human) and a controller (machine). Given a sequence of human decisions over time, how can the controller dynamically provide input messages to adapt these decisions so as to converge to a specific decision? We show via novel stochastic Lyapunov arguments how the Lindbladian dynamics of the quantum decision maker can be controlled to converge to a specific decision asymptotically. Our methodology yields a useful mathematical framework for human-sensor decision making. The stochastic Lyapunov results are also of independent interest as they generalize recent results in the literature. △ Less

Submitted 31 March, 2022; originally announced April 2022.

arXiv:2203.14482 [pdf, other]

doi 10.1109/ISBI52829.2022.9761493

Leveraging Clinically Relevant Biometric Constraints To Supervise A Deep Learning Model For The Accurate Caliper Placement To Obtain Sonographic Measurements Of The Fetal Brain

Authors: Hari Shankar, Adithya Narayan, Shefali Jain, Divya Singh, Pooja Vyas, Nivedita Hegde, Purbayan Kar, Abhi Lad, Jens Thang, Jagruthi Atada, Duy Nguyen, PS Roopa, Akhila Vasudeva, Prathima Radhakrishnan, Sripad Krishna Devalla

Abstract: Multiple studies have demonstrated that obtaining standardized fetal brain biometry from mid-trimester ultrasonography (USG) examination is key for the reliable assessment of fetal neurodevelopment and the screening of central nervous system (CNS) anomalies. Obtaining these measurements is highly subjective, expertise-driven, and requires years of training experience, limiting quality prenatal car… ▽ More Multiple studies have demonstrated that obtaining standardized fetal brain biometry from mid-trimester ultrasonography (USG) examination is key for the reliable assessment of fetal neurodevelopment and the screening of central nervous system (CNS) anomalies. Obtaining these measurements is highly subjective, expertise-driven, and requires years of training experience, limiting quality prenatal care for all pregnant mothers. In this study, we propose a deep learning (DL) approach to compute 3 key fetal brain biometry from the 2D USG images of the transcerebellar plane (TC) through the accurate and automated caliper placement (2 per biometry) by modeling it as a landmark detection problem. We leveraged clinically relevant biometric constraints (relationship between caliper points) and domain-relevant data augmentation to improve the accuracy of a U-Net DL model (trained/tested on: 596 images, 473 subjects/143 images, 143 subjects). We performed multiple experiments demonstrating the effect of the DL backbone, data augmentation, generalizability and benchmarked against a recent state-of-the-art approach through extensive clinical validation (DL vs. 7 experienced clinicians). For all cases, the mean errors in the placement of the individual caliper points and the computed biometry were comparable to error rates among clinicians. The clinical translation of the proposed framework can assist novice users from low-resource settings in the reliable and standardized assessment of fetal brain sonograms. △ Less

Submitted 31 July, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted for presentation at 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)

arXiv:2202.13553 [pdf, other]

Towards A Device-Independent Deep Learning Approach for the Automated Segmentation of Sonographic Fetal Brain Structures: A Multi-Center and Multi-Device Validation

Authors: Abhi Lad, Adithya Narayan, Hari Shankar, Shefali Jain, Pooja Punjani Vyas, Divya Singh, Nivedita Hegde, Jagruthi Atada, Jens Thang, Saw Shier Nee, Arunkumar Govindarajan, Roopa PS, Muralidhar V Pai, Akhila Vasudeva, Prathima Radhakrishnan, Sripad Krishna Devalla

Abstract: Quality assessment of prenatal ultrasonography is essential for the screening of fetal central nervous system (CNS) anomalies. The interpretation of fetal brain structures is highly subjective, expertise-driven, and requires years of training experience, limiting quality prenatal care for all pregnant mothers. With recent advancement in Artificial Intelligence (AI), specifically deep learning (DL)… ▽ More Quality assessment of prenatal ultrasonography is essential for the screening of fetal central nervous system (CNS) anomalies. The interpretation of fetal brain structures is highly subjective, expertise-driven, and requires years of training experience, limiting quality prenatal care for all pregnant mothers. With recent advancement in Artificial Intelligence (AI), specifically deep learning (DL), assistance in precise anatomy identification through semantic segmentation essential for the reliable assessment of growth and neurodevelopment, and detection of structural abnormalities have been proposed. However, existing works only identify certain structures (e.g., cavum septum pellucidum, lateral ventricles, cerebellum) from either of the axial views (transventricular, transcerebellar), limiting the scope for a thorough anatomical assessment as per practice guidelines necessary for the screening of CNS anomalies. Further, existing works do not analyze the generalizability of these DL algorithms across images from multiple ultrasound devices and centers, thus, limiting their real-world clinical impact. In this study, we propose a DL based segmentation framework for the automated segmentation of 10 key fetal brain structures from 2 axial planes from fetal brain USG images (2D). We developed a custom U-Net variant that uses inceptionv4 block as a feature extractor and leverages custom domain-specific data augmentation. Quantitatively, the mean (10 structures; test sets 1/2/3/4) Dice-coefficients were: 0.827, 0.802, 0.731, 0.783. Irrespective of the USG device/center, the DL segmentations were qualitatively comparable to their manual segmentations. The proposed DL system offered a promising and generalizable performance (multi-centers, multi-device) and also presents evidence in support of device-induced variation in image quality (a challenge to generalizibility) by using UMAP analysis. △ Less

Submitted 28 February, 2022; originally announced February 2022.

Comments: SPIE Medical Imaging 2022: Computer Aided Diagnosis (12033-75), 11 pages, 7 figures

arXiv:2110.06123 [pdf, other]

COVID-19 Diagnosis from Cough Acoustics using ConvNets and Data Augmentation

Authors: Saranga Kingkor Mahanta, Darsh Kaushik, Shubham Jain, Hoang Van Truong, Koushik Guha

Abstract: With the periodic rise and fall of COVID-19 and countries being inflicted by its waves, an efficient, economic, and effortless diagnosis procedure for the virus has been the utmost need of the hour. COVID-19 positive individuals may even be asymptomatic making the diagnosis difficult, but amongst the infected subjects, the asymptomatic ones need not be entirely free of symptoms caused by the virus… ▽ More With the periodic rise and fall of COVID-19 and countries being inflicted by its waves, an efficient, economic, and effortless diagnosis procedure for the virus has been the utmost need of the hour. COVID-19 positive individuals may even be asymptomatic making the diagnosis difficult, but amongst the infected subjects, the asymptomatic ones need not be entirely free of symptoms caused by the virus. They might not show any observable symptoms like the symptomatic subjects, but they may differ from uninfected ones in the way they cough. These differences in the coughing sounds are minute and indiscernible to the human ear, however, these can be captured using machine learning-based statistical models. In this paper, we present a deep learning approach to analyze the acoustic dataset provided in Track 1 of the DiCOVA 2021 Challenge containing cough sound recordings belonging to both COVID-19 positive and negative examples. To perform the classification on the sound recordings as belonging to a COVID-19 positive or negative examples, we propose a ConvNet model. Our model achieved an AUC score percentage of 72.23 on the blind test set provided by the same for an unbiased evaluation of the models. The ConvNet model incorporated with Data Augmentation further increased the AUC-ROC percentage from 72.23 to 87.07. It also outperformed the DiCOVA 2021 Challenge's baseline model by 23% thus, claiming the top position on the DiCOVA 2021 Challenge leaderboard. This paper proposes the use of Mel frequency cepstral coefficients as the feature input for the proposed model. △ Less

Submitted 3 May, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: DiCOVA, top 1st, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2109.14546 [pdf]

An Energy Efficient Health Monitoring Approach with Wireless Body Area Networks

Authors: Seemandhar Jain, Prarthi Jain, Prabhat K. Upadhyay, Jules M. Moualeu, Abhishek Srivastava

Abstract: Wireless Body Area Networks (WBANs) comprise a network of sensors subcutaneously implanted or placed near the body surface and facilitate continuous monitoring of health parameters of a patient. Research endeavours involving WBAN are directed towards effective transmission of detected parameters to a Local Processing Unit (LPU, usually a mobile device) and analysis of the parameters at the LPU or… ▽ More Wireless Body Area Networks (WBANs) comprise a network of sensors subcutaneously implanted or placed near the body surface and facilitate continuous monitoring of health parameters of a patient. Research endeavours involving WBAN are directed towards effective transmission of detected parameters to a Local Processing Unit (LPU, usually a mobile device) and analysis of the parameters at the LPU or a back-end cloud. An important concern in WBAN is the lightweight nature of WBAN nodes and the need to conserve their energy. This is especially true for subcutaneously implanted nodes that cannot be recharged or regularly replaced. Work in energy conservation is mostly aimed at optimising the routing of signals to minimise energy expended. In this paper, a simple yet innovative approach to energy conservation and detection of alarming health status is proposed. Energy conservation is ensured through a two-tier approach wherein the first tier eliminates `uninteresting' health parameter readings at the site of a sensing node and prevents these from being transmitted across the WBAN to the LPU. A reading is categorised as uninteresting if it deviates very slightly from its immediately preceding reading and does not provide new insight on the patient's well being. In addition to this, readings that are faulty and emanate from possible sensor malfunctions are also eliminated. These eliminations are done at the site of the sensor using algorithms that are light enough to effectively function in the extremely resource-constrained environments of the sensor nodes. We notice, through experiments, that this eliminates and thus reduces around 90% of the readings that need to be transmitted to the LPU leading to significant energy savings. Furthermore, the proper functioning of these algorithms in such constrained environments is confirmed and validated over a hardware simulation set up. The second tier of assessment includes a proposed anomaly detection model at the LPU that is capable of identifying anomalies from streaming health parameter readings and indicates an adverse medical condition. In addition to being able to handle streaming data, the model works within the resource-constrained environments of an LPU and eliminates the need of transmitting the data to a back-end cloud, ensuring further energy savings. The anomaly detection capability of the model is validated using data available from the critical care units of hospitals and is shown to be superior to other anomaly detection techniques. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: 23 pages, 18 figures. (Full Abstract : https://seemandhar.herokuapp.com/wban)

arXiv:2105.11241 [pdf]

Generation of COVID-19 Chest CT Scan Images using Generative Adversarial Networks

Authors: Prerak Mann, Sahaj Jain, Saurabh Mittal, Aruna Bhat

Abstract: SARS-CoV-2, also known as COVID-19 or Coronavirus, is a viral contagious disease that is infected by a novel coronavirus, and has been rapidly spreading across the globe. It is very important to test and isolate people to reduce spread, and from here comes the need to do this quickly and efficiently. According to some studies, Chest-CT outperforms RT-PCR lab testing, which is the current standard,… ▽ More SARS-CoV-2, also known as COVID-19 or Coronavirus, is a viral contagious disease that is infected by a novel coronavirus, and has been rapidly spreading across the globe. It is very important to test and isolate people to reduce spread, and from here comes the need to do this quickly and efficiently. According to some studies, Chest-CT outperforms RT-PCR lab testing, which is the current standard, when diagnosing COVID-19 patients. Due to this, computer vision researchers have developed various deep learning systems that can predict COVID-19 using a Chest-CT scan correctly to a certain degree. The accuracy of these systems is limited since deep learning neural networks such as CNNs (Convolutional Neural Networks) need a significantly large quantity of data for training in order to produce good quality results. Since the disease is relatively recent and more focus has been on CXR (Chest XRay) images, the available chest CT Scan image dataset is much less. We propose a method, by utilizing GANs, to generate synthetic chest CT images of both positive and negative COVID-19 patients. Using a pre-built predictive model, we concluded that around 40% of the generated images are correctly predicted as COVID-19 positive. The dataset thus generated can be used to train a CNN-based classifier which can help determine COVID-19 in a patient with greater accuracy. △ Less

Submitted 20 May, 2021; originally announced May 2021.

arXiv:2104.00793 [pdf, ps, other]

Effect of Radiology Report Labeler Quality on Deep Learning Models for Chest X-Ray Interpretation

Authors: Saahil Jain, Akshay Smit, Andrew Y. Ng, Pranav Rajpurkar

Abstract: Although deep learning models for chest X-ray interpretation are commonly trained on labels generated by automatic radiology report labelers, the impact of improvements in report labeling on the performance of chest X-ray classification models has not been systematically investigated. We first compare the CheXpert, CheXbert, and VisualCheXbert labelers on the task of extracting accurate chest X-ra… ▽ More Although deep learning models for chest X-ray interpretation are commonly trained on labels generated by automatic radiology report labelers, the impact of improvements in report labeling on the performance of chest X-ray classification models has not been systematically investigated. We first compare the CheXpert, CheXbert, and VisualCheXbert labelers on the task of extracting accurate chest X-ray image labels from radiology reports, reporting that the VisualCheXbert labeler outperforms the CheXpert and CheXbert labelers. Next, after training image classification models using labels generated from the different radiology report labelers on one of the largest datasets of chest X-rays, we show that an image classification model trained on labels from the VisualCheXbert labeler outperforms image classification models trained on labels from the CheXpert and CheXbert labelers. Our work suggests that recent improvements in radiology report labeling can translate to the development of higher performing chest X-ray classification models. △ Less

Submitted 27 November, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: In Neural Information Processing Systems (NeurIPS) Workshop on Data-Centric AI (DCAI)

arXiv:2103.16670 [pdf, other]

doi 10.1007/978-3-030-87589-3_58

Contrastive Learning of Single-Cell Phenotypic Representations for Treatment Classification

Authors: Alexis Perakis, Ali Gorji, Samriddhi Jain, Krishna Chaitanya, Simone Rizza, Ender Konukoglu

Abstract: Learning robust representations to discriminate cell phenotypes based on microscopy images is important for drug discovery. Drug development efforts typically analyse thousands of cell images to screen for potential treatments. Early works focus on creating hand-engineered features from these images or learn such features with deep neural networks in a fully or weakly-supervised framework. Both re… ▽ More Learning robust representations to discriminate cell phenotypes based on microscopy images is important for drug discovery. Drug development efforts typically analyse thousands of cell images to screen for potential treatments. Early works focus on creating hand-engineered features from these images or learn such features with deep neural networks in a fully or weakly-supervised framework. Both require prior knowledge or labelled datasets. Therefore, subsequent works propose unsupervised approaches based on generative models to learn these representations. Recently, representations learned with self-supervised contrastive loss-based methods have yielded state-of-the-art results on various imaging tasks compared to earlier unsupervised approaches. In this work, we leverage a contrastive learning framework to learn appropriate representations from single-cell fluorescent microscopy images for the task of Mechanism-of-Action classification. The proposed work is evaluated on the annotated BBBC021 dataset, and we obtain state-of-the-art results in NSC, NCSB and drop metrics for an unsupervised approach. We observe an improvement of 10% in NCSB accuracy and 11% in NSC-NSCB drop over the previously best unsupervised method. Moreover, the performance of our unsupervised approach ties with the best supervised approach. Additionally, we observe that our framework performs well even without post-processing, unlike earlier methods. With this, we conclude that one can learn robust cell representations with contrastive learning. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: 12 pages, 2 figures, 7 tables. This article is a pre-print and is currently under review at a conference

Journal ref: In: Lian C., Cao X., Rekik I., Xu X., Yan P. (eds) Machine Learning in Medical Imaging. MLMI 2021. Lecture Notes in Computer Science, vol 12966. Springer, Cham

arXiv:2103.00383 [pdf, other]

Brain Signals to Rescue Aphasia, Apraxia and Dysarthria Speech Recognition

Authors: Gautam Krishna, Mason Carnahan, Shilpa Shamapant, Yashitha Surendranath, Saumya Jain, Arundhati Ghosh, Co Tran, Jose del R Millan, Ahmed H Tewfik

Abstract: In this paper, we propose a deep learning-based algorithm to improve the performance of automatic speech recognition (ASR) systems for aphasia, apraxia, and dysarthria speech by utilizing electroencephalography (EEG) features recorded synchronously with aphasia, apraxia, and dysarthria speech. We demonstrate a significant decoding performance improvement by more than 50\% during test time for isol… ▽ More In this paper, we propose a deep learning-based algorithm to improve the performance of automatic speech recognition (ASR) systems for aphasia, apraxia, and dysarthria speech by utilizing electroencephalography (EEG) features recorded synchronously with aphasia, apraxia, and dysarthria speech. We demonstrate a significant decoding performance improvement by more than 50\% during test time for isolated speech recognition task and we also provide preliminary results indicating performance improvement for the more challenging continuous speech recognition task by utilizing EEG features. The results presented in this paper show the first step towards demonstrating the possibility of utilizing non-invasive neural signals to design a real-time robust speech prosthetic for stroke survivors recovering from aphasia, apraxia, and dysarthria. Our aphasia, apraxia, and dysarthria speech-EEG data set will be released to the public to help further advance this interesting and crucial research. △ Less

Submitted 17 July, 2021; v1 submitted 27 February, 2021; originally announced March 2021.

Comments: Accepted to IEEE EMBC 2021

arXiv:2102.11467 [pdf, other]

doi 10.1145/3450439.3451862

VisualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels

Authors: Saahil Jain, Akshay Smit, Steven QH Truong, Chanh DT Nguyen, Minh-Thanh Huynh, Mudit Jain, Victoria A. Young, Andrew Y. Ng, Matthew P. Lungren, Pranav Rajpurkar

Abstract: Automatic extraction of medical conditions from free-text radiology reports is critical for supervising computer vision models to interpret medical images. In this work, we show that radiologists labeling reports significantly disagree with radiologists labeling corresponding chest X-ray images, which reduces the quality of report labels as proxies for image labels. We develop and evaluate methods… ▽ More Automatic extraction of medical conditions from free-text radiology reports is critical for supervising computer vision models to interpret medical images. In this work, we show that radiologists labeling reports significantly disagree with radiologists labeling corresponding chest X-ray images, which reduces the quality of report labels as proxies for image labels. We develop and evaluate methods to produce labels from radiology reports that have better agreement with radiologists labeling images. Our best performing method, called VisualCheXbert, uses a biomedically-pretrained BERT model to directly map from a radiology report to the image labels, with a supervisory signal determined by a computer vision model trained to detect medical conditions from chest X-ray images. We find that VisualCheXbert outperforms an approach using an existing radiology report labeler by an average F1 score of 0.14 (95% CI 0.12, 0.17). We also find that VisualCheXbert better agrees with radiologists labeling chest X-ray images than do radiologists labeling the corresponding radiology reports by an average F1 score across several medical conditions of between 0.12 (95% CI 0.09, 0.15) and 0.21 (95% CI 0.18, 0.24). △ Less

Submitted 15 March, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: Accepted to ACM Conference on Health, Inference, and Learning (ACM-CHIL) 2021

arXiv:2010.06200 [pdf, other]

End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition

Authors: Puneet Kumar, Sidharth Jain, Balasubramanian Raman, Partha Pratim Roy, Masakazu Iwamura

Abstract: In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Ne… ▽ More In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Neural Network architecture. It is trained using softmax pre-training and triplet loss function. The weights between the fully connected and embedding layers of the trained network are used to calculate the embedding values. The embedding representations of various emotions are mapped onto a hyperplane, and the angles among them are computed using the cosine similarity. These angles are utilized to classify a new speech sample into its appropriate emotion class. The proposed system has demonstrated 91.67% and 64.44% accuracy while recognizing emotions for RAVDESS and IEMOCAP dataset, respectively. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Accepted in ICPR 2020

arXiv:2008.02344 [pdf, ps, other]

Exploiting Temporal Attention Features for Effective Denoising in Videos

Authors: Aryansh Omray, Samyak Jain, Utsav Krishnan, Pratik Chattopadhyay

Abstract: Video Denoising is one of the fundamental tasks of any videoprocessing pipeline. It is different from image denoising due to the tem-poral aspects of video frames, and any image denoising approach appliedto videos will result in flickering. The proposed method makes use oftemporal as well as spatial dimensions of video frames as part of a two-stage pipeline. Each stage in the architecture named as… ▽ More Video Denoising is one of the fundamental tasks of any videoprocessing pipeline. It is different from image denoising due to the tem-poral aspects of video frames, and any image denoising approach appliedto videos will result in flickering. The proposed method makes use oftemporal as well as spatial dimensions of video frames as part of a two-stage pipeline. Each stage in the architecture named as Spatio-TemporalNetwork uses a channel-wise attention mechanism to forward the encodersignal to the decoder side. The Attention Block used in this paper usessoft attention to ranks the filters for better training. △ Less

Submitted 27 August, 2020; v1 submitted 5 August, 2020; originally announced August 2020.

arXiv:2006.13817 [pdf, other]

Stacked Convolutional Neural Network for Diagnosis of COVID-19 Disease from X-ray Images

Authors: Mahesh Gour, Sweta Jain

Abstract: Automatic and rapid screening of COVID-19 from the chest X-ray images has become an urgent need in this pandemic situation of SARS-CoV-2 worldwide in 2020. However, accurate and reliable screening of patients is a massive challenge due to the discrepancy between COVID-19 and other viral pneumonia in X-ray images. In this paper, we design a new stacked convolutional neural network model for the aut… ▽ More Automatic and rapid screening of COVID-19 from the chest X-ray images has become an urgent need in this pandemic situation of SARS-CoV-2 worldwide in 2020. However, accurate and reliable screening of patients is a massive challenge due to the discrepancy between COVID-19 and other viral pneumonia in X-ray images. In this paper, we design a new stacked convolutional neural network model for the automatic diagnosis of COVID-19 disease from the chest X-ray images. We obtain different sub-models from the VGG19 and developed a 30-layered CNN model (named as CovNet30) during the training, and obtained sub-models are stacked together using logistic regression. The proposed CNN model combines the discriminating power of the different CNN`s sub-models and classifies chest X-ray images into COVID-19, Normal, and Pneumonia classes. In addition, we generate X-ray images dataset referred to as COVID19CXr, which includes 2764 chest x-ray images of 1768 patients from the three publicly available data repositories. The proposed stacked CNN achieves an accuracy of 92.74%, the sensitivity of 93.33%, PPV of 92.13%, and F1-score of 0.93 for the classification of X-ray images. Our proposed approach shows its superiority over the existing methods for the diagnosis of the COVID-19 from the X-ray images. △ Less

Submitted 22 June, 2020; originally announced June 2020.

Comments: 6 tables, 4 figures

arXiv:2005.08834 [pdf, other]

Designing Just-in-Time Detection for Gamified Fitness Frameworks

Authors: Slobodan Milanko, Alexander Launi, Shubham Jain

Abstract: This paper presents our findings from a multi-year effort to detect motion events early using inertial sensors in real-world settings. We believe early event detection is the next step in advancing motion tracking, and can enable just-in-time interventions, particularly for mHealth applications. Our system targets strength training workouts in the fitness domain, where users perform well-defined m… ▽ More This paper presents our findings from a multi-year effort to detect motion events early using inertial sensors in real-world settings. We believe early event detection is the next step in advancing motion tracking, and can enable just-in-time interventions, particularly for mHealth applications. Our system targets strength training workouts in the fitness domain, where users perform well-defined movements for each exercise, while wearing an inertial sensor. We collect data for 20 exercises across 12 users over 26 months. We propose an algorithm to detect repetitions before they end, to allow a user to visualize movement derived metrics in real-time. We further develop a gamified approach to display this information to the user and encourage them to perform consistent movements. Participants in a feasibility study find the gamified feedback useful in improving their form. Our system can detect repetition events as early as 500 ms before it ends, which is 2x faster and more accurate than state-of-the-art trackers. We believe our approach will open exciting avenues for tracking, detection, and gamification for fitness frameworks. △ Less

Submitted 18 May, 2020; originally announced May 2020.

arXiv:2004.04736 [pdf, other]

Capsules for Biomedical Image Segmentation

Authors: Rodney LaLonde, Ziyue Xu, Ismail Irmakci, Sanjay Jain, Ulas Bagci

Abstract: Our work expands the use of capsule networks to the task of object segmentation for the first time in the literature. This is made possible via the introduction of locally-constrained routing and transformation matrix sharing, which reduces the parameter/memory burden and allows for the segmentation of objects at large resolutions. To compensate for the loss of global information in constraining t… ▽ More Our work expands the use of capsule networks to the task of object segmentation for the first time in the literature. This is made possible via the introduction of locally-constrained routing and transformation matrix sharing, which reduces the parameter/memory burden and allows for the segmentation of objects at large resolutions. To compensate for the loss of global information in constraining the routing, we propose the concept of "deconvolutional" capsules to create a deep encoder-decoder style network, called SegCaps. We extend the masked reconstruction regularization to the task of segmentation and perform thorough ablation experiments on each component of our method. The proposed convolutional-deconvolutional capsule network, SegCaps, shows state-of-the-art results while using a fraction of the parameters of popular segmentation networks. To validate our proposed method, we perform experiments segmenting pathological lungs from clinical and pre-clinical thoracic computed tomography (CT) scans and segmenting muscle and adipose (fat) tissue from magnetic resonance imaging (MRI) scans of human subjects' thighs. Notably, our experiments in lung segmentation represent the largest-scale study in pathological lung segmentation in the literature, where we conduct experiments across five extremely challenging datasets, containing both clinical and pre-clinical subjects, and nearly 2000 computed-tomography scans. Our newly developed segmentation platform outperforms other methods across all datasets while utilizing less than 5% of the parameters in the popular U-Net for biomedical image segmentation. Further, we demonstrate capsules' ability to generalize to unseen rotations/reflections on natural images. △ Less

Submitted 10 December, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: Extension of the non-archival Capsules of Object Segmentation with experiments on both clinical and pre-clinical pathological lung segmentation from CT scans and muscular and adipose tissue segmentation from MR images. Accepted for publication in Medical Image Analysis. DOI: https://doi.org/10.1016/j.media.2020.101889. arXiv admin note: text overlap with arXiv:1804.04241

arXiv:2003.08809 [pdf, other]

Morphological Reconstruction of Detached Dendritic Spines via Geodesic Path Prediction

Authors: Sammit Jain, Suvadip Mukherjee, Lydia Danglot, Jean-Christophe Olivo-Marin

Abstract: Morphological reconstruction of dendritic spines from fluorescent microscopy is a critical open problem in neuro-image analysis. Existing segmentation tools are ill-equipped to handle thin spines with long, poorly illuminated neck membranes. We address this issue, and introduce an unsupervised path prediction technique based on a stochastic framework which seeks the optimal solution from a path-sp… ▽ More Morphological reconstruction of dendritic spines from fluorescent microscopy is a critical open problem in neuro-image analysis. Existing segmentation tools are ill-equipped to handle thin spines with long, poorly illuminated neck membranes. We address this issue, and introduce an unsupervised path prediction technique based on a stochastic framework which seeks the optimal solution from a path-space of possible spine neck reconstructions. Our method is specifically designed to reduce bias due to outliers, and is adept at reconstructing challenging shapes from images plagued by noise and poor contrast. Experimental analyses on two photon microscopy data demonstrate the efficacy of our method, where an improvement of 12.5% is observed over the state-of-the-art in terms of mean absolute reconstruction error. △ Less

Submitted 21 September, 2020; v1 submitted 19 March, 2020; originally announced March 2020.

Comments: S. Jain and S. Mukherjee contributed equally to this work. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2002.12868 [pdf]

doi 10.1681/ASN.2020050652

Neural Network Segmentation of Interstitial Fibrosis, Tubular Atrophy, and Glomerulosclerosis in Renal Biopsies

Authors: Brandon Ginley, Kuang-Yu Jen, Avi Rosenberg, Felicia Yen, Sanjay Jain, Agnes Fogo, Pinaki Sarder

Abstract: Glomerulosclerosis, interstitial fibrosis, and tubular atrophy (IFTA) are histologic indicators of irrecoverable kidney injury. In standard clinical practice, the renal pathologist visually assesses, under the microscope, the percentage of sclerotic glomeruli and the percentage of renal cortical involvement by IFTA. Estimation of IFTA is a subjective process due to a varied spectrum and definition… ▽ More Glomerulosclerosis, interstitial fibrosis, and tubular atrophy (IFTA) are histologic indicators of irrecoverable kidney injury. In standard clinical practice, the renal pathologist visually assesses, under the microscope, the percentage of sclerotic glomeruli and the percentage of renal cortical involvement by IFTA. Estimation of IFTA is a subjective process due to a varied spectrum and definition of morphological manifestations. Modern artificial intelligence and computer vision algorithms have the ability to reduce inter-observer variability through rigorous quantitation. In this work, we apply convolutional neural networks for the segmentation of glomerulosclerosis and IFTA in periodic acid-Schiff stained renal biopsies. The convolutional network approach achieves high performance in intra-institutional holdout data, and achieves moderate performance in inter-intuitional holdout data, which the network had never seen in training. The convolutional approach demonstrated interesting properties, such as learning to predict regions better than the provided ground truth as well as develo** its own conceptualization of segmental sclerosis. Subsequent estimations of IFTA and glomerulosclerosis percentages showed high correlation with ground truth. △ Less

Submitted 28 February, 2020; originally announced February 2020.

arXiv:2002.11151 [pdf, other]

TxSim:Modeling Training of Deep Neural Networks on Resistive Crossbar Systems

Authors: Sourjya Roy, Shrihari Sridharan, Shubham Jain, Anand Raghunathan

Abstract: Resistive crossbars have attracted significant interest in the design of Deep Neural Network (DNN) accelerators due to their ability to natively execute massively parallel vector-matrix multiplications within dense memory arrays. However, crossbar-based computations face a major challenge due to a variety of device and circuit-level non-idealities, which manifest as errors in the vector-matrix mul… ▽ More Resistive crossbars have attracted significant interest in the design of Deep Neural Network (DNN) accelerators due to their ability to natively execute massively parallel vector-matrix multiplications within dense memory arrays. However, crossbar-based computations face a major challenge due to a variety of device and circuit-level non-idealities, which manifest as errors in the vector-matrix multiplications and eventually degrade DNN accuracy. To address this challenge, there is a need for tools that can model the functional impact of non-idealities on DNN training and inference. Existing efforts towards this goal are either limited to inference, or are too slow to be used for large-scale DNN training. We propose TxSim, a fast and customizable modeling framework to functionally evaluate DNN training on crossbar-based hardware considering the impact of non-idealities. The key features of TxSim that differentiate it from prior efforts are: (i) It comprehensively models non-idealities during all training operations (forward propagation, backward propagation, and weight update) and (ii) it achieves computational efficiency by map** crossbar evaluations to well-optimized BLAS routines and incorporates speedup techniques to further reduce simulation time with minimal impact on accuracy. TxSim achieves orders-of-magnitude improvement in simulation speed over prior works, and thereby makes it feasible to evaluate training of large-scale DNNs on crossbars. Our experiments using TxSim reveal that the accuracy degradation in DNN training due to non-idealities can be substantial (3%-10%) for large-scale DNNs, underscoring the need for further research in mitigation techniques. We also analyze the impact of various device and circuit-level parameters and the associated non-idealities to provide key insights that can guide the design of crossbar-based DNN training accelerators. △ Less

Submitted 7 January, 2021; v1 submitted 25 February, 2020; originally announced February 2020.

arXiv:1908.01134 [pdf, ps, other]

A Fuzzy Edge Detector Driven Telegraph Total Variation Model For Image Despeckling

Authors: Sudeb Majee, Subit K Jain, Rajendra K Ray, Ananta K Majee

Abstract: Speckle noise suppression is a challenging and crucial pre-processing stage for higher-level image analysis. In this work, a new attempt has been made using telegraph total variation equation and fuzzy set theory for speckle noise suppression. The intuitionistic fuzzy divergence (IFD) function has been used to distinguish between edges and noise. To the best of the author's knowledge, most of the… ▽ More Speckle noise suppression is a challenging and crucial pre-processing stage for higher-level image analysis. In this work, a new attempt has been made using telegraph total variation equation and fuzzy set theory for speckle noise suppression. The intuitionistic fuzzy divergence (IFD) function has been used to distinguish between edges and noise. To the best of the author's knowledge, most of the studies on multiplicative speckle noise removal process focus on only diffusion-based filters, and little attention has been paid to the study of fuzzy set theory. The proposed approach enjoy the benefits of both telegraph total variation equation and fuzzy edge detector, which is not only robust to noise but also preserves image structural details. Moreover, we establish the existence and uniqueness of a weak solution of the regularized version of the proposed model using Schauder fixed point theorem. With the proposed model, despeckling is carried out on natural and Synthetic Aperture Radar (SAR) images. The experimental results of the proposed model are reported, which found better in terms of noise suppression and detail/edge preservation, with respect to the existing approaches. △ Less

Submitted 5 August, 2019; v1 submitted 3 August, 2019; originally announced August 2019.

Comments: 19 pages, 4 figures, 3 tables

arXiv:1812.07509 [pdf]

doi 10.1038/s42256-019-0018-3

Iterative annotation to ease neural network training: Specialized machine learning in medical image analysis

Authors: Brendon Lutnick, Brandon Ginley, Darshana Govind, Sean D. McGarry, Peter S. LaViolette, Rabi Yacoub, Sanjay Jain, John E. Tomaszewski, Kuang-Yu Jen, Pinaki Sarder

Abstract: Neural networks promise to bring robust, quantitative analysis to medical fields, but adoption is limited by the technicalities of training these networks. To address this translation gap between medical researchers and neural networks in the field of pathology, we have created an intuitive interface which utilizes the commonly used whole slide image (WSI) viewer, Aperio ImageScope (Leica Biosyste… ▽ More Neural networks promise to bring robust, quantitative analysis to medical fields, but adoption is limited by the technicalities of training these networks. To address this translation gap between medical researchers and neural networks in the field of pathology, we have created an intuitive interface which utilizes the commonly used whole slide image (WSI) viewer, Aperio ImageScope (Leica Biosystems Imaging, Inc.), for the annotation and display of neural network predictions on WSIs. Leveraging this, we propose the use of a human-in-the-loop strategy to reduce the burden of WSI annotation. We track network performance improvements as a function of iteration and quantify the use of this pipeline for the segmentation of renal histologic findings on WSIs. More specifically, we present network performance when applied to segmentation of renal micro compartments, and demonstrate multi-class segmentation in human and mouse renal tissue slides. Finally, to show the adaptability of this technique to other medical imaging fields, we demonstrate its ability to iteratively segment human prostate glands from radiology imaging data. △ Less

Submitted 18 December, 2018; originally announced December 2018.

Comments: 15 pages, 7 figures, 2 supplemental figures (on the last page)

Journal ref: Nature Machine Intelligence 1.2 (2019): 112

Showing 1–35 of 35 results for author: Jain, S