-
PlateSegFL: A Privacy-Preserving License Plate Detection Using Federated Segmentation Learning
Authors:
Md. Shahriar Rahman Anuvab,
Mishkat Sultana,
Md. Atif Hossain,
Shashwata Das,
Suvarthi Chowdhury,
Rafeed Rahman,
Dibyo Fabian Dofadar,
Shahriar Rahman Rana
Abstract:
Automatic License Plate Recognition (ALPR) is an integral component of an intelligent transport system with extensive applications in secure transportation, vehicle-to-vehicle communication, stolen vehicles detection, traffic violations, and traffic flow management. The existing license plate detection system focuses on one-shot learners or pre-trained models that operate with a geometric bounding…
▽ More
Automatic License Plate Recognition (ALPR) is an integral component of an intelligent transport system with extensive applications in secure transportation, vehicle-to-vehicle communication, stolen vehicles detection, traffic violations, and traffic flow management. The existing license plate detection system focuses on one-shot learners or pre-trained models that operate with a geometric bounding box, limiting the model's performance. Furthermore, continuous video data streams uploaded to the central server result in network and complexity issues. To combat this, PlateSegFL was introduced, which implements U-Net-based segmentation along with Federated Learning (FL). U-Net is well-suited for multi-class image segmentation tasks because it can analyze a large number of classes and generate a pixel-level segmentation map for each class. Federated Learning is used to reduce the quantity of data required while safeguarding the user's privacy. Different computing platforms, such as mobile phones, are able to collaborate on the development of a standard prediction model where it makes efficient use of one's time; incorporates more diverse data; delivers projections in real-time; and requires no physical effort from the user; resulting around 95% F1 score.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
A Lightweight Attention-based Deep Network via Multi-Scale Feature Fusion for Multi-View Facial Expression Recognition
Authors:
Ali Ezati,
Mohammadreza Dezyani,
Rajib Rana,
Roozbeh Rajabi,
Ahmad Ayatollahi
Abstract:
Convolutional neural networks (CNNs) and their variations have shown effectiveness in facial expression recognition (FER). However, they face challenges when dealing with high computational complexity and multi-view head poses in real-world scenarios. We introduce a lightweight attentional network incorporating multi-scale feature fusion (LANMSFF) to tackle these issues. For the first challenge, w…
▽ More
Convolutional neural networks (CNNs) and their variations have shown effectiveness in facial expression recognition (FER). However, they face challenges when dealing with high computational complexity and multi-view head poses in real-world scenarios. We introduce a lightweight attentional network incorporating multi-scale feature fusion (LANMSFF) to tackle these issues. For the first challenge, we have carefully designed a lightweight fully convolutional network (FCN). We address the second challenge by presenting two novel components, namely mass attention (MassAtt) and point wise feature selection (PWFS) blocks. The MassAtt block simultaneously generates channel and spatial attention maps to recalibrate feature maps by emphasizing important features while suppressing irrelevant ones. On the other hand, the PWFS block employs a feature selection mechanism that discards less meaningful features prior to the fusion process. This mechanism distinguishes it from previous methods that directly fuse multi-scale features. Our proposed approach achieved results comparable to state-of-the-art methods in terms of parameter counts and robustness to pose variation, with accuracy rates of 90.77% on KDEF, 70.44% on FER-2013, and 86.96% on FERPlus datasets. The code for LANMSFF is available at https://github.com/AE-1129/LANMSFF.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition
Authors:
Thejan Rajapakshe,
Rajib Rana,
Sara Khalifa,
Berrak Sisman,
Bjorn W. Schuller,
Carlos Busso
Abstract:
Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a pot…
▽ More
Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance.
While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
BFT-PoLoc: A Byzantine Fortified Trigonometric Proof of Location Protocol using Internet Delays
Authors:
Peiyao Sheng,
Vishal Sevani,
Ranvir Rana,
Himanshu Tyagi,
Pramod Viswanath
Abstract:
Internet platforms depend on accurately determining the geographical locations of online users to deliver targeted services (e.g., advertising). The advent of decentralized platforms (blockchains) emphasizes the importance of geographically distributed nodes, making the validation of locations more crucial. In these decentralized settings, mutually non-trusting participants need to {\em prove} the…
▽ More
Internet platforms depend on accurately determining the geographical locations of online users to deliver targeted services (e.g., advertising). The advent of decentralized platforms (blockchains) emphasizes the importance of geographically distributed nodes, making the validation of locations more crucial. In these decentralized settings, mutually non-trusting participants need to {\em prove} their locations to each other. The incentives for claiming desired location include decentralization properties (validators of a blockchain), explicit rewards for improving coverage (physical infrastructure blockchains) and regulatory compliance -- and entice participants towards prevaricating their true location malicious via VPNs, tampering with internet delays, or compromising other parties (challengers) to misrepresent their location. Traditional delay-based geolocation methods focus on reducing the noise in measurements and are very vulnerable to wilful divergences from prescribed protocol.
In this paper we use Internet delay measurements to securely prove the location of IP addresses while being immune to a large fraction of Byzantine actions. Our core methods are to endow Internet telemetry tools (e.g., **) with cryptographic primitives (signatures and hash functions) together with Byzantine resistant data inferences subject to Euclidean geometric constraints. We introduce two new networking protocols, robust against Byzantine actions: Proof of Internet Geometry (PoIG) converts delay measurements into precise distance estimates across the Internet; Proof of Location (PoLoc) enables accurate and efficient multilateration of a specific IP address. The key algorithmic innovations are in conducting ``Byzantine fortified trigonometry" (BFT) inferences of data, endowing low rank matrix completion methods with Byzantine resistance.
△ Less
Submitted 28 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Proof of Diligence: Cryptoeconomic Security for Rollups
Authors:
Peiyao Sheng,
Ranvir Rana,
Himanshu Tyagi,
Pramod Viswanath
Abstract:
Layer 1 (L1) blockchains such as Ethereum are secured under an "honest supermajority of stake" assumption for a large pool of validators who verify each and every transaction on it. This high security comes at a scalability cost which not only effects the throughput of the blockchain but also results in high gas fees for executing transactions on chain. The most successful solution for this proble…
▽ More
Layer 1 (L1) blockchains such as Ethereum are secured under an "honest supermajority of stake" assumption for a large pool of validators who verify each and every transaction on it. This high security comes at a scalability cost which not only effects the throughput of the blockchain but also results in high gas fees for executing transactions on chain. The most successful solution for this problem is provided by optimistic rollups, Layer 2 (L2) blockchains that execute transactions outside L1 but post the transaction data on L1. The security for such L2 chains is argued, informally, under the assumption that a set of nodes will check the transaction data posted on L1 and raise an alarm (a fraud proof) if faulty transactions are detected. However, all current deployments lack a proper incentive mechanism for ensuring that these nodes will do their job ``diligently'', and simply rely on a cursory incentive alignment argument for security. We solve this problem by introducing an incentivized watchtower network designed to serve as the first line of defense for rollups. Our main contribution is a ``Proof of Diligence'' protocol that requires watchtowers to continuously provide a proof that they have verified L2 assertions and get rewarded for the same. Proof of Diligence protocol includes a carefully-designed incentive mechanism that is provably secure when watchtowers are rational actors, under a mild rational independence assumption.
Our proposed system is now live on Ethereum testnet. We deployed a watchtower network and implemented Proof of Diligence for multiple optimistic rollups. We extract execution as well as inclusion proofs for transactions as a part of the bounty. Each watchtower has minimal additional computational overhead beyond access to standard L1 and L2 RPC nodes.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
ZeroSwap: Data-driven Optimal Market Making in DeFi
Authors:
Viraj Nadkarni,
Jiachen Hu,
Ranvir Rana,
Chi **,
Sanjeev Kulkarni,
Pramod Viswanath
Abstract:
Automated Market Makers (AMMs) are major centers of matching liquidity supply and demand in Decentralized Finance. Their functioning relies primarily on the presence of liquidity providers (LPs) incentivized to invest their assets into a liquidity pool. However, the prices at which a pooled asset is traded is often more stale than the prices on centralized and more liquid exchanges. This leads to…
▽ More
Automated Market Makers (AMMs) are major centers of matching liquidity supply and demand in Decentralized Finance. Their functioning relies primarily on the presence of liquidity providers (LPs) incentivized to invest their assets into a liquidity pool. However, the prices at which a pooled asset is traded is often more stale than the prices on centralized and more liquid exchanges. This leads to the LPs suffering losses to arbitrage. This problem is addressed by adapting market prices to trader behavior, captured via the classical market microstructure model of Glosten and Milgrom. In this paper, we propose the first optimal Bayesian and the first model-free data-driven algorithm to optimally track the external price of the asset. The notion of optimality that we use enforces a zero-profit condition on the prices of the market maker, hence the name ZeroSwap. This ensures that the market maker balances losses to informed traders with profits from noise traders. The key property of our approach is the ability to estimate the external market price without the need for price oracles or loss oracles. Our theoretical guarantees on the performance of both these algorithms, ensuring the stability and convergence of their price recommendations, are of independent interest in the theory of reinforcement learning. We empirically demonstrate the robustness of our algorithms to changing market conditions.
△ Less
Submitted 29 April, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
DATT: Deep Adaptive Trajectory Tracking for Quadrotor Control
Authors:
Kevin Huang,
Rwik Rana,
Alexander Spitzer,
Guanya Shi,
Byron Boots
Abstract:
Precise arbitrary trajectory tracking for quadrotors is challenging due to unknown nonlinear dynamics, trajectory infeasibility, and actuation limits. To tackle these challenges, we present Deep Adaptive Trajectory Tracking (DATT), a learning-based approach that can precisely track arbitrary, potentially infeasible trajectories in the presence of large disturbances in the real world. DATT builds o…
▽ More
Precise arbitrary trajectory tracking for quadrotors is challenging due to unknown nonlinear dynamics, trajectory infeasibility, and actuation limits. To tackle these challenges, we present Deep Adaptive Trajectory Tracking (DATT), a learning-based approach that can precisely track arbitrary, potentially infeasible trajectories in the presence of large disturbances in the real world. DATT builds on a novel feedforward-feedback-adaptive control structure trained in simulation using reinforcement learning. When deployed on real hardware, DATT is augmented with a disturbance estimator using L1 adaptive control in closed-loop, without any fine-tuning. DATT significantly outperforms competitive adaptive nonlinear and model predictive controllers for both feasible smooth and infeasible trajectories in unsteady wind fields, including challenging scenarios where baselines completely fail. Moreover, DATT can efficiently run online with an inference time less than 3.2 ms, less than 1/4 of the adaptive nonlinear model predictive control baseline
△ Less
Submitted 13 December, 2023; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Integrating Contrastive Learning into a Multitask Transformer Model for Effective Domain Adaptation
Authors:
Chung-Soo Ahn,
Jagath C. Rajapakse,
Rajib Rana
Abstract:
While speech emotion recognition (SER) research has made significant progress, achieving generalization across various corpora continues to pose a problem. We propose a novel domain adaptation technique that embodies a multitask framework with SER as the primary task, and contrastive learning and information maximisation loss as auxiliary tasks, underpinned by fine-tuning of transformers pre-train…
▽ More
While speech emotion recognition (SER) research has made significant progress, achieving generalization across various corpora continues to pose a problem. We propose a novel domain adaptation technique that embodies a multitask framework with SER as the primary task, and contrastive learning and information maximisation loss as auxiliary tasks, underpinned by fine-tuning of transformers pre-trained on large language models. Empirical results obtained through experiments on well-established datasets like IEMOCAP and MSP-IMPROV, illustrate that our proposed model achieves state-of-the-art performance in SER within cross-corpus scenarios.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
Deep Model Predictive Optimization
Authors:
Jacob Sacks,
Rwik Rana,
Kevin Huang,
Alex Spitzer,
Guanya Shi,
Byron Boots
Abstract:
A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world. On one end of the spectrum, we have model-free reinforcement learning (MFRL), which is incredibly flexible and general but often results in brittle policies. In contrast, model predictive control (MPC) continually re-plans at each time step to remain robust to perturbations and mo…
▽ More
A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world. On one end of the spectrum, we have model-free reinforcement learning (MFRL), which is incredibly flexible and general but often results in brittle policies. In contrast, model predictive control (MPC) continually re-plans at each time step to remain robust to perturbations and model inaccuracies. However, despite its real-world successes, MPC often under-performs the optimal strategy. This is due to model quality, myopic behavior from short planning horizons, and approximations due to computational constraints. And even with a perfect model and enough compute, MPC can get stuck in bad local optima, depending heavily on the quality of the optimization algorithm. To this end, we propose Deep Model Predictive Optimization (DMPO), which learns the inner-loop of an MPC optimization algorithm directly via experience, specifically tailored to the needs of the control problem. We evaluate DMPO on a real quadrotor agile trajectory tracking task, on which it improves performance over a baseline MPC algorithm for a given computational budget. It can outperform the best MPC algorithm by up to 27% with fewer samples and an end-to-end policy trained with MFRL by 19%. Moreover, because DMPO requires fewer samples, it can also achieve these benefits with 4.3X less memory. When we subject the quadrotor to turbulent wind fields with an attached drag plate, DMPO can adapt zero-shot while still outperforming all baselines. Additional results can be found at https://tinyurl.com/mr2ywmnw.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Inequivalent $Z_2^n$-graded brackets, $n$-bit parastatistics and statistical transmutations of supersymmetric quantum mechanics
Authors:
M. M. Balbino,
I. P. de Freitas,
R. G. Rana,
F. Toppan
Abstract:
Given an associative ring of $Z_2^n$-graded operators, the number of inequivalent brackets of Lie-type which are compatible with the grading and satisfy graded Jacobi identities is $b_n= n+\lfloor n/2\rfloor+1$. This follows from the Rittenberg-Wyler and Scheunert analysis of "color" Lie (super)algebras which is revisited here in terms of Boolean logic gates. The inequivalent brackets, recovered f…
▽ More
Given an associative ring of $Z_2^n$-graded operators, the number of inequivalent brackets of Lie-type which are compatible with the grading and satisfy graded Jacobi identities is $b_n= n+\lfloor n/2\rfloor+1$. This follows from the Rittenberg-Wyler and Scheunert analysis of "color" Lie (super)algebras which is revisited here in terms of Boolean logic gates. The inequivalent brackets, recovered from $Z_2^n\times Z_2^n\rightarrow Z_2$ map**s, are defined by consistent sets of commutators/anticommutators describing particles accommodated into an $n$-bit parastatistics (ordinary bosons/fermions correspond to $1$ bit). Depending on the given graded Lie (super)algebra, its graded sectors can fall into different classes of equivalence expressing different types of (para)bosons and/or (para)fermions. As a first application we construct $Z_2^2$ and $ Z_2^3$-graded quantum Hamiltonians which respectively admit $b_2=4$ and $b_3=5$ inequivalent multiparticle quantizations (the inequivalent parastatistics are discriminated by measuring the eigenvalues of certain observables in some given states). As a main physical application we prove that the $N$-extended, $1D$ supersymmetric and superconformal quantum mechanics, for $N=1,2,4,8$, are respectively described by $s_{N}=2,6,10,14 $ alternative formulations based on the inequivalent graded Lie (super)algebras. These numbers correspond to all possible "statistical transmutations" of a given set of supercharges which, for ${N}=1,2,4,8$, are accommodated into a $Z_2^n$-grading with $n=1,2,3,4$ (the identification is $N= 2^{n-1}$). In the simplest ${N}=2$ setting (the $2$-particle sector of the de DFF deformed oscillator with $sl(2|1)$ spectrum-generating superalgebra), the $Z_2^2$-graded parastatistics imply a degeneration of the energy levels which cannot be reproduced by ordinary bosons/fermions statistics.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
Architecture Optimization Dramatically Improves Reverse Bias Stability in Perovskite Solar Cells: A Role of Polymer Hole Transport Layers
Authors:
Fangyuan Jiang,
Yangwei Shi,
Tanka R. Rana,
Daniel Morales,
Isaac Gould,
Declan P. McCarthy,
Joel Smith,
Grey Christoforo,
Hannah Contreras,
Stephen Barlow,
Aditya D. Mohite,
Henry Snaith,
Seth R. Marder,
J. Devin MacKenzie,
Michael D. McGehee,
David S. Ginger
Abstract:
We report that device architecture engineering has a substantial impact on the reverse bias instability that has been reported as a critical issue in commercializing perovskite solar cells. We demonstrate breakdown voltages exceeding -15 V in typical pin structured perovskite solar cells via two steps: i) using polymer hole transporting materials; ii) using a more electrochemically stable gold ele…
▽ More
We report that device architecture engineering has a substantial impact on the reverse bias instability that has been reported as a critical issue in commercializing perovskite solar cells. We demonstrate breakdown voltages exceeding -15 V in typical pin structured perovskite solar cells via two steps: i) using polymer hole transporting materials; ii) using a more electrochemically stable gold electrode. While device degradation can be exacerbated by higher reverse bias and prolonged exposure, our as-fabricated perovskite solar cells completely recover their performance even after stressing at -7 V for 9 hours both in the dark and under partial illumination. Following these observations, we systematically discuss and compare the reverse bias driven degradation pathways in perovskite solar cells with different device architectures. Our model highlights the role of electrochemical reaction rates and species in dictating the reverse bias stability of perovskite solar cells.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
SAKSHI: Decentralized AI Platforms
Authors:
Suma Bhat,
Canhui Chen,
Zerui Cheng,
Zhixuan Fang,
Ashwin Hebbar,
Sreeram Kannan,
Ranvir Rana,
Peiyao Sheng,
Himanshu Tyagi,
Pramod Viswanath,
Xuechao Wang
Abstract:
Large AI models (e.g., Dall-E, GPT4) have electrified the scientific, technological and societal landscape through their superhuman capabilities. These services are offered largely in a traditional web2.0 format (e.g., OpenAI's GPT4 service). As more large AI models proliferate (personalizing and specializing to a variety of domains), there is a tremendous need to have a neutral trust-free platfor…
▽ More
Large AI models (e.g., Dall-E, GPT4) have electrified the scientific, technological and societal landscape through their superhuman capabilities. These services are offered largely in a traditional web2.0 format (e.g., OpenAI's GPT4 service). As more large AI models proliferate (personalizing and specializing to a variety of domains), there is a tremendous need to have a neutral trust-free platform that allows the hosting of AI models, clients receiving AI services efficiently, yet in a trust-free, incentive compatible, Byzantine behavior resistant manner. In this paper we propose SAKSHI, a trust-free decentralized platform specifically suited for AI services. The key design principles of SAKSHI are the separation of the data path (where AI query and service is managed) and the control path (where routers and compute and storage hosts are managed) from the transaction path (where the metering and billing of services are managed over a blockchain). This separation is enabled by a "proof of inference" layer which provides cryptographic resistance against a variety of misbehaviors, including poor AI service, nonpayment for service, copying of AI models. This is joint work between multiple universities (Princeton University, University of Illinois at Urbana-Champaign, Tsinghua University, HKUST) and two startup companies (Witness Chain and Eigen Layer).
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
Natural Language Processing in Electronic Health Records in Relation to Healthcare Decision-making: A Systematic Review
Authors:
Elias Hossain,
Rajib Rana,
Niall Higgins,
Jeffrey Soar,
Prabal Datta Barua,
Anthony R. Pisani,
Ph. D,
Kathryn Turner}
Abstract:
Background: Natural Language Processing (NLP) is widely used to extract clinical insights from Electronic Health Records (EHRs). However, the lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this s…
▽ More
Background: Natural Language Processing (NLP) is widely used to extract clinical insights from Electronic Health Records (EHRs). However, the lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this space comprehensively.
Methodology: After screening 261 articles from 11 databases, we included 127 papers for full-text review covering seven categories of articles: 1) medical note classification, 2) clinical entity recognition, 3) text summarisation, 4) deep learning (DL) and transfer learning architecture, 5) information extraction, 6) Medical language translation and 7) other NLP applications. This study follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Result and Discussion: EHR was the most commonly used data type among the selected articles, and the datasets were primarily unstructured. Various ML and DL methods were used, with prediction or classification being the most common application of ML or DL. The most common use cases were: the International Classification of Diseases, Ninth Revision (ICD-9) classification, clinical note analysis, and named entity recognition (NER) for clinical descriptions and research on psychiatric disorders.
Conclusion: We find that the adopted ML models were not adequately assessed. In addition, the data imbalance problem is quite important, yet we must find techniques to address this underlining problem. Future studies should address key limitations in studies, primarily identifying Lupus Nephritis, Suicide Attempts, perinatal self-harmed and ICD-9 classification.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Enhancing Speech Emotion Recognition Through Differentiable Architecture Search
Authors:
Thejan Rajapakshe,
Rajib Rana,
Sara Khalifa,
Berrak Sisman,
Björn Schuller
Abstract:
Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Recent advancements in Deep Learning (DL) have substantially enhanced the performance of SER models through increased model complexity. However, designing optimal DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS…
▽ More
Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Recent advancements in Deep Learning (DL) have substantially enhanced the performance of SER models through increased model complexity. However, designing optimal DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) offers a promising avenue to determine an optimal DL model automatically. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. This paper proposes a DARTS-optimised joint CNN and LSTM architecture, to improve SER performance, where the literature informs the selection of CNN and LSTM coupling to offer improved performance. While DARTS has previously been applied to CNN and LSTM combinations, our approach introduces a novel mechanism, particularly in selecting CNN operations using DARTS. In contrast to previous studies, we refrain from imposing constraints on the order of the layers for the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal layer order autonomously. Experimenting with the IEMOCAP and MSP-IMPROV datasets, we demonstrate that our proposed methodology achieves significantly higher SER accuracy than hand-engineering the CNN-LSTM configuration. It also outperforms the best-reported SER results achieved using DARTS on CNN-LSTM.
△ Less
Submitted 18 January, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
AI-Based Emotion Recognition: Promise, Peril, and Prescriptions for Prosocial Path
Authors:
Siddique Latif,
Hafiz Shehbaz Ali,
Muhammad Usama,
Rajib Rana,
Björn Schuller,
Junaid Qadir
Abstract:
Automated emotion recognition (AER) technology can detect humans' emotional states in real-time using facial expressions, voice attributes, text, body movements, and neurological signals and has a broad range of applications across many sectors. It helps businesses get a much deeper understanding of their customers, enables monitoring of individuals' moods in healthcare, education, or the automoti…
▽ More
Automated emotion recognition (AER) technology can detect humans' emotional states in real-time using facial expressions, voice attributes, text, body movements, and neurological signals and has a broad range of applications across many sectors. It helps businesses get a much deeper understanding of their customers, enables monitoring of individuals' moods in healthcare, education, or the automotive industry, and enables identification of violence and threat in forensics, to name a few. However, AER technology also risks using artificial intelligence (AI) to interpret sensitive human emotions. It can be used for economic and political power and against individual rights. Human emotions are highly personal, and users have justifiable concerns about privacy invasion, emotional manipulation, and bias. In this paper, we present the promises and perils of AER applications. We discuss the ethical challenges related to the data and AER systems and highlight the prescriptions for prosocial perspectives for future AER applications. We hope this work will help AI researchers and developers design prosocial AER applications.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering
Authors:
Shamane Siriwardhana,
Rivindu Weerasekera,
Elliott Wen,
Tharindu Kaluarachchi,
Rajib Rana,
Suranga Nanayakkara
Abstract:
Retrieval Augment Generation (RAG) is a recent advancement in Open-Domain Question Answering (ODQA). RAG has only been trained and explored with a Wikipedia-based external knowledge base and is not optimized for use in other specialized domains such as healthcare and news. In this paper, we evaluate the impact of joint training of the retriever and generator components of RAG for the task of domai…
▽ More
Retrieval Augment Generation (RAG) is a recent advancement in Open-Domain Question Answering (ODQA). RAG has only been trained and explored with a Wikipedia-based external knowledge base and is not optimized for use in other specialized domains such as healthcare and news. In this paper, we evaluate the impact of joint training of the retriever and generator components of RAG for the task of domain adaptation in ODQA. We propose \textit{RAG-end2end}, an extension to RAG, that can adapt to a domain-specific knowledge base by updating all components of the external knowledge base during training. In addition, we introduce an auxiliary training signal to inject more domain-specific knowledge. This auxiliary signal forces \textit{RAG-end2end} to reconstruct a given sentence by accessing the relevant information from the external knowledge base. Our novel contribution is unlike RAG, RAG-end2end does joint training of the retriever and generator for the end QA task and domain adaptation. We evaluate our approach with datasets from three domains: COVID-19, News, and Conversations, and achieve significant performance improvements compared to the original RAG model. Our work has been open-sourced through the Huggingface Transformers library, attesting to our work's credibility and technical consistency.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Optimal Bootstrap** of PoW Blockchains
Authors:
Ranvir Rana,
Dimitris Karakostas,
Sreeram Kannan,
Aggelos Kiayias,
Pramod Viswanath
Abstract:
Proof of Work (PoW) blockchains are susceptible to adversarial majority mining attacks in the early stages due to incipient participation and corresponding low net hash power. Bootstrap** ensures safety and liveness during the transient stage by protecting against a majority mining attack, allowing a PoW chain to grow the participation base and corresponding mining hash power. Liveness is especi…
▽ More
Proof of Work (PoW) blockchains are susceptible to adversarial majority mining attacks in the early stages due to incipient participation and corresponding low net hash power. Bootstrap** ensures safety and liveness during the transient stage by protecting against a majority mining attack, allowing a PoW chain to grow the participation base and corresponding mining hash power. Liveness is especially important since a loss of liveness will lead to loss of honest mining rewards, decreasing honest participation, hence creating an undesired spiral; indeed existing bootstrap** mechanisms offer especially weak liveness guarantees.
In this paper, we propose Advocate, a new bootstrap** methodology, which achieves two main results: (a) optimal liveness and low latency under a super-majority adversary for the Nakamoto longest chain protocol and (b) immediate black-box generalization to a variety of parallel-chain based scaling architectures, including OHIE and Prism. We demonstrate via a full-stack implementation the robustness of Advocate under a 90% adversarial majority.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Speech Synthesis with Mixed Emotions
Authors:
Kun Zhou,
Berrak Sisman,
Rajib Rana,
B. W. Schuller,
Haizhou Li
Abstract:
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions.…
▽ More
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.
△ Less
Submitted 28 December, 2022; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Domain Adapting Deep Reinforcement Learning for Real-world Speech Emotion Recognition
Authors:
Thejan Rajapakshe,
Rajib Rana,
Sara Khalifa,
Bjorn W. Schuller
Abstract:
Computers can understand and then engage with people in an emotionally intelligent way thanks to speech-emotion recognition (SER). However, the performance of SER in cross-corpus and real-world live data feed scenarios can be significantly improved. The inability to adapt an existing model to a new domain is one of the shortcomings of SER methods. To address this challenge, researchers have develo…
▽ More
Computers can understand and then engage with people in an emotionally intelligent way thanks to speech-emotion recognition (SER). However, the performance of SER in cross-corpus and real-world live data feed scenarios can be significantly improved. The inability to adapt an existing model to a new domain is one of the shortcomings of SER methods. To address this challenge, researchers have developed domain adaptation techniques that transfer knowledge learnt by a model across the domain. Although existing domain adaptation techniques have improved performances across domains, they can be improved to adapt to a real-world live data feed situation where a model can self-tune while deployed. In this paper, we present a deep reinforcement learning-based strategy (RL-DA) for adapting a pre-trained model to a real-world live data feed setting while interacting with the environment and collecting continual feedback. RL-DA is evaluated on SER tasks, including cross-corpus and cross-language domain adaption schema. Evaluation results show that in a live data feed setting, RL-DA outperforms a baseline strategy by 11% and 14% in cross-corpus and cross-language scenarios, respectively.
△ Less
Submitted 23 September, 2022; v1 submitted 6 July, 2022;
originally announced July 2022.
-
Multitask Learning from Augmented Auxiliary Data for Improving Speech Emotion Recognition
Authors:
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Björn W. Schuller
Abstract:
Despite the recent progress in speech emotion recognition (SER), state-of-the-art systems lack generalisation across different conditions. A key underlying reason for poor generalisation is the scarcity of emotion datasets, which is a significant roadblock to designing robust machine learning (ML) models. Recent works in SER focus on utilising multitask learning (MTL) methods to improve generalisa…
▽ More
Despite the recent progress in speech emotion recognition (SER), state-of-the-art systems lack generalisation across different conditions. A key underlying reason for poor generalisation is the scarcity of emotion datasets, which is a significant roadblock to designing robust machine learning (ML) models. Recent works in SER focus on utilising multitask learning (MTL) methods to improve generalisation by learning shared representations. However, most of these studies propose MTL solutions with the requirement of meta labels for auxiliary tasks, which limits the training of SER systems. This paper proposes an MTL framework (MTL-AUG) that learns generalised representations from augmented data. We utilise augmentation-type classification and unsupervised reconstruction as auxiliary tasks, which allow training SER systems on augmented data without requiring any meta labels for auxiliary tasks. The semi-supervised nature of MTL-AUG allows for the exploitation of the abundant unlabelled data to further boost the performance of SER. We comprehensively evaluate the proposed framework in the following settings: (1) within corpus, (2) cross-corpus and cross-language, (3) noisy speech, (4) and adversarial attacks. Our evaluations using the widely used IEMOCAP, MSP-IMPROV, and EMODB datasets show improved results compared to existing state-of-the-art methods.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.
-
Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition
Authors:
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Björn Schuller
Abstract:
Despite the recent advancement in speech emotion recognition (SER) within a single corpus setting, the performance of these SER systems degrades significantly for cross-corpus and cross-language scenarios. The key reason is the lack of generalisation in SER systems towards unseen conditions, which causes them to perform poorly in cross-corpus and cross-language settings. Recent studies focus on ut…
▽ More
Despite the recent advancement in speech emotion recognition (SER) within a single corpus setting, the performance of these SER systems degrades significantly for cross-corpus and cross-language scenarios. The key reason is the lack of generalisation in SER systems towards unseen conditions, which causes them to perform poorly in cross-corpus and cross-language settings. Recent studies focus on utilising adversarial methods to learn domain generalised representation for improving cross-corpus and cross-language SER to address this issue. However, many of these methods only focus on cross-corpus SER without addressing the cross-language SER performance degradation due to a larger domain gap between source and target language data. This contribution proposes an adversarial dual discriminator (ADDi) network that uses the three-players adversarial game to learn generalised representations without requiring any target data labels. We also introduce a self-supervised ADDi (sADDi) network that utilises self-supervised pre-training with unlabelled data. We propose synthetic data generation as a pretext task in sADDi, enabling the network to produce emotionally discriminative and domain invariant representations and providing complementary synthetic data to augment the system. The proposed model is rigorously evaluated using five publicly available datasets in three languages and compared with multiple studies on cross-corpus and cross-language SER. Experimental results demonstrate that the proposed model achieves improved performance compared to the state-of-the-art methods.
△ Less
Submitted 18 April, 2022;
originally announced April 2022.
-
Emotion Intensity and its Control for Emotional Voice Conversion
Authors:
Kun Zhou,
Berrak Sisman,
Rajib Rana,
Björn W. Schuller,
Haizhou Li
Abstract:
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity…
▽ More
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
△ Less
Submitted 18 July, 2022; v1 submitted 9 January, 2022;
originally announced January 2022.
-
Fast and Real-time End to End Control in Autonomous Racing Cars Through Representation Learning
Authors:
Praveen Venkatesh,
Rwik Rana,
Harish PM
Abstract:
The challenges presented in an autonomous racing situation are distinct from those faced in regular autonomous driving and require faster end-to-end algorithms and consideration of a longer horizon in determining optimal current actions kee** in mind upcoming maneuvers and situations. In this paper, we propose an end-to-end method for autonomous racing that takes in as inputs video information f…
▽ More
The challenges presented in an autonomous racing situation are distinct from those faced in regular autonomous driving and require faster end-to-end algorithms and consideration of a longer horizon in determining optimal current actions kee** in mind upcoming maneuvers and situations. In this paper, we propose an end-to-end method for autonomous racing that takes in as inputs video information from an onboard camera and determines final steering and throttle control actions. We use the following split to construct such a method (1) learning a low dimensional representation of the scene, (2) pre-generating the optimal trajectory for the given scene, and (3) tracking the predicted trajectory using a classical control method. In learning a low-dimensional representation of the scene, we use intermediate representations with a novel unsupervised trajectory planner to generate expert trajectories, and hence utilize them to directly predict race lines from a given front-facing input image. Thus, the proposed algorithm employs the best of two worlds - the robustness of learning-based approaches to perception and the accuracy of optimization-based approaches for trajectory generation in an end-to-end learning-based framework. We deploy and demonstrate our framework on CARLA, a photorealistic simulator for testing self-driving cars in realistic environments.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
Memory Guided Road Detection
Authors:
Praveen Venkatesh,
Rwik Rana,
Varun Jain
Abstract:
In self driving car applications, there is a requirement to predict the location of the lane given an input RGB front facing image. In this paper, we propose an architecture that allows us to increase the speed and robustness of road detection without a large hit in accuracy by introducing an underlying shared feature space that is propagated over time, which serves as a flowing dynamic memory. By…
▽ More
In self driving car applications, there is a requirement to predict the location of the lane given an input RGB front facing image. In this paper, we propose an architecture that allows us to increase the speed and robustness of road detection without a large hit in accuracy by introducing an underlying shared feature space that is propagated over time, which serves as a flowing dynamic memory. By utilizing the gist of previous frames, we train the network to predict the current road with a greater accuracy and lesser deviation from previous frames.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
The Lorentz-violating real scalar field at thermal equilibrium
Authors:
A. R. Aguirre,
G. Flores-Hidalgo,
R. G. Rana,
E. S. Souza
Abstract:
In this paper we study Lorentz-Violation(LV) effects on the thermodynamics properties of a real scalar field theory due to the presence of a constant background tensor field. In particular, we analyse and compute explicitly the deviations of the internal energy, pressure, and entropy of the system at thermal equilibrium due to the LV contributions. For the free massless scalar field we obtain exac…
▽ More
In this paper we study Lorentz-Violation(LV) effects on the thermodynamics properties of a real scalar field theory due to the presence of a constant background tensor field. In particular, we analyse and compute explicitly the deviations of the internal energy, pressure, and entropy of the system at thermal equilibrium due to the LV contributions. For the free massless scalar field we obtain exact results, whereas for the massive case we perform approximated calculations. Finally, we consider the self interacting $φ^4$ theory, and perform perturbative expansions in the coupling constant for obtaining relevant thermodynamics quantities.
△ Less
Submitted 15 March, 2021;
originally announced March 2021.
-
Optical Kerr nonlinearity and multi-photon absorption of DSTMS measured by Z-scan method
Authors:
Jiang Li,
Rakesh Rana,
Liguo Zhu,
Cangli Liu,
Harald Schneider,
Alexej Pashkin
Abstract:
We investigate the optical Kerr nonlinearity and multi-photon absorption (MPA) properties of DSTMS excited by femtosecond pulses at a wavelengths of 1.43 μm, which is optimal for terahertz generation via difference frequency mixing. The MPA and the optical Kerr coefficients of DSTMS at 1.43 μm are strongly anisotropic indicating a dominating contribution from cascaded 2nd-order nonlinearity. These…
▽ More
We investigate the optical Kerr nonlinearity and multi-photon absorption (MPA) properties of DSTMS excited by femtosecond pulses at a wavelengths of 1.43 μm, which is optimal for terahertz generation via difference frequency mixing. The MPA and the optical Kerr coefficients of DSTMS at 1.43 μm are strongly anisotropic indicating a dominating contribution from cascaded 2nd-order nonlinearity. These results suggest that the saturation of the THz generation efficiency is mainly related to the MPA process and to a spectral broadening caused by cascaded 2nd-order frequency mixing within DSTMS
△ Less
Submitted 25 August, 2021; v1 submitted 5 February, 2021;
originally announced February 2021.
-
A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition
Authors:
Thejan Rajapakshe,
Rajib Rana,
Sara Khalifa,
Björn W. Schuller,
Jiajun Liu
Abstract:
Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored fo…
▽ More
Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored for challenging tasks like Speech Emotion Recognition (SER). The deep RL being used for SER can potentially improve the performance of an automated call centre agent by dynamically learning emotional-aware response to customer queries. While the policy employed by the RL agent plays a major role in action selection, there is no current RL policy tailored for SER. In addition, extended learning period is a general challenge for deep RL which can impact the speed of learning for SER. Therefore, in this paper, we introduce a novel policy - "Zeta policy" which is tailored for SER and apply Pre-training in deep RL to achieve faster learning rate. Pre-training with cross dataset was also studied to discover the feasibility of pre-training the RL Agent with a similar dataset in a scenario of where no real environmental data is not available. IEMOCAP and SAVEE datasets were used for the evaluation with the problem being to recognize four emotions happy, sad, angry and neutral in the utterances provided. Experimental results show that the proposed "Zeta policy" performs better than existing policies. The results also support that pre-training can reduce the training time upon reducing the warm-up period and is robust to cross-corpus scenario.
△ Less
Submitted 31 January, 2021; v1 submitted 3 January, 2021;
originally announced January 2021.
-
High-Fidelity Audio Generation and Representation Learning with Guided Adversarial Autoencoder
Authors:
Kazi Nazmul Haque,
Rajib Rana,
Björn W Schuller
Abstract:
Unsupervised disentangled representation learning from the unlabelled audio data, and high fidelity audio generation have become two linchpins in the machine learning research fields. However, the representation learned from an unsupervised setting does not guarantee its' usability for any downstream task at hand, which can be a wastage of the resources, if the training was conducted for that part…
▽ More
Unsupervised disentangled representation learning from the unlabelled audio data, and high fidelity audio generation have become two linchpins in the machine learning research fields. However, the representation learned from an unsupervised setting does not guarantee its' usability for any downstream task at hand, which can be a wastage of the resources, if the training was conducted for that particular posterior job. Also, during the representation learning, if the model is highly biased towards the downstream task, it losses its generalisation capability which directly benefits the downstream job but the ability to scale it to other related task is lost. Therefore, to fill this gap, we propose a new autoencoder based model named "Guided Adversarial Autoencoder (GAAE)", which can learn both post-task-specific representations and the general representation capturing the factors of variation in the training data leveraging a small percentage of labelled samples; thus, makes it suitable for future related tasks. Furthermore, our proposed model can generate audio with superior quality, which is indistinguishable from the real audio samples. Hence, with the extensive experimental results, we have demonstrated that by harnessing the power of the high-fidelity audio generation, the proposed GAAE model can learn powerful representation from unlabelled dataset leveraging a fewer percentage of labelled data as supervision/guidance.
△ Less
Submitted 17 October, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Deep Reinforcement Learning with Pre-training for Time-efficient Training of Automatic Speech Recognition
Authors:
Thejan Rajapakshe,
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Björn W. Schuller
Abstract:
Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This has led to breakthroughs in many complex tasks, such as playing the game "Go", that were previously difficult to solve. However, deep RL requires significant training time making it difficult to use in va…
▽ More
Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This has led to breakthroughs in many complex tasks, such as playing the game "Go", that were previously difficult to solve. However, deep RL requires significant training time making it difficult to use in various real-life applications such as Human-Computer Interaction (HCI). In this paper, we study pre-training in deep RL to reduce the training time and improve the performance of Speech Recognition, a popular application of HCI. To evaluate the performance improvement in training we use the publicly available "Speech Command" dataset, which contains utterances of 30 command keywords spoken by 2,618 speakers. Results show that pre-training with deep RL offers faster convergence compared to non-pre-trained RL while achieving improved speech recognition accuracy.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Free2Shard: Adaptive-adversary-resistant sharding via Dynamic Self Allocation
Authors:
Ranvir Rana,
Sreeram Kannan,
David Tse,
Pramod Viswanath
Abstract:
Propelled by the growth of large-scale blockchain deployments, much recent progress has been made in designing sharding protocols that achieve throughput scaling linearly in the number of nodes. However, existing protocols are not robust to an adversary adaptively corrupting a fixed fraction of nodes. In this paper, we propose Free2Shard -- a new architecture that achieves near-linear scaling whil…
▽ More
Propelled by the growth of large-scale blockchain deployments, much recent progress has been made in designing sharding protocols that achieve throughput scaling linearly in the number of nodes. However, existing protocols are not robust to an adversary adaptively corrupting a fixed fraction of nodes. In this paper, we propose Free2Shard -- a new architecture that achieves near-linear scaling while being secure against a fully adaptive adversary.
The focal point of this architecture is a dynamic self-allocation algorithm that lets users allocate themselves to shards in response to adversarial action, without requiring a central or cryptographic proof. This architecture has several attractive features unusual for sharding protocols, including: (a) the ability to handle the regime of large number of shards (relative to the number of nodes); (b) heterogeneous shard demands; (c) requiring only a small minority to follow the self-allocation; (d) asynchronous shard rotation; (e) operation in a purely identity-free proof-of-work setting. The key technical contribution is a deep mathematical connection to the classical work of Blackwell in dynamic game theory.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-corpus Setting for Speech Emotion Recognition
Authors:
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Björn W. Schuller
Abstract:
Speech emotion recognition systems (SER) can achieve high accuracy when the training and test data are identically distributed, but this assumption is frequently violated in practice and the performance of SER systems plummet against unforeseen data shifts. The design of robust models for accurate SER is challenging, which limits its use in practical applications. In this paper we propose a deeper…
▽ More
Speech emotion recognition systems (SER) can achieve high accuracy when the training and test data are identically distributed, but this assumption is frequently violated in practice and the performance of SER systems plummet against unforeseen data shifts. The design of robust models for accurate SER is challenging, which limits its use in practical applications. In this paper we propose a deeper neural network architecture wherein we fuse DenseNet, LSTM and Highway Network to learn powerful discriminative features which are robust to noise. We also propose data augmentation with our network architecture to further improve the robustness. We comprehensively evaluate the architecture coupled with data augmentation against (1) noise, (2) adversarial attacks and (3) cross-corpus settings. Our evaluations on the widely used IEMOCAP and MSP-IMPROV datasets show promising results when compared with existing studies and state-of-the-art models.
△ Less
Submitted 25 July, 2020; v1 submitted 18 May, 2020;
originally announced May 2020.
-
Augmenting Generative Adversarial Networks for Speech Emotion Recognition
Authors:
Siddique Latif,
Muhammad Asim,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Björn W. Schuller
Abstract:
Generative adversarial networks (GANs) have shown potential in learning emotional attributes and generating new data samples. However, their performance is usually hindered by the unavailability of larger speech emotion recognition (SER) data. In this work, we propose a framework that utilises the mixup data augmentation scheme to augment the GAN in feature learning and generation. To show the eff…
▽ More
Generative adversarial networks (GANs) have shown potential in learning emotional attributes and generating new data samples. However, their performance is usually hindered by the unavailability of larger speech emotion recognition (SER) data. In this work, we propose a framework that utilises the mixup data augmentation scheme to augment the GAN in feature learning and generation. To show the effectiveness of the proposed framework, we present results for SER on (i) synthetic feature vectors, (ii) augmentation of the training data with synthetic features, (iii) encoded features in compressed representation. Our results show that the proposed framework can effectively learn compressed emotional representations as well as it can generate synthetic samples that help improve performance in within-corpus and cross-corpus evaluation.
△ Less
Submitted 25 July, 2020; v1 submitted 18 May, 2020;
originally announced May 2020.
-
Nonlinear Charge Transport in InGaAs Nanowires at Terahertz Frequencies
Authors:
Rakesh Rana,
Leila Balaghi,
Ivan Fotev,
Harald Schneider,
Manfred Helm,
Emmanouil Dimakis,
Alexej Pashkin
Abstract:
We probe the electron transport properties in the shell of GaAs/In0.2Ga0.8As core/shell nanowires at high electric fields using optical pump / THz probe spectroscopy with broadband THz pulses and peak electric fields up to 0.6 MV/cm. The plasmon resonance of the photoexcited charge carriers exhibits a systematic redshift and a suppression of its spectral weight for THz driving fields exceeding 0.4…
▽ More
We probe the electron transport properties in the shell of GaAs/In0.2Ga0.8As core/shell nanowires at high electric fields using optical pump / THz probe spectroscopy with broadband THz pulses and peak electric fields up to 0.6 MV/cm. The plasmon resonance of the photoexcited charge carriers exhibits a systematic redshift and a suppression of its spectral weight for THz driving fields exceeding 0.4 MV/cm. This behavior is attributed to the intervalley electron scattering resulting in the increase of the average electron effective mass and the corresponding decrease of the electron mobility by about 2 times at the highest fields. We demonstrate that the increase of the effective mass is non-uniform along the nanowires and takes place mainly in their middle part, leading to a spatially inhomogeneous carrier response. Our results quantify the nonlinear transport regime in GaAs-based nanowires and show their high potential for development of nano-devices operating at THz frequencies.
△ Less
Submitted 1 April, 2020;
originally announced April 2020.
-
Guided Generative Adversarial Neural Network for Representation Learning and High Fidelity Audio Generation using Fewer Labelled Audio Data
Authors:
Kazi Nazmul Haque,
Rajib Rana,
John H. L. Hansen,
Björn Schuller
Abstract:
Recent improvements in Generative Adversarial Neural Networks (GANs) have shown their ability to generate higher quality samples as well as to learn good representations for transfer learning. Most of the representation learning methods based on GANs learn representations ignoring their post-use scenario, which can lead to increased generalisation ability. However, the model can become redundant i…
▽ More
Recent improvements in Generative Adversarial Neural Networks (GANs) have shown their ability to generate higher quality samples as well as to learn good representations for transfer learning. Most of the representation learning methods based on GANs learn representations ignoring their post-use scenario, which can lead to increased generalisation ability. However, the model can become redundant if it is intended for a specific task. For example, assume we have a vast unlabelled audio dataset, and we want to learn a representation from this dataset so that it can be used to improve the emotion recognition performance of a small labelled audio dataset. During the representation learning training, if the model does not know the post emotion recognition task, it can completely ignore emotion-related characteristics in the learnt representation. This is a fundamental challenge for any unsupervised representation learning model. In this paper, we aim to address this challenge by proposing a novel GAN framework: Guided Generative Neural Network (GGAN), which guides a GAN to focus on learning desired representations and generating superior quality samples for audio data leveraging fewer labelled samples. Experimental results show that using a very small amount of labelled data as guidance, a GGAN learns significantly better representations.
△ Less
Submitted 1 June, 2020; v1 submitted 5 March, 2020;
originally announced March 2020.
-
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends
Authors:
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Junaid Qadir,
Björn W. Schuller
Abstract:
Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requir…
▽ More
Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech -- a gap that our survey aims to bridge.
△ Less
Submitted 24 September, 2021; v1 submitted 2 January, 2020;
originally announced January 2020.
-
Galaxy And Mass Assembly (GAMA): Properties and evolution of red spiral galaxies
Authors:
Smriti Mahajan,
Kriti Kamal Gupta,
Rahul Rana,
M. J. I. Brown,
S. Phillipps,
Joss Bland-Hawthorn,
M. N. Bremer,
S. Brough,
B. W. Holwerda,
A. M. Hopkins,
J. Loveday,
Kevin Pimbblet,
Lingyu Wang
Abstract:
We use multi-wavelength data from the Galaxy and Mass Assembly (GAMA) survey to explore the cause of red optical colours in nearby (0.002<z<0.06) spiral galaxies. We show that the colours of red spiral galaxies are a direct consequence of some environment-related mechanism(s) which has removed dust and gas, leading to a lower star formation rate. We conclude that this process acts on long timescal…
▽ More
We use multi-wavelength data from the Galaxy and Mass Assembly (GAMA) survey to explore the cause of red optical colours in nearby (0.002<z<0.06) spiral galaxies. We show that the colours of red spiral galaxies are a direct consequence of some environment-related mechanism(s) which has removed dust and gas, leading to a lower star formation rate. We conclude that this process acts on long timescales (several Gyr) due to a lack of morphological transformation associated with the transition in optical colour. The sSFR and dust-to-stellar mass ratio of red spiral galaxies is found to be statistically lower than blue spiral galaxies. On the other hand, red spirals are on average $0.9$ dex more massive, and reside in environments 2.6 times denser than their blue counterparts. We find no evidence of excessive nuclear activity, or higher inclination angles to support these as the major causes for the red optical colours seen in >= 47% of all spirals in our sample. Furthermore, for a small subsample of our spiral galaxies which are detected in HI, we find that the SFR of gas-rich red spiral galaxies is lower by ~1 dex than their blue counterparts.
△ Less
Submitted 25 October, 2019;
originally announced October 2019.
-
Pre-training in Deep Reinforcement Learning for Automatic Speech Recognition
Authors:
Thejan Rajapakshe,
Rajib Rana,
Siddique Latif,
Sara Khalifa,
Björn W. Schuller
Abstract:
Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This led to breakthroughs in many complex tasks that were previously difficult to solve. However, deep RL requires a large amount of training time that makes it difficult to use in various real-life applicatio…
▽ More
Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This led to breakthroughs in many complex tasks that were previously difficult to solve. However, deep RL requires a large amount of training time that makes it difficult to use in various real-life applications like human-computer interaction (HCI). Therefore, in this paper, we study pre-training in deep RL to reduce the training time and improve the performance in speech recognition, a popular application of HCI. We achieve significantly improved performance in less time on a publicly available speech command recognition dataset.
△ Less
Submitted 26 October, 2019; v1 submitted 24 October, 2019;
originally announced October 2019.
-
Barracuda: The Power of $\ell$-polling in Proof-of-Stake Blockchains
Authors:
Giulia Fanti,
Jiantao Jiao,
Ashok Makkuva,
Sewoong Oh,
Ranvir Rana,
Pramod Viswanath
Abstract:
A blockchain is a database of sequential events that is maintained by a distributed group of nodes. A key consensus problem in blockchains is that of determining the next block (data element) in the sequence. Many blockchains address this by electing a new node to propose each new block. The new block is (typically) appended to the tip of the proposer's local blockchain, and subsequently broadcast…
▽ More
A blockchain is a database of sequential events that is maintained by a distributed group of nodes. A key consensus problem in blockchains is that of determining the next block (data element) in the sequence. Many blockchains address this by electing a new node to propose each new block. The new block is (typically) appended to the tip of the proposer's local blockchain, and subsequently broadcast to the rest of the network. Without network delay (or adversarial behavior), this procedure would give a perfect chain, since each proposer would have the same view of the blockchain. A major challenge in practice is forking. Due to network delays, a proposer may not yet have the most recent block, and may, therefore, create a side chain that branches from the middle of the main chain. Forking reduces throughput, since only one a single main chain can survive, and all other blocks are discarded. We propose a new P2P protocol for blockchains called Barracuda, in which each proposer, prior to proposing a block, polls $\ell$ other nodes for their local blocktree information. Under a stochastic network model, we prove that this lightweight primitive improves throughput as if the entire network were a factor of $\ell$ faster. We provide guidelines on how to implement Barracuda in practice, guaranteeing robustness against several real-world factors.
△ Less
Submitted 18 September, 2019;
originally announced September 2019.
-
Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition
Authors:
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Julien Epps,
Björn W. Schuller
Abstract:
Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for develo** any robust machine learning model in general. In this paper, we propose a solution to this problem: a…
▽ More
Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for develo** any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. This semi-supervised learning essentially helps to improve generalisation of our framework and thus leads to improvements in SER performance. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available datasets.
△ Less
Submitted 22 March, 2020; v1 submitted 13 July, 2019;
originally announced July 2019.
-
Disentangled Representation Learning with Information Maximizing Autoencoder
Authors:
Kazi Nazmul Haque,
Siddique Latif,
Rajib Rana
Abstract:
Learning disentangled representation from any unlabelled data is a non-trivial problem. In this paper we propose Information Maximising Autoencoder (InfoAE) where the encoder learns powerful disentangled representation through maximizing the mutual information between the representation and given information in an unsupervised fashion. We have evaluated our model on MNIST dataset and achieved 98.9…
▽ More
Learning disentangled representation from any unlabelled data is a non-trivial problem. In this paper we propose Information Maximising Autoencoder (InfoAE) where the encoder learns powerful disentangled representation through maximizing the mutual information between the representation and given information in an unsupervised fashion. We have evaluated our model on MNIST dataset and achieved 98.9 ($\pm .1$) $\%$ test accuracy while using complete unsupervised training.
△ Less
Submitted 18 April, 2019;
originally announced April 2019.
-
Direct Modelling of Speech Emotion from Raw Speech
Authors:
Siddique Latif,
Rajib Rana,
Sara Khalifa,
Raja Jurdak,
Julien Epps
Abstract:
Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classification. This has fuelled the…
▽ More
Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classification. This has fuelled the emerging trend of learning representations from raw speech especially using deep learning neural networks. In particular, a combination of Convolution Neural Networks (CNNs) and Long Short Term Memory (LSTM) have gained great traction for the intrinsic property of LSTM in learning contextual information crucial for emotion recognition; and CNNs been used for its ability to overcome the scalability problem of regular neural networks. In this paper, we show that there are still opportunities to improve the performance of emotion recognition from the raw speech by exploiting the properties of CNN in modelling contextual information. We propose the use of parallel convolutional layers to harness multiple temporal resolutions in the feature extraction block that is jointly trained with the LSTM based classification network for the emotion recognition task. Our results suggest that the proposed model can reach the performance of CNN trained with hand-engineered features from both IEMOCAP and MSP-IMPROV datasets.
△ Less
Submitted 27 July, 2020; v1 submitted 8 April, 2019;
originally announced April 2019.
-
Automated Screening for Distress: A Perspective for the Future
Authors:
Rajib Rana,
Siddique Latif,
Raj Gururajan,
Anthony Gray,
Geraldine Mackenzie,
Gerald Humphris,
Jeff Dunn
Abstract:
Distress is a complex condition which affects a significant percentage of cancer patients and may lead to depression, anxiety, sadness, suicide and other forms of psychological morbidity. Compelling evidence supports screening for distress as a means of facilitating early intervention and subsequent improvements in psychological well-being and overall quality of life. Nevertheless, despite the exi…
▽ More
Distress is a complex condition which affects a significant percentage of cancer patients and may lead to depression, anxiety, sadness, suicide and other forms of psychological morbidity. Compelling evidence supports screening for distress as a means of facilitating early intervention and subsequent improvements in psychological well-being and overall quality of life. Nevertheless, despite the existence of evidence based and easily administered screening tools, for example, the Distress Thermometer, routine screening for distress is yet to achieve widespread implementation. Efforts are intensifying to utilise innovative, cost effective methods now available through emerging technologies in the informatics and computational arenas.
△ Less
Submitted 27 July, 2020; v1 submitted 22 February, 2019;
originally announced February 2019.
-
Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness
Authors:
Siddique Latif,
Rajib Rana,
Junaid Qadir
Abstract:
Deep learning has undoubtedly offered tremendous improvements in the performance of state-of-the-art speech emotion recognition (SER) systems. However, recent research on adversarial examples poses enormous challenges on the robustness of SER systems by showing the susceptibility of deep neural networks to adversarial examples as they rely only on small and imperceptible perturbations. In this stu…
▽ More
Deep learning has undoubtedly offered tremendous improvements in the performance of state-of-the-art speech emotion recognition (SER) systems. However, recent research on adversarial examples poses enormous challenges on the robustness of SER systems by showing the susceptibility of deep neural networks to adversarial examples as they rely only on small and imperceptible perturbations. In this study, we evaluate how adversarial examples can be used to attack SER systems and propose the first black-box adversarial attack on SER systems. We also explore potential defenses including adversarial training and generative adversarial network (GAN) to enhance robustness. Experimental evaluations suggest various interesting aspects of the effective utilization of adversarial examples useful for achieving robustness for SER systems opening up opportunities for researchers to further innovate in this space.
△ Less
Submitted 30 December, 2018; v1 submitted 28 November, 2018;
originally announced November 2018.
-
Automating Motion Correction in Multishot MRI Using Generative Adversarial Networks
Authors:
Siddique Latif,
Muhammad Asim,
Muhammad Usman,
Junaid Qadir,
Rajib Rana
Abstract:
Multishot Magnetic Resonance Imaging (MRI) has recently gained popularity as it accelerates the MRI data acquisition process without compromising the quality of final MR image. However, it suffers from motion artifacts caused by patient movements which may lead to misdiagnosis. Modern state-of-the-art motion correction techniques are able to counter small degree motion, however, their adoption is…
▽ More
Multishot Magnetic Resonance Imaging (MRI) has recently gained popularity as it accelerates the MRI data acquisition process without compromising the quality of final MR image. However, it suffers from motion artifacts caused by patient movements which may lead to misdiagnosis. Modern state-of-the-art motion correction techniques are able to counter small degree motion, however, their adoption is hindered by their time complexity. This paper proposes a Generative Adversarial Network (GAN) for reconstructing motion free high-fidelity images while reducing the image reconstruction time by an impressive two orders of magnitude.
△ Less
Submitted 23 November, 2018;
originally announced November 2018.
-
Non-thermal nature of photo-induced insulator-to-metal transition in NbO$_2$
Authors:
Rakesh Rana,
J. Michael Klopf,
Jörg Grenzer,
Harald Schneider,
Manfred Helm,
Alexej Pashkin
Abstract:
We study the photo-induced metallization process in niobium dioxide NbO$_2$. This compound undergoes the thermal insulator-to-metal transition at the remarkably high temperature of 1080 K. Our optical pump - terahertz probe measurements reveal the ultrafast switching of the film on a sub-picosecond timescale and the formation of a metastable metallic phase when the incident pump fluence exceeds th…
▽ More
We study the photo-induced metallization process in niobium dioxide NbO$_2$. This compound undergoes the thermal insulator-to-metal transition at the remarkably high temperature of 1080 K. Our optical pump - terahertz probe measurements reveal the ultrafast switching of the film on a sub-picosecond timescale and the formation of a metastable metallic phase when the incident pump fluence exceeds the threshold of 10 mJ/cm$^2$. Remarkably, this threshold value corresponds to the deposited energy which is capable of heating NbO$_2$ only up to 790 K, thus, evidencing the non-thermal character of the photo-induced insulator-to-metal transition. We also observe an enhanced formation of the metallic phase above the second threshold of 17.5 mJ/cm$^2$ which corresponds to the onset of the thermal switching. The transient optical conductivity in the metastable phase can be modeled using the Drude-Smith model confirming its metallic character. The present observation of non-thermal transition in NbO$_2$ can serve as an important test bed for understanding photo-induced phenomena in strongly correlated oxides.
△ Less
Submitted 20 September, 2018; v1 submitted 19 September, 2018;
originally announced September 2018.
-
Communication Algorithms via Deep Learning
Authors:
Hyeji Kim,
Yihan Jiang,
Ranvir Rana,
Sreeram Kannan,
Sewoong Oh,
Pramod Viswanath
Abstract:
Coding theory is a central discipline underpinning wireline and wireless modems that are the workhorses of the information age. Progress in coding theory is largely driven by individual human ingenuity with sporadic breakthroughs over the past century. In this paper we study whether it is possible to automate the discovery of decoding algorithms via deep learning. We study a family of sequential c…
▽ More
Coding theory is a central discipline underpinning wireline and wireless modems that are the workhorses of the information age. Progress in coding theory is largely driven by individual human ingenuity with sporadic breakthroughs over the past century. In this paper we study whether it is possible to automate the discovery of decoding algorithms via deep learning. We study a family of sequential codes parameterized by recurrent neural network (RNN) architectures. We show that creatively designed and trained RNN architectures can decode well known sequential codes such as the convolutional and turbo codes with close to optimal performance on the additive white Gaussian noise (AWGN) channel, which itself is achieved by breakthrough algorithms of our times (Viterbi and BCJR decoders, representing dynamic programing and forward-backward algorithms). We show strong generalizations, i.e., we train at a specific signal to noise ratio and block length but test at a wide range of these quantities, as well as robustness and adaptivity to deviations from the AWGN setting.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
Carrier driven antiferromagnetism and exchange-bias in SrRuO3/CaRuO3 heterostructures
Authors:
Parul Pandey,
Ching-Hao Chang,
Angus Huang,
Rakesh Rana,
Changan Wang,
Chi Xu,
Horng-Tay Jeng,
Manfred Helm,
R. Ganesh,
Shengqiang Zhou
Abstract:
Oxide heterostructures exhibit a rich variety of magnetic and transport properties which arise due to contact at an interface. This can lead to surprising effects that are very different from the bulk properties of the materials involved. We report the magnetic properties of bilayers of SrRuO3, a well known ferromagnet, and CaRuO3, which is nominally a paramagnet. We find intriguing features that…
▽ More
Oxide heterostructures exhibit a rich variety of magnetic and transport properties which arise due to contact at an interface. This can lead to surprising effects that are very different from the bulk properties of the materials involved. We report the magnetic properties of bilayers of SrRuO3, a well known ferromagnet, and CaRuO3, which is nominally a paramagnet. We find intriguing features that are consistent with CaRuO3 develo** dual magnetic character, with both a net moment as well as antiferromagnetic order. We argue the ordered SrRuO3 layer induces an undulating polarization profile in the conduction electrons of CaRuO3, by a mechanism akin to Friedel oscillations. At low temperatures, this oscillating polarization is inherited by rigid local moments within CaRuO3, leading to a robust exchange bias. We present ab initio simulations in support of this picture. Our results demonstrate a new ordering mechanism and throw light on the magnetic character of CaRuO3 .
△ Less
Submitted 16 February, 2018;
originally announced February 2018.
-
Phonocardiographic Sensing using Deep Learning for Abnormal Heartbeat Detection
Authors:
Siddique Latif,
Muhammad Usman,
Rajib Rana,
Junaid Qadir
Abstract:
Cardiac auscultation involves expert interpretation of abnormalities in heart sounds using stethoscope. Deep learning based cardiac auscultation is of significant interest to the healthcare community as it can help reducing the burden of manual auscultation with automated detection of abnormal heartbeats. However, the problem of automatic cardiac auscultation is complicated due to the requirement…
▽ More
Cardiac auscultation involves expert interpretation of abnormalities in heart sounds using stethoscope. Deep learning based cardiac auscultation is of significant interest to the healthcare community as it can help reducing the burden of manual auscultation with automated detection of abnormal heartbeats. However, the problem of automatic cardiac auscultation is complicated due to the requirement of reliability and high accuracy, and due to the presence of background noise in the heartbeat sound. In this work, we propose a Recurrent Neural Networks (RNNs) based automated cardiac auscultation solution. Our choice of RNNs is motivated by the great success of deep learning in medical applications and by the observation that RNNs represent the deep learning configuration most suitable for dealing with sequential or temporal data even in the presence of noise. We explore the use of various RNN models, and demonstrate that these models deliver the abnormal heartbeat classification score with significant improvement. Our proposed approach using RNNs can be potentially be used for real-time abnormal heartbeat detection in the Internet of Medical Things for remote monitoring applications.
△ Less
Submitted 27 July, 2020; v1 submitted 25 January, 2018;
originally announced January 2018.
-
Transfer Learning for Improving Speech Emotion Classification Accuracy
Authors:
Siddique Latif,
Rajib Rana,
Shahzad Younis,
Junaid Qadir,
Julien Epps
Abstract:
The majority of existing speech emotion recognition research focuses on automatic emotion detection using training and testing data from same corpus collected under the same conditions. The performance of such systems has been shown to drop significantly in cross-corpus and cross-language scenarios. To address the problem, this paper exploits a transfer learning technique to improve the performanc…
▽ More
The majority of existing speech emotion recognition research focuses on automatic emotion detection using training and testing data from same corpus collected under the same conditions. The performance of such systems has been shown to drop significantly in cross-corpus and cross-language scenarios. To address the problem, this paper exploits a transfer learning technique to improve the performance of speech emotion recognition systems that is novel in cross-language and cross-corpus scenarios. Evaluations on five different corpora in three different languages show that Deep Belief Networks (DBNs) offer better accuracy than previous approaches on cross-corpus emotion recognition, relative to a Sparse Autoencoder and SVM baseline system. Results also suggest that using a large number of languages for training and using a small fraction of the target data in training can significantly boost accuracy compared with baseline also for the corpus with limited training examples.
△ Less
Submitted 27 July, 2020; v1 submitted 19 January, 2018;
originally announced January 2018.
-
Image denoising and restoration with CNN-LSTM Encoder Decoder with Direct Attention
Authors:
Kazi Nazmul Haque,
Mohammad Abu Yousuf,
Rajib Rana
Abstract:
Image denoising is always a challenging task in the field of computer vision and image processing. In this paper, we have proposed an encoder-decoder model with direct attention, which is capable of denoising and reconstruct highly corrupted images. Our model consists of an encoder and a decoder, where the encoder is a convolutional neural network and decoder is a multilayer Long Short-Term memory…
▽ More
Image denoising is always a challenging task in the field of computer vision and image processing. In this paper, we have proposed an encoder-decoder model with direct attention, which is capable of denoising and reconstruct highly corrupted images. Our model consists of an encoder and a decoder, where the encoder is a convolutional neural network and decoder is a multilayer Long Short-Term memory network. In the proposed model, the encoder reads an image and catches the abstraction of that image in a vector, where decoder takes that vector as well as the corrupted image to reconstruct a clean image. We have trained our model on MNIST handwritten digit database after making lower half of every image as black as well as adding noise top of that. After a massive destruction of the images where it is hard for a human to understand the content of those images, our model can retrieve that image with minimal error. Our proposed model has been compared with convolutional encoder-decoder, where our model has performed better at generating missing part of the images than convolutional autoencoder.
△ Less
Submitted 16 January, 2018;
originally announced January 2018.