Search | arXiv e-print repository

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

Authors: Shijian Deng, Erin E. Kosloski, Siddhi Patel, Zeke A. Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T. Doan, Matthew Wang, Harsh Singh, Pamela R. Rollins, Yapeng Tian

Abstract: In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-rel… ▽ More In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models. △ Less

Submitted 22 March, 2024; originally announced June 2024.

arXiv:2310.11713 [pdf, other]

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Authors: Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu

Abstract: The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation.… ▽ More The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Accepted at ICCV 2023 - AV4D, 4 figures, 3 tables

arXiv:2310.02141 [pdf, other]

Adaptive Gait Modeling and Optimization for Principally Kinematic Systems

Authors: Siming Deng, Noah J. Cowan, Brian A. Bittner

Abstract: Robotic adaptation to unanticipated operating conditions is crucial to achieving persistence and robustness in complex real world settings. For a wide range of cutting-edge robotic systems, such as micro- and nano-scale robots, soft robots, medical robots, and bio-hybrid robots, it is infeasible to anticipate the operating environment a priori due to complexities that arise from numerous factors i… ▽ More Robotic adaptation to unanticipated operating conditions is crucial to achieving persistence and robustness in complex real world settings. For a wide range of cutting-edge robotic systems, such as micro- and nano-scale robots, soft robots, medical robots, and bio-hybrid robots, it is infeasible to anticipate the operating environment a priori due to complexities that arise from numerous factors including imprecision in manufacturing, chemo-mechanical forces, and poorly understood contact mechanics. Drawing inspiration from data-driven modeling, geometric mechanics (or gauge theory), and adaptive control, we employ an adaptive system identification framework and demonstrate its efficacy in enhancing the performance of principally kinematic locomotors (those governed by Rayleigh dissipation or zero momentum conservation). We showcase the capability of the adaptive model to efficiently accommodate varying terrains and iteratively modified behaviors within a behavior optimization framework. This provides both the ability to improve fundamental behaviors and perform motion tracking to precision. Notably, we are capable of optimizing the gaits of the Purcell swimmer using approximately 10 cycles per link, which for the nine-link Purcell swimmer provides a factor of ten improvement in optimization speed over the state of the art. Beyond simply a computational speed up, this ten-fold improvement may enable this method to be successfully deployed for in-situ behavior refinement, injury recovery, and terrain adaptation, particularly in domains where simulations provide poor guides for the real world. △ Less

Submitted 18 April, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: 7 pages, 4 figures

arXiv:2307.09729 [pdf, other]

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu , et al. (47 additional authors not shown)

Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual… ▽ More This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2307.02836 [pdf, other]

Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization

Authors: Shiqi Deng, Zhiyu Sun, Ruiyan Zhuang, Jun Gong

Abstract: Anomaly detection has a wide range of applications and is especially important in industrial quality inspection. Currently, many top-performing anomaly-detection models rely on feature-embedding methods. However, these methods do not perform well on datasets with large variations in object locations. Reconstruction-based methods use reconstruction errors to detect anomalies without considering pos… ▽ More Anomaly detection has a wide range of applications and is especially important in industrial quality inspection. Currently, many top-performing anomaly-detection models rely on feature-embedding methods. However, these methods do not perform well on datasets with large variations in object locations. Reconstruction-based methods use reconstruction errors to detect anomalies without considering positional differences between samples. In this study, a reconstruction-based method using the noise-to-norm paradigm is proposed, which avoids the invariant reconstruction of anomalous regions. Our reconstruction network is based on M-net and incorporates multiscale fusion and residual attention modules to enable end-to-end anomaly detection and localization. Experiments demonstrate that the method is effective in reconstructing anomalous regions into normal patterns and achieving accurate anomaly detection and localization. On the MPDD and VisA datasets, our proposed method achieved more competitive results than the latest methods, and it set a new state-of-the-art standard on the MPDD dataset. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2307.01062 [pdf, other]

A Data-Driven Approach to Geometric Modeling of Systems with Low-Bandwidth Actuator Dynamics

Authors: Siming Deng, Junning Liu, Bibekananda Datta, Aishwarya Pantula, David H. Gracias, Thao D. Nguyen, Brian A. Bittner, Noah J. Cowan

Abstract: It is challenging to perform system identification on soft robots due to their underactuated, high-dimensional dynamics. In this work, we present a data-driven modeling framework, based on geometric mechanics (also known as gauge theory) that can be applied to systems with low-bandwidth control of the system's internal configuration. This method constructs a series of connected models comprising a… ▽ More It is challenging to perform system identification on soft robots due to their underactuated, high-dimensional dynamics. In this work, we present a data-driven modeling framework, based on geometric mechanics (also known as gauge theory) that can be applied to systems with low-bandwidth control of the system's internal configuration. This method constructs a series of connected models comprising actuator and locomotor dynamics based on data points from stochastically perturbed, repeated behaviors. By deriving these connected models from general formulations of dissipative Lagrangian systems with symmetry, we offer a method that can be applied broadly to robots with first-order, low-pass actuator dynamics, including swelling-driven actuators used in hydrogel crawlers. These models accurately capture the dynamics of the system shape and body movements of a simplified swimming robot model. We further apply our approach to a stimulus-responsive hydrogel simulator that captures the complexity of chemo-mechanical interactions that drive shape changes in biomedically relevant micromachines. Finally, we propose an approach of numerically optimizing control signals by iteratively refining models, which is applied to optimize the input waveform for the hydrogel crawler. This transfer to realistic environments provides promise for applications in locomotor design and biomedical engineering. △ Less

Submitted 3 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

Comments: 9 pages, 6 figures

arXiv:2212.13913 [pdf]

Highly-Accurate Electricity Load Estimation via Knowledge Aggregation

Authors: Yuting Ding, Di Wu, Yi He, Xin Luo, Song Deng

Abstract: Mid-term and long-term electric energy demand prediction is essential for the planning and operations of the smart grid system. Mainly in countries where the power system operates in a deregulated environment. Traditional forecasting models fail to incorporate external knowledge while modern data-driven ignore the interpretation of the model, and the load series can be influenced by many complex f… ▽ More Mid-term and long-term electric energy demand prediction is essential for the planning and operations of the smart grid system. Mainly in countries where the power system operates in a deregulated environment. Traditional forecasting models fail to incorporate external knowledge while modern data-driven ignore the interpretation of the model, and the load series can be influenced by many complex factors making it difficult to cope with the highly unstable and nonlinear power load series. To address the forecasting problem, we propose a more accurate district level load prediction model Based on domain knowledge and the idea of decomposition and ensemble. Its main idea is three-fold: a) According to the non-stationary characteristics of load time series with obvious cyclicality and periodicity, decompose into series with actual economic meaning and then carry out load analysis and forecast. 2) Kernel Principal Component Analysis(KPCA) is applied to extract the principal components of the weather and calendar rule feature sets to realize data dimensionality reduction. 3) Give full play to the advantages of various models based on the domain knowledge and propose a hybrid model(XASXG) based on Autoregressive Integrated Moving Average model(ARIMA), support vector regression(SVR) and Extreme gradient boosting model(XGBoost). With such designs, it accurately forecasts the electricity demand in spite of their highly unstable characteristic. We compared our method with nine benchmark methods, including classical statistical models as well as state-of-the-art models based on machine learning, on the real time series of monthly electricity demand in four Chinese cities. The empirical study shows that the proposed hybrid model is superior to all competitors in terms of accuracy and prediction bias. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2211.16806 [pdf, other]

Toward Robust Diagnosis: A Contour Attention Preserving Adversarial Defense for COVID-19 Detection

Authors: Kun Xiang, Xing Zhang, **wen She, **peng Liu, Haohan Wang, Shiqi Deng, Shancheng Jiang

Abstract: As the COVID-19 pandemic puts pressure on healthcare systems worldwide, the computed tomography image based AI diagnostic system has become a sustainable solution for early diagnosis. However, the model-wise vulnerability under adversarial perturbation hinders its deployment in practical situation. The existing adversarial training strategies are difficult to generalized into medical imaging field… ▽ More As the COVID-19 pandemic puts pressure on healthcare systems worldwide, the computed tomography image based AI diagnostic system has become a sustainable solution for early diagnosis. However, the model-wise vulnerability under adversarial perturbation hinders its deployment in practical situation. The existing adversarial training strategies are difficult to generalized into medical imaging field challenged by complex medical texture features. To overcome this challenge, we propose a Contour Attention Preserving (CAP) method based on lung cavity edge extraction. The contour prior features are injected to attention layer via a parameter regularization and we optimize the robust empirical risk with hybrid distance metric. We then introduce a new cross-nation CT scan dataset to evaluate the generalization capability of the adversarial robustness under distribution shift. Experimental results indicate that the proposed method achieves state-of-the-art performance in multiple adversarial defense and generalization tasks. The code and dataset are available at https://github.com/Quinn777/CAP. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Comments: Accepted to AAAI 2023

arXiv:2210.06091 [pdf]

Summary on the ISCSLP 2022 Chinese-English Code-Switching ASR Challenge

Authors: Shuhao Deng, Chengfei Li, **feng Bai, Qingqing Zhang, Wei-Qiang Zhang, Runyan Yang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

Abstract: Code-switching automatic speech recognition becomes one of the most challenging and the most valuable scenarios of automatic speech recognition, due to the code-switching phenomenon between multilingual language and the frequent occurrence of code-switching phenomenon in daily life. The ISCSLP 2022 Chinese-English Code-Switching Automatic Speech Recognition (CSASR) Challenge aims to promote the de… ▽ More Code-switching automatic speech recognition becomes one of the most challenging and the most valuable scenarios of automatic speech recognition, due to the code-switching phenomenon between multilingual language and the frequent occurrence of code-switching phenomenon in daily life. The ISCSLP 2022 Chinese-English Code-Switching Automatic Speech Recognition (CSASR) Challenge aims to promote the development of code-switching automatic speech recognition. The ISCSLP 2022 CSASR challenge provided two training sets, TAL_CSASR corpus and MagicData-RAMC corpus, a development and a test set for participants, which are used for CSASR model training and evaluation. Along with the challenge, we also provide the baseline system performance for reference. As a result, more than 40 teams participated in this challenge, and the winner team achieved 16.70% Mixture Error Rate (MER) performance on the test set and has achieved 9.8% MER absolute improvement compared with the baseline system. In this paper, we will describe the datasets, the associated baselines system and the requirements, and summarize the CSASR challenge results and major techniques and tricks used in the submitted systems. △ Less

Submitted 13 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: accepted by ISCSLP 2022

arXiv:2209.11971 [pdf, other]

A Homogeneous Processing Fabric for Matrix-Vector Multiplication and Associative Search Using Ferroelectric Time-Domain Compute-in-Memory

Authors: Xunzhao Yin, Qingrong Huang, Franz Müller, Shan Deng, Alptekin Vardar, Sourav De, Zhouhang Jiang, Mohsen Imani, Cheng Zhuo, Thomas Kämpfe, Kai Ni

Abstract: In this work, we propose a ferroelectric FET(FeFET) time-domain compute-in-memory (TD-CiM) array as a homogeneous processing fabric for binary multiplication-accumulation (MAC) and content addressable memory (CAM). We demonstrate that: i) the XOR(XNOR)/AND logic function can be realized using a single cell composed of 2FeFETs connected in series; ii) a two-phase computation in an inverter chain wi… ▽ More In this work, we propose a ferroelectric FET(FeFET) time-domain compute-in-memory (TD-CiM) array as a homogeneous processing fabric for binary multiplication-accumulation (MAC) and content addressable memory (CAM). We demonstrate that: i) the XOR(XNOR)/AND logic function can be realized using a single cell composed of 2FeFETs connected in series; ii) a two-phase computation in an inverter chain with each stage featuring the XOR/AND cell to control the associated capacitor loading and the computation results of binary MAC and CAM are reflected in the chain output signal delay, illustrating full digital compatibility; iii) comprehensive theoretical and experimental validation of the proposed 2FeFET cell and inverter delay chains and their robustness against FeFET variation; iv) the homogeneous processing fabric is applied in hyperdimensional computing to show dynamic and fine-grain resource allocation to accommodate different tasks requiring varying demands over the binary MAC and CAM resources. △ Less

Submitted 24 September, 2022; originally announced September 2022.

Comments: 8 pages, 8 figures

arXiv:2206.14746 [pdf, other]

Placenta Segmentation in Ultrasound Imaging: Addressing Sources of Uncertainty and Limited Field-of-View

Authors: Veronika A. Zimmer, Alberto Gomez, Emily Skelton, Robert Wright, Gavin Wheeler, Shujie Deng, Nooshin Ghavami, Karen Lloyd, Jacqueline Matthew, Bernhard Kainz, Daniel Rueckert, Joseph V. Hajnal, Julia A. Schnabel

Abstract: Automatic segmentation of the placenta in fetal ultrasound (US) is challenging due to the (i) high diversity of placenta appearance, (ii) the restricted quality in US resulting in highly variable reference annotations, and (iii) the limited field-of-view of US prohibiting whole placenta assessment at late gestation. In this work, we address these three challenges with a multi-task learning approac… ▽ More Automatic segmentation of the placenta in fetal ultrasound (US) is challenging due to the (i) high diversity of placenta appearance, (ii) the restricted quality in US resulting in highly variable reference annotations, and (iii) the limited field-of-view of US prohibiting whole placenta assessment at late gestation. In this work, we address these three challenges with a multi-task learning approach that combines the classification of placental location (e.g., anterior, posterior) and semantic placenta segmentation in a single convolutional neural network. Through the classification task the model can learn from larger and more diverse datasets while improving the accuracy of the segmentation task in particular in limited training set conditions. With this approach we investigate the variability in annotations from multiple raters and show that our automatic segmentations (Dice of 0.86 for anterior and 0.83 for posterior placentas) achieve human-level performance as compared to intra- and inter-observer variability. Lastly, our approach can deliver whole placenta segmentation using a multi-view US acquisition pipeline consisting of three stages: multi-probe image acquisition, image fusion and image segmentation. This results in high quality segmentation of larger structures such as the placenta in US with reduced image artifacts which are beyond the field-of-view of single probes. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: 21 pages (18 + appendix), 13 figures (9 + appendix)

arXiv:2206.13135 [pdf]

TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline

Authors: Chengfei Li, Shuhao Deng, Yao** Wang, Guang**g Wang, Yaguang Gong, Changbin Chen, **feng Bai

Abstract: This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest wel… ▽ More This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition (ASR) dataset in the world. In this paper, we will introduce the recording procedure in detail, including audio capturing devices and corpus environments. And the TALCS corpus is freely available for download under the permissive license1. Using TALCS corpus, we conduct ASR experiments in two popular speech recognition toolkits to make a baseline system, including ESPnet and Wenet. The Mixture Error Rate (MER) performance in the two speech recognition toolkits is compared in TALCS corpus. The experimental results implies that the quality of audio recordings and transcriptions are promising and the baseline system is workable. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: accepted by INTERSPEECH 2022

arXiv:2204.08013 [pdf, other]

doi 10.1109/TBME.2022.3181153

One-step Method for Material Quantitation using In-line Tomography with Single Scanning

Authors: Suyu Liao, Shiwo Deng, Yining Zhu, Huitao Zhang, Pei** Zhu, Kai Zhang, Xing Zhao

Abstract: Objective: Quantitative technique based on In-line phase-contrast computed tomography with single scanning attracts more attention in application due to the flexibility of the implementation. However, the quantitative results usually suffer from artifacts and noise, since the phase retrieval and reconstruction are independent ("two-steps") without feedback from the original data. Our goal is to de… ▽ More Objective: Quantitative technique based on In-line phase-contrast computed tomography with single scanning attracts more attention in application due to the flexibility of the implementation. However, the quantitative results usually suffer from artifacts and noise, since the phase retrieval and reconstruction are independent ("two-steps") without feedback from the original data. Our goal is to develop a method for material quantitative imaging based on a priori information specifically for the single-scanning data. Method: An iterative method that directly reconstructs the refractive index decrement delta and imaginary beta of the object from observed data ("one-step") within single object-to-detector distance (ODD) scanning. Simultaneously, high-quality quantitative reconstruction results are obtained by using a linear approximation that achieves material decomposition in the iterative process. Results: By comparing the equivalent atomic number of the material decomposition results in experiments, the accuracy of the proposed method is greater than 97.2%. Conclusion: The quantitative reconstruction and decomposition results are effectively improved, and there are feedback and corrections during the iteration, which effectively reduce the impact of noise and errors. Significance: This algorithm has the potential for quantitative imaging research, especially for imaging live samples and human breast preclinical studies. △ Less

Submitted 17 April, 2022; originally announced April 2022.

Journal ref: IEEE Transactions on Biomedical Engineering, 2022

arXiv:2203.07948 [pdf, other]

An Ultra-Compact Single FeFET Binary and Multi-Bit Associative Search Engine

Authors: Xunzhao Yin, Franz Müller, Qingrong Huang, Chao Li, Mohsen Imani, Zeyu Yang, Jiahao Cai, Maximilian Lederer, Ricardo Olivo, Nellie Laleni, Shan Deng, Zijian Zhao, Cheng Zhuo, Thomas Kämpfe, Kai Ni

Abstract: Content addressable memory (CAM) is widely used in associative search tasks for its highly parallel pattern matching capability. To accommodate the increasingly complex and data-intensive pattern matching tasks, it is critical to keep improving the CAM density to enhance the performance and area efficiency. In this work, we demonstrate: i) a novel ultra-compact 1FeFET CAM design that enables paral… ▽ More Content addressable memory (CAM) is widely used in associative search tasks for its highly parallel pattern matching capability. To accommodate the increasingly complex and data-intensive pattern matching tasks, it is critical to keep improving the CAM density to enhance the performance and area efficiency. In this work, we demonstrate: i) a novel ultra-compact 1FeFET CAM design that enables parallel associative search and in-memory hamming distance calculation; ii) a multi-bit CAM for exact search using the same CAM cell; iii) compact device designs that integrate the series resistor current limiter into the intrinsic FeFET structure to turn the 1FeFET1R into an effective 1FeFET cell; iv) a successful 2-step search operation and a sufficient sensing margin of the proposed binary and multi-bit 1FeFET1R CAM array with sizes of practical interests in both experiments and simulations, given the existing unoptimized FeFET device variation; v) 89.9x speedup and 66.5x energy efficiency improvement over the state-of-the art alignment tools on GPU in accelerating genome pattern matching applications through the hyperdimensional computing paradigm. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: 20 pages, 14 figures

arXiv:2110.02495 [pdf, other]

Deep Random Forest with Ferroelectric Analog Content Addressable Memory

Authors: Xunzhao Yin, Franz Müller, Ann Franchesca Laguna, Chao Li, Wenwen Ye, Qingrong Huang, Qinming Zhang, Zhiguo Shi, Maximilian Lederer, Nellie Laleni, Shan Deng, Zijian Zhao, Michael Niemier, Xiaobo Sharon Hu, Cheng Zhuo, Thomas Kämpfe, Kai Ni

Abstract: Deep random forest (DRF), which incorporates the core features of deep learning and random forest (RF), exhibits comparable classification accuracy, interpretability, and low memory and computational overhead when compared with deep neural networks (DNNs) in various information processing tasks for edge intelligence. However, the development of efficient hardware to accelerate DRF is lagging behin… ▽ More Deep random forest (DRF), which incorporates the core features of deep learning and random forest (RF), exhibits comparable classification accuracy, interpretability, and low memory and computational overhead when compared with deep neural networks (DNNs) in various information processing tasks for edge intelligence. However, the development of efficient hardware to accelerate DRF is lagging behind its DNN counterparts. The key for hardware acceleration of DRF lies in efficiently realizing the branch-split operation at decision nodes when traversing a decision tree. In this work, we propose to implement DRF through simple associative searches realized with ferroelectric analog content addressable memory (ACAM). Utilizing only two ferroelectric field effect transistors (FeFETs), the ultra-compact ACAM cell can perform a branch-split operation with an energy-efficient associative search by storing the decision boundaries as the analog polarization states in an FeFET. The DRF accelerator architecture and the corresponding map** of the DRF model to the ACAM arrays are presented. The functionality, characteristics, and scalability of the FeFET ACAM based DRF and its robustness against FeFET device non-idealities are validated both in experiments and simulations. Evaluation results show that the FeFET ACAM DRF accelerator exhibits 10^6x/16x and 10^6x/2.5x improvements in terms of energy and latency when compared with other deep random forest hardware implementations on the state-of-the-art CPU/ReRAM, respectively. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: 44 pages, 16 figures

arXiv:2107.06172 [pdf, other]

Arrhenius.jl: A Differentiable Combustion SimulationPackage

Authors: Weiqi Ji, Xingyu Su, Bin Pang, Sean Joseph Cassady, Alison M. Ferris, Yujuan Li, Zhuyin Ren, Ronald Hanson, Sili Deng

Abstract: Combustion kinetic modeling is an integral part of combustion simulation, and extensive studies have been devoted to develo** both high fidelity and computationally affordable models. Despite these efforts, modeling combustion kinetics is still challenging due to the demand for expert knowledge and optimization against experiments, as well as the lack of understanding of the associated uncertain… ▽ More Combustion kinetic modeling is an integral part of combustion simulation, and extensive studies have been devoted to develo** both high fidelity and computationally affordable models. Despite these efforts, modeling combustion kinetics is still challenging due to the demand for expert knowledge and optimization against experiments, as well as the lack of understanding of the associated uncertainties. Therefore, data-driven approaches that enable efficient discovery and calibration of kinetic models have received much attention in recent years, the core of which is the optimization based on big data. Differentiable programming is a promising approach for learning kinetic models from data by efficiently computing the gradient of objective functions to model parameters. However, it is often challenging to implement differentiable programming in practice. Therefore, it is still not available in widely utilized combustion simulation packages such as CHEMKIN and Cantera. Here, we present a differentiable combustion simulation package leveraging the eco-system in Julia, including DifferentialEquations.jl for solving differential equations, ForwardDiff.jl for auto-differentiation, and Flux.jl for incorporating neural network models into combustion simulations and optimizing neural network models using the state-of-the-art deep learning optimizers. We demonstrate the benefits of differentiable programming in efficient and accurate gradient computations, with applications in uncertainty quantification, kinetic model reduction, data assimilation, and model discovery. △ Less

Submitted 19 June, 2021; originally announced July 2021.

arXiv:2010.12715 [pdf, other]

Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition

Authors: Jagadeesh Balam, Jocelyn Huang, Vitaly Lavrukhin, Slyne Deng, Somshubra Majumdar, Boris Ginsburg

Abstract: We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting… ▽ More We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting with a model trained on clean data helps establish baseline performance on clean speech. We carefully fine-tune this model to both maintain the performance on clean speech, and improve the model accuracy in noisy conditions. With this schema, we trained robust to noise English and Mandarin ASR models on large public corpora. All described models and training recipes are open sourced in NeMo, a toolkit for conversational AI. △ Less

Submitted 23 October, 2020; originally announced October 2020.

arXiv:2008.06091 [pdf, other]

A Technical Overview of AV1

Authors: **gning Han, Bohan Li, Debargha Mukherjee, Ching-Han Chiang, Adrian Grange, Cheng Chen, Hui Su, Sarah Parker, Sai Deng, Urvang Joshi, Yue Chen, Yunqing Wang, Paul Wilkins, Yaowu Xu, James Bankoski

Abstract: The AV1 video compression format is developed by the Alliance for Open Media consortium. It achieves more than 30% reduction in bit-rate compared to its predecessor VP9 for the same decoded video quality. This paper provides a technical overview of the AV1 codec design that enables the compression performance gains with considerations for hardware feasibility. The AV1 video compression format is developed by the Alliance for Open Media consortium. It achieves more than 30% reduction in bit-rate compared to its predecessor VP9 for the same decoded video quality. This paper provides a technical overview of the AV1 codec design that enables the compression performance gains with considerations for hardware feasibility. △ Less

Submitted 8 February, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

arXiv:1908.10267 [pdf, other]

DRD-Net: Detail-recovery Image Deraining via Context Aggregation Networks

Authors: Sen Deng, Mingqiang Wei, Jun Wang, Luming Liang, Haoran Xie, Meng Wang

Abstract: Image deraining is a fundamental, yet not well-solved problem in computer vision and graphics. The traditional image deraining approaches commonly behave ineffectively in medium and heavy rain removal, while the learning-based ones lead to image degradations such as the loss of image details, halo artifacts and/or color distortion. Unlike existing image deraining approaches that lack the detail-re… ▽ More Image deraining is a fundamental, yet not well-solved problem in computer vision and graphics. The traditional image deraining approaches commonly behave ineffectively in medium and heavy rain removal, while the learning-based ones lead to image degradations such as the loss of image details, halo artifacts and/or color distortion. Unlike existing image deraining approaches that lack the detail-recovery mechanism, we propose an end-to-end detail-recovery image deraining network (termed a DRD-Net) for single images. We for the first time introduce two sub-networks with a comprehensive loss function which synergize to derain and recover the lost details caused by deraining. We have three key contributions. First, we present a rain residual network to remove rain streaks from the rainy images, which combines the squeeze-and-excitation (SE) operation with residual blocks to make full advantage of spatial contextual information. Second, we design a new connection style block, named structure detail context aggregation block (SDCAB), which aggregates context feature information and has a large reception field. Third, benefiting from the SDCAB, we construct a detail repair network to encourage the lost details to return for eliminating image degradations. We have validated our approach on four recognized datasets (three synthetic and one real-world). Both quantitative and qualitative comparisons show that our approach outperforms the state-of-the-art deraining methods in terms of the deraining robustness and detail accuracy. The source code has been available for public evaluation and use on GitHub. △ Less

Submitted 28 August, 2019; v1 submitted 27 August, 2019; originally announced August 2019.

arXiv:1904.05243 [pdf]

doi 10.21437/Interspeech.2018-1299

A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification

Authors: Hongwei Song, Jiqing Han, Shiwen Deng

Abstract: One of the biggest challenges of acoustic scene classification (ASC) is to find proper features to better represent and characterize environmental sounds. Environmental sounds generally involve more sound sources while exhibiting less structure in temporal spectral representations. However, the background of an acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting it coul… ▽ More One of the biggest challenges of acoustic scene classification (ASC) is to find proper features to better represent and characterize environmental sounds. Environmental sounds generally involve more sound sources while exhibiting less structure in temporal spectral representations. However, the background of an acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting it could be characterized by distribution statistics rather than temporal details. In this work, we investigated using auditory summary statistics as the feature for ASC tasks. The inspiration comes from a recent neuroscience study, which shows the human auditory system tends to perceive sound textures through time-averaged statistics. Based on these statistics, we further proposed to use linear discriminant analysis to eliminate redundancies among these statistics while kee** the discriminative information, providing an extreme com-pact representation for acoustic scenes. Experimental results show the outstanding performance of the proposed feature over the conventional handcrafted features. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: Accepted as a conference paper of Interspeech 2018

Journal ref: in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018-September, 2018, pp. 3294-3298

arXiv:1904.05204 [pdf, other]

doi 10.21437/Interspeech.2019-2231

Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

Authors: Hongwei Song, Jiqing Han, Shiwen Deng, Zhihao Du

Abstract: In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct… ▽ More In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct sound events for each scene, we formulate ASC in a multi-instance learning (MIL) framework, where each audio recording is mapped into a bag-of-instances representation. Here, instances can be seen as high-level representations for sound events inside a scene. We also propose a MIL neural networks model, which implicitly identifies distinct instances (i.e., sound events). Furthermore, we propose two specially designed modules that model the multi-temporal scale and multi-modal natures of the sound events respectively. The experiments were conducted on the official development set of the DCASE2018 Task1 Subtask B, and our best-performing model improves over the official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy. This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones. △ Less

Submitted 26 April, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

Comments: code URL typo, code is available at https://github.com/hackerekcah/distinct-events-asc.git

Journal ref: Proc. Interspeech 2019, 3860-3864

Showing 1–21 of 21 results for author: Deng, S