Search | arXiv e-print repository

NAIST Simultaneous Speech Translation System for IWSLT 2024

Authors: Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura

Abstract: This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding poli… ▽ More This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: IWSLT 2024 system paper

arXiv:2406.12164 [pdf, other]

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Authors: Guoqiang Hu, Huaning Tan, Ruilai Li

Abstract: Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrog… ▽ More Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2404.17357 [pdf, other]

Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model

Authors: Yushen Xu, Xiaosong Li, Yuchan Jie, Haishu Tan

Abstract: In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease's shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-… ▽ More In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease's shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-optimal fusion performance, and affecting the depth of image analysis by the physician. Thus, there is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information. Although current image processing methods can effectively address image fusion and super-resolution individually, solving both problems synchronously remains extremely challenging. In this paper, we propose TFS-Diff, a simultaneously realize tri-modal medical image fusion and super-resolution model. Specially, TFS-Diff is based on the diffusion model generation of a random iterative denoising process. We also develop a simple objective function and the proposed fusion super-resolution loss, effectively evaluates the uncertainty in the fusion and ensures the stability of the optimization process. And the channel attention module is proposed to effectively integrate key information from different modalities for clinical diagnosis, avoiding information loss caused by multiple image processing. Extensive experiments on public Harvard datasets show that TFS-Diff significantly surpass the existing state-of-the-art methods in both quantitative and visual evaluations. The source code will be available at GitHub. △ Less

Submitted 13 May, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.17126 [pdf, other]

Deep Evidential Learning for Dose Prediction

Authors: Hai Siong Tan, Kuancheng Wang, Rafe Mcbeth

Abstract: In this work, we present a novel application of an uncertainty-quantification framework called Deep Evidential Learning in the domain of radiotherapy dose prediction. Using medical images of the Open Knowledge-Based Planning Challenge dataset, we found that this model can be effectively harnessed to yield uncertainty estimates that inherited correlations with prediction errors upon completion of n… ▽ More In this work, we present a novel application of an uncertainty-quantification framework called Deep Evidential Learning in the domain of radiotherapy dose prediction. Using medical images of the Open Knowledge-Based Planning Challenge dataset, we found that this model can be effectively harnessed to yield uncertainty estimates that inherited correlations with prediction errors upon completion of network training. This was achieved only after reformulating the original loss function for a stable implementation. We found that (i)epistemic uncertainty was highly correlated with prediction errors, with various association indices comparable or stronger than those for Monte-Carlo Dropout and Deep Ensemble methods, (ii)the median error varied with uncertainty threshold much more linearly for epistemic uncertainty in Deep Evidential Learning relative to these other two conventional frameworks, indicative of a more uniformly calibrated sensitivity to model errors, (iii)relative to epistemic uncertainty, aleatoric uncertainty demonstrated a more significant shift in its distribution in response to Gaussian noise added to CT intensity, compatible with its interpretation as reflecting data noise. Collectively, our results suggest that Deep Evidential Learning is a promising approach that can endow deep-learning models in radiotherapy dose prediction with statistical robustness. Towards enhancing its clinical relevance, we demonstrate how we can use such a model to construct the predicted Dose-Volume-Histograms' confidence intervals. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 24 pages, 8 figures

arXiv:2403.17770 [pdf, other]

CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation

Authors: Yongrui Yu, Hanyu Chen, Zitian Zhang, Qiong Xiao, Wenhui Lei, Linrui Dai, Yu Fu, Hui Tan, Guan Wang, Peng Gao, Xiaofan Zhang

Abstract: Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node… ▽ More Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.10024 [pdf, other]

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

Authors: Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji

Abstract: This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, a… ▽ More This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2401.03173 [pdf, other]

doi 10.4108/eetcasa.v10i1.4681

UGGNet: Bridging U-Net and VGG for Advanced Breast Cancer Diagnosis

Authors: Tran Cao Minh, Nguyen Kim Quoc, Phan Cong Vinh, Dang Nhu Phu, Vuong Xuan Chi, Ha Minh Tan

Abstract: In the field of medical imaging, breast ultrasound has emerged as a crucial diagnostic tool for early detection of breast cancer. However, the accuracy of diagnosing the location of the affected area and the extent of the disease depends on the experience of the physician. In this paper, we propose a novel model called UGGNet, combining the power of the U-Net and VGG architectures to enhance the p… ▽ More In the field of medical imaging, breast ultrasound has emerged as a crucial diagnostic tool for early detection of breast cancer. However, the accuracy of diagnosing the location of the affected area and the extent of the disease depends on the experience of the physician. In this paper, we propose a novel model called UGGNet, combining the power of the U-Net and VGG architectures to enhance the performance of breast ultrasound image analysis. The U-Net component of the model helps accurately segment the lesions, while the VGG component utilizes deep convolutional layers to extract features. The fusion of these two architectures in UGGNet aims to optimize both segmentation and feature representation, providing a comprehensive solution for accurate diagnosis in breast ultrasound images. Experimental results have demonstrated that the UGGNet model achieves a notable accuracy of 78.2% on the "Breast Ultrasound Images Dataset." △ Less

Submitted 6 January, 2024; originally announced January 2024.

Comments: Submitted to the journal "EAI Endorsed Transactions on Context-aware Systems and Applications" ,2 images, 5 data tables

Journal ref: EAI Endorsed Transactions on Contex-aware Systems and Applications, 10(1), 2024

arXiv:2311.06572 [pdf, other]

Swin UNETR++: Advancing Transformer-Based Dense Dose Prediction Towards Fully Automated Radiation Oncology Treatments

Authors: Kuancheng Wang, Hai Siong Tan, Rafe Mcbeth

Abstract: The field of Radiation Oncology is uniquely positioned to benefit from the use of artificial intelligence to fully automate the creation of radiation treatment plans for cancer therapy. This time-consuming and specialized task combines patient imaging with organ and tumor segmentation to generate a 3D radiation dose distribution to meet clinical treatment goals, similar to voxel-level dense predic… ▽ More The field of Radiation Oncology is uniquely positioned to benefit from the use of artificial intelligence to fully automate the creation of radiation treatment plans for cancer therapy. This time-consuming and specialized task combines patient imaging with organ and tumor segmentation to generate a 3D radiation dose distribution to meet clinical treatment goals, similar to voxel-level dense prediction. In this work, we propose Swin UNETR++, that contains a lightweight 3D Dual Cross-Attention (DCA) module to capture the intra and inter-volume relationships of each patient's unique anatomy, which fully convolutional neural networks lack. Our model was trained, validated, and tested on the Open Knowledge-Based Planning dataset. In addition to metrics of Dose Score $\overline{S_{\text{Dose}}}$ and DVH Score $\overline{S_{\text{DVH}}}$ that quantitatively measure the difference between the predicted and ground-truth 3D radiation dose distribution, we propose the qualitative metrics of average volume-wise acceptance rate $\overline{R_{\text{VA}}}$ and average patient-wise clinical acceptance rate $\overline{R_{\text{PA}}}$ to assess the clinical reliability of the predictions. Swin UNETR++ demonstrates near-state-of-the-art performance on validation and test dataset (validation: $\overline{S_{\text{DVH}}}$=1.492 Gy, $\overline{S_{\text{Dose}}}$=2.649 Gy, $\overline{R_{\text{VA}}}$=88.58%, $\overline{R_{\text{PA}}}$=100.0%; test: $\overline{S_{\text{DVH}}}$=1.634 Gy, $\overline{S_{\text{Dose}}}$=2.757 Gy, $\overline{R_{\text{VA}}}$=90.50%, $\overline{R_{\text{PA}}}$=98.0%), establishing a basis for future studies to translate 3D dose predictions into a deliverable treatment plan, facilitating full automation. △ Less

Submitted 17 March, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2023, December 10th, 2023, New Orleans, United States, 16 pages

arXiv:2311.00940 [pdf, other]

Dynamic Uploading Scheduling in mmWave-Based Sensor Networks via Mobile Blocker Detection

Authors: Yifei Sun, Bojie Lv, Rui Wang, Haisheng Tan, Francis C. M. Lau

Abstract: The freshness of information, measured as Age of Information (AoI), is critical for many applications in next-generation wireless sensor networks (WSNs). Due to its high bandwidth, millimeter wave (mmWave) communication is seen to be frequently exploited in WSNs to facilitate the deployment of bandwidth-demanding applications. However, the vulnerability of mmWave to user mobility typically results… ▽ More The freshness of information, measured as Age of Information (AoI), is critical for many applications in next-generation wireless sensor networks (WSNs). Due to its high bandwidth, millimeter wave (mmWave) communication is seen to be frequently exploited in WSNs to facilitate the deployment of bandwidth-demanding applications. However, the vulnerability of mmWave to user mobility typically results in link blockage and thus postponed real-time communications. In this paper, joint sampling and uploading scheduling in an AoI-oriented WSN working in mmWave band is considered, where a single human blocker is moving randomly and signal propagation paths may be blocked. The locations of signal reflectors and the real-time position of the blocker can be detected via wireless sensing technologies. With the knowledge of blocker motion pattern, the statistics of future wireless channels can be predicted. As a result, the AoI degradation arising from link blockage can be forecast and mitigated. Specifically, we formulate the long-term sampling, uplink transmission time and power allocation as an infinite-horizon Markov decision process (MDP) with discounted cost. Due to the curse of dimensionality, the optimal solution is infeasible. A novel low-complexity solution framework with guaranteed performance in the worst case is proposed where the forecast of link blockage is exploited in a value function approximation. Simulations show that compared with several heuristic benchmarks, our proposed policy, benefiting from the awareness of link blockage, can reduce average cost up to 49.6%. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 10 pages, 5 figures, accepted for publication on ICPADS23

arXiv:2308.11162 [pdf, other]

A Preliminary Investigation into Search and Matching for Tumour Discrimination in WHO Breast Taxonomy Using Deep Networks

Authors: Abubakr Shafique, Ricardo Gonzalez, Liron Pantanowitz, Puay Hoon Tan, Alberto Machado, Ian A Cree, Hamid R. Tizhoosh

Abstract: Breast cancer is one of the most common cancers affecting women worldwide. They include a group of malignant neoplasms with a variety of biological, clinical, and histopathological characteristics. There are more than 35 different histological forms of breast lesions that can be classified and diagnosed histologically according to cell morphology, growth, and architecture patterns. Recently, deep… ▽ More Breast cancer is one of the most common cancers affecting women worldwide. They include a group of malignant neoplasms with a variety of biological, clinical, and histopathological characteristics. There are more than 35 different histological forms of breast lesions that can be classified and diagnosed histologically according to cell morphology, growth, and architecture patterns. Recently, deep learning, in the field of artificial intelligence, has drawn a lot of attention for the computerized representation of medical images. Searchable digital atlases can provide pathologists with patch matching tools allowing them to search among evidently diagnosed and treated archival cases, a technology that may be regarded as computational second opinion. In this study, we indexed and analyzed the WHO breast taxonomy (Classification of Tumours 5th Ed.) spanning 35 tumour types. We visualized all tumour types using deep features extracted from a state-of-the-art deep learning model, pre-trained on millions of diagnostic histopathology images from the TCGA repository. Furthermore, we test the concept of a digital "atlas" as a reference for search and matching with rare test cases. The patch similarity search within the WHO breast taxonomy data reached over 88% accuracy when validating through "majority vote" and more than 91% accuracy when validating using top-n tumour types. These results show for the first time that complex relationships among common and rare breast lesions can be investigated using an indexed digital archive. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2303.16734 [pdf, other]

Predictive Resource Allocation in mmWave Systems with Rotation Detection

Authors: Yifei Sun, Bojie Lv, Rui Wang, Haisheng Tan, Francis C. M. Lau

Abstract: Millimeter wave (MmWave) has been regarded as a promising technology to support high-capacity communications in 5G era. However, its high-layer performance such as latency and packet drop rate in the long term highly depends on resource allocation because mmWave channel suffers significant fluctuation with rotating users due to mmWave sparse channel property and limited field-of-view (FoV) of ante… ▽ More Millimeter wave (MmWave) has been regarded as a promising technology to support high-capacity communications in 5G era. However, its high-layer performance such as latency and packet drop rate in the long term highly depends on resource allocation because mmWave channel suffers significant fluctuation with rotating users due to mmWave sparse channel property and limited field-of-view (FoV) of antenna arrays. In this paper, downlink transmission scheduling considering rotation of user equipments (UE) and limited antenna FoV in an mmWave system is optimized via a novel approximate Markov decision process (MDP) method. Specifically, we consider the joint downlink UE selection and power allocation in a number of frames where future orientations of rotating UEs can be predicted via embedded motion sensors. The problem is formulated as a finite-horizon MDP with non-stationary state transition probabilities. A novel low-complexity solution framework is proposed via one iteration step over a base policy whose average future cost can be predicted with analytical expressions. It is demonstrated by simulations that compared with existing benchmarks, the proposed scheme can schedule the downlink transmission and suppress the packet drop rate efficiently in non-stationary mmWave links. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: 7 pages, 5 figures. Paper accepted for publication in IEEE International Conference on Communications, 2023

arXiv:2212.09988 [pdf, other]

Multi-Reference Image Super-Resolution: A Posterior Fusion Approach

Authors: Ke Zhao, Haining Tan, Tsz Fung Yau

Abstract: Reference-based Super-resolution (RefSR) approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution image. Multi-reference super-resolution extends this approach by allowing more information to be incorporated. This paper proposes a 2-step-weighting posterior fusion approach to combine the outputs of… ▽ More Reference-based Super-resolution (RefSR) approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution image. Multi-reference super-resolution extends this approach by allowing more information to be incorporated. This paper proposes a 2-step-weighting posterior fusion approach to combine the outputs of RefSR models with multiple references. Extensive experiments on the CUFED5 dataset demonstrate that the proposed methods can be applied to various state-of-the-art RefSR models to get a consistent improvement in image quality. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2208.02250 [pdf]

Adversarial Attacks on ASR Systems: An Overview

Authors: Xiao Zhang, Hao Tan, Xuan Huang, Denghui Zhang, Keke Tang, Zhaoquan Gu

Abstract: With the development of hardware and algorithms, ASR(Automatic Speech Recognition) systems evolve a lot. As The models get simpler, the difficulty of development and deployment become easier, ASR systems are getting closer to our life. On the one hand, we often use APPs or APIs of ASR to generate subtitles and record meetings. On the other hand, smart speaker and self-driving car rely on ASR syste… ▽ More With the development of hardware and algorithms, ASR(Automatic Speech Recognition) systems evolve a lot. As The models get simpler, the difficulty of development and deployment become easier, ASR systems are getting closer to our life. On the one hand, we often use APPs or APIs of ASR to generate subtitles and record meetings. On the other hand, smart speaker and self-driving car rely on ASR systems to control AIoT devices. In past few years, there are a lot of works on adversarial examples attacks against ASR systems. By adding a small perturbation to the waveforms, the recognition results make a big difference. In this paper, we describe the development of ASR system, different assumptions of attacks, and how to evaluate these attacks. Next, we introduce the current works on adversarial examples attacks from two attack assumptions: white-box attack and black-box attack. Different from other surveys, we pay more attention to which layer they perturb waveforms in ASR system, the relationship between these attacks, and their implementation methods. We focus on the effect of their works. △ Less

Submitted 3 August, 2022; originally announced August 2022.

arXiv:2206.02996 [pdf, other]

doi 10.23919/JCIN.2022.10005216

An Indoor Environment Sensing and Localization System via mmWave Phased Array

Authors: Yifei Sun, Jie Li, Tong Zhang, Rui Wang, Xiaohui Peng, Tony Xiao Han, Haisheng Tan

Abstract: An indoor layout sensing and localization system in 60GHz millimeter wave (mmWave) band, named mmReality, is elaborated in this paper. The mmReality system consists of one transmitter and one mobile receiver, each with a phased array and a single radio frequency (RF) chain. To reconstruct the room layout, the pilot signal is delivered from the transmitter to the receiver via different pairs of tra… ▽ More An indoor layout sensing and localization system in 60GHz millimeter wave (mmWave) band, named mmReality, is elaborated in this paper. The mmReality system consists of one transmitter and one mobile receiver, each with a phased array and a single radio frequency (RF) chain. To reconstruct the room layout, the pilot signal is delivered from the transmitter to the receiver via different pairs of transmission and receiving beams, so that the signals at all antenna elements can be resolved. Then, the spatial smoothing and two-dimensional multiple signal classification (MUSIC) algorithm is applied to detect the angle-of-arrival (AoAs) and angle-of-departure (AoDs) of the rays from the transmitter to the receiver. Moreover, the technique of multi-carrier ranging is adopted to measure the distance of each propagation path. Synthesizing the above geometrical parameters, the location of receiver relative to the transmitter can be pinpointed, both line-of-sight (LoS) and non-line-of-sight (NLoS) paths can also be determined. Therefore, the room layout can be reconstructed by moving the receiver and repeating the above measurement in different locations of the room. At the end, we show that the reconstructed room layout can be utilized to locate a mobile device according to its AoA spectrum, even with single access point. △ Less

Submitted 9 January, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

Comments: Paper accepted for publication in Journal of Communications and Information Networks, 2022

arXiv:2201.01669 [pdf, other]

Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough

Authors: Esin Darici Haritaoglu, Nicholas Rasmussen, Daniel C. H. Tan, Jennifer Ranjani J., Jaclyn Xiao, Gunvant Chaudhari, Akanksha Rajput, Praveen Govindan, Christian Canham, Wei Chen, Minami Yamaura, Laura Gomezjurado, Aaron Broukhim, Amil Khanzada, Mert Pilanci

Abstract: The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a soluti… ▽ More The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a solution utilizing cough sounds as a primary screening mode for COVID-19 diagnosis. This paper presents multiple models that have achieved relatively respectable performance on the largest evaluation dataset currently presented in academic literature. Through investigation of a self-supervised learning model (Area under the ROC curve, AUC = 0.807) and a convolutional nerual network (CNN) model (AUC = 0.802), we observe the possibility of model bias with limited datasets. Moreover, we observe that performance increases with training data size, showing the need for the worldwide collection of data to help combat the Covid-19 pandemic with non-traditional means. △ Less

Submitted 29 March, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

arXiv:2112.14574 [pdf]

Industry 4.0: Challenges and success factors for adopting digital technologies in airports

Authors: Jia Hao Tan, Tariq Masood

Abstract: With the advent of Industry 4.0 technologies in the last decade, airports have undergone digitalisation to capitalise on the purported benefits of these technologies such as improved operational efficiency and passenger experience. The ongoing COVID-19 pandemic with emergence of its variants (e.g. Delta, Omicron) has exacerbated the need for airports to adopt new technologies such as contactless a… ▽ More With the advent of Industry 4.0 technologies in the last decade, airports have undergone digitalisation to capitalise on the purported benefits of these technologies such as improved operational efficiency and passenger experience. The ongoing COVID-19 pandemic with emergence of its variants (e.g. Delta, Omicron) has exacerbated the need for airports to adopt new technologies such as contactless and robotic technologies to facilitate travel during this pandemic. However, there is limited knowledge of recent challenges and success factors for adoption of digital technologies in airports. Therefore, through an industry survey of airport operators and managers around the world (n=102, 0.754<Composite Reliability<0.892; conducted during COVID-19), this study identifies the challenges faced in adopting Industry 4.0 technologies (n=20) as well as enhances understanding of best practices or success factors that supported technology adoption in airports. The widely used technology, organisation, environment (TOE) framework is used as a theoretically basis for the quantitative part of the questionnaire. A complementary qualitative part is used to underpin and extend the findings. The industry survey is the first-of-its-kind that was conducted to understand the implementation challenges that airport operators face in adopting Industry 4.0 technologies in the airport. The survey results have shown that that the Industry 4.0 technologies were not implemented to a similar extent in airports despite the generic challenges that were faced in adopting the various Industry 4.0 technologies in the airport. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 25 pages, 4 figures, 9 tables

arXiv:2112.14333 [pdf]

Adoption of Industry 4.0 technologies in airports -- A systematic literature review

Authors: Jia Hao Tan, Tariq Masood

Abstract: Airports have been constantly evolving and adopting digital technologies to improve operational efficiency, enhance passenger experience, generate ancillary revenues and boost capacity from existing infrastructure. The COVID-19 pandemic has also challenged airports and aviation stakeholders alike to adapt and manage new operational challenges such as facilitating a contactless travel experience an… ▽ More Airports have been constantly evolving and adopting digital technologies to improve operational efficiency, enhance passenger experience, generate ancillary revenues and boost capacity from existing infrastructure. The COVID-19 pandemic has also challenged airports and aviation stakeholders alike to adapt and manage new operational challenges such as facilitating a contactless travel experience and ensuring business continuity. Digitalisation using Industry 4.0 technologies offers opportunities for airports to address short-term challenges associated with the COVID-19 pandemic while also preparing for future long-term challenges that ensue the crisis. Through a systematic literature review of 102 relevant articles, we discuss the current state of adoption of Industry 4.0 technologies in airports, the associated challenges as well as future research directions. The results of this review suggest that the implementation of Industry 4.0 technologies is slowly gaining traction within the airport environment, and shall continue to remain relevant in the digital transformation journeys in develo** future airports. △ Less

Submitted 28 December, 2021; originally announced December 2021.

Comments: 25 pages, 2 figures, 2 tables, 106 references

arXiv:2112.00702 [pdf, other]

Semi-supervised music emotion recognition using noisy student training and harmonic pitch class profiles

Authors: Hao Hao Tan

Abstract: We present Mirable's submission to the 2021 Emotions and Themes in Music challenge. In this work, we intend to address the question: can we leverage semi-supervised learning techniques on music emotion recognition? With that, we experiment with noisy student training, which has improved model performance in the image classification domain. As the noisy student method requires a strong teacher mode… ▽ More We present Mirable's submission to the 2021 Emotions and Themes in Music challenge. In this work, we intend to address the question: can we leverage semi-supervised learning techniques on music emotion recognition? With that, we experiment with noisy student training, which has improved model performance in the image classification domain. As the noisy student method requires a strong teacher model, we further delve into the factors including (i) input training length and (ii) complementary music representations to further boost the performance of the teacher model. For (i), we find that models trained with short input length perform better in PR-AUC, whereas those trained with long input length perform better in ROC-AUC. For (ii), we find that using harmonic pitch class profiles (HPCP) consistently improve tagging performance, which suggests that harmonic representation is useful for music emotion tagging. Finally, we find that noisy student method only improves tagging results for the case of long training length. Additionally, we find that ensembling representations trained with different training lengths can improve tagging results significantly, which suggest a possible direction to explore incorporating multiple temporal resolutions in the network architecture for future work. △ Less

Submitted 9 December, 2021; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: MediaEval 2021 submission for Emotion and Themes in Music

arXiv:2108.07007 [pdf, other]

Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation

Authors: Haobin Tan, Chang Chen, Xinyu Luo, Jiaming Zhang, Constantin Seibold, Kailun Yang, Rainer Stiefelhagen

Abstract: Lacking the ability to sense ambient environments effectively, blind and visually impaired people (BVIP) face difficulty in walking outdoors, especially in urban areas. Therefore, tools for assisting BVIP are of great importance. In this paper, we propose a novel "flying guide dog" prototype for BVIP assistance using drone and street view semantic segmentation. Based on the walkable areas extracte… ▽ More Lacking the ability to sense ambient environments effectively, blind and visually impaired people (BVIP) face difficulty in walking outdoors, especially in urban areas. Therefore, tools for assisting BVIP are of great importance. In this paper, we propose a novel "flying guide dog" prototype for BVIP assistance using drone and street view semantic segmentation. Based on the walkable areas extracted from the segmentation prediction, the drone can adjust its movement automatically and thus lead the user to walk along the walkable path. By recognizing the color of pedestrian traffic lights, our prototype can help the user to cross a street safely. Furthermore, we introduce a new dataset named Pedestrian and Vehicle Traffic Lights (PVTL), which is dedicated to traffic light recognition. The result of our user study in real-world scenarios shows that our prototype is effective and easy to use, providing new insight into BVIP assistance. △ Less

Submitted 16 August, 2021; originally announced August 2021.

Comments: Code, dataset, and video demo will be made publicly available at https://github.com/EckoTan0804/flying-guide-dog

arXiv:2102.08015 [pdf]

Improving speech recognition models with small samples for air traffic control systems

Authors: Yi Lin, Qin Li, Bo Yang, Zhen Yan, Huachun Tan, Zhengmao Chen

Abstract: In the domain of air traffic control (ATC) systems, efforts to train a practical automatic speech recognition (ASR) model always faces the problem of small training samples since the collection and annotation of speech samples are expert- and domain-dependent task. In this work, a novel training approach based on pretraining and transfer learning is proposed to address this issue, and an improved… ▽ More In the domain of air traffic control (ATC) systems, efforts to train a practical automatic speech recognition (ASR) model always faces the problem of small training samples since the collection and annotation of speech samples are expert- and domain-dependent task. In this work, a novel training approach based on pretraining and transfer learning is proposed to address this issue, and an improved end-to-end deep learning model is developed to address the specific challenges of ASR in the ATC domain. An unsupervised pretraining strategy is first proposed to learn speech representations from unlabeled samples for a certain dataset. Specifically, a masking strategy is applied to improve the diversity of the sample without losing their general patterns. Subsequently, transfer learning is applied to fine-tune a pretrained or other optimized baseline models to finally achieves the supervised ASR task. By virtue of the common terminology used in the ATC domain, the transfer learning task can be regarded as a sub-domain adaption task, in which the transferred model is optimized using a joint corpus consisting of baseline samples and new transcribed samples from the target dataset. This joint corpus construction strategy enriches the size and diversity of the training samples, which is important for addressing the issue of the small transcribed corpus. In addition, speed perturbation is applied to augment the new transcribed samples to further improve the quality of the speech corpus. Three real ATC datasets are used to validate the proposed ASR model and training strategies. The experimental results demonstrate that the ASR performance is significantly improved on all three datasets, with an absolute character error rate only one-third of that achieved through the supervised training. The applicability of the proposed strategies to other ASR approaches is also validated. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: This work has been accepted by Neurocomputing for publication

arXiv:2012.04885 [pdf]

doi 10.1038/s41467-021-26216-9

Annotation-efficient deep learning for automatic medical image segmentation

Authors: Shanshan Wang, Cheng Li, Rongpin Wang, Zaiyi Liu, Meiyun Wang, Hongna Tan, Ya** Wu, Xinfeng Liu, Hui Sun, Rui Yang, Xin Liu, Jie Chen, Huihui Zhou, Ismail Ben Ayed, Hairong Zheng

Abstract: Automatic medical image segmentation plays a critical role in scientific research and medical care. Existing high-performance deep learning methods typically rely on large training datasets with high-quality manual annotations, which are difficult to obtain in many clinical applications. Here, we introduce Annotation-effIcient Deep lEarning (AIDE), an open-source framework to handle imperfect trai… ▽ More Automatic medical image segmentation plays a critical role in scientific research and medical care. Existing high-performance deep learning methods typically rely on large training datasets with high-quality manual annotations, which are difficult to obtain in many clinical applications. Here, we introduce Annotation-effIcient Deep lEarning (AIDE), an open-source framework to handle imperfect training datasets. Methodological analyses and empirical evaluations are conducted, and we demonstrate that AIDE surpasses conventional fully-supervised models by presenting better performance on open datasets possessing scarce or noisy annotations. We further test AIDE in a real-life case study for breast tumor segmentation. Three datasets containing 11,852 breast images from three medical centers are employed, and AIDE, utilizing 10% training annotations, consistently produces segmentation maps comparable to those generated by fully-supervised counterparts or provided by independent radiologists. The 10-fold enhanced efficiency in utilizing expert labels has the potential to promote a wide range of biomedical applications. △ Less

Submitted 23 September, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

arXiv:2007.15474 [pdf, other]

Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling

Authors: Hao Hao Tan, Dorien Herremans

Abstract: High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-le… ▽ More High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w.r.t. the modelled low-level attributes of the generated output music. Furthermore, we demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes (rhythm and note density), with only 1% of the training set being labelled. Finally, using the learnt high-level feature representations, we explore the application of our framework in style transfer tasks across different arousal states. The effectiveness of this approach is verified through a subjective listening test. △ Less

Submitted 29 July, 2020; originally announced July 2020.

Journal ref: Proc. of 21st International Society of Music Information Retrieval Conference, ISMIR 2020

arXiv:2006.09833 [pdf, other]

Generative Modelling for Controllable Audio Synthesis of Expressive Piano Performance

Authors: Hao Hao Tan, Yin-Jyun Luo, Dorien Herremans

Abstract: We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of syn… ▽ More We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of synthesizing the audio. This is based on conditions which are latent variables that can be sampled from the prior or inferred from other pieces. One of the envisioned use cases is to inspire creative and brand new interpretations for existing pieces of piano music. △ Less

Submitted 12 July, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

Journal ref: Published at ICML Workshop on Machine Learning for Media Discovery Workshop (ML4MD) 2020

arXiv:2004.00879 [pdf, other]

Enhance the performance of navigation: A two-stage machine learning approach

Authors: Yimin Fan, Zhiyuan Wang, Yuanpeng Lin, Haisheng Tan

Abstract: Real time traffic navigation is an important capability in smart transportation technologies, which has been extensively studied these years. Due to the vast development of edge devices, collecting real time traffic data is no longer a problem. However, real traffic navigation is still considered to be a particularly challenging problem because of the time-varying patterns of the traffic flow and… ▽ More Real time traffic navigation is an important capability in smart transportation technologies, which has been extensively studied these years. Due to the vast development of edge devices, collecting real time traffic data is no longer a problem. However, real traffic navigation is still considered to be a particularly challenging problem because of the time-varying patterns of the traffic flow and unpredictable accidents/congestion. To give accurate and reliable navigation results, predicting the future traffic flow(speed,congestion,volume,etc) in a fast and accurate way is of great importance. In this paper, we adopt the ideas of ensemble learning and develop a two-stage machine learning model to give accurate navigation results. We model the traffic flow as a time series and apply XGBoost algorithm to get accurate predictions on future traffic conditions(1st stage). We then apply the Top K Dijkstra algorithm to find a set of shortest paths from the give start point to the destination as the candidates of the output optimal path. With the prediction results in the 1st stage, we find one optimal path from the candidates as the output of the navigation algorithm. We show that our navigation algorithm can be greatly improved via EOPF(Enhanced Optimal Path Finding), which is based on neural network(2nd stage). We show that our method can be over 7% better than the method without EOPF in many situations, which indicates the effectiveness of our model. △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: 8 pages, under review

arXiv:2002.12588 [pdf, other]

Regional Registration of Whole Slide Image Stacks Containing Highly Deformed Artefacts

Authors: Mahsa Paknezhad, Sheng Yang Michael Loh, Yukti Choudhury, Valerie Koh Cui Koh, TimothyTay Kwang Yong, Hui Shan Tan, Ravindran Kanesvaran, Puay Hoon Tan, John Yuen Shyi Peng, Weimiao Yu, Yongcheng Benjamin Tan, Yong Zhen Loy, Min-Han Tan, Hwee Kuan Lee

Abstract: Motivation: High resolution 2D whole slide imaging provides rich information about the tissue structure. This information can be a lot richer if these 2D images can be stacked into a 3D tissue volume. A 3D analysis, however, requires accurate reconstruction of the tissue volume from the 2D image stack. This task is not trivial due to the distortions that each individual tissue slice experiences wh… ▽ More Motivation: High resolution 2D whole slide imaging provides rich information about the tissue structure. This information can be a lot richer if these 2D images can be stacked into a 3D tissue volume. A 3D analysis, however, requires accurate reconstruction of the tissue volume from the 2D image stack. This task is not trivial due to the distortions that each individual tissue slice experiences while cutting and mounting the tissue on the glass slide. Performing registration for the whole tissue slices may be adversely affected by the deformed tissue regions. Consequently, regional registration is found to be more effective. In this paper, we propose an accurate and robust regional registration algorithm for whole slide images which incrementally focuses registration on the area around the region of interest. Results: Using mean similarity index as the metric, the proposed algorithm (mean $\pm$ std: $0.84 \pm 0.11$) followed by a fine registration algorithm ($0.86 \pm 0.08$) outperformed the state-of-the-art linear whole tissue registration algorithm ($0.74 \pm 0.19$) and the regional version of this algorithm ($0.81 \pm 0.15$). The proposed algorithm also outperforms the state-of-the-art nonlinear registration algorithm (original : $0.82 \pm 0.12$, regional : $0.77 \pm 0.22$) for whole slide images and a recently proposed patch-based registration algorithm (patch size 256: $0.79 \pm 0.16$ , patch size 512: $0.77 \pm 0.16$) for medical images. Availability: The C++ implementation code is available online at the github repository: https://github.com/MahsaPaknezhad/WSIRegistration △ Less

Submitted 28 February, 2020; originally announced February 2020.

arXiv:1911.12796 [pdf, other]

Light-weight Calibrator: a Separable Component for Unsupervised Domain Adaptation

Authors: Shaokai Ye, Kailu Wu, Mu Zhou, Yunfei Yang, Sia huat Tan, Kaidi Xu, Jiebo Song, Chenglong Bao, Kaisheng Ma

Abstract: Existing domain adaptation methods aim at learning features that can be generalized among domains. These methods commonly require to update source classifier to adapt to the target domain and do not properly handle the trade off between the source domain and the target domain. In this work, instead of training a classifier to adapt to the target domain, we use a separable component called data cal… ▽ More Existing domain adaptation methods aim at learning features that can be generalized among domains. These methods commonly require to update source classifier to adapt to the target domain and do not properly handle the trade off between the source domain and the target domain. In this work, instead of training a classifier to adapt to the target domain, we use a separable component called data calibrator to help the fixed source classifier recover discrimination power in the target domain, while preserving the source domain's performance. When the difference between two domains is small, the source classifier's representation is sufficient to perform well in the target domain and outperforms GAN-based methods in digits. Otherwise, the proposed method can leverage synthetic images generated by GANs to boost performance and achieve state-of-the-art performance in digits datasets and driving scene semantic segmentation. Our method empirically reveals that certain intriguing hints, which can be mitigated by adversarial attack to domain discriminators, are one of the sources for performance degradation under the domain shift. △ Less

Submitted 28 February, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

Comments: Accepted by CVPR2020

arXiv:1911.00364 [pdf, other]

Validation of a deep learning mammography model in a population with low screening rates

Authors: Kevin Wu, Eric Wu, Ya** Wu, Hongna Tan, Greg Sorensen, Meiyun Wang, Bill Lotter

Abstract: A key promise of AI applications in healthcare is in increasing access to quality medical care in under-served populations and emerging markets. However, deep learning models are often only trained on data from advantaged populations that have the infrastructure and resources required for large-scale data collection. In this paper, we aim to empirically investigate the potential impact of such bia… ▽ More A key promise of AI applications in healthcare is in increasing access to quality medical care in under-served populations and emerging markets. However, deep learning models are often only trained on data from advantaged populations that have the infrastructure and resources required for large-scale data collection. In this paper, we aim to empirically investigate the potential impact of such biases on breast cancer detection in mammograms. We specifically explore how a deep learning algorithm trained on screening mammograms from the US and UK generalizes to mammograms collected at a hospital in China, where screening is not widely implemented. For the evaluation, we use a top-scoring model developed for the Digital Mammography DREAM Challenge. Despite the change in institution and population composition, we find that the model generalizes well, exhibiting similar performance to that achieved in the DREAM Challenge, even when controlling for tumor size. We also illustrate a simple but effective method for filtering predictions based on model variance, which can be particularly useful for deployment in new settings. While there are many components in develo** a clinically effective system, these results represent a promising step towards increasing access to life-saving screening mammography in populations where screening rates are currently low. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Journal ref: NeurIPS 2019. Fair ML for Health Workshop

arXiv:1810.12093 [pdf]

80-Channel WDM-MDM Transmission over 50-km Ring-Core Fiber Using a Compact OAM DEMUX and Modular 4x4 MIMO Equalization

Authors: Junwei Zhang, Yuanhui Wen, Heyun Tan, Jie Liu, Lei Shen, Maochun Wang, Jiangbo Zhu, Changjian Guo, Yujie Chen, Zhaohui Li, Siyuan Yu

Abstract: 8-OAM modes each carrying 10 wavelengths with 2.56-Tbit/s aggregated capacity and 10.24-bit/s/Hz spectral efficiency have been transmitted over 50-km specially designed ring-core fiber, using a compact OAM mode sorter and only modular 4x4 MIMO equalization. 8-OAM modes each carrying 10 wavelengths with 2.56-Tbit/s aggregated capacity and 10.24-bit/s/Hz spectral efficiency have been transmitted over 50-km specially designed ring-core fiber, using a compact OAM mode sorter and only modular 4x4 MIMO equalization. △ Less

Submitted 22 October, 2018; originally announced October 2018.

Comments: 3 pages,2 figures, conference

Showing 1–28 of 28 results for author: Tan, H