-
Overview of Dialogue Robot Competition 2023
Authors:
Takashi Minato,
Ryuichiro Higashinaka,
Kurima Sakai,
Tomo Funayama,
Hiromitsu Nishizaki,
Takayuki Naga
Abstract:
We have held dialogue robot competitions in 2020 and 2022 to compare the performances of interactive robots using an android that closely resembles a human. In 2023, the third competition DRC2023 was held. The task of DRC2023 was designed to be more challenging than the previous travel agent dialogue tasks. Since anyone can now develop a dialogue system using LLMs, the participating teams are requ…
▽ More
We have held dialogue robot competitions in 2020 and 2022 to compare the performances of interactive robots using an android that closely resembles a human. In 2023, the third competition DRC2023 was held. The task of DRC2023 was designed to be more challenging than the previous travel agent dialogue tasks. Since anyone can now develop a dialogue system using LLMs, the participating teams are required to develop a system that effectively uses information about the situation on the spot (real-time information), which is not handled by ChatGPT and other systems. DRC2023 has two rounds, a preliminary round and the final round as well as the previous competitions. The preliminary round has held on Oct.27 -- Nov.20, 2023 at real travel agency stores. The final round will be held on December 23, 2023. This paper provides an overview of the task settings and evaluation method of DRC2023 and the preliminary round results.
△ Less
Submitted 7 January, 2024;
originally announced January 2024.
-
Proceedings of the Dialogue Robot Competition 2023
Authors:
Ryuichiro Higashinaka,
Takashi Minato,
Hiromitsu Nishizaki,
Takayuki Nagai
Abstract:
The Dialogic Robot Competition 2023 (DRC2023) is a competition for humanoid robots (android robots that closely resemble humans) to compete in interactive capabilities. This is the third year of the competition. The top four teams from the preliminary competition held in November 2023 will compete in the final competition on Saturday, December 23. The task for the interactive robots is to recommen…
▽ More
The Dialogic Robot Competition 2023 (DRC2023) is a competition for humanoid robots (android robots that closely resemble humans) to compete in interactive capabilities. This is the third year of the competition. The top four teams from the preliminary competition held in November 2023 will compete in the final competition on Saturday, December 23. The task for the interactive robots is to recommend a tourism plan for a specific region. The robots can employ multimodal behaviors, such as language and gestures, to engage the user in the sightseeing plan they recommend. In the preliminary round, the interactive robots were stationed in a travel agency office, where visitors conversed with them and rated their performance via a questionnaire. In the final round, dialogue researchers and tourism industry professionals interacted with the robots and evaluated their performance. This event allows visitors to gain insights into the types of dialogue services that future dialogue robots should offer. The proceedings include papers on dialogue systems developed by the 12 teams participating in DRC2023, as well as an overview of the papers provided by all the teams.
△ Less
Submitted 14 January, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Pretraining Conformer with ASR or ASV for Anti-Spoofing Countermeasure
Authors:
Yikang Wang,
Hiromitsu Nishizaki,
Ming Li
Abstract:
Finding synthetic artifacts of spoofing data will help the anti-spoofing countermeasures (CMs) system discriminate between spoofed and real speech. The Conformer combines the best of convolutional neural network and the Transformer, allowing it to aggregate global and local information. This may benefit the CM system to capture the synthetic artifacts hidden both locally and globally. In this pape…
▽ More
Finding synthetic artifacts of spoofing data will help the anti-spoofing countermeasures (CMs) system discriminate between spoofed and real speech. The Conformer combines the best of convolutional neural network and the Transformer, allowing it to aggregate global and local information. This may benefit the CM system to capture the synthetic artifacts hidden both locally and globally. In this paper, we present the transfer learning based MFA-Conformer structure for CM systems. By pre-training the Conformer encoder with different tasks, the robustness of the CM system is enhanced. The proposed method is evaluated on both Chinese and English spoofing detection databases. In the FAD clean set, proposed method achieves an EER of 0.04%, which dramatically outperforms the baseline. Our system is also comparable to the pre-training methods base on Wav2Vec 2.0. Moreover, we also provide a detailed analysis of the robustness of different models.
△ Less
Submitted 30 October, 2023; v1 submitted 4 July, 2023;
originally announced July 2023.
-
Low Pass Filtering and Bandwidth Extension for Robust Anti-spoofing Countermeasure Against Codec Variabilities
Authors:
Yikang Wang,
Xingming Wang,
Hiromitsu Nishizaki,
Ming Li
Abstract:
A reliable voice anti-spoofing countermeasure system needs to robustly protect automatic speaker verification (ASV) systems in various kinds of spoofing scenarios. However, the performance of countermeasure systems could be degraded by channel effects and codecs. In this paper, we show that using the low-frequency subbands of signals as input can mitigate the negative impact introduced by codecs o…
▽ More
A reliable voice anti-spoofing countermeasure system needs to robustly protect automatic speaker verification (ASV) systems in various kinds of spoofing scenarios. However, the performance of countermeasure systems could be degraded by channel effects and codecs. In this paper, we show that using the low-frequency subbands of signals as input can mitigate the negative impact introduced by codecs on the countermeasure systems. To validate this, two types of low-pass filters with different cut-off frequencies are applied to countermeasure systems, and the equal error rate (EER) is reduced by up to 25% relatively. In addition, we propose a deep learning based bandwidth extension approach to further improve the detection accuracy. Recent studies show that the error rate of countermeasure systems increase dramatically when the silence part is removed by Voice Activity Detection (VAD), our experimental results show that the filtering and bandwidth extension approaches are also effective under the codec condition when VAD is applied.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
Overview of Dialogue Robot Competition 2022
Authors:
Takashi Minato,
Ryuichiro Higashinaka,
Kurima Sakai,
Tomo Funayama,
Hiromitsu Nishizaki,
Takayuki Nagai
Abstract:
Although many competitions have been held on dialogue systems in the past, no competition has been organized specifically for dialogue with humanoid robots. As the first such attempt in the world, we held a dialogue robot competition in 2020 to compare the performances of interactive robots using an android that closely resembles a human. Dialogue Robot Competition 2022 (DRC2022) was the second co…
▽ More
Although many competitions have been held on dialogue systems in the past, no competition has been organized specifically for dialogue with humanoid robots. As the first such attempt in the world, we held a dialogue robot competition in 2020 to compare the performances of interactive robots using an android that closely resembles a human. Dialogue Robot Competition 2022 (DRC2022) was the second competition, held in August 2022. The task and regulations followed those of the first competition, while the evaluation method was improved and the event was internationalized. The competition has two rounds, a preliminary round and the final round. In the preliminary round, twelve participating teams competed in performance of a dialogue robot in the manner of a field experiment, and then three of those teams were selected as finalists. The final round will be held on October 25, 2022, in the Robot Competition session of IROS2022. This paper provides an overview of the task settings and evaluation method of DRC2022 and the results of the preliminary round.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
Proceedings of the Dialogue Robot Competition 2022
Authors:
Ryuichiro Higashinaka,
Takashi Minato,
Hiromitsu Nishizaki,
Takayuki Nagai
Abstract:
The proceedings contain papers on the dialogue systems developed by the twelve teams participating in DRC2022, as well as an overview paper summarizing the competition.
The proceedings contain papers on the dialogue systems developed by the twelve teams participating in DRC2022, as well as an overview paper summarizing the competition.
△ Less
Submitted 25 October, 2022; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Combination of Time-domain, Frequency-domain, and Cepstral-domain Acoustic Features for Speech Commands Classification
Authors:
Yikang Wang,
Hiromitsu Nishizaki
Abstract:
In speech-related classification tasks, frequency-domain acoustic features such as logarithmic Mel-filter bank coefficients (FBANK) and cepstral-domain acoustic features such as Mel-frequency cepstral coefficients (MFCC) are often used. However, time-domain features perform more effectively in some sound classification tasks which contain non-vocal or weakly speech-related sounds. We previously pr…
▽ More
In speech-related classification tasks, frequency-domain acoustic features such as logarithmic Mel-filter bank coefficients (FBANK) and cepstral-domain acoustic features such as Mel-frequency cepstral coefficients (MFCC) are often used. However, time-domain features perform more effectively in some sound classification tasks which contain non-vocal or weakly speech-related sounds. We previously proposed a feature called bit sequence representation (BSR), which is a time-domain binary acoustic feature based on the raw waveform. Compared with MFCC, BSR performed better in environmental sound detection and showed comparable accuracy performance in limited-vocabulary speech recognition tasks. In this paper, we propose a novel improvement BSR feature called BSR-float16 to represent floating-point values more precisely. We experimentally demonstrated the complementarity among time-domain, frequency-domain, and cepstral-domain features using a dataset called Speech Commands proposed by Google. Therefore, we used a simple back-end score fusion method to improve the final classification accuracy. The fusion results also showed better noise robustness.
△ Less
Submitted 16 June, 2022; v1 submitted 30 March, 2022;
originally announced March 2022.
-
Frequency-Directional Attention Model for Multilingual Automatic Speech Recognition
Authors:
Akihiro Dobashi,
Chee Siang Leow,
Hiromitsu Nishizaki
Abstract:
This paper proposes a model for transforming speech features using the frequency-directional attention model for End-to-End (E2E) automatic speech recognition. The idea is based on the hypothesis that in the phoneme system of each language, the characteristics of the frequency bands of speech when uttering them are different. By transforming the input Mel filter bank features with an attention mod…
▽ More
This paper proposes a model for transforming speech features using the frequency-directional attention model for End-to-End (E2E) automatic speech recognition. The idea is based on the hypothesis that in the phoneme system of each language, the characteristics of the frequency bands of speech when uttering them are different. By transforming the input Mel filter bank features with an attention model that characterizes the frequency direction, a feature transformation suitable for ASR in each language can be expected. This paper introduces a Transformer-encoder as a frequency-directional attention model. We evaluated the proposed method on a multilingual E2E ASR system for six different languages and found that the proposed method could achieve, on average, 5.3 points higher accuracy than the ASR model for each language by introducing the frequency-directional attention mechanism. Furthermore, visualization of the attention weights based on the proposed method suggested that it is possible to transform acoustic features considering the frequency characteristics of each language.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Peer Collaborative Learning for Polyphonic Sound Event Detection
Authors:
Hayato Endo,
Hiromitsu Nishizaki
Abstract:
This paper describes that semi-supervised learning called peer collaborative learning (PCL) can be applied to the polyphonic sound event detection (PSED) task, which is one of the tasks in the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. Many deep learning models have been studied to find out what kind of sound events occur where and for how long in a given audio c…
▽ More
This paper describes that semi-supervised learning called peer collaborative learning (PCL) can be applied to the polyphonic sound event detection (PSED) task, which is one of the tasks in the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. Many deep learning models have been studied to find out what kind of sound events occur where and for how long in a given audio clip. The characteristic of PCL used in this paper is the combination of ensemble-based knowledge distillation into sub-networks and student-teacher model-based knowledge distillation, which can train a robust PSED model from a small amount of strongly labeled data, weakly labeled data, and a large amount of unlabeled data. We evaluated the proposed PCL model using the DCASE 2019 Task 4 datasets and achieved an F1-score improvement of about 10% compared to the baseline model.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi
Authors:
Yu Wang,
Chee Siang Leow,
Akio Kobayashi,
Takehito Utsuro,
Hiromitsu Nishizaki
Abstract:
This paper describes the ExKaldi-RT online automatic speech recognition (ASR) toolkit that is implemented based on the Kaldi ASR toolkit and Python language. ExKaldi-RT provides tools for building online recognition pipelines. While similar tools are available built on Kaldi, a key feature of ExKaldi-RT that it works on Python, which has an easy-to-use interface that allows online ASR system devel…
▽ More
This paper describes the ExKaldi-RT online automatic speech recognition (ASR) toolkit that is implemented based on the Kaldi ASR toolkit and Python language. ExKaldi-RT provides tools for building online recognition pipelines. While similar tools are available built on Kaldi, a key feature of ExKaldi-RT that it works on Python, which has an easy-to-use interface that allows online ASR system developers to develop original research, such as by applying neural network-based signal processing and by decoding model trained with deep learning frameworks. We performed benchmark experiments on the minimum LibriSpeech corpus, and it showed that ExKaldi-RT could achieve competitive ASR performance in real-time recognition.
△ Less
Submitted 8 August, 2021; v1 submitted 3 April, 2021;
originally announced April 2021.
-
Audio Classification of Bit-Representation Waveform
Authors:
Masaki Okawa,
Takuya Saito,
Naoki Sawada,
Hiromitsu Nishizaki
Abstract:
This study investigated the waveform representation for audio signal classification. Recently, many studies on audio waveform classification such as acoustic event detection and music genre classification have been published. Most studies on audio waveform classification have proposed the use of a deep learning (neural network) framework. Generally, a frequency analysis method such as Fourier tran…
▽ More
This study investigated the waveform representation for audio signal classification. Recently, many studies on audio waveform classification such as acoustic event detection and music genre classification have been published. Most studies on audio waveform classification have proposed the use of a deep learning (neural network) framework. Generally, a frequency analysis method such as Fourier transform is applied to extract the frequency or spectral information from the input audio waveform before inputting the raw audio waveform into the neural network. In contrast to these previous studies, in this paper, we propose a novel waveform representation method, in which audio waveforms are represented as a bit sequence, for audio classification. In our experiment, we compare the proposed bit representation waveform, which is directly given to a neural network, to other representations of audio waveforms such as a raw audio waveform and a power spectrum with two classification tasks: one is an acoustic event classification task and the other is a sound/music classification task. The experimental results showed that the bit representation waveform achieved the best classification performance for both the tasks.
△ Less
Submitted 18 September, 2019; v1 submitted 8 April, 2019;
originally announced April 2019.
-
Evidence for incommensurate spin fluctuations in Sr_2RuO_4
Authors:
Y. Sidis,
M. Braden,
P. Bourges,
B. Hennion. S. Nishizaki,
Y. Maeno,
Y. Mori
Abstract:
We report first inelastic neutron scattering measurements in the normal state of Sr_2RuO_4 that reveal the existence of incommensurate magnetic spin fluctuations located at ${\bf q}_0=(\pm 0.6π/a, \pm 0.6π/a, 0)$. This finding confirms recent band structure calculations that have predicted incommensurate magnetic responses related to dynamical nesting properties of its Fermi surface.
We report first inelastic neutron scattering measurements in the normal state of Sr_2RuO_4 that reveal the existence of incommensurate magnetic spin fluctuations located at ${\bf q}_0=(\pm 0.6π/a, \pm 0.6π/a, 0)$. This finding confirms recent band structure calculations that have predicted incommensurate magnetic responses related to dynamical nesting properties of its Fermi surface.
△ Less
Submitted 23 April, 1999;
originally announced April 1999.