-
ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech
Authors:
Xin Wang,
Junichi Yamagishi,
Massimiliano Todisco,
Hector Delgado,
Andreas Nautsch,
Nicholas Evans,
Md Sahidullah,
Ville Vestman,
Tomi Kinnunen,
Kong Aik Lee,
Lauri Juvela,
Paavo Alku,
Yu-Huai Peng,
Hsin-Te Hwang,
Yu Tsao,
Hsin-Min Wang,
Sebastien Le Maguer,
Markus Becker,
Fergus Henderson,
Rob Clark,
Yu Zhang,
Quan Wang,
Ye Jia,
Kai Onuma,
Koji Mushika
, et al. (15 additional authors not shown)
Abstract:
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso…
▽ More
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects.
△ Less
Submitted 14 July, 2020; v1 submitted 4 November, 2019;
originally announced November 2019.
-
Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System
Authors:
Eran Raveh,
Ingmar Steiner,
Iona Gessinger,
Bernd Möbius
Abstract:
This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user's…
▽ More
This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user's speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, offering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and efficiency of human-computer interaction.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
Authors:
Arif Khan,
Ingmar Steiner,
Yusuke Sugano,
Andreas Bulling,
Ross Macdonald
Abstract:
Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as cl…
▽ More
Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy.
△ Less
Submitted 11 May, 2018; v1 submitted 13 December, 2017;
originally announced December 2017.
-
Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform
Authors:
Ingmar Steiner,
Sébastien Le Maguer
Abstract:
We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with t…
▽ More
We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with the updated MaryTTS architecture, enabling new features and state-of-the-art paradigms such as synthesis based on deep neural networks (DNNs). Like MaryTTS itself, the new tools are free, open source software (FOSS), and promote the use of open data.
△ Less
Submitted 11 May, 2018; v1 submitted 13 December, 2017;
originally announced December 2017.
-
Synthesis of Tongue Motion and Acoustics from Text using a Multimodal Articulatory Database
Authors:
Ingmar Steiner,
Sébastien Le Maguer,
Alexander Hewer
Abstract:
We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates…
▽ More
We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.
△ Less
Submitted 13 April, 2018; v1 submitted 29 December, 2016;
originally announced December 2016.
-
A real-time framework for visual feedback of articulatory data using statistical shape models
Authors:
Kristy James,
Alexander Hewer,
Ingmar Steiner,
Stefanie Wuhrer
Abstract:
We present a novel open-source framework for visualizing electromagnetic articulography (EMA) data in real-time, with a modular framework and anatomically accurate tongue and palate models derived by multilinear subspace learning.
We present a novel open-source framework for visualizing electromagnetic articulography (EMA) data in real-time, with a modular framework and anatomically accurate tongue and palate models derived by multilinear subspace learning.
△ Less
Submitted 19 December, 2016;
originally announced December 2016.
-
A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract
Authors:
Alexander Hewer,
Stefanie Wuhrer,
Ingmar Steiner,
Korin Richmond
Abstract:
We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a templa…
▽ More
We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a template fitting technique. Furthermore, it uses image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our evaluation concludes that limiting the degrees of freedom for the anatomical and speech related variations to 5 and 4, respectively, produces a model that can reliably register unknown data while avoiding overfitting effects. Furthermore, we show that it can be used to generate a plausible tongue animation by tracking sparse motion capture data.
△ Less
Submitted 17 April, 2018; v1 submitted 15 December, 2016;
originally announced December 2016.
-
A statistical shape space model of the palate surface trained on 3D MRI scans of the vocal tract
Authors:
Alexander Hewer,
Ingmar Steiner,
Timo Bolkart,
Stefanie Wuhrer,
Korin Richmond
Abstract:
We describe a minimally-supervised method for computing a statistical shape space model of the palate surface. The model is created from a corpus of volumetric magnetic resonance imaging (MRI) scans collected from 12 speakers. We extract a 3D mesh of the palate from each speaker, then train the model using principal component analysis (PCA). The palate model is then tested using 3D MRI from anothe…
▽ More
We describe a minimally-supervised method for computing a statistical shape space model of the palate surface. The model is created from a corpus of volumetric magnetic resonance imaging (MRI) scans collected from 12 speakers. We extract a 3D mesh of the palate from each speaker, then train the model using principal component analysis (PCA). The palate model is then tested using 3D MRI from another corpus and evaluated using a high-resolution optical scan. We find that the error is low even when only a handful of measured coordinates are available. In both cases, our approach yields promising results. It can be applied to extract the palate shape from MRI data, and could be useful to other analysis modalities, such as electromagnetic articulography (EMA) and ultrasound tongue imaging (UTI).
△ Less
Submitted 4 September, 2015;
originally announced February 2016.
-
Speech animation using electromagnetic articulography as motion capture data
Authors:
Ingmar Steiner,
Korin Richmond,
Slim Ouni
Abstract:
Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-cap…
▽ More
Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-capture perspective and applied to the visualization of an existing multimodal corpus of articulatory data, creating a kinematic 3D model of the tongue and teeth by adapting a conventional motion capture based animation paradigm. This is accomplished using off-the-shelf, open-source software. Such an animated model can then be easily integrated into multimedia applications as a digital asset, allowing the analysis of speech production in an intuitive and accessible manner. The processing of the EMA data, its co-registration with 3D data from vocal tract magnetic resonance imaging (MRI) and dental scans, and the modeling workflow are presented in detail, and several issues discussed.
△ Less
Submitted 30 October, 2013;
originally announced October 2013.
-
Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis
Authors:
Ingmar Steiner,
Korin Richmond,
Slim Ouni
Abstract:
The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in sta…
▽ More
The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.
△ Less
Submitted 22 September, 2012;
originally announced September 2012.
-
Artimate: an articulatory animation framework for audiovisual speech synthesis
Authors:
Ingmar Steiner,
Slim Ouni
Abstract:
We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide r…
▽ More
We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide realistic animation of the tongue and teeth for a virtual character. The framework also provides an interface to articulatory animation synthesis, as well as an example application to illustrate its use with a 3D game engine. We rely on cross-platform, open-source software and open standards to provide a lightweight, accessible, and portable workflow.
△ Less
Submitted 15 March, 2012;
originally announced March 2012.
-
Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis
Authors:
Ingmar Steiner,
Slim Ouni
Abstract:
We present a technique for the animation of a 3D kinematic tongue model, one component of the talking head of an acoustic-visual (AV) speech synthesizer. The skeletal animation approach is adapted to make use of a deformable rig controlled by tongue motion capture data obtained with electromagnetic articulography (EMA), while the tongue surface is extracted from volumetric magnetic resonance imagi…
▽ More
We present a technique for the animation of a 3D kinematic tongue model, one component of the talking head of an acoustic-visual (AV) speech synthesizer. The skeletal animation approach is adapted to make use of a deformable rig controlled by tongue motion capture data obtained with electromagnetic articulography (EMA), while the tongue surface is extracted from volumetric magnetic resonance imaging (MRI) data. Initial results are shown and future work outlined.
△ Less
Submitted 19 January, 2012;
originally announced January 2012.
-
GROND - a 7-channel imager
Authors:
J. Greiner,
W. Bornemann,
C. Clemens,
M. Deuter,
G. Hasinger,
M. Honsberg,
H. Huber,
S. Huber,
M. Krauss,
T. Krühler,
A. Küpcü Yoldaş,
H. Mayer-Hasselwander,
B. Mican,
N. Primak,
F. Schrey,
I. Steiner,
G. Szokoly,
C. C. Thöne,
A. Yoldaş,
S. Klose,
U. Laux,
J. Winkler
Abstract:
We describe the construction of GROND, a 7-channel imager, primarily designed for rapid observations of gamma-ray burst afterglows. It allows simultaneous imaging in the Sloan g'r'i'z' and near-infrared $JHK$ bands. GROND was commissioned at the MPI/ESO 2.2m telescope at La Silla (Chile) in April 2007, and first results of its performance and calibration are presented.
We describe the construction of GROND, a 7-channel imager, primarily designed for rapid observations of gamma-ray burst afterglows. It allows simultaneous imaging in the Sloan g'r'i'z' and near-infrared $JHK$ bands. GROND was commissioned at the MPI/ESO 2.2m telescope at La Silla (Chile) in April 2007, and first results of its performance and calibration are presented.
△ Less
Submitted 30 January, 2008;
originally announced January 2008.