Search | arXiv e-print repository

arXiv:1911.01601 [pdf, other]

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika , et al. (15 additional authors not shown)

Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso… ▽ More Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects. △ Less

Submitted 14 July, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

Comments: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114

arXiv:1809.04945 [pdf, other]

doi 10.1007/978-3-319-99579-3_57

Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System

Authors: Eran Raveh, Ingmar Steiner, Iona Gessinger, Bernd Möbius

Abstract: This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user's… ▽ More This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user's speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, offering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and efficiency of human-computer interaction. △ Less

Submitted 13 September, 2018; originally announced September 2018.

Comments: Proc. 20th International Conference on Speech and Computer (SPECOM)

arXiv:1712.04798 [pdf, other]

A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks

Authors: Arif Khan, Ingmar Steiner, Yusuke Sugano, Andreas Bulling, Ross Macdonald

Abstract: Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as cl… ▽ More Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy. △ Less

Submitted 11 May, 2018; v1 submitted 13 December, 2017; originally announced December 2017.

Journal ref: Proc. LREC 11 (2018) 4277-4281

arXiv:1712.04787 [pdf, other]

Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform

Authors: Ingmar Steiner, Sébastien Le Maguer

Abstract: We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with t… ▽ More We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with the updated MaryTTS architecture, enabling new features and state-of-the-art paradigms such as synthesis based on deep neural networks (DNNs). Like MaryTTS itself, the new tools are free, open source software (FOSS), and promote the use of open data. △ Less

Submitted 11 May, 2018; v1 submitted 13 December, 2017; originally announced December 2017.

Journal ref: Proc. LREC 11 (2018) 3171-3175

arXiv:1612.09352 [pdf, other]

doi 10.1109/TASLP.2017.2756818

Synthesis of Tongue Motion and Acoustics from Text using a Multimodal Articulatory Database

Authors: Ingmar Steiner, Sébastien Le Maguer, Alexander Hewer

Abstract: We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates… ▽ More We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data. △ Less

Submitted 13 April, 2018; v1 submitted 29 December, 2016; originally announced December 2016.

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (2017) 2351 - 2361

arXiv:1612.06114 [pdf, other]

A real-time framework for visual feedback of articulatory data using statistical shape models

Authors: Kristy James, Alexander Hewer, Ingmar Steiner, Stefanie Wuhrer

Abstract: We present a novel open-source framework for visualizing electromagnetic articulography (EMA) data in real-time, with a modular framework and anatomically accurate tongue and palate models derived by multilinear subspace learning. We present a novel open-source framework for visualizing electromagnetic articulography (EMA) data in real-time, with a modular framework and anatomically accurate tongue and palate models derived by multilinear subspace learning. △ Less

Submitted 19 December, 2016; originally announced December 2016.

Comments: 17th Annual Conference of the International Speech Communication Association (Interspeech), Oct 2016, San Francisco, United States

arXiv:1612.05005 [pdf, other]

doi 10.1016/j.csl.2018.02.001

A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

Authors: Alexander Hewer, Stefanie Wuhrer, Ingmar Steiner, Korin Richmond

Abstract: We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a templa… ▽ More We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a template fitting technique. Furthermore, it uses image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our evaluation concludes that limiting the degrees of freedom for the anatomical and speech related variations to 5 and 4, respectively, produces a model that can reliably register unknown data while avoiding overfitting effects. Furthermore, we show that it can be used to generate a plausible tongue animation by tracking sparse motion capture data. △ Less

Submitted 17 April, 2018; v1 submitted 15 December, 2016; originally announced December 2016.

Journal ref: Computer Speech & Language 51 (2018) 68-92

arXiv:1602.07679 [pdf, other]

A statistical shape space model of the palate surface trained on 3D MRI scans of the vocal tract

Authors: Alexander Hewer, Ingmar Steiner, Timo Bolkart, Stefanie Wuhrer, Korin Richmond

Abstract: We describe a minimally-supervised method for computing a statistical shape space model of the palate surface. The model is created from a corpus of volumetric magnetic resonance imaging (MRI) scans collected from 12 speakers. We extract a 3D mesh of the palate from each speaker, then train the model using principal component analysis (PCA). The palate model is then tested using 3D MRI from anothe… ▽ More We describe a minimally-supervised method for computing a statistical shape space model of the palate surface. The model is created from a corpus of volumetric magnetic resonance imaging (MRI) scans collected from 12 speakers. We extract a 3D mesh of the palate from each speaker, then train the model using principal component analysis (PCA). The palate model is then tested using 3D MRI from another corpus and evaluated using a high-resolution optical scan. We find that the error is low even when only a handful of measured coordinates are available. In both cases, our approach yields promising results. It can be applied to extract the palate shape from MRI data, and could be useful to other analysis modalities, such as electromagnetic articulography (EMA) and ultrasound tongue imaging (UTI). △ Less

Submitted 4 September, 2015; originally announced February 2016.

Comments: Proceedings of the 18th International Congress of Phonetic Sciences, Aug 2015, Glasgow, United Kingdom. 2015, http://www.icphs2015.info/

arXiv:1310.8585 [pdf, other]

Speech animation using electromagnetic articulography as motion capture data

Authors: Ingmar Steiner, Korin Richmond, Slim Ouni

Abstract: Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-cap… ▽ More Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-capture perspective and applied to the visualization of an existing multimodal corpus of articulatory data, creating a kinematic 3D model of the tongue and teeth by adapting a conventional motion capture based animation paradigm. This is accomplished using off-the-shelf, open-source software. Such an animated model can then be easily integrated into multimedia applications as a digital asset, allowing the analysis of speech production in an intuitive and accessible manner. The processing of the EMA data, its co-registration with 3D data from vocal tract magnetic resonance imaging (MRI) and dental scans, and the modeling workflow are presented in detail, and several issues discussed. △ Less

Submitted 30 October, 2013; originally announced October 2013.

Journal ref: AVSP - 12th International Conference on Auditory-Visual Speech Processing - 2013 (2013) 55-60

arXiv:1209.4982 [pdf, other]

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

Authors: Ingmar Steiner, Korin Richmond, Slim Ouni

Abstract: The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in sta… ▽ More The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis. △ Less

Submitted 22 September, 2012; originally announced September 2012.

Journal ref: 3rd International Symposium on Facial Analysis and Animation (2012)

arXiv:1203.3574 [pdf, other]

Artimate: an articulatory animation framework for audiovisual speech synthesis

Authors: Ingmar Steiner, Slim Ouni

Abstract: We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide r… ▽ More We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide realistic animation of the tongue and teeth for a virtual character. The framework also provides an interface to articulatory animation synthesis, as well as an example application to illustrate its use with a 3D game engine. We rely on cross-platform, open-source software and open standards to provide a lightweight, accessible, and portable workflow. △ Less

Submitted 15 March, 2012; originally announced March 2012.

Comments: Workshop on Innovation and Applications in Speech Technology (2012)

arXiv:1201.4080 [pdf, other]

Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis

Authors: Ingmar Steiner, Slim Ouni

Abstract: We present a technique for the animation of a 3D kinematic tongue model, one component of the talking head of an acoustic-visual (AV) speech synthesizer. The skeletal animation approach is adapted to make use of a deformable rig controlled by tongue motion capture data obtained with electromagnetic articulography (EMA), while the tongue surface is extracted from volumetric magnetic resonance imagi… ▽ More We present a technique for the animation of a 3D kinematic tongue model, one component of the talking head of an acoustic-visual (AV) speech synthesizer. The skeletal animation approach is adapted to make use of a deformable rig controlled by tongue motion capture data obtained with electromagnetic articulography (EMA), while the tongue surface is extracted from volumetric magnetic resonance imaging (MRI) data. Initial results are shown and future work outlined. △ Less

Submitted 19 January, 2012; originally announced January 2012.

Journal ref: Elektronische Sprachsignalverarbeitung 2011 TUDpress (Ed.) (2011) 245-252

arXiv:0801.4801 [pdf, ps, other]

doi 10.1086/587032

GROND - a 7-channel imager

Authors: J. Greiner, W. Bornemann, C. Clemens, M. Deuter, G. Hasinger, M. Honsberg, H. Huber, S. Huber, M. Krauss, T. Krühler, A. Küpcü Yoldaş, H. Mayer-Hasselwander, B. Mican, N. Primak, F. Schrey, I. Steiner, G. Szokoly, C. C. Thöne, A. Yoldaş, S. Klose, U. Laux, J. Winkler

Abstract: We describe the construction of GROND, a 7-channel imager, primarily designed for rapid observations of gamma-ray burst afterglows. It allows simultaneous imaging in the Sloan g'r'i'z' and near-infrared $JHK$ bands. GROND was commissioned at the MPI/ESO 2.2m telescope at La Silla (Chile) in April 2007, and first results of its performance and calibration are presented. We describe the construction of GROND, a 7-channel imager, primarily designed for rapid observations of gamma-ray burst afterglows. It allows simultaneous imaging in the Sloan g'r'i'z' and near-infrared $JHK$ bands. GROND was commissioned at the MPI/ESO 2.2m telescope at La Silla (Chile) in April 2007, and first results of its performance and calibration are presented. △ Less

Submitted 30 January, 2008; originally announced January 2008.

Comments: 25 pages, 21 figs, PASP (subm); version with full-resolution figures at http://www.mpe.mpg.de/~jcg/GROND/grond_pasp.pdf

Showing 1–13 of 13 results for author: Steiner, I