Skip to main content

Showing 1–19 of 19 results for author: Kolossa, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2305.17000  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    DistriBlock: Identifying adversarial audio samples by leveraging characteristics of the output distribution

    Authors: Matías P. Pizarro B., Dorothea Kolossa, Asja Fischer

    Abstract: Adversarial attacks can mislead automatic speech recognition (ASR) systems into predicting an arbitrary target text, thus posing a clear security threat. To prevent such attacks, we propose DistriBlock, an efficient detection strategy applicable to any ASR system that predicts a probability distribution over output tokens in each time step. We measure a set of characteristics of this distribution:… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

  2. arXiv:2112.07400  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Robustifying automatic speech recognition by extracting slowly varying features

    Authors: Matias Pizarro, Dorothea Kolossa, Asja Fischer

    Abstract: In the past few years, it has been shown that deep learning systems are highly vulnerable under attacks with adversarial examples. Neural-network-based automatic speech recognition (ASR) systems are no exception. Targeted and untargeted attacks can modify an audio input signal in such a way that humans still recognise the same words, while ASR systems are steered to predict a different transcripti… ▽ More

    Submitted 14 December, 2021; originally announced December 2021.

  3. arXiv:2109.15108  [pdf, other

    eess.AS cs.SD

    Federated Learning in ASR: Not as Easy as You Think

    Authors: Wentao Yu, Jan Freiwald, Sören Tewes, Fabien Huennemeyer, Dorothea Kolossa

    Abstract: With the growing availability of smart devices and cloud services, personal speech assistance systems are increasingly used on a daily basis. Most devices redirect the voice recordings to a central server, which uses them for upgrading the recognizer model. This leads to major privacy concerns, since private data could be misused by the server or third parties. Federated learning is a decentralize… ▽ More

    Submitted 30 September, 2021; originally announced September 2021.

    Journal ref: ITG Conference on Speech Communication, 2021

  4. arXiv:2109.04894  [pdf, other

    eess.AS cs.SD eess.IV

    Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

    Authors: Wentao Yu, Steffen Zeiler, Dorothea Kolossa

    Abstract: Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider su… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Journal ref: The IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), 2021

  5. arXiv:2106.03903  [pdf, other

    cs.SD cs.LG eess.AS

    PILOT: Introducing Transformers for Probabilistic Sound Event Localization

    Authors: Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  6. arXiv:2104.09482  [pdf, other

    eess.AS cs.SD

    Fusing information streams in end-to-end audio-visual speech recognition

    Authors: Wentao Yu, Steffen Zeiler, Dorothea Kolossa

    Abstract: End-to-end acoustic speech recognition has quickly gained widespread popularity and shows promising results in many studies. Specifically the joint transformer/CTC model provides very good performance in many tasks. However, under noisy and distorted conditions, the performance still degrades notably. While audio-visual speech recognition can significantly improve the recognition rate of end-to-en… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: 5 pages

    Journal ref: Published in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

  7. arXiv:2103.01173  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Unsupervised Classification of Voiced Speech and Pitch Tracking Using Forward-Backward Kalman Filtering

    Authors: Benedikt Boenninghoff, Robert M. Nickel, Steffen Zeiler, Dorothea Kolossa

    Abstract: The detection of voiced speech, the estimation of the fundamental frequency, and the tracking of pitch values over time are crucial subtasks for a variety of speech processing techniques. Many different algorithms have been developed for each of the three subtasks. We present a new algorithm that integrates the three subtasks into a single procedure. The algorithm can be applied to pre-recorded sp… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Speech Communication; 12. ITG Symposium, 5-7 Oct. 2016

  8. arXiv:2103.00417  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

    Authors: Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utili… ▽ More

    Submitted 28 February, 2021; originally announced March 2021.

    Comments: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020

  9. arXiv:2102.11588  [pdf, other

    cs.SD cs.AI cs.CL cs.CV cs.LG eess.AS eess.IV

    Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

    Authors: Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura

    Abstract: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the… ▽ More

    Submitted 24 February, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: 4 pages, 6 figures, ICASSP 2021

  10. arXiv:2010.10682  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    VenoMave: Targeted Poisoning Against Speech Recognition

    Authors: Hojjat Aghakhani, Lea Schönherr, Thorsten Eisenhofer, Dorothea Kolossa, Thorsten Holz, Christopher Kruegel, Giovanni Vigna

    Abstract: Despite remarkable improvements, automatic speech recognition is susceptible to adversarial perturbations. Compared to standard machine learning architectures, these attacks are significantly more challenging, especially since the inputs to a speech recognition system are time series that contain both acoustic and linguistic properties of speech. Extracting all recognition-relevant information req… ▽ More

    Submitted 20 April, 2023; v1 submitted 20 October, 2020; originally announced October 2020.

  11. arXiv:2010.08574  [pdf, other

    eess.AS

    Non-intrusive speech intelligibility prediction using automatic speech recognition derived measures

    Authors: Mahdie Karbasi, Stefan Bleeck, Dorothea Kolossa

    Abstract: The estimation of speech intelligibility is still far from being a solved problem. Especially one aspect is problematic: most of the standard models require a clean reference signal in order to estimate intelligibility. This is an issue of some significance, as a reference signal is often unavailable in practice. In this work, therefore a non-intrusive speech intelligibility estimation framework i… ▽ More

    Submitted 28 October, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

  12. arXiv:2007.14223  [pdf, other

    eess.AS cs.SD

    Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

    Authors: Wentao Yu, Steffen Zeiler, Dorothea Kolossa

    Abstract: For many small- and medium-vocabulary tasks, audio-visual speech recognition can significantly improve the recognition rates compared to audio-only systems. However, there is still an ongoing debate regarding the best combination strategy for multi-modal information, which should allow for the translation of these gains to large-vocabulary recognition. While an integration at the level of state-po… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

    Comments: 5 pages

    Journal ref: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020

  13. arXiv:2005.14611  [pdf, other

    eess.AS cs.CR cs.LG cs.SD stat.ML

    Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification

    Authors: Sina Däubener, Lea Schönherr, Asja Fischer, Dorothea Kolossa

    Abstract: Machine learning systems and also, specifically, automatic speech recognition (ASR) systems are vulnerable against adversarial attacks, where an attacker maliciously changes the input. In the case of ASR systems, the most interesting cases are targeted attacks, in which an attacker aims to force the system into recognizing given target transcriptions in an arbitrary audio sample. The increasing nu… ▽ More

    Submitted 2 August, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

  14. arXiv:2003.08685  [pdf, other

    cs.CV eess.IV

    Leveraging Frequency Analysis for Deep Fake Image Recognition

    Authors: Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, Thorsten Holz

    Abstract: Deep neural networks can generate images that are astonishingly realistic, so much so that it is often hard for humans to distinguish them from actual photos. These achievements have been largely made possible by Generative Adversarial Networks (GANs). While deep fake images have been thoroughly investigated in the image domain - a classical approach from the area of image forensics - an analysis… ▽ More

    Submitted 26 June, 2020; v1 submitted 19 March, 2020; originally announced March 2020.

    Comments: Accepted to ICML 2020. New experiments, updated several sections, code: https://github.com/RUB-SysSec/GANDCTAnalysis

  15. arXiv:1912.05869  [pdf, other

    eess.AS cs.NE cs.SD q-bio.NC

    On Neural Phone Recognition of Mixed-Source ECoG Signals

    Authors: Ahmed Hussen Abdelaziz, Shuo-Yiin Chang, Nelson Morgan, Erik Edwards, Dorothea Kolossa, Dan Ellis, David A. Moses, Edward F. Chang

    Abstract: The emerging field of neural speech recognition (NSR) using electrocorticography has recently attracted remarkable research interest for studying how human brains recognize speech in quiet and noisy surroundings. In this study, we demonstrate the utility of NSR systems to objectively prove the ability of human beings to attend to a single speech source while suppressing the interfering signals in… ▽ More

    Submitted 12 December, 2019; originally announced December 2019.

    Comments: 5 pages, showing algorithms, results and references from our collaboration during a 2017 postdoc stay of the first author

  16. arXiv:1908.01551  [pdf, other

    cs.CR cs.LG cs.SD eess.AS

    Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

    Authors: Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa

    Abstract: Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. The few published over-the-air adversarial examples fall into one… ▽ More

    Submitted 24 November, 2020; v1 submitted 5 August, 2019; originally announced August 2019.

  17. Joining Sound Event Detection and Localization Through Spatial Segregation

    Authors: Ivo Trowitzsch, Christopher Schymura, Dorothea Kolossa, Klaus Obermayer

    Abstract: Identification and localization of sounds are both integral parts of computational auditory scene analysis. Although each can be solved separately, the goal of forming coherent auditory objects and achieving a comprehensive spatial scene understanding suggests pursuing a joint solution of the two problems. This work presents an approach that robustly binds localization with the detection of sound… ▽ More

    Submitted 21 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

    Comments: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  18. arXiv:1903.06031  [pdf, other

    cs.CV cs.SD eess.AS

    Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

    Authors: Christopher Schymura, Dorothea Kolossa

    Abstract: Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve th… ▽ More

    Submitted 14 March, 2019; originally announced March 2019.

  19. arXiv:1808.05665  [pdf, other

    cs.CR cs.SD eess.AS

    Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding

    Authors: Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa

    Abstract: Voice interfaces are becoming accepted widely as input methods for a diverse set of devices. This development is driven by rapid improvements in automatic speech recognition (ASR), which now performs on par with human listening in many tasks. These improvements base on an ongoing evolution of DNNs as the computational core of ASR. However, recent research results show that DNNs are vulnerable to a… ▽ More

    Submitted 30 October, 2018; v1 submitted 16 August, 2018; originally announced August 2018.