Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction
Authors:
Daria Soboleva,
Ondrej Skopek,
Márius Šajgalík,
Victor Cărbune,
Felix Weissenberger,
Julia Proskurnia,
Bogdan Prisacari,
Daniel Valcarce,
Justin Lu,
Rohit Prabhavalkar,
Balint Miklos
Abstract:
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our mo…
▽ More
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, kee** the model size small and latency low.
△ Less
Submitted 10 February, 2021; v1 submitted 20 October, 2020;
originally announced October 2020.