Skip to main content

Showing 1–4 of 4 results for author: Polák, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2310.11141  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Long-form Simultaneous Speech Translation: Thesis Proposal

    Authors: Peter Polák

    Abstract: Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-to… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: IJCNLP-AACL SRW 2023 - camera-ready version

  2. arXiv:2309.11384  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Long-Form End-to-End Speech Translation via Latent Alignment Segmentation

    Authors: Peter Polák, Ondřej Bojar

    Abstract: Current simultaneous speech translation models can process audio only up to a few seconds long. Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches either offer poor segmentation quality or have to trade latency f… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  3. arXiv:2309.11379  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

    Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

    Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at INTERSPEECH 2023

    Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

  4. arXiv:2304.04596  [pdf, other

    cs.SD cs.CL eess.AS

    ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

    Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

    Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More

    Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: ACL 2023; System Demonstration