Skip to main content

Showing 1–7 of 7 results for author: Berry, L

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.06959  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

    Authors: Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath

    Abstract: The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propos… ▽ More

    Submitted 10 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  2. arXiv:2402.05819  [pdf, other

    eess.AS cs.CL cs.LG

    Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

    Authors: Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath

    Abstract: Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-wo… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)

  3. arXiv:2401.12913  [pdf, other

    gr-qc astro-ph.IM eess.IV

    Advancing Glitch Classification in Gravity Spy: Multi-view Fusion with Attention-based Machine Learning for Advanced LIGO's Fourth Observing Run

    Authors: Yunan Wu, Michael Zevin, Christopher P. L. Berry, Kevin Crowston, Carsten Ă˜sterlund, Zoheyr Doctor, Sharan Banagiri, Corey B. Jackson, Vicky Kalogera, Aggelos K. Katsaggelos

    Abstract: The first successful detection of gravitational waves by ground-based observatories, such as the Laser Interferometer Gravitational-Wave Observatory (LIGO), marked a revolutionary breakthrough in our comprehension of the Universe. However, due to the unprecedented sensitivity required to make such observations, gravitational-wave detectors also capture disruptive noise sources called glitches, pot… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  4. arXiv:2309.10787  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

    Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

    Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More

    Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

  5. arXiv:2211.01180  [pdf, other

    cs.CL cs.SD eess.AS

    M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

    Authors: Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath

    Abstract: This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differenc… ▽ More

    Submitted 10 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  6. arXiv:2210.00705  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

    Authors: Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath

    Abstract: Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions… ▽ More

    Submitted 25 October, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  7. arXiv:1909.04542  [pdf, other

    eess.IV cs.CV

    Integrating cross-modality hallucinated MRI with CT to aid mediastinal lung tumor segmentation

    Authors: Jue Jiang, Jason Hu, Neelam Tyagi, Andreas Rimner, Sean L. Berry, Joseph O. Deasy, Harini Veeraraghavan

    Abstract: Lung tumors, especially those located close to or surrounded by soft tissues like the mediastinum, are difficult to segment due to the low soft tissue contrast on computed tomography images. Magnetic resonance images contain superior soft-tissue contrast information that can be leveraged if both modalities were available for training. Therefore, we developed a cross-modality educed learning approa… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: This paper has been accepted by MICCAI 2019