Listen, Think, and Understand

Gong, Yuan; Luo, Hongyin; Liu, Alexander H.; Karlinsky, Leonid; Glass, James

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2305.10790v1 (eess)

[Submitted on 18 May 2023 (this version), latest version 19 Feb 2024 (v3)]

Title:Listen, Think, and Understand

Authors:Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

View PDF

Abstract:The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into coarse-grained categories, but also to listen to the details of the sounds, explain the reason for the predictions, think what the sound infers, and understand the scene and what action needs to be taken. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build an AI model that has both audio perception and a reasoning ability?
In this paper, we propose a novel audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and used an autoregressive training framework and a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. Moreover, it exhibits remarkable reasoning and comprehension abilities in the audio domain. To the best of our knowledge, LTU is the first audio-enabled large language model that bridges audio perception with advanced reasoning.

Comments:	Preprint, work in progress
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2305.10790 [eess.AS]
	(or arXiv:2305.10790v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2305.10790

Submission history

From: Yuan Gong [view email]
[v1] Thu, 18 May 2023 08:03:37 UTC (480 KB)
[v2] Mon, 2 Oct 2023 14:50:34 UTC (990 KB)
[v3] Mon, 19 Feb 2024 23:51:49 UTC (997 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Listen, Think, and Understand

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Listen, Think, and Understand

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators