-
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment
Authors:
Wenliang Zhong,
Wenyi Wu,
Qi Li,
Rob Barton,
Boxin Du,
Shioulin Sam,
Karim Bouyarmane,
Ismail Tutar,
Junzhou Huang
Abstract:
Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose…
▽ More
Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose a general component termed Multi-instance Visual Prompt Generator (MIVPG) to incorporate enriched visual representations into LLMs by taking advantage of instance correlation between images or patches for the same sample. Quantatitive evaluation on three public vision-language (VL) datasets from different scenarios shows that the proposed MIVPG improves Q-former in main VL tasks.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Preliminary Study on SSCF-derived Polar Coordinate for ASR
Authors:
Sotheara Leang,
Eric Castelli,
Dominique Vaufreydaz,
Sethserey Sam
Abstract:
The transition angles are defined to describe the vowel-to-vowel transitions in the acoustic space of the Spectral Subband Centroids, and the findings show that they are similar among speakers and speaking rates. In this paper, we propose to investigate the usage of polar coordinates in favor of angles to describe a speech signal by characterizing its acoustic trajectory and using them in Automati…
▽ More
The transition angles are defined to describe the vowel-to-vowel transitions in the acoustic space of the Spectral Subband Centroids, and the findings show that they are similar among speakers and speaking rates. In this paper, we propose to investigate the usage of polar coordinates in favor of angles to describe a speech signal by characterizing its acoustic trajectory and using them in Automatic Speech Recognition. According to the experimental results evaluated on the BRAF100 dataset, the polar coordinates achieved significantly higher accuracy than the angles in the mixed and cross-gender speech recognitions, demonstrating that these representations are superior at defining the acoustic trajectory of the speech signal. Furthermore, the accuracy was significantly improved when they were utilized with their first and second-order derivatives ($Δ$, $Δ$$Δ$), especially in cross-female recognition. However, the results showed they were not much more gender-independent than the conventional Mel-frequency Cepstral Coefficients (MFCCs).
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
Tropicalization of classical moduli spaces
Authors:
Qingchun Ren,
Steven V Sam,
Bernd Sturmfels
Abstract:
The image of the complement of a hyperplane arrangement under a monomial map can be tropicalized combinatorially using matroid theory. We apply this to classical moduli spaces that are associated with complex reflection arrangements. Starting from modular curves, we visit the Segre cubic, the Igusa quartic, and moduli of marked del Pezzo surfaces of degrees 2 and 3. Our primary example is the Burk…
▽ More
The image of the complement of a hyperplane arrangement under a monomial map can be tropicalized combinatorially using matroid theory. We apply this to classical moduli spaces that are associated with complex reflection arrangements. Starting from modular curves, we visit the Segre cubic, the Igusa quartic, and moduli of marked del Pezzo surfaces of degrees 2 and 3. Our primary example is the Burkhardt quartic, whose tropicalization is a 3-dimensional fan in 39-dimensional space. This effectuates a synthesis of concrete and abstract approaches to tropical moduli of genus 2 curves.
△ Less
Submitted 16 November, 2013; v1 submitted 5 March, 2013;
originally announced March 2013.