Search | arXiv e-print repository

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Authors: Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay

Abstract: We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing a… ▽ More We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2403.10498 [pdf, other]

doi 10.1007/JHEP07(2024)012

$DK/Dπ$ scattering and an exotic virtual bound state at the $SU(3)$ flavour symmetric point from lattice QCD

Authors: J. Daniel E. Yeo, Christopher E. Thomas, David J. Wilson

Abstract: Elastic $S-$wave scattering of a charm meson with a light pseudoscalar meson in $J^P =0^+$ is investigated in the flavour $\bar{\mathbf{3}}$, $\mathbf{6}$ and $\overline{\mathbf{15}}$ sectors at the $SU(3)_f$ flavour point using lattice QCD, working on three volumes with $m_π \approx 700$ MeV. Large bases of interpolating operators are employed to extract finite-volume spectra, which are subsequen… ▽ More Elastic $S-$wave scattering of a charm meson with a light pseudoscalar meson in $J^P =0^+$ is investigated in the flavour $\bar{\mathbf{3}}$, $\mathbf{6}$ and $\overline{\mathbf{15}}$ sectors at the $SU(3)_f$ flavour point using lattice QCD, working on three volumes with $m_π \approx 700$ MeV. Large bases of interpolating operators are employed to extract finite-volume spectra, which are subsequently used with the Lüscher method to provide constraints on infinite-volume scattering amplitudes. Examining the singularities of the amplitudes, the $S-$wave amplitude in the flavour $\bar{\mathbf{3}}$ sector is found to contain a deeply bound state, strongly coupled to elastic threshold, corresponding to the $J^P = 0^+$ $D_{s0}^*(2317)$. In the exotic flavour $\mathbf{6}$ sector a virtual bound state is found at $\sqrt{s_{\rm{pole}}} = 2510 - 2610$ MeV, roughly $40-140$ MeV below threshold, whereas the $\overline{\mathbf{15}}$ channel shows weak repulsion. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: 41 pages, 15 figures

Journal ref: JHEP 07 (2024) 012

arXiv:2311.14413 [pdf, ps, other]

A continuum model for the elongation and orientation of Von Willebrand Factor with applications in arterial flow

Authors: Edwina F. Yeo, James M. Oliver, Netanel Korin, Sarah L. Waters

Abstract: The blood protein Von Willebrand Factor (VWF) is critical in facilitating arterial thrombosis. At pathologically high shear rates the protein unfolds and binds to the arterial wall, enabling the rapid deposition of platelets from the blood. We present a novel continuum model for VWF dynamics in flow based on a modified viscoelastic fluid model that incorporates a single constitutive relation to de… ▽ More The blood protein Von Willebrand Factor (VWF) is critical in facilitating arterial thrombosis. At pathologically high shear rates the protein unfolds and binds to the arterial wall, enabling the rapid deposition of platelets from the blood. We present a novel continuum model for VWF dynamics in flow based on a modified viscoelastic fluid model that incorporates a single constitutive relation to describe the propensity of VWF to unfold as a function of the scalar shear rate. Using experimental data of VWF unfolding in pure shear flow, we fix the parameters for VWF's unfolding propensity and the maximum VWF length, so that the protein is half unfolded at a shear rate of approximately 5,000 s$^{-1}$. We then use the theoretical model to predict VWF's behaviour in two complex flows where experimental data is challenging to obtain: pure elongational flow and stenotic arterial flow. In pure elongational flow, our model predicts that VWF is 50% unfolded at approximately 2,000 s$^{-1}$, matching the established hypothesis that VWF unfolds at lower shear rates in elongational flow than in shear flow. We demonstrate the sensitivity of this elongational flow prediction to the value of maximum VWF length used in the model, which varies significantly across experimental studies, predicting that VWF can unfold between 600 - 3,200 s$^{-1}$ depending on the selected value. Finally, we examine VWF dynamics in a range of idealised arterial stenoses, predicting the relative extension of VWF in elongational flow structures in the centre of the artery compared to high-shear regions near the arterial walls. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2306.10821 [pdf, other]

Comparison of L2 Korean pronunciation error patterns from five L1 backgrounds by using automatic phonetic transcription

Authors: Eun Jung Yeo, Hyungshin Ryu, Jooyoung Lee, Sunhee Kim, Minhwa Chung

Abstract: This paper presents a large-scale analysis of L2 Korean pronunciation error patterns from five different language backgrounds, Chinese, Vietnamese, Japanese, Thai, and English, by using automatic phonetic transcription. For the analysis, confusion matrices are generated for each L1, by aligning canonical phone sequences and automatically transcribed phone sequences obtained from fine-tuned Wav2Vec… ▽ More This paper presents a large-scale analysis of L2 Korean pronunciation error patterns from five different language backgrounds, Chinese, Vietnamese, Japanese, Thai, and English, by using automatic phonetic transcription. For the analysis, confusion matrices are generated for each L1, by aligning canonical phone sequences and automatically transcribed phone sequences obtained from fine-tuned Wav2Vec2 XLS-R phone recognizer. Each value in the confusion matrices is compared to capture frequent common error patterns and to specify patterns unique to a certain language background. Using the Foreign Speakers' Voice Data of Korean for Artificial Intelligence Learning dataset, common error pattern types are found to be (1) substitutions of aspirated or tense consonants with plain consonants, (2) deletions of syllable-final consonants, and (3) substitutions of diphthongs with monophthongs. On the other hand, thirty-nine patterns including (1) syllable-final /l/ substitutions with /n/ for Vietnamese and (2) /\textturnm/ insertions for Japanese are discovered as language-dependent. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: 5 pages, 2 figures, accepted to ICPhS 2023

arXiv:2305.18392 [pdf, other]

Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification

Authors: Eun Jung Yeo, Kwanghee Choi, Sunhee Kim, Minhwa Chung

Abstract: This paper proposes an improved Goodness of Pronunciation (GoP) that utilizes Uncertainty Quantification (UQ) for automatic speech intelligibility assessment for dysarthric speech. Current GoP methods rely heavily on neural network-driven overconfident predictions, which is unsuitable for assessing dysarthric speech due to its significant acoustic differences from healthy speech. To alleviate the… ▽ More This paper proposes an improved Goodness of Pronunciation (GoP) that utilizes Uncertainty Quantification (UQ) for automatic speech intelligibility assessment for dysarthric speech. Current GoP methods rely heavily on neural network-driven overconfident predictions, which is unsuitable for assessing dysarthric speech due to its significant acoustic differences from healthy speech. To alleviate the problem, UQ techniques were used on GoP by 1) normalizing the phoneme prediction (entropy, margin, maxlogit, logit-margin) and 2) modifying the scoring function (scaling, prior normalization). As a result, prior-normalized maxlogit GoP achieves the best performance, with a relative increase of 5.66%, 3.91%, and 23.65% compared to the baseline GoP for English, Korean, and Tamil, respectively. Furthermore, phoneme analysis is conducted to identify which phoneme scores significantly correlate with intelligibility scores in each language. △ Less

Submitted 28 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023

arXiv:2210.15387 [pdf, other]

Automatic Severity Classification of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Authors: Eun Jung Yeo, Kwanghee Choi, Sunhee Kim, Minhwa Chung

Abstract: Automatic assessment of dysarthric speech is essential for sustained treatments and rehabilitation. However, obtaining atypical speech is challenging, often leading to data scarcity issues. To tackle the problem, we propose a novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning. Wav2vec 2.0 XLS-R is jointly traine… ▽ More Automatic assessment of dysarthric speech is essential for sustained treatments and rehabilitation. However, obtaining atypical speech is challenging, often leading to data scarcity issues. To tackle the problem, we propose a novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning. Wav2vec 2.0 XLS-R is jointly trained for two different tasks: severity classification and auxiliary automatic speech recognition (ASR). For the baseline experiments, we employ hand-crafted acoustic features and machine learning classifiers such as SVM, MLP, and XGBoost. Explored on the Korean dysarthric speech QoLT database, our model outperforms the traditional baseline methods, with a relative percentage increase of 1.25% for F1-score. In addition, the proposed model surpasses the model trained without ASR head, achieving 10.61% relative percentage improvements. Furthermore, we present how multi-task learning affects the severity classification performance by analyzing the latent representations and regularization effect. △ Less

Submitted 28 April, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted to ICASSP 2023

arXiv:2210.15386 [pdf, other]

Opening the Black Box of wav2vec Feature Encoder

Authors: Kwanghee Choi, Eun Jung Yeo

Abstract: Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units. To ana… ▽ More Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units. To analyze the embedding space in a reductive manner, we feed the synthesized audio signals, which is the summation of simple sine waves. Through extensive experiments, we conclude that various information is embedded inside the feature encoder representations: (1) fundamental frequency, (2) formants, and (3) amplitude, packed with (4) sufficient temporal detail. Further, the information incorporated inside the latent representations is analogous to spectrograms but with a fundamental difference: latent representations construct a metric space so that closer representations imply acoustic similarity. △ Less

Submitted 27 October, 2022; originally announced October 2022.

arXiv:2209.13260 [pdf, other]

Multilingual analysis of intelligibility classification using English, Korean, and Tamil dysarthric speech datasets

Authors: Eun Jung Yeo, Sunhee Kim, Minhwa Chung

Abstract: This paper analyzes dysarthric speech datasets from three languages with different prosodic systems: English, Korean, and Tamil. We inspect 39 acoustic measurements which reflect three speech dimensions including voice quality, pronunciation, and prosody. As multilingual analysis, examination on the mean values of acoustic measurements by intelligibility levels is conducted. Further, automatic int… ▽ More This paper analyzes dysarthric speech datasets from three languages with different prosodic systems: English, Korean, and Tamil. We inspect 39 acoustic measurements which reflect three speech dimensions including voice quality, pronunciation, and prosody. As multilingual analysis, examination on the mean values of acoustic measurements by intelligibility levels is conducted. Further, automatic intelligibility classification is performed to scrutinize the optimal feature set by languages. Analyses suggest pronunciation features, such as Percentage of Correct Consonants, Percentage of Correct Vowels, and Percentage of Correct Phonemes to be language-independent measurements. Voice quality and prosody features, however, generally present different aspects by languages. Experimental results additionally show that different speech dimension play a greater role for different languages: prosody for English, pronunciation for Korean, both prosody and pronunciation for Tamil. This paper contributes to speech pathology in that it differentiates between language-independent and language-dependent measurements in intelligibility classification for English, Korean, and Tamil dysarthric speech. △ Less

Submitted 2 November, 2022; v1 submitted 27 September, 2022; originally announced September 2022.

Comments: 6 pages, 1 figure, O-COCOSDA 2022

arXiv:2209.12942 [pdf]

Cross-lingual Dysarthria Severity Classification for English, Korean, and Tamil

Authors: Eun Jung Yeo, Kwanghee Choi, Sunhee Kim, Minhwa Chung

Abstract: This paper proposes a cross-lingual classification method for English, Korean, and Tamil, which employs both language-independent features and language-unique features. First, we extract thirty-nine features from diverse speech dimensions such as voice quality, pronunciation, and prosody. Second, feature selections are applied to identify the optimal feature set for each language. A set of shared… ▽ More This paper proposes a cross-lingual classification method for English, Korean, and Tamil, which employs both language-independent features and language-unique features. First, we extract thirty-nine features from diverse speech dimensions such as voice quality, pronunciation, and prosody. Second, feature selections are applied to identify the optimal feature set for each language. A set of shared features and a set of distinctive features are distinguished by comparing the feature selection results of the three languages. Lastly, automatic severity classification is performed, utilizing the two feature sets. Notably, the proposed method removes different features by languages to prevent the negative effect of unique features for other languages. Accordingly, eXtreme Gradient Boosting (XGBoost) algorithm is employed for classification, due to its strength in imputing missing data. In order to validate the effectiveness of our proposed method, two baseline experiments are conducted: experiments using the intersection set of mono-lingual feature sets (Intersection) and experiments using the union set of mono-lingual feature sets (Union). According to the experimental results, our method achieves better performance with a 67.14% F1 score, compared to 64.52% for the Intersection experiment and 66.74% for the Union experiment. Further, the proposed method attains better performances than mono-lingual classifications for all three languages, achieving 17.67%, 2.28%, 7.79% relative percentage increases for English, Korean, and Tamil, respectively. The result specifies that commonly shared features and language-specific features must be considered separately for cross-language dysarthria severity classification. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 9 pages, 4 figures, APSIPA 2022

arXiv:1308.4487 [pdf, other]

doi 10.1088/0957-4484/23/49/495702

Strain dependence of the heat transport properties of graphene nanoribbons

Authors: Pei Shan Emmeline Yeo, Kian ** Loh, Chee Kwan Gan

Abstract: Using a combination of accurate density-functional theory and a nonequilibrium Green function's method, we calculate the ballistic thermal conductance characteristics of tensile-strained armchair (AGNR) and zigzag (ZGNR) edge graphene nanoribbons, with widths between 3-50 Å. The optimized lateral lattice constants for AGNRs of different widths display a three-family behavior when the ribbons are g… ▽ More Using a combination of accurate density-functional theory and a nonequilibrium Green function's method, we calculate the ballistic thermal conductance characteristics of tensile-strained armchair (AGNR) and zigzag (ZGNR) edge graphene nanoribbons, with widths between 3-50 Å. The optimized lateral lattice constants for AGNRs of different widths display a three-family behavior when the ribbons are grouped according to N modulo 3, where $N$ represents the number of carbon atoms across the width of the ribbon. Two lowest-frequency out-of-plane acoustic modes play a decisive role in increasing the thermal conductance of AGNR-N at low temperatures. At high temperatures the effect of tensile strain is to reduce the thermal conductance of AGNR-N and ZGNR-N. These results could be explained by the changes in force constants in the in-plane and out-of-plane directions with the application of strain. This fundamental atomistic understanding of the heat transport in graphene nanoribbons paves a way to effect changes in their thermal properties via strain at various temperatures. △ Less

Submitted 21 August, 2013; originally announced August 2013.

Comments: 10 pages

Journal ref: Nanotechnology v.23, p.495702 (2012)

arXiv:1308.4474 [pdf, other]

doi 10.1039/c3ta12211e

First-principles study of the thermoelectric properties of strained graphene nanoribbons

Authors: Pei Shan Emmeline Yeo, Michael B. Sullivan, Kian ** Loh, Chee Kwan Gan

Abstract: We study the transport properties, in particular, the thermoelectric figure of merit ZT of armchair graphene nanoribbons, AGNR-N (for N=4-12, with widths ranging from 3.7 to 13.6~Å) through strain engineering, where N is the number of carbon dimer lines across the AGNR width. We find that the tensile strain applied to AGNR-$N$ changes the transport properties by modifying the electronic structures… ▽ More We study the transport properties, in particular, the thermoelectric figure of merit ZT of armchair graphene nanoribbons, AGNR-N (for N=4-12, with widths ranging from 3.7 to 13.6~Å) through strain engineering, where N is the number of carbon dimer lines across the AGNR width. We find that the tensile strain applied to AGNR-$N$ changes the transport properties by modifying the electronic structures and phonon dispersion relations. The tensile strain increases the ZT value of the AGNR-$N$ families with N=3p and N=3p+2, where $p$ is an integer. Our analysis based on accurate density-functional theory calculations suggests a possible route to increase the ZT values of AGNR-$N$ for potential thermoelectric applications. △ Less

Submitted 20 August, 2013; originally announced August 2013.

Comments: 6 pages

Journal ref: Journal of Materials Chemistry A, v.1, p.10762 (2013)

Showing 1–11 of 11 results for author: Yeo, E