Search | arXiv e-print repository

ALCAP: Alignment-Augmented Music Captioner

Authors: Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, Xuchen Song

Abstract: Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the intricate interplay between the two. However, a comprehensive understanding of music necessitates the integration of both these elements. In this study, we delve into t… ▽ More Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the intricate interplay between the two. However, a comprehensive understanding of music necessitates the integration of both these elements. In this study, we delve into this overlooked realm by introducing a method to systematically learn multimodal alignment between audio and lyrics through contrastive learning. This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence, thereby producing high-quality captions. We provide both theoretical and empirical results demonstrating the advantage of the proposed method, which achieves new state-of-the-art on two music captioning datasets. △ Less

Submitted 21 October, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

arXiv:2109.05056 [pdf, other]

Speaker Turn Modeling for Dialogue Act Classification

Authors: Zihao He, Leili Tavabi, Kristina Lerman, Mohammad Soleymani

Abstract: Dialogue Act (DA) classification is the task of classifying utterances with respect to the function they serve in a dialogue. Existing approaches to DA classification model utterances without incorporating the turn changes among speakers throughout the dialogue, therefore treating it no different than non-interactive written text. In this paper, we propose to integrate the turn changes in conversa… ▽ More Dialogue Act (DA) classification is the task of classifying utterances with respect to the function they serve in a dialogue. Existing approaches to DA classification model utterances without incorporating the turn changes among speakers throughout the dialogue, therefore treating it no different than non-interactive written text. In this paper, we propose to integrate the turn changes in conversations among speakers when modeling DAs. Specifically, we learn conversation-invariant speaker turn embeddings to represent the speaker turns in a conversation; the learned speaker turn embeddings are then merged with the utterance embeddings for the downstream task of DA classification. With this simple yet effective mechanism, our model is able to capture the semantics from the dialogue content while accounting for different speaker turns in a conversation. Validation on three benchmark public datasets demonstrates superior performance of our model. △ Less

Submitted 10 September, 2021; originally announced September 2021.

arXiv:2106.00614 [pdf, other]

Pattern Discovery in Time Series with Byte Pair Encoding

Authors: Nazgol Tavabi, Kristina Lerman

Abstract: The growing popularity of wearable sensors has generated large quantities of temporal physiological and activity data. Ability to analyze this data offers new opportunities for real-time health monitoring and forecasting. However, temporal physiological data presents many analytic challenges: the data is noisy, contains many missing values, and each series has a different length. Most methods prop… ▽ More The growing popularity of wearable sensors has generated large quantities of temporal physiological and activity data. Ability to analyze this data offers new opportunities for real-time health monitoring and forecasting. However, temporal physiological data presents many analytic challenges: the data is noisy, contains many missing values, and each series has a different length. Most methods proposed for time series analysis and classification do not handle datasets with these characteristics nor do they offer interpretability and explainability, a critical requirement in the health domain. We propose an unsupervised method for learning representations of time series based on common patterns identified within them. The patterns are, interpretable, variable in length, and extracted using Byte Pair Encoding compression technique. In this way the method can capture both long-term and short-term dependencies present in the data. We show that this method applies to both univariate and multivariate time series and beats state-of-the-art approaches on a real world dataset collected from wearable sensors. △ Less

Submitted 29 May, 2021; originally announced June 2021.

arXiv:2103.12149 [pdf, other]

A Directed, Bi-Populated Preferential Attachment Model with Applications to Analyzing the Glass Ceiling Effect

Authors: Buddhika Nettasinghe, Nazanin Alipourfard, Vikram Krishnamurthy, Kristina Lerman

Abstract: Preferential attachment, homophily and, their consequences such as the glass ceiling effect have been well-studied in the context of undirected networks. However, the lack of an intuitive, theoretically tractable model of a directed, bi-populated~(i.e.,~containing two groups) network with variable levels of preferential attachment, homophily and growth dynamics~(e.g.,~the rate at which new nodes j… ▽ More Preferential attachment, homophily and, their consequences such as the glass ceiling effect have been well-studied in the context of undirected networks. However, the lack of an intuitive, theoretically tractable model of a directed, bi-populated~(i.e.,~containing two groups) network with variable levels of preferential attachment, homophily and growth dynamics~(e.g.,~the rate at which new nodes join, whether the new nodes mostly follow existing nodes or the existing nodes follow them, etc.) has largely prevented such consequences from being explored in the context of directed networks, where they more naturally occur due to the asymmetry of links. To this end, we present a rigorous theoretical analysis of the \emph{Directed Mixed Preferential Attachment} model and, use it to analyze the glass ceiling effect in directed networks. More specifically, we derive the closed-form expressions for the power-law exponents of the in- and out- degree distributions of each group~(minority and majority) and, compare them with each other to obtain insights. In particular, our results yield answers to questions such as: \emph{when does the minority group have a heavier out-degree (or in-degree) distribution compared to the majority group? what effect does frequent addition of edges between existing nodes have on the in- and out- degree distributions of the majority and minority groups?}. Such insights shed light on the interplay between the structure~(i.e., the in- and out- degree distributions of the two groups) and dynamics~(characterized collectively by the homophily, preferential attachment, group sizes and growth dynamics) of various real-world networks. Finally, we utilize the obtained analytical results to characterize the conditions under which the glass ceiling effect emerge in a directed network. Our analytical results are supported by detailed numerical results. △ Less

Submitted 22 March, 2021; originally announced March 2021.

arXiv:2003.08474 [pdf, other]

doi 10.1038/s41597-020-00655-3

TILES-2018, a longitudinal physiologic and behavioral data set of hospital workers

Authors: Karel Mundnich, Brandon M. Booth, Michelle L'Hommedieu, Tiantian Feng, Benjamin Girault, Justin L'Hommedieu, Mackenzie Wildman, Sophia Skaaden, Amrutha Nadarajan, Jennifer L. Villatte, Tiago H. Falk, Kristina Lerman, Emilio Ferrara, Shrikanth Narayanan

Abstract: We present a novel longitudinal multimodal corpus of physiological and behavioral data collected from direct clinical providers in a hospital workplace. We designed the study to investigate the use of off-the-shelf wearable and environmental sensors to understand individual-specific constructs such as job performance, interpersonal interaction, and well-being of hospital workers over time in their… ▽ More We present a novel longitudinal multimodal corpus of physiological and behavioral data collected from direct clinical providers in a hospital workplace. We designed the study to investigate the use of off-the-shelf wearable and environmental sensors to understand individual-specific constructs such as job performance, interpersonal interaction, and well-being of hospital workers over time in their natural day-to-day job settings. We collected behavioral and physiological data from $n = 212$ participants through Internet-of-Things Bluetooth data hubs, wearable sensors (including a wristband, a biometrics-tracking garment, a smartphone, and an audio-feature recorder), together with a battery of surveys to assess personality traits, behavioral states, job performance, and well-being over time. Besides the default use of the data set, we envision several novel research opportunities and potential applications, including multi-modal and multi-task behavioral modeling, authentication through biometrics, and privacy-aware and privacy-preserving machine learning. △ Less

Submitted 18 December, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: 57 pages, 9 figures, journal paper

Journal ref: Sci Data 7, 354 (2020)

arXiv:1911.06959 [pdf, other]

Learning Behavioral Representations from Wearable Sensors

Authors: Nazgol Tavabi, Homa Hosseinmardi, Jennifer L. Villatte, Andrés Abeliuk, Shrikanth Narayanan, Emilio Ferrara, Kristina Lerman

Abstract: Continuous collection of physiological data from wearable sensors enables temporal characterization of individual behaviors. Understanding the relation between an individual's behavioral patterns and psychological states can help identify strategies to improve quality of life. One challenge in analyzing physiological data is extracting the underlying behavioral states from the temporal sensor sign… ▽ More Continuous collection of physiological data from wearable sensors enables temporal characterization of individual behaviors. Understanding the relation between an individual's behavioral patterns and psychological states can help identify strategies to improve quality of life. One challenge in analyzing physiological data is extracting the underlying behavioral states from the temporal sensor signals and interpreting them. Here, we use a non-parametric Bayesian approach to model sensor data from multiple people and discover the dynamic behaviors they share. We apply this method to data collected from sensors worn by a population of hospital workers and show that the learned states can cluster participants into meaningful groups and better predict their cognitive and psychological states. This method offers a way to learn interpretable compact behavioral representations from multivariate sensor signals. △ Less

Submitted 4 July, 2020; v1 submitted 16 November, 2019; originally announced November 2019.

Showing 1–6 of 6 results for author: Lerman, K