Search | arXiv e-print repository

Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free

Authors: M. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones

Abstract: Podcasts are conversational in nature and speaker changes are frequent -- requiring speaker diarization for content understanding. We propose an unsupervised technique for speaker diarization without relying on language-specific components. The algorithm is overlap-aware and does not require information about the number of speakers. Our approach shows 79% improvement on purity scores (34% on F-sco… ▽ More Podcasts are conversational in nature and speaker changes are frequent -- requiring speaker diarization for content understanding. We propose an unsupervised technique for speaker diarization without relying on language-specific components. The algorithm is overlap-aware and does not require information about the number of speakers. Our approach shows 79% improvement on purity scores (34% on F-score) against the Google Cloud Platform solution on podcast data. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: Published at Interspeech 2022

arXiv:2103.14131 [pdf, other]

Persistence Homology of TEDtalk: Do Sentence Embeddings Have a Topological Shape?

Authors: Shouman Das, Syed A. Haque, Md. Iftekhar Tanveer

Abstract: \emph{Topological data analysis} (TDA) has recently emerged as a new technique to extract meaningful discriminitve features from high dimensional data. In this paper, we investigate the possibility of applying TDA to improve the classification accuracy of public speaking rating. We calculated \emph{persistence image vectors} for the sentence embeddings of TEDtalk data and feed this vectors as addi… ▽ More \emph{Topological data analysis} (TDA) has recently emerged as a new technique to extract meaningful discriminitve features from high dimensional data. In this paper, we investigate the possibility of applying TDA to improve the classification accuracy of public speaking rating. We calculated \emph{persistence image vectors} for the sentence embeddings of TEDtalk data and feed this vectors as additional inputs to our machine learning models. We have found a negative result that this topological information does not improve the model accuracy significantly. In some cases, it makes the accuracy slightly worse than the original one. From our results, we could not conclude that the topological shapes of the sentence embeddings can help us train a better model for public speaking rating. △ Less

Submitted 25 March, 2021; originally announced March 2021.

Comments: 6 pages, 2 figures

arXiv:2012.06157 [pdf, other]

Fairness in Rating Prediction by Awareness of Verbal and Gesture Quality of Public Speeches

Authors: Ankani Chattoraj, Rupam Acharyya, Shouman Das, Md. Iftekhar Tanveer, Ehsan Hoque

Abstract: The role of verbal and non-verbal cues towards great public speaking has been a topic of exploration for many decades. We identify a commonality across present theories, the element of "variety or heterogeneity" in channels or modes of communication (e.g. resorting to stories, scientific facts, emotional connections, facial expressions etc.) which is essential for effectively communicating informa… ▽ More The role of verbal and non-verbal cues towards great public speaking has been a topic of exploration for many decades. We identify a commonality across present theories, the element of "variety or heterogeneity" in channels or modes of communication (e.g. resorting to stories, scientific facts, emotional connections, facial expressions etc.) which is essential for effectively communicating information. We use this observation to formalize a novel HEterogeneity Metric, HEM, that quantifies the quality of a talk both in the verbal and non-verbal domain (transcript and facial gestures). We use TED talks as an input repository of public speeches because it consists of speakers from a diverse community besides having a wide outreach. We show that there is an interesting relationship between HEM and the ratings of TED talks given to speakers by viewers. It emphasizes that HEM inherently and successfully represents the quality of a talk based on "variety or heterogeneity". Further, we also discover that HEM successfully captures the prevalent bias in ratings with respect to race and gender, that we call sensitive attributes (because prediction based on these might result in unfair outcome). We incorporate the HEM metric into the loss function of a neural network with the goal to reduce unfairness in rating predictions with respect to race and gender. Our results show that the modified loss function improves fairness in prediction without considerably affecting prediction accuracy of the neural network. Our work ties together a novel metric for public speeches in both verbal and non-verbal domain with the computational power of a neural network to design a fair prediction system for speakers. △ Less

Submitted 15 November, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

arXiv:2003.00683 [pdf, other]

Detection and Mitigation of Bias in Ted Talk Ratings

Authors: Rupam Acharyya, Shouman Das, Ankani Chattoraj, Oishani Sengupta, Md Iftekar Tanveer

Abstract: Unbiased data collection is essential to guaranteeing fairness in artificial intelligence models. Implicit bias, a form of behavioral conditioning that leads us to attribute predetermined characteristics to members of certain groups and informs the data collection process. This paper quantifies implicit bias in viewer ratings of TEDTalks, a diverse social platform assessing social and professional… ▽ More Unbiased data collection is essential to guaranteeing fairness in artificial intelligence models. Implicit bias, a form of behavioral conditioning that leads us to attribute predetermined characteristics to members of certain groups and informs the data collection process. This paper quantifies implicit bias in viewer ratings of TEDTalks, a diverse social platform assessing social and professional performance, in order to present the correlations of different kinds of bias across sensitive attributes. Although the viewer ratings of these videos should purely reflect the speaker's competence and skill, our analysis of the ratings demonstrates the presence of overwhelming and predominant implicit bias with respect to race and gender. In our paper, we present strategies to detect and mitigate bias that are critical to removing unfairness in AI. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:2002.12721 [pdf, other]

To be or not to be? A spatial predictive crime model for Rochester

Authors: Ankani Chattoraj, Rupam Acharyya, Sabyasachi Shivkumar, Md Iftekar Tanveer, Mohammad Rafayet Ali

Abstract: This project uses a spatial model (Geographically Weighted Regression) to relate various physical and social features to crime rates. Besides making interesting predictions from basic data statistics, the trained model can be used to predict on the test data. The high accuracy of this prediction on test data then allows us to make predictions of crime probabilities in different areas based on the… ▽ More This project uses a spatial model (Geographically Weighted Regression) to relate various physical and social features to crime rates. Besides making interesting predictions from basic data statistics, the trained model can be used to predict on the test data. The high accuracy of this prediction on test data then allows us to make predictions of crime probabilities in different areas based on the location, the population, the property rate, the time of the day/year and so on. This then further gives us the idea that an application can be built to help people traveling around Rochester be aware when and if they enter crime prone area. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:1911.11558 [pdf, other]

FairyTED: A Fair Rating Predictor for TED Talk Data

Authors: Rupam Acharyya, Shouman Das, Ankani Chattoraj, Md. Iftekhar Tanveer

Abstract: With the recent trend of applying machine learning in every aspect of human life, it is important to incorporate fairness into the core of the predictive algorithms. We address the problem of predicting the quality of public speeches while being fair with respect to sensitive attributes of the speakers, e.g. gender and race. We use the TED talks as an input repository of public speeches because it… ▽ More With the recent trend of applying machine learning in every aspect of human life, it is important to incorporate fairness into the core of the predictive algorithms. We address the problem of predicting the quality of public speeches while being fair with respect to sensitive attributes of the speakers, e.g. gender and race. We use the TED talks as an input repository of public speeches because it consists of speakers from a diverse community and has a wide outreach. Utilizing the theories of Causal Models, Counterfactual Fairness and state-of-the-art neural language models, we propose a mathematical framework for fair prediction of the public speaking quality. We employ grounded assumptions to construct a causal model capturing how different attributes affect public speaking quality. This causal model contributes in generating counterfactual data to train a fair predictive model. Our framework is general enough to utilize any assumption within the causal model. Experimental results show that while prediction accuracy is comparable to recent work on this dataset, our predictions are counterfactually fair with respect to a novel metric when compared to true data labels. The FairyTED setup not only allows organizers to make informed and diverse selection of speakers from the unobserved counterfactual possibilities but it also ensures that viewers and new users are not influenced by unfair and unbalanced ratings from arbitrary visitors to the www.ted.com website when deciding to view a talk. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: 9 pages, 4 figures, 3 tables. Accepted as a conference paper to be presented at AAAI 2020

arXiv:1906.03940 [pdf, other]

Predicting TED Talk Ratings from Language and Prosody

Authors: Md Iftekhar Tanveer, Md Kamrul Hassan, Daniel Gildea, M. Ehsan Hoque

Abstract: We use the largest open repository of public speaking---TED Talks---to predict the ratings of the online viewers. Our dataset contains over 2200 TED Talk transcripts (includes over 200 thousand sentences), audio features and the associated meta information including about 5.5 Million ratings from spontaneous visitors of the website. We propose three neural network architectures and compare with st… ▽ More We use the largest open repository of public speaking---TED Talks---to predict the ratings of the online viewers. Our dataset contains over 2200 TED Talk transcripts (includes over 200 thousand sentences), audio features and the associated meta information including about 5.5 Million ratings from spontaneous visitors of the website. We propose three neural network architectures and compare with statistical machine learning. Our experiments reveal that it is possible to predict all the 14 different ratings with an average AUC of 0.83 using the transcripts and prosody features only. The dataset and the complete source code is available for further analysis. △ Less

Submitted 20 May, 2019; originally announced June 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.08392

arXiv:1905.08392 [pdf, other]

A Causality-Guided Prediction of the TED Talk Ratings from the Speech-Transcripts using Neural Networks

Authors: Md Iftekhar Tanveer, Md Kamrul Hasan, Daniel Gildea, M. Ehsan Hoque

Abstract: Automated prediction of public speaking performance enables novel systems for tutoring public speaking skills. We use the largest open repository---TED Talks---to predict the ratings provided by the online viewers. The dataset contains over 2200 talk transcripts and the associated meta information including over 5.5 million ratings from spontaneous visitors to the website. We carefully removed the… ▽ More Automated prediction of public speaking performance enables novel systems for tutoring public speaking skills. We use the largest open repository---TED Talks---to predict the ratings provided by the online viewers. The dataset contains over 2200 talk transcripts and the associated meta information including over 5.5 million ratings from spontaneous visitors to the website. We carefully removed the bias present in the dataset (e.g., the speakers' reputations, popularity gained by publicity, etc.) by modeling the data generating process using a causal diagram. We use a word sequence based recurrent architecture and a dependency tree based recursive architecture as the neural networks for predicting the TED talk ratings. Our neural network models can predict the ratings with an average F-score of 0.77 which largely outperforms the competitive baseline method. △ Less

Submitted 20 May, 2019; originally announced May 2019.

arXiv:1904.06618 [pdf, other]

doi 10.18653/v1/D19-1211

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Authors: Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, Mohammed, Hoque

Abstract: Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it… ▽ More Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research. △ Less

Submitted 13 April, 2019; originally announced April 2019.

Journal ref: EMNLP-IJCNLP, 2019, 2046-2056

arXiv:1707.04790 [pdf, other]

Automatic Identification of Non-Meaningful Body-Movements and What It Reveals About Humans

Authors: Md Iftekhar Tanveer, RuJie Zhao, Mohammed Hoque

Abstract: We present a framework to identify whether a public speaker's body movements are meaningful or non-meaningful ("Mannerisms") in the context of their speeches. In a dataset of 84 public speaking videos from 28 individuals, we extract 314 unique body movement patterns (e.g. pacing, gesturing, shifting body weights, etc.). Online workers and the speakers themselves annotated the meaningfulness of the… ▽ More We present a framework to identify whether a public speaker's body movements are meaningful or non-meaningful ("Mannerisms") in the context of their speeches. In a dataset of 84 public speaking videos from 28 individuals, we extract 314 unique body movement patterns (e.g. pacing, gesturing, shifting body weights, etc.). Online workers and the speakers themselves annotated the meaningfulness of the patterns. We extracted five types of features from the audio-video recordings: disfluency, prosody, body movements, facial, and lexical. We use linear classifiers to predict the annotations with AUC up to 0.82. Analysis of the classifier weights reveals that it puts larger weights on the lexical features while predicting self-annotations. Contrastingly, it puts a larger weight on prosody features while predicting audience annotations. This analysis might provide subtle hint that public speakers tend to focus more on the verbal features while evaluating self-performances. The audience, on the other hand, tends to focus more on the non-verbal aspects of the speech. The dataset and code associated with this work has been released for peer review and further analysis. △ Less

Submitted 15 July, 2017; originally announced July 2017.

arXiv:1505.07310 [pdf, other]

Use of Laplacian Projection Technique for Summarizing Likert Scale Annotations

Authors: M. Iftekhar Tanveer

Abstract: Summarizing Likert scale ratings from human annotators is an important step for collecting human judgments. In this project we study a novel, graph theoretic method for this purpose. We also analyze a few interesting properties for this approach using real annotation datasets. Summarizing Likert scale ratings from human annotators is an important step for collecting human judgments. In this project we study a novel, graph theoretic method for this purpose. We also analyze a few interesting properties for this approach using real annotation datasets. △ Less

Submitted 26 May, 2015; originally announced May 2015.

arXiv:1504.03425 [pdf, ps, other]

Automated Analysis and Prediction of Job Interview Performance

Authors: Iftekhar Naim, M. Iftekhar Tanveer, Daniel Gildea, Mohammed, Hoque

Abstract: We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews. The proposed framework is trained by analyzing the videos of 138 interview sessions with 69 internship-seeking undergraduates at the Massachusetts Institute of Technology (MIT). Our automated analysis includes facial expressions (e.g., smiles, head gestures, facial tr… ▽ More We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews. The proposed framework is trained by analyzing the videos of 138 interview sessions with 69 internship-seeking undergraduates at the Massachusetts Institute of Technology (MIT). Our automated analysis includes facial expressions (e.g., smiles, head gestures, facial tracking points), language (e.g., word counts, topic modeling), and prosodic information (e.g., pitch, intonation, and pauses) of the interviewees. The ground truth labels are derived by taking a weighted average over the ratings of 9 independent judges. Our framework can automatically predict the ratings for interview traits such as excitement, friendliness, and engagement with correlation coefficients of 0.75 or higher, and can quantify the relative importance of prosody, language, and facial expressions. By analyzing the relative feature weights learned by the regression models, our framework recommends to speak more fluently, use less filler words, speak as "we" (vs. "I"), use more unique words, and smile more. We also find that the students who were rated highly while answering the first interview question were also rated highly overall (i.e., first impression matters). Finally, our MIT Interview dataset will be made available to other researchers to further validate and expand our findings. △ Less

Submitted 14 April, 2015; originally announced April 2015.

Comments: 14 pages, 8 figures, 6 tables

Showing 1–12 of 12 results for author: Tanveer, M I