Search | arXiv e-print repository

Academic Article Recommendation Using Multiple Perspectives

Authors: Kenneth Church, Omar Alonso, Peter Vickers, Jiameng Sun, Abteen Ebrahimi, Raman Chandrasekar

Abstract: We argue that Content-based filtering (CBF) and Graph-based methods (GB) complement one another in Academic Search recommendations. The scientific literature can be viewed as a conversation between authors and the audience. CBF uses abstracts to infer authors' positions, and GB uses citations to infer responses from the audience. In this paper, we describe nine differences between CBF and GB, as w… ▽ More We argue that Content-based filtering (CBF) and Graph-based methods (GB) complement one another in Academic Search recommendations. The scientific literature can be viewed as a conversation between authors and the audience. CBF uses abstracts to infer authors' positions, and GB uses citations to infer responses from the audience. In this paper, we describe nine differences between CBF and GB, as well as synergistic opportunities for hybrid combinations. Two embeddings will be used to illustrate these opportunities: (1) Specter, a CBF method based on BERT-like deepnet encodings of abstracts, and (2) ProNE, a GB method based on spectral clustering of more than 200M papers and 2B citations from Semantic Scholar. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2406.19504 [pdf, other]

Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT

Authors: Ibrahim Said Ahmad, Shiran Dudy, Resmi Ramachandranpillai, Kenneth Church

Abstract: Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa's culture and emotions. We compare responses generated by ChatGPT with those provided by nat… ▽ More Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa's culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis and applied two similarity metrics to measure the alignment between human and ChatGPT responses. We also collected human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of the Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2403.18251 [pdf, other]

Since the Scientific Literature Is Multilingual, Our Models Should Be Too

Authors: Abteen Ebrahimi, Kenneth Church

Abstract: English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evi… ▽ More English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2307.15456 [pdf, other]

doi 10.3233/FAIA230311

Worrisome Properties of Neural Network Controllers and Their Symbolic Representations

Authors: Jacek Cyranka, Kevin E M Church, Jean-Philippe Lessard

Abstract: We raise concerns about controllers' robustness in simple reinforcement learning benchmark problems. We focus on neural network controllers and their low neuron and symbolic abstractions. A typical controller reaching high mean return values still generates an abundance of persistent low-return solutions, which is a highly undesirable property, easily exploitable by an adversary. We find that the… ▽ More We raise concerns about controllers' robustness in simple reinforcement learning benchmark problems. We focus on neural network controllers and their low neuron and symbolic abstractions. A typical controller reaching high mean return values still generates an abundance of persistent low-return solutions, which is a highly undesirable property, easily exploitable by an adversary. We find that the simpler controllers admit more persistent bad solutions. We provide an algorithm for a systematic robustness study and prove existence of persistent solutions and, in some cases, periodic orbits, using a computer-assisted proof methodology. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: accepted to ECAI23

Journal ref: Frontiers in Artificial Intelligence and Applications, ECAI 2023

arXiv:2211.10780 [pdf, other]

ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture

Authors: Youssef Mohamed, Mohamed Abdelfattah, Shyma Alhuwaider, Feifan Li, Xiangliang Zhang, Kenneth Ward Church, Mohamed Elhoseiny

Abstract: This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate "cultural-transfer" performance. More than 51K artworks… ▽ More This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate "cultural-transfer" performance. More than 51K artworks have 5 annotations or more in 3 languages. This diversity makes it possible to study similarities and differences across languages and cultures. Further, we investigate captioning tasks, and find diversity improves the performance of baseline models. ArtELingo is publicly available at https://www.artelingo.org/ with standard splits and baseline models. We hope our work will help ease future research on multilinguality and culturally-aware AI. △ Less

Submitted 19 November, 2022; originally announced November 2022.

Comments: 9 pages, Accepted at EMNLP 22, for more details see https://www.artelingo.org/

arXiv:2204.12672 [pdf, other]

Data-Driven Adaptive Simultaneous Machine Translation

Authors: Guangxu Xun, Mingbo Ma, Yuchen Bian, Xingyu Cai, Jiaji Huang, Renjie Zheng, Junkun Chen, Jiahong Yuan, Kenneth Church, Liang Huang

Abstract: In simultaneous translation (SimulMT), the most widely used strategy is the wait-k policy thanks to its simplicity and effectiveness in balancing translation quality and latency. However, wait-k suffers from two major limitations: (a) it is a fixed policy that can not adaptively adjust latency given context, and (b) its training is much slower than full-sentence translation. To alleviate these iss… ▽ More In simultaneous translation (SimulMT), the most widely used strategy is the wait-k policy thanks to its simplicity and effectiveness in balancing translation quality and latency. However, wait-k suffers from two major limitations: (a) it is a fixed policy that can not adaptively adjust latency given context, and (b) its training is much slower than full-sentence translation. To alleviate these issues, we propose a novel and efficient training scheme for adaptive SimulMT by augmenting the training corpus with adaptive prefix-to-prefix pairs, while the training complexity remains the same as that of training full-sentence translation models. Experiments on two language pairs show that our method outperforms all strong baselines in terms of translation quality and latency. △ Less

Submitted 26 April, 2022; originally announced April 2022.

arXiv:2203.03763 [pdf, other]

doi 10.1007/s10714-022-03054-8

Periodic orbits in Hořava-Lifshitz cosmologies

Authors: Kevin E. M. Church, Olivier Hénot, Phillipo Lappicy, Jean-Philippe Lessard, Hauke Sprink

Abstract: We consider spatially homogeneous Hořava-Lifshitz (HL) models that perturb General Relativity (GR) by a parameter $v\in (0,1)$ such that GR occurs at $v=1/2$. We describe the dynamics for the extremal case $v=0$, which possess the usual Bianchi hierarchy: type $\mathrm{I}$ (Kasner circle of equilibria), type $\mathrm{II}$ (heteroclinics that induce the Kasner map) and type… ▽ More We consider spatially homogeneous Hořava-Lifshitz (HL) models that perturb General Relativity (GR) by a parameter $v\in (0,1)$ such that GR occurs at $v=1/2$. We describe the dynamics for the extremal case $v=0$, which possess the usual Bianchi hierarchy: type $\mathrm{I}$ (Kasner circle of equilibria), type $\mathrm{II}$ (heteroclinics that induce the Kasner map) and type $\mathrm{VI_0},\mathrm{VII_0}$ (further heteroclinics). For type $\mathrm{VIII}$ and $\mathrm{IX}$, we use a computer-assisted approach to prove the existence of periodic orbits which are far from the Mixmaster attractor and thereby we obtain a new behaviour which is not described by the BKL picture of bouncing Kasner-like states. △ Less

Submitted 7 December, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: 21 pages, 7 figures. arXiv admin note: text overlap with arXiv:2012.07614

arXiv:2202.13326 [pdf, other]

Computer-assisted proofs of Hopf bubbles and degenerate Hopf bifurcations

Authors: Kevin Church, Elena Queirolo

Abstract: We present a computer-assisted approach to prove the existence of Hopf bubbles and degenerate Hopf bifurcations in ordinary and delay differential equations. We apply the method to rigorously investigate these nonlocal bifurcation structures in the FitzHugh- Nagumo equation, the extended Lorenz-84 model and a time-delay SI model. We present a computer-assisted approach to prove the existence of Hopf bubbles and degenerate Hopf bifurcations in ordinary and delay differential equations. We apply the method to rigorously investigate these nonlocal bifurcation structures in the FitzHugh- Nagumo equation, the extended Lorenz-84 model and a time-delay SI model. △ Less

Submitted 27 February, 2022; originally announced February 2022.

Comments: 51 pages, 15 figures

MSC Class: 34K18 (Primary); 37G15 (Secondary)

arXiv:2201.01942 [pdf, other]

Efficiently Disentangle Causal Representations

Authors: Yuanpeng Li, Joel Hestness, Mohamed Elhoseiny, Liang Zhao, Kenneth Church

Abstract: This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach,… ▽ More This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach, which relies on the learner's adaptation speed to new distribution, the proposed approach only requires evaluating the model's generalization ability. We provide a theoretical explanation for the advantage of the proposed method, and our experiments show that the proposed technique is 1.9--11.0$\times$ more sample efficient and 9.4--32.4 times quicker than the previous method on various tasks. The source code is available at \url{https://github.com/yuanpeng16/EDCR}. △ Less

Submitted 1 January, 2024; v1 submitted 6 January, 2022; originally announced January 2022.

Comments: 17 pages, 7 figures

Report number: Causal-01

arXiv:2111.03628 [pdf, other]

Exploiting a Zoo of Checkpoints for Unseen Tasks

Authors: Jiaji Huang, Qiang Qiu, Kenneth Church

Abstract: There are so many models in the literature that it is difficult for practitioners to decide which combinations are likely to be effective for a new task. This paper attempts to address this question by capturing relationships among checkpoints published on the web. We model the space of tasks as a Gaussian process. The covariance can be estimated from checkpoints and unlabeled probing data. With t… ▽ More There are so many models in the literature that it is difficult for practitioners to decide which combinations are likely to be effective for a new task. This paper attempts to address this question by capturing relationships among checkpoints published on the web. We model the space of tasks as a Gaussian process. The covariance can be estimated from checkpoints and unlabeled probing data. With the Gaussian process, we can identify representative checkpoints by a maximum mutual information criterion. This objective is submodular. A greedy method identifies representatives that are likely to "cover" the task space. These representatives generalize to new tasks with superior performance. Empirical evidence is provided for applications from both computational linguistics as well as computer vision. △ Less

Submitted 5 November, 2021; originally announced November 2021.

Comments: Accepted in Neurips 2021

arXiv:2108.01132 [pdf, other]

The Role of Phonetic Units in Speech Emotion Recognition

Authors: Jiahong Yuan, Xingyu Cai, Renjie Zheng, Liang Huang, Kenneth Church

Abstract: We propose a method for emotion recognition through emotiondependent speech recognition using Wav2vec 2.0. Our method achieved a significant improvement over most previously reported results on IEMOCAP, a benchmark emotion dataset. Different types of phonetic units are employed and compared in terms of accuracy and robustness of emotion recognition within and across datasets and languages. Models… ▽ More We propose a method for emotion recognition through emotiondependent speech recognition using Wav2vec 2.0. Our method achieved a significant improvement over most previously reported results on IEMOCAP, a benchmark emotion dataset. Different types of phonetic units are employed and compared in terms of accuracy and robustness of emotion recognition within and across datasets and languages. Models of phonemes, broad phonetic classes, and syllables all significantly outperform the utterance model, demonstrating that phonetic units are helpful and should be incorporated in speech emotion recognition. The best performance is from using broad phonetic classes. Further research is needed to investigate the optimal set of broad phonetic classes for the task of emotion recognition. Finally, we found that Wav2vec 2.0 can be fine-tuned to recognize coarser-grained or larger phonetic units than phonemes, such as broad phonetic classes and syllables. △ Less

Submitted 2 August, 2021; originally announced August 2021.

arXiv:2108.01129 [pdf, other]

Decoupling recognition and transcription in Mandarin ASR

Authors: Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang, Kenneth Church

Abstract: Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese.… ▽ More Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: submitted to ASRU 2021

arXiv:2108.01122 [pdf, other]

Automatic recognition of suprasegmentals in speech

Authors: Jiahong Yuan, Neville Ryant, Xingyu Cai, Kenneth Church, Mark Liberman

Abstract: This study reports our efforts to improve automatic recognition of suprasegmentals by fine-tuning wav2vec 2.0 with CTC, a method that has been successful in automatic speech recognition. We demonstrate that the method can improve the state-of-the-art on automatic recognition of syllables, tones, and pitch accents. Utilizing segmental information, by employing tonal finals or tonal syllables as rec… ▽ More This study reports our efforts to improve automatic recognition of suprasegmentals by fine-tuning wav2vec 2.0 with CTC, a method that has been successful in automatic speech recognition. We demonstrate that the method can improve the state-of-the-art on automatic recognition of syllables, tones, and pitch accents. Utilizing segmental information, by employing tonal finals or tonal syllables as recognition units, can significantly improve Mandarin tone recognition. Language models are helpful when tonal syllables are used as recognition units, but not helpful when tones are recognition units. Finally, Mandarin tone recognition can benefit from English phoneme recognition by combining the two tasks in fine-tuning wav2vec 2.0. △ Less

Submitted 3 August, 2021; v1 submitted 2 August, 2021; originally announced August 2021.

Comments: submitted to ASRU 2021

arXiv:2105.05915 [pdf, other]

Better than BERT but Worse than Baseline

Authors: Boxiang Liu, Jiaji Huang, Xingyu Cai, Kenneth Church

Abstract: This paper compares BERT-SQuAD and Ab3P on the Abbreviation Definition Identification (ADI) task. ADI inputs a text and outputs short forms (abbreviations/acronyms) and long forms (expansions). BERT with reranking improves over BERT without reranking but fails to reach the Ab3P rule-based baseline. What is BERT missing? Reranking introduces two new features: charmatch and freq. The first feature i… ▽ More This paper compares BERT-SQuAD and Ab3P on the Abbreviation Definition Identification (ADI) task. ADI inputs a text and outputs short forms (abbreviations/acronyms) and long forms (expansions). BERT with reranking improves over BERT without reranking but fails to reach the Ab3P rule-based baseline. What is BERT missing? Reranking introduces two new features: charmatch and freq. The first feature identifies opportunities to take advantage of character constraints in acronyms and the second feature identifies opportunities to take advantage of frequency constraints across documents. △ Less

Submitted 12 May, 2021; originally announced May 2021.

Comments: 6 pages, 2 figures, 5 tables

arXiv:2012.01477 [pdf, other]

The Third DIHARD Diarization Challenge

Authors: Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman

Abstract: DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain. Speaker diarization was evaluated under two speech activity conditions (diarization from a reference speech activity vs. diarization from scratch) and 11 diverse domains. The domains span… ▽ More DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain. Speaker diarization was evaluated under two speech activity conditions (diarization from a reference speech activity vs. diarization from scratch) and 11 diverse domains. The domains span a range of recording conditions and interaction types, including read audio-books, meeting speech, clinical interviews, web videos, and, for the first time, conversational telephone speech. A total of 30 organizations (forming 21teams) from industry and academia submitted 499 valid system outputs. The evaluation results indicate that speaker diarization has improved markedly since DIHARD I, particularly for two-party interactions, but that for many domains (e.g., web video) the problem remains far from solved. △ Less

Submitted 5 April, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: arXiv admin note: text overlap with arXiv:1906.07839

arXiv:2010.10048 [pdf, other]

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Authors: Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu, Jiahong Yuan, Kenneth Church, Liang Huang

Abstract: Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate la… ▽ More Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech (as measured by the naturalness metric MOS) with substantially lower latency than the baseline, in both Zh <-> En directions. △ Less

Submitted 21 October, 2020; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: 10 pages, accepted by Findings of EMNLP 2020

Journal ref: Findings of EMNLP 2020

arXiv:2006.05815 [pdf, other]

Third DIHARD Challenge Evaluation Plan

Authors: Neville Ryant, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman

Abstract: This paper introduces the third DIHARD challenge, the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises two tracks evaluating diarization performance when starting from a reference speech segmentation (track 1) and diarization from ra… ▽ More This paper introduces the third DIHARD challenge, the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises two tracks evaluating diarization performance when starting from a reference speech segmentation (track 1) and diarization from raw audio scratch (track 2). We describe the task, metrics, datasets, and evaluation protocol. △ Less

Submitted 2 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: Version 1.2 - Planned schedule updated - Updated numbers in tables from final versions of development/evaluation sets - Corrected typo

arXiv:2004.00436 [pdf, other]

Exploring Long Tail Visual Relationship Recognition with Large Vocabulary

Authors: Sherif Abdelkarim, Aniket Agarwal, Panos Achlioptas, Jun Chen, Jiaji Huang, Boyang Li, Kenneth Church, Mohamed Elhoseiny

Abstract: Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In th… ▽ More Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes. Benchmarks, code, and models have been made available at: https://github.com/Vision-CAIR/LTVRR. △ Less

Submitted 25 September, 2021; v1 submitted 25 March, 2020; originally announced April 2020.

ACM Class: I.2.10; I.5.0; I.4.0

arXiv:1912.07766 [pdf, ps, other]

User manual and tutorial for ISIM1s: a tiny MATLAB package for single stage invariant manifold-guided impulsive stabilization of delay equations

Authors: Kevin E. M. Church

Abstract: ISIM1s consists of a few MATLAB functions and a script that can be used to derive stabilizing impulsive controllers for delay differential equations. This document serves as both a manual and tutorial on the functionality of the ISIM1s package. Brief background on the theoretically guaranteed stabilization scenario are provided before the primary MATLAB script is explained. The tutorial demonstrat… ▽ More ISIM1s consists of a few MATLAB functions and a script that can be used to derive stabilizing impulsive controllers for delay differential equations. This document serves as both a manual and tutorial on the functionality of the ISIM1s package. Brief background on the theoretically guaranteed stabilization scenario are provided before the primary MATLAB script is explained. The tutorial demonstrates how the package can be used to derive stabilizing impulsive controllers for delay differential equations of various complexity scales. Emphasis is placed on the role of various tuning parameters. △ Less

Submitted 12 February, 2021; v1 submitted 16 December, 2019; originally announced December 2019.

arXiv:1911.02750 [pdf, other]

Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Authors: Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng, Hairong Liu, Kainan Peng, Kenneth Church, Liang Huang

Abstract: Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the {\em computational latency} (synthesizing time), which grows linearly with the sentence length even with parallel approaches, and (b) the {\em input latency} in scenarios… ▽ More Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the {\em computational latency} (synthesizing time), which grows linearly with the sentence length even with parallel approaches, and (b) the {\em input latency} in scenarios where the input text is incrementally generated (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we devise the first neural incremental TTS approach based on the recently proposed prefix-to-prefix framework. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an $O(1)$ rather than $O(n)$ latency. △ Less

Submitted 6 October, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: Findings of EMNLP 2020

arXiv:1906.07839 [pdf, ps, other]

The Second DIHARD Diarization Challenge: Dataset, task, and baselines

Authors: Neville Ryant, Kenneth Church, Christopher Cieri, Alejandrina Cristia, Jun Du, Sriram Ganapathy, Mark Liberman

Abstract: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentatio… ▽ More This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization. △ Less

Submitted 18 June, 2019; originally announced June 2019.

Comments: Accepted by Interspeech 2019

arXiv:1810.10045 [pdf, other]

Language Modeling at Scale

Authors: Mostofa Patwary, Milind Chabbi, Heewoo Jun, Jiaji Huang, Gregory Diamos, Kenneth Church

Abstract: We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabyt… ▽ More We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of various bottlenecks, especially memory (within GPUs) and communication (across GPUs). This paper shows how Zipf's Law can address these bottlenecks by grou** parameters for common words and character sequences, because $U \ll N$, where $U$ is the number of unique words (types) and $N$ is the size of the training set (tokens). For a local batch size $K$ with $G$ GPUs and a $D$-dimension embedding matrix, we reduce the original per-GPU memory and communication asymptotic complexity from $Θ(GKD)$ to $Θ(GK + UD)$. Empirically, we find $U \propto (GK)^{0.64}$ on four publicly available large datasets. When we scale up the number of GPUs to 64, a factor of 8, training time speeds up by factors up to 6.7$\times$ (for character LMs) and 6.3$\times$ (for word LMs) with negligible loss of accuracy. Our weak scaling on 192 GPUs on the Tieba dataset shows a 35\% improvement in LM prediction accuracy by training on 93 GB of data (2.5$\times$ larger than publicly available SOTA dataset), but taking only 1.25$\times$ increase in training time, compared to 3 GB of the same dataset running on 6 GPUs. △ Less

Submitted 23 October, 2018; originally announced October 2018.

arXiv:1510.05192 [pdf]

Three Hours a Day: Understanding Current Teen Practices of Smartphone Application Use

Authors: Frank Bentley, Karen Church, Beverly Harrison, Kent Lyons, Matthew Rafalow

Abstract: Teens are using mobile devices for an increasing number of activities. Smartphones and a variety of mobile apps for communication, entertainment, and productivity have become an integral part of their lives. This mobile phone use has evolved rapidly as technology has changed and thus studies from even 2 or 3 years ago may not reflect new patterns and practices as smartphones have become more sophi… ▽ More Teens are using mobile devices for an increasing number of activities. Smartphones and a variety of mobile apps for communication, entertainment, and productivity have become an integral part of their lives. This mobile phone use has evolved rapidly as technology has changed and thus studies from even 2 or 3 years ago may not reflect new patterns and practices as smartphones have become more sophisticated. In order to understand current teen's practices around smartphone use, we conducted a two week, mixed-methods study with 14 diverse teens. Through voicemail diaries, interviews, and real world usage data from a logging application installed on their smartphones, we developed an understanding of the types of apps used by teens, when they use these apps, and their reasons for using specific apps in particular situations. We found that the teens in our study used their smartphones for an average of almost 3 hours per day and that two-thirds of all app use involved interacting with an average of almost 10 distinct communications applications. From our study data, we highlight key implications for the design of future mobile apps or services, specifically new social and communications-related applications that allow teens to maintain desired levels of privacy and permanence on the content that they share. △ Less

Submitted 17 October, 2015; originally announced October 2015.

ACM Class: H.5.m

arXiv:1505.03014 [pdf, other]

Frappe: Understanding the Usage and Perception of Mobile App Recommendations In-The-Wild

Authors: Linas Baltrunas, Karen Church, Alexandros Karatzoglou, Nuria Oliver

Abstract: This paper describes a real world deployment of a context-aware mobile app recommender system (RS) called Frappe. Utilizing a hybrid-approach, we conducted a large-scale app market deployment with 1000 Android users combined with a small-scale local user study involving 33 users. The resulting usage logs and subjective feedback enabled us to gather key insights into (1) context-dependent app usage… ▽ More This paper describes a real world deployment of a context-aware mobile app recommender system (RS) called Frappe. Utilizing a hybrid-approach, we conducted a large-scale app market deployment with 1000 Android users combined with a small-scale local user study involving 33 users. The resulting usage logs and subjective feedback enabled us to gather key insights into (1) context-dependent app usage and (2) the perceptions and experiences of end-users while interacting with context-aware mobile app recommendations. While Frappe performs very well based on usage-centric evaluation metrics insights from the small-scale study reveal some negative user experiences. Our results point to a number of actionable lessons learned specifically related to designing, deploying and evaluating mobile context-aware RS in-the-wild with real users. △ Less

Submitted 12 May, 2015; originally announced May 2015.

Report number: 11 ACM Class: H.3.3; H.5.2

arXiv:cs/0610155 [pdf, ps, other]

Nonlinear Estimators and Tail Bounds for Dimension Reduction in $l_1$ Using Cauchy Random Projections

Authors: ** Li, Trevor J. Hastie, Kenneth W. Church

Abstract: For dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies the original data matrix $\mathbf{A} \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility results, one can not hope to recover the pairwise $l_1$ distances in… ▽ More For dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies the original data matrix $\mathbf{A} \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility results, one can not hope to recover the pairwise $l_1$ distances in $\mathbf{A}$ from $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times k}$, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computation, information retrieval, learning, and data mining. We propose three types of nonlinear estimators: the bias-corrected sample median estimator, the bias-corrected geometric mean estimator, and the bias-corrected maximum likelihood estimator. The sample median estimator and the geometric mean estimator are asymptotically (as $k\to \infty$) equivalent but the latter is more accurate at small $k$. We derive explicit tail bounds for the geometric mean estimator and establish an analog of the Johnson-Lindenstrauss (JL) lemma for dimension reduction in $l_1$, which is weaker than the classical JL lemma for dimension reduction in $l_2$. Asymptotically, both the sample median estimator and the geometric mean estimators are about 80% efficient compared to the maximum likelihood estimator (MLE). We analyze the moments of the MLE and propose approximating the distribution of the MLE by an inverse Gaussian. △ Less

Submitted 27 October, 2006; originally announced October 2006.

arXiv:cmp-lg/9407021 [pdf, ps]

K-vec: A New Approach for Aligning Parallel Texts

Authors: Pascale Fung, Kenneth Church

Abstract: Various methods have been proposed for aligning texts in two or more languages such as the Canadian Parliamentary Debates(Hansards). Some of these methods generate a bilingual lexicon as a by-product. We present an alternative alignment strategy which we call K-vec, that starts by estimating the lexicon. For example, it discovers that the English word "fisheries" is similar to the French "pe^che… ▽ More Various methods have been proposed for aligning texts in two or more languages such as the Canadian Parliamentary Debates(Hansards). Some of these methods generate a bilingual lexicon as a by-product. We present an alternative alignment strategy which we call K-vec, that starts by estimating the lexicon. For example, it discovers that the English word "fisheries" is similar to the French "pe^ches" by noting that the distribution of "fisheries" in the English text is similar to the distribution of "pe^ches" in the French. K-vec does not depend on sentence boundaries. △ Less

Submitted 25 July, 1994; originally announced July 1994.

Comments: 7 pages, uuencoded, compressed PostScript; Proc. COLING-94

Showing 1–26 of 26 results for author: Church, K