Search | arXiv e-print repository

Small Language Models are Good Too: An Empirical Study of Zero-Shot Classification

Authors: Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, Sophie Rosset

Abstract: This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring fu… ▽ More This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts.We developed and shared a comprehensive open-source repository that encapsulates our methodologies. This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Journal ref: LREC-COLING 2024, May 2024, TURIN, Italy

arXiv:2403.19727 [pdf, ps, other]

New Semantic Task for the French Spoken Language Understanding MEDIA Benchmark

Authors: Nadège Alavoine, Gaëlle Laperriere, Christophe Servan, Sahar Ghannay, Sophie Rosset

Abstract: Intent classification and slot-filling are essential tasks of Spoken Language Understanding (SLU). In most SLUsystems, those tasks are realized by independent modules. For about fifteen years, models achieving both of themjointly and exploiting their mutual enhancement have been proposed. A multilingual module using a joint modelwas envisioned to create a touristic dialogue system for a European p… ▽ More Intent classification and slot-filling are essential tasks of Spoken Language Understanding (SLU). In most SLUsystems, those tasks are realized by independent modules. For about fifteen years, models achieving both of themjointly and exploiting their mutual enhancement have been proposed. A multilingual module using a joint modelwas envisioned to create a touristic dialogue system for a European project, HumanE-AI-Net. A combination ofmultiple datasets, including the MEDIA dataset, was suggested for training this joint model. The MEDIA SLU datasetis a French dataset distributed since 2005 by ELRA, mainly used by the French research community and free foracademic research since 2020. Unfortunately, it is annotated only in slots but not intents. An enhanced version ofMEDIA annotated with intents has been built to extend its use to more tasks and use cases. This paper presents thesemi-automatic methodology used to obtain this enhanced version. In addition, we present the first results of SLUexperiments on this enhanced dataset using joint models for intent classification and slot-filling. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Journal ref: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, Torino, Italy

arXiv:2403.18338 [pdf, other]

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

Authors: Christophe Servan, Sahar Ghannay, Sophie Rosset

Abstract: Within the current trend of Pretained Language Models (PLM), emerge more and more criticisms about the ethical andecological impact of such models. In this article, considering these critical remarks, we propose to focus on smallermodels, such as compact models like ALBERT, which are more ecologically virtuous than these PLM. However,PLMs enable huge breakthroughs in Natural Language Processing ta… ▽ More Within the current trend of Pretained Language Models (PLM), emerge more and more criticisms about the ethical andecological impact of such models. In this article, considering these critical remarks, we propose to focus on smallermodels, such as compact models like ALBERT, which are more ecologically virtuous than these PLM. However,PLMs enable huge breakthroughs in Natural Language Processing tasks, such as Spoken and Natural LanguageUnderstanding, classification, Question--Answering tasks. PLMs also have the advantage of being multilingual, and,as far as we know, a multilingual version of compact ALBERT models does not exist. Considering these facts, wepropose the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipediadata, which complies with the ethical aspect of such a language model. We also evaluate the model against classicalmultilingual PLMs in classical NLP tasks. Finally, this paper proposes a rare study on the subword tokenizationimpact on language performances. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, May 2024, Torino, Italy

arXiv:2302.09526 [pdf, other]

Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators

Authors: Oren Yuval, Saharon Rosset

Abstract: We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $α$, controlling the weight given to the unlabeled data. Focusing on Generalized Line… ▽ More We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $α$, controlling the weight given to the unlabeled data. Focusing on Generalized Linear Models (GLM) and linear interpolators classes of models, we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is invariably beneficial to integrate the unlabeled data with some nonzero mixing ratio $α>0$, in terms of predictive performance. Moreover, we provide a rigorous framework to estimate the best mixing ratio $α^*$ where mixed SSL delivers the best predictive performance, while using the labeled and unlabeled data on hand. The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, in a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) to improve more complex models, such as deep neural networks, in real-world regression tasks. △ Less

Submitted 28 May, 2024; v1 submitted 19 February, 2023; originally announced February 2023.

Comments: 63 pages, 10 figures

MSC Class: 62F10; 62J12; 68T07

arXiv:2207.09157 [pdf, ps, other]

On the cross-lingual transferability of multilingual prototypical models across NLU tasks

Authors: Oralie Cattan, Christophe Servan, Sophie Rosset

Abstract: Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications when a sufficient number of training examples are available. In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages. Domain and language models are supposed to grow and change as the p… ▽ More Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications when a sufficient number of training examples are available. In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages. Domain and language models are supposed to grow and change as the problem space evolves. On one hand, research on transfer learning has demonstrated the cross-lingual ability of multilingual Transformers-based models to learn semantically rich representations. On the other, in addition to the above approaches, meta-learning have enabled the development of task and language learning algorithms capable of far generalization. Through this context, this article proposes to investigate the cross-lingual transferability of using synergistically few-shot learning with prototypical neural networks and multilingual Transformers-based models. Experiments in natural language understanding tasks on MultiATIS++ corpus shows that our approach substantially improves the observed transfer learning performances between the low and the high resource languages. More generally our approach confirms that the meaningful latent space learned in a given language can be can be generalized to unseen and under-resourced ones using meta-learning. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: Accepted to the ACL workshop METANLP 2021

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2207.09152 [pdf, ps, other]

Benchmarking Transformers-based models on French Spoken Language Understanding tasks

Authors: Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset

Abstract: In the last five years, the rise of the self-attentional Transformer-based architectures led to state-of-the-art performances over many natural language tasks. Although these approaches are increasingly popular, they require large amounts of data and computational resources. There is still a substantial need for benchmarking methodologies ever upwards on under-resourced languages in data-scarce ap… ▽ More In the last five years, the rise of the self-attentional Transformer-based architectures led to state-of-the-art performances over many natural language tasks. Although these approaches are increasingly popular, they require large amounts of data and computational resources. There is still a substantial need for benchmarking methodologies ever upwards on under-resourced languages in data-scarce application conditions. Most pre-trained language models were massively studied using the English language and only a few of them were evaluated on French. In this paper, we propose a unified benchmark, focused on evaluating models quality and their ecological impact on two well-known French spoken language understanding tasks. Especially we benchmark thirteen well-established Transformer-based models on the two available spoken language understanding tasks for French: MEDIA and ATIS-FR. Within this framework, we show that compact models can reach comparable results to bigger ones while their ecological impact is considerably lower. However, this assumption is nuanced and depends on the considered compression method. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: Accepted paper at INTERSPEECH 2022

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2207.09150 [pdf, ps, other]

On the Usability of Transformers-based models for a French Question-Answering task

Authors: Oralie Cattan, Christophe Servan, Sophie Rosset

Abstract: For many tasks, state-of-the-art results have been achieved with Transformer-based architectures, resulting in a paradigmatic shift in practices from the use of task-specific architectures to the fine-tuning of pre-trained language models. The ongoing trend consists in training models with an ever-increasing amount of data and parameters, which requires considerable resources. It leads to a strong… ▽ More For many tasks, state-of-the-art results have been achieved with Transformer-based architectures, resulting in a paradigmatic shift in practices from the use of task-specific architectures to the fine-tuning of pre-trained language models. The ongoing trend consists in training models with an ever-increasing amount of data and parameters, which requires considerable resources. It leads to a strong search to improve resource efficiency based on algorithmic and hardware improvements evaluated only for English. This raises questions about their usability when applied to small-scale learning problems, for which a limited amount of training data is available, especially for under-resourced languages tasks. The lack of appropriately sized corpora is a hindrance to applying data-driven and transfer learning-based approaches with strong instability cases. In this paper, we establish a state-of-the-art of the efforts dedicated to the usability of Transformer-based models and propose to evaluate these improvements on the question-answering performances of French language which have few resources. We address the instability relating to data scarcity by investigating various training strategies with data augmentation, hyperparameters optimization and cross-lingual transfer. We also introduce a new compact model for French FrALBERT which proves to be competitive in low-resource settings. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: French compact model paper: FrALBERT, Accepted to RANLP 2021

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2206.03314 [pdf, other]

Integrating Random Effects in Deep Neural Networks

Authors: Giora Simchoni, Saharon Rosset

Abstract: Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are… ▽ More Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn. △ Less

Submitted 27 January, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

Comments: 53 pages, 9 figures

arXiv:2109.06483 [pdf, other]

Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation

Authors: Juan M. Coria, Hervé Bredin, Sahar Ghannay, Sophie Rosset

Abstract: We propose to address online speaker diarization as a combination of incremental clustering and local diarization applied to a rolling buffer updated every 500ms. Every single step of the proposed pipeline is designed to take full advantage of the strong ability of a recently proposed end-to-end overlap-aware segmentation to detect and separate overlap** speakers. In particular, we propose a mod… ▽ More We propose to address online speaker diarization as a combination of incremental clustering and local diarization applied to a rolling buffer updated every 500ms. Every single step of the proposed pipeline is designed to take full advantage of the strong ability of a recently proposed end-to-end overlap-aware segmentation to detect and separate overlap** speakers. In particular, we propose a modified version of the statistics pooling layer (initially introduced in the x-vector architecture) to give less weight to frames where the segmentation model predicts simultaneous speakers. Furthermore, we derive cannot-link constraints from the initial segmentation step to prevent two local speakers from being wrongfully merged during the incremental clustering step. Finally, we show how the latency of the proposed approach can be adjusted between 500ms and 5s to match the requirements of a particular use case, and we provide a systematic analysis of the influence of latency on the overall performance (on AMI, DIHARD and VoxConverse). △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: To appear in ASRU 2021. Code available at https://github.com/juanmc2005/StreamingSpeakerDiarization/

arXiv:2102.13589 [pdf, other]

Evaluate On-the-job Learning Dialogue Systems and a Case Study for Natural Language Understanding

Authors: Mathilde Veron, Sophie Rosset, Olivier Galibert, Guillaume Bernard

Abstract: On-the-job learning consists in continuously learning while being used in production, in an open environment, meaning that the system has to deal on its own with situations and elements never seen before. The kind of systems that seem to be especially adapted to on-the-job learning are dialogue systems, since they can take advantage of their interactions with users to collect feedback to adapt and… ▽ More On-the-job learning consists in continuously learning while being used in production, in an open environment, meaning that the system has to deal on its own with situations and elements never seen before. The kind of systems that seem to be especially adapted to on-the-job learning are dialogue systems, since they can take advantage of their interactions with users to collect feedback to adapt and improve their components over time. Some dialogue systems performing on-the-job learning have been built and evaluated but no general methodology has yet been defined. Thus in this paper, we propose a first general methodology for evaluating on-the-job learning dialogue systems. We also describe a task-oriented dialogue system which improves on-the-job its natural language component through its user interactions. We finally evaluate our system with the described methodology. △ Less

Submitted 26 February, 2021; originally announced February 2021.

Comments: Accepted to NeurIPS 2020 Human in the Loop Dialogue Systems Workshop

arXiv:2009.00606 [pdf, other]

doi 10.1214/22-EJS1985

Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction

Authors: Oren Yuval, Saharon Rosset

Abstract: We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled d… ▽ More We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions. △ Less

Submitted 5 February, 2022; v1 submitted 1 September, 2020; originally announced September 2020.

Comments: 39 pages, 4 figures

arXiv:2008.13173 [pdf, other]

LIMSI_UPV at SemEval-2020 Task 9: Recurrent Convolutional Neural Network for Code-mixed Sentiment Analysis

Authors: Somnath Banerjee, Sahar Ghannay, Sophie Rosset, Anne Vilnat, Paolo Rosso

Abstract: This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and… ▽ More This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and the convolutional network to better capture the semantics of the text, for code-mixed sentiment analysis. The proposed system obtained 0.69 (best run) in terms of F1 score on the given test data and achieved the 9th place (Codalab username: somban) in the SentiMix Hindi-English subtask. △ Less

Submitted 30 August, 2020; originally announced August 2020.

Comments: To be published in the Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, Sep. Association for Computational Linguistics

arXiv:2003.14021 [pdf, ps, other]

A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification

Authors: Juan M. Coria, Hervé Bredin, Sahar Ghannay, Sophie Rosset

Abstract: Despite the growing popularity of metric learning approaches, very little work has attempted to perform a fair comparison of these techniques for speaker verification. We try to fill this gap and compare several metric learning loss functions in a systematic manner on the VoxCeleb dataset. The first family of loss functions is derived from the cross entropy loss (usually used for supervised classi… ▽ More Despite the growing popularity of metric learning approaches, very little work has attempted to perform a fair comparison of these techniques for speaker verification. We try to fill this gap and compare several metric learning loss functions in a systematic manner on the VoxCeleb dataset. The first family of loss functions is derived from the cross entropy loss (usually used for supervised classification) and includes the congenerous cosine loss, the additive angular margin loss, and the center loss. The second family of loss functions focuses on the similarity between training samples and includes the contrastive loss and the triplet loss. We show that the additive angular margin loss function outperforms all other loss functions in the study, while learning more robust representations. Based on a combination of SincNet trainable features and the x-vector architecture, the network used in this paper brings us a step closer to a really-end-to-end speaker verification system, when combined with the additive angular margin loss, while still being competitive with the x-vector baseline. In the spirit of reproducible research, we also release open source Python code for reproducing our results, and share pretrained PyTorch models on torch.hub that can be used either directly or after fine-tuning. △ Less

Submitted 31 March, 2020; originally announced March 2020.

arXiv:1905.13354 [pdf, other]

DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

Authors: Rachel Bawden, Sophie Rosset, Thomas Lavergne, Eric Bilinski

Abstract: We present a new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality… ▽ More We present a new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is two-fold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide a preliminary analysis of the corpus to confirm that the participants' judgments reveal perceptible differences in MT quality between the two MT systems used. △ Less

Submitted 30 May, 2019; originally announced May 2019.

arXiv:1905.04071 [pdf, other]

doi 10.1007/s10462-020-09866-x

Survey on Evaluation Methods for Dialogue Systems

Authors: Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, Mark Cieliebak

Abstract: In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of huma… ▽ More In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented dialogue systems, conversational dialogue systems, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then by presenting the evaluation methods regarding this class. △ Less

Submitted 26 June, 2020; v1 submitted 10 May, 2019; originally announced May 2019.

Journal ref: Artificial Intelligence Review, June 2020

arXiv:1903.08560 [pdf, other]

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Authors: Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

Abstract: Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, w… ▽ More Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_i \in {\mathbb R}^p$ are obtained by applying a linear transform to a vector of i.i.d. entries, $x_i = Σ^{1/2} z_i$ (with $z_i \in {\mathbb R}^p$); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, $x_i = \varphi(W z_i)$ (with $z_i \in {\mathbb R}^d$, $W \in {\mathbb R}^{p \times d}$ a matrix of i.i.d. entries, and $\varphi$ an activation function acting componentwise on $W z_i$). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization. △ Less

Submitted 7 December, 2020; v1 submitted 19 March, 2019; originally announced March 2019.

Comments: 68 pages; 16 figures. This revision contains non-asymptotic version of earlier results, and results for general coefficients

arXiv:1901.08974 [pdf, other]

doi 10.1111/rssb.12537

On the cross-validation bias due to unsupervised pre-processing

Authors: Amit Moscovich, Saharon Rosset

Abstract: Cross-validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data-dependent preprocessing, such as mean-centering, rescaling, dimensionality reduction, and outlier removal. It is often believed that such preprocessing stages, if done in… ▽ More Cross-validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data-dependent preprocessing, such as mean-centering, rescaling, dimensionality reduction, and outlier removal. It is often believed that such preprocessing stages, if done in an unsupervised manner (that does not incorporate the class labels or response values) are generally safe to do prior to cross-validation. In this paper, we study three commonly-practiced preprocessing procedures prior to a regression analysis: (i) variance-based feature selection; (ii) grou** of rare categorical features; and (iii) feature rescaling. We demonstrate that unsupervised preprocessing can, in fact, introduce a substantial bias into cross-validation estimates and potentially hurt model selection. This bias may be either positive or negative and its exact magnitude depends on all the parameters of the problem in an intricate manner. Further research is needed to understand the real-world impact of this bias across different application domains, particularly when dealing with small sample sizes and high-dimensional data. △ Less

Submitted 27 May, 2021; v1 submitted 25 January, 2019; originally announced January 2019.

Comments: 31 pages, 6 figures, 1 table. New sections: (4.2.) Experiments on a real dataset; (6.) Potential impact on model selection; (7.1.) Upper bounds based on stability arguments. Updated Fig. 1. with larger sample sizes

MSC Class: 62-07 ACM Class: G.3

Journal ref: J. R. Stat. Soc. B (2022), 84(4), 1474-1502

arXiv:1811.10071 [pdf, other]

Innovation Representation of Stochastic Processes with Application to Causal Inference

Authors: Amichai Painsky, Saharon Rosset, Meir Feder

Abstract: Typically, real-world stochastic processes are not easy to analyze. In this work we study the representation of any stochastic process as a memoryless innovation process triggering a dynamic system. We show that such a representation is always feasible for innovation processes taking values over a continuous set. However, the problem becomes more challenging when the alphabet size of the innovatio… ▽ More Typically, real-world stochastic processes are not easy to analyze. In this work we study the representation of any stochastic process as a memoryless innovation process triggering a dynamic system. We show that such a representation is always feasible for innovation processes taking values over a continuous set. However, the problem becomes more challenging when the alphabet size of the innovation is finite. In this case, we introduce both lossless and lossy frameworks, and provide closed-form solutions and practical algorithmic methods. In addition, we discuss the properties and uniqueness of our suggested approach. Finally, we show that the innovation representation problem has many applications. We focus our attention to Entropic Causal Inference, which has recently demonstrated promising performance, compared to alternative methods. △ Less

Submitted 25 November, 2018; originally announced November 2018.

Comments: arXiv admin note: text overlap with arXiv:1611.04035 by other authors

arXiv:1811.09417 [pdf, other]

Natural language understanding for task oriented dialog in the biomedical domain in a low resources context

Authors: Antoine Neuraz, Leonardo Campillos Llanos, Anita Burgun, Sophie Rosset

Abstract: In the biomedical domain, the lack of sharable datasets often limit the possibility of develo** natural language processing systems, especially dialogue applications and natural language understanding models. To overcome this issue, we explore data generation using templates and terminologies and data augmentation approaches. Namely, we report our experiments using paraphrasing and word represen… ▽ More In the biomedical domain, the lack of sharable datasets often limit the possibility of develo** natural language processing systems, especially dialogue applications and natural language understanding models. To overcome this issue, we explore data generation using templates and terminologies and data augmentation approaches. Namely, we report our experiments using paraphrasing and word representations learned on a large EHR corpus with Fasttext and ELMo, to learn a NLU model without any available dataset. We evaluate on a NLU task of natural language queries in EHRs divided in slot-filling and intent classification sub-tasks. On the slot-filling task, we obtain a F-score of 0.76 with the ELMo representation; and on the classification task, a mean F-score of 0.71. Our results show that this method could be used to develop a baseline system. △ Less

Submitted 29 November, 2018; v1 submitted 23 November, 2018; originally announced November 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

arXiv:1810.11197 [pdf, other]

Lossless (and Lossy) Compression of Random Forests

Authors: Amichai Painsky, Saharon Rosset

Abstract: Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based envi… ▽ More Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate. △ Less

Submitted 26 October, 2018; originally announced October 2018.

arXiv:1809.05815 [pdf, other]

doi 10.1109/TSP.2018.2872006

Linear Independent Component Analysis over Finite Fields: Algorithms and Bounds

Authors: Amichai Painsky, Saharon Rosset, Meir Feder

Abstract: Independent Component Analysis (ICA) is a statistical tool that decomposes an observed random vector into components that are as statistically independent as possible. ICA over finite fields is a special case of ICA, in which both the observations and the decomposed components take values over a finite alphabet. This problem is also known as minimal redundancy representation or factorial coding. I… ▽ More Independent Component Analysis (ICA) is a statistical tool that decomposes an observed random vector into components that are as statistically independent as possible. ICA over finite fields is a special case of ICA, in which both the observations and the decomposed components take values over a finite alphabet. This problem is also known as minimal redundancy representation or factorial coding. In this work we focus on linear methods for ICA over finite fields. We introduce a basic lower bound which provides a fundamental limit to the ability of any linear solution to solve this problem. Based on this bound, we present a greedy algorithm that outperforms all currently known methods. Importantly, we show that the overhead of our suggested algorithm (compared with the lower bound) typically decreases, as the scale of the problem grows. In addition, we provide a sub-optimal variant of our suggested method that significantly reduces the computational complexity at a relatively small cost in performance. Finally, we discuss the universal abilities of linear transformations in decomposing random vectors, compared with existing non-linear solutions. △ Less

Submitted 16 September, 2018; originally announced September 2018.

arXiv:1803.04307 [pdf, ps, other]

The Everlasting Database: Statistical Validity at a Fair Price

Authors: Blake Woodworth, Vitaly Feldman, Saharon Rosset, Nathan Srebro

Abstract: The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional s… ▽ More The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples. Crucially, we guarantee statistical validity without any assumptions on how the queries are generated. We also ensure with high probability that the cost for $M$ non-adaptive queries is $O(\log M)$, while the cost to a potentially adaptive user who makes $M$ queries that do not depend on any others is $O(\sqrt{M})$. △ Less

Submitted 2 April, 2019; v1 submitted 12 March, 2018; originally announced March 2018.

Comments: 22 pages, accepted to NeurIPS 2018

arXiv:1607.07003 [pdf, other]

Large Alphabet Source Coding using Independent Component Analysis

Authors: Amichai Painsky, Saharon Rosset, Meir Feder

Abstract: Large alphabet source coding is a basic and well-studied problem in data compression. It has many applications such as compression of natural language text, speech and images. The classic perception of most commonly used methods is that a source is best described over an alphabet which is at least as large as the observed alphabet. In this work we challenge this approach and introduce a conceptual… ▽ More Large alphabet source coding is a basic and well-studied problem in data compression. It has many applications such as compression of natural language text, speech and images. The classic perception of most commonly used methods is that a source is best described over an alphabet which is at least as large as the observed alphabet. In this work we challenge this approach and introduce a conceptual framework in which a large alphabet source is decomposed into "as statistically independent as possible" components. This decomposition allows us to apply entropy encoding to each component separately, while benefiting from their reduced alphabet size. We show that in many cases, such decomposition results in a sum of marginal entropies which is only slightly greater than the entropy of the source. Our suggested algorithm, based on a generalization of the Binary Independent Component Analysis, is applicable for a variety of large alphabet source coding setups. This includes the classical lossless compression, universal compression and high-dimensional vector quantization. In each of these setups, our suggested approach outperforms most commonly used methods. Moreover, our proposed framework is significantly easier to implement in most of these cases. △ Less

Submitted 24 July, 2016; originally announced July 2016.

arXiv:1508.04934 [pdf, other]

Generalized Independent Component Analysis Over Finite Alphabets

Authors: Amichai Painsky, Saharon Rosset, Meir Feder

Abstract: Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other.Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA… ▽ More Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other.Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow's minimal redundancy representation problem and is considered an open problem. We propose several theorems and show that this NP hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Our contribution provides the first efficient and constructive set of solutions to Barlow's problem.The minimal redundancy representation (also known as factorial code) has many applications, mainly in the fields of Neural Networks and Deep Learning. The Binary ICA (BICA) is also shown to have applications in several domains including medical diagnosis, multi-cluster assignment, network tomography and internet resource management. In this work we show this formulation further applies to multiple disciplines in source coding such as predictive coding, distributed source coding and coding of large alphabet sources. △ Less

Submitted 20 August, 2015; originally announced August 2015.

Comments: arXiv admin note: text overlap with arXiv:1007.0528 by other authors

Showing 1–24 of 24 results for author: Rosset, S