Skip to main content

Showing 1–24 of 24 results for author: Rosset, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.11122  [pdf, other

    cs.AI

    Small Language Models are Good Too: An Empirical Study of Zero-Shot Classification

    Authors: Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, Sophie Rosset

    Abstract: This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring fu… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Journal ref: LREC-COLING 2024, May 2024, TURIN, Italy

  2. arXiv:2403.19727  [pdf, ps, other

    cs.CL cs.AI

    New Semantic Task for the French Spoken Language Understanding MEDIA Benchmark

    Authors: Nadège Alavoine, Gaëlle Laperriere, Christophe Servan, Sahar Ghannay, Sophie Rosset

    Abstract: Intent classification and slot-filling are essential tasks of Spoken Language Understanding (SLU). In most SLUsystems, those tasks are realized by independent modules. For about fifteen years, models achieving both of themjointly and exploiting their mutual enhancement have been proposed. A multilingual module using a joint modelwas envisioned to create a touristic dialogue system for a European p… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Journal ref: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, Torino, Italy

  3. arXiv:2403.18338  [pdf, other

    cs.AI

    mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

    Authors: Christophe Servan, Sahar Ghannay, Sophie Rosset

    Abstract: Within the current trend of Pretained Language Models (PLM), emerge more and more criticisms about the ethical andecological impact of such models. In this article, considering these critical remarks, we propose to focus on smallermodels, such as compact models like ALBERT, which are more ecologically virtuous than these PLM. However,PLMs enable huge breakthroughs in Natural Language Processing ta… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, May 2024, Torino, Italy

  4. arXiv:2302.09526  [pdf, other

    stat.ME cs.LG stat.ML

    Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators

    Authors: Oren Yuval, Saharon Rosset

    Abstract: We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $α$, controlling the weight given to the unlabeled data. Focusing on Generalized Line… ▽ More

    Submitted 28 May, 2024; v1 submitted 19 February, 2023; originally announced February 2023.

    Comments: 63 pages, 10 figures

    MSC Class: 62F10; 62J12; 68T07

  5. arXiv:2207.09157  [pdf, ps, other

    cs.CL

    On the cross-lingual transferability of multilingual prototypical models across NLU tasks

    Authors: Oralie Cattan, Christophe Servan, Sophie Rosset

    Abstract: Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications when a sufficient number of training examples are available. In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages. Domain and language models are supposed to grow and change as the p… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: Accepted to the ACL workshop METANLP 2021

    MSC Class: 68T50 ACM Class: I.2.7

  6. arXiv:2207.09152  [pdf, ps, other

    cs.CL cs.AI

    Benchmarking Transformers-based models on French Spoken Language Understanding tasks

    Authors: Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset

    Abstract: In the last five years, the rise of the self-attentional Transformer-based architectures led to state-of-the-art performances over many natural language tasks. Although these approaches are increasingly popular, they require large amounts of data and computational resources. There is still a substantial need for benchmarking methodologies ever upwards on under-resourced languages in data-scarce ap… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: Accepted paper at INTERSPEECH 2022

    MSC Class: 68T50 ACM Class: I.2.7

  7. arXiv:2207.09150  [pdf, ps, other

    cs.CL cs.AI

    On the Usability of Transformers-based models for a French Question-Answering task

    Authors: Oralie Cattan, Christophe Servan, Sophie Rosset

    Abstract: For many tasks, state-of-the-art results have been achieved with Transformer-based architectures, resulting in a paradigmatic shift in practices from the use of task-specific architectures to the fine-tuning of pre-trained language models. The ongoing trend consists in training models with an ever-increasing amount of data and parameters, which requires considerable resources. It leads to a strong… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: French compact model paper: FrALBERT, Accepted to RANLP 2021

    MSC Class: 68T50 ACM Class: I.2.7

  8. arXiv:2206.03314  [pdf, other

    stat.ML cs.LG

    Integrating Random Effects in Deep Neural Networks

    Authors: Giora Simchoni, Saharon Rosset

    Abstract: Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are… ▽ More

    Submitted 27 January, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

    Comments: 53 pages, 9 figures

  9. arXiv:2109.06483  [pdf, other

    eess.AS cs.SD

    Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation

    Authors: Juan M. Coria, Hervé Bredin, Sahar Ghannay, Sophie Rosset

    Abstract: We propose to address online speaker diarization as a combination of incremental clustering and local diarization applied to a rolling buffer updated every 500ms. Every single step of the proposed pipeline is designed to take full advantage of the strong ability of a recently proposed end-to-end overlap-aware segmentation to detect and separate overlap** speakers. In particular, we propose a mod… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: To appear in ASRU 2021. Code available at https://github.com/juanmc2005/StreamingSpeakerDiarization/

  10. arXiv:2102.13589  [pdf, other

    cs.CL

    Evaluate On-the-job Learning Dialogue Systems and a Case Study for Natural Language Understanding

    Authors: Mathilde Veron, Sophie Rosset, Olivier Galibert, Guillaume Bernard

    Abstract: On-the-job learning consists in continuously learning while being used in production, in an open environment, meaning that the system has to deal on its own with situations and elements never seen before. The kind of systems that seem to be especially adapted to on-the-job learning are dialogue systems, since they can take advantage of their interactions with users to collect feedback to adapt and… ▽ More

    Submitted 26 February, 2021; originally announced February 2021.

    Comments: Accepted to NeurIPS 2020 Human in the Loop Dialogue Systems Workshop

  11. arXiv:2009.00606  [pdf, other

    stat.ML cs.LG stat.ME

    Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction

    Authors: Oren Yuval, Saharon Rosset

    Abstract: We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled d… ▽ More

    Submitted 5 February, 2022; v1 submitted 1 September, 2020; originally announced September 2020.

    Comments: 39 pages, 4 figures

  12. arXiv:2008.13173  [pdf, other

    cs.CL cs.AI

    LIMSI_UPV at SemEval-2020 Task 9: Recurrent Convolutional Neural Network for Code-mixed Sentiment Analysis

    Authors: Somnath Banerjee, Sahar Ghannay, Sophie Rosset, Anne Vilnat, Paolo Rosso

    Abstract: This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and… ▽ More

    Submitted 30 August, 2020; originally announced August 2020.

    Comments: To be published in the Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, Sep. Association for Computational Linguistics

  13. arXiv:2003.14021  [pdf, ps, other

    cs.LG cs.SD eess.AS stat.ML

    A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification

    Authors: Juan M. Coria, Hervé Bredin, Sahar Ghannay, Sophie Rosset

    Abstract: Despite the growing popularity of metric learning approaches, very little work has attempted to perform a fair comparison of these techniques for speaker verification. We try to fill this gap and compare several metric learning loss functions in a systematic manner on the VoxCeleb dataset. The first family of loss functions is derived from the cross entropy loss (usually used for supervised classi… ▽ More

    Submitted 31 March, 2020; originally announced March 2020.

  14. arXiv:1905.13354  [pdf, other

    cs.CL

    DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

    Authors: Rachel Bawden, Sophie Rosset, Thomas Lavergne, Eric Bilinski

    Abstract: We present a new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

  15. arXiv:1905.04071  [pdf, other

    cs.CL cs.AI cs.HC cs.LG

    Survey on Evaluation Methods for Dialogue Systems

    Authors: Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, Mark Cieliebak

    Abstract: In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of huma… ▽ More

    Submitted 26 June, 2020; v1 submitted 10 May, 2019; originally announced May 2019.

    Journal ref: Artificial Intelligence Review, June 2020

  16. arXiv:1903.08560  [pdf, other

    math.ST cs.LG stat.ML

    Surprises in High-Dimensional Ridgeless Least Squares Interpolation

    Authors: Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

    Abstract: Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, w… ▽ More

    Submitted 7 December, 2020; v1 submitted 19 March, 2019; originally announced March 2019.

    Comments: 68 pages; 16 figures. This revision contains non-asymptotic version of earlier results, and results for general coefficients

  17. arXiv:1901.08974  [pdf, other

    stat.ME cs.LG stat.ML

    On the cross-validation bias due to unsupervised pre-processing

    Authors: Amit Moscovich, Saharon Rosset

    Abstract: Cross-validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data-dependent preprocessing, such as mean-centering, rescaling, dimensionality reduction, and outlier removal. It is often believed that such preprocessing stages, if done in… ▽ More

    Submitted 27 May, 2021; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: 31 pages, 6 figures, 1 table. New sections: (4.2.) Experiments on a real dataset; (6.) Potential impact on model selection; (7.1.) Upper bounds based on stability arguments. Updated Fig. 1. with larger sample sizes

    MSC Class: 62-07 ACM Class: G.3

    Journal ref: J. R. Stat. Soc. B (2022), 84(4), 1474-1502

  18. arXiv:1811.10071  [pdf, other

    cs.IT

    Innovation Representation of Stochastic Processes with Application to Causal Inference

    Authors: Amichai Painsky, Saharon Rosset, Meir Feder

    Abstract: Typically, real-world stochastic processes are not easy to analyze. In this work we study the representation of any stochastic process as a memoryless innovation process triggering a dynamic system. We show that such a representation is always feasible for innovation processes taking values over a continuous set. However, the problem becomes more challenging when the alphabet size of the innovatio… ▽ More

    Submitted 25 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: text overlap with arXiv:1611.04035 by other authors

  19. arXiv:1811.09417  [pdf, other

    cs.CL

    Natural language understanding for task oriented dialog in the biomedical domain in a low resources context

    Authors: Antoine Neuraz, Leonardo Campillos Llanos, Anita Burgun, Sophie Rosset

    Abstract: In the biomedical domain, the lack of sharable datasets often limit the possibility of develo** natural language processing systems, especially dialogue applications and natural language understanding models. To overcome this issue, we explore data generation using templates and terminologies and data augmentation approaches. Namely, we report our experiments using paraphrasing and word represen… ▽ More

    Submitted 29 November, 2018; v1 submitted 23 November, 2018; originally announced November 2018.

    Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

  20. arXiv:1810.11197  [pdf, other

    cs.LG stat.ML

    Lossless (and Lossy) Compression of Random Forests

    Authors: Amichai Painsky, Saharon Rosset

    Abstract: Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based envi… ▽ More

    Submitted 26 October, 2018; originally announced October 2018.

  21. Linear Independent Component Analysis over Finite Fields: Algorithms and Bounds

    Authors: Amichai Painsky, Saharon Rosset, Meir Feder

    Abstract: Independent Component Analysis (ICA) is a statistical tool that decomposes an observed random vector into components that are as statistically independent as possible. ICA over finite fields is a special case of ICA, in which both the observations and the decomposed components take values over a finite alphabet. This problem is also known as minimal redundancy representation or factorial coding. I… ▽ More

    Submitted 16 September, 2018; originally announced September 2018.

  22. arXiv:1803.04307  [pdf, ps, other

    cs.LG

    The Everlasting Database: Statistical Validity at a Fair Price

    Authors: Blake Woodworth, Vitaly Feldman, Saharon Rosset, Nathan Srebro

    Abstract: The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional s… ▽ More

    Submitted 2 April, 2019; v1 submitted 12 March, 2018; originally announced March 2018.

    Comments: 22 pages, accepted to NeurIPS 2018

  23. arXiv:1607.07003  [pdf, other

    cs.IT

    Large Alphabet Source Coding using Independent Component Analysis

    Authors: Amichai Painsky, Saharon Rosset, Meir Feder

    Abstract: Large alphabet source coding is a basic and well-studied problem in data compression. It has many applications such as compression of natural language text, speech and images. The classic perception of most commonly used methods is that a source is best described over an alphabet which is at least as large as the observed alphabet. In this work we challenge this approach and introduce a conceptual… ▽ More

    Submitted 24 July, 2016; originally announced July 2016.

  24. arXiv:1508.04934  [pdf, other

    cs.IT

    Generalized Independent Component Analysis Over Finite Alphabets

    Authors: Amichai Painsky, Saharon Rosset, Meir Feder

    Abstract: Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other.Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA… ▽ More

    Submitted 20 August, 2015; originally announced August 2015.

    Comments: arXiv admin note: text overlap with arXiv:1007.0528 by other authors