Search | arXiv e-print repository

arXiv:2403.13369 [pdf, other]

Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting

Authors: Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab, Christina Kiriakou, Nicolas Geis, Christoph Dieterich, Anette Frank

Abstract: Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, whi… ▽ More Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2309.14047 [pdf, other]

Random-Energy Secret Sharing via Extreme Synergy

Authors: Vudtiwat Ngampruetikorn, David J. Schwab

Abstract: The random-energy model (REM), a solvable spin-glass model, has impacted an incredibly diverse set of problems, from protein folding to combinatorial optimization to many-body localization. Here, we explore a new connection to secret sharing. We formulate a secret-sharing scheme, based on the REM, and analyze its information-theoretic properties. Our analyses reveal that the correlations between s… ▽ More The random-energy model (REM), a solvable spin-glass model, has impacted an incredibly diverse set of problems, from protein folding to combinatorial optimization to many-body localization. Here, we explore a new connection to secret sharing. We formulate a secret-sharing scheme, based on the REM, and analyze its information-theoretic properties. Our analyses reveal that the correlations between subsystems of the REM are highly synergistic and form the basis for secure secret-sharing schemes. We derive the ranges of temperatures and secret lengths over which the REM satisfies the requirement of secure secret sharing. We show further that a special point in the phase diagram exists at which the REM-based scheme is optimal in its information encoding. Our analytical results for the thermodynamic limit are in good qualitative agreement with numerical simulations of finite systems, for which the strict security requirement is replaced by a tradeoff between secrecy and recoverability. Our work offers a further example of information theory as a unifying concept, connecting problems in statistical physics to those in computation. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 6 pages, 5 figures

arXiv:2309.05472 [pdf, other]

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

Authors: Titouan Parcollet, Ha Nguyen, Solene Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli, Shucong Zhang, Alexandre Allauzen, Maximin Coavoux, Yannick Esteve, Mickael Rouvier, Jerome Goulian, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

Abstract: Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-… ▽ More Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 hours of French speech outperform multilingual and previous LeBenchmark SSL models across the benchmark but also required up to four times more energy for pre-training. △ Less

Submitted 18 March, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

Comments: Published in Computer Science and Language. Preprint allowed

arXiv:2307.11170 [pdf, other]

UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition

Authors: Aidan Mannion, Thierry Chevalier, Didier Schwab, Lorraine Geouriot

Abstract: Pre-trained transformer language models (LMs) have in recent years become the dominant paradigm in applied NLP. These models have achieved state-of-the-art performance on tasks such as information extraction, question answering, sentiment analysis, document classification and many others. In the biomedical domain, significant progress has been made in adapting this paradigm to NLP tasks that requi… ▽ More Pre-trained transformer language models (LMs) have in recent years become the dominant paradigm in applied NLP. These models have achieved state-of-the-art performance on tasks such as information extraction, question answering, sentiment analysis, document classification and many others. In the biomedical domain, significant progress has been made in adapting this paradigm to NLP tasks that require the integration of domain-specific knowledge as well as statistical modelling of language. In particular, research in this area has focused on the question of how best to construct LMs that take into account not only the patterns of token distribution in medical text, but also the wealth of structured information contained in terminology resources such as the UMLS. This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS. This allows for graph-based learning objectives to be combined with masked-language pre-training. Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2303.17762 [pdf, other]

Generalized Information Bottleneck for Gaussian Variables

Authors: Vudtiwat Ngampruetikorn, David J. Schwab

Abstract: The information bottleneck (IB) method offers an attractive framework for understanding representation learning, however its applications are often limited by its computational intractability. Analytical characterization of the IB method is not only of practical interest, but it can also lead to new insights into learning phenomena. Here we consider a generalized IB problem, in which the mutual in… ▽ More The information bottleneck (IB) method offers an attractive framework for understanding representation learning, however its applications are often limited by its computational intractability. Analytical characterization of the IB method is not only of practical interest, but it can also lead to new insights into learning phenomena. Here we consider a generalized IB problem, in which the mutual information in the original IB method is replaced by correlation measures based on Renyi and Jeffreys divergences. We derive an exact analytical IB solution for the case of Gaussian correlated variables. Our analysis reveals a series of structural transitions, similar to those previously observed in the original IB case. We find further that although solving the original, Renyi and Jeffreys IB problems yields different representations in general, the structural transitions occur at the same critical tradeoff parameters, and the Renyi and Jeffreys IB solutions perform well under the original IB objective. Our results suggest that formulating the IB method with alternative correlation measures could offer a strategy for obtaining an approximate solution to the original IB problem. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: 7 pages, 3 figures

arXiv:2301.11716 [pdf, other]

Pre-training for Speech Translation: CTC Meets Optimal Transport

Authors: Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan Pino, Benjamin Lecouteux, Didier Schwab

Abstract: The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC)… ▽ More The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models. Code and pre-trained models are available at https://github.com/formiel/fairseq. △ Less

Submitted 5 June, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: ICML 2023 (oral presentation). This version fixed URLs, updated affiliations & acknowledgements, and improved formatting

arXiv:2208.03848 [pdf, other]

Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Authors: Vudtiwat Ngampruetikorn, David J. Schwab

Abstract: Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize resi… ▽ More Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil information-theoretic analogs of double and multiple descent phenomena. △ Less

Submitted 11 October, 2022; v1 submitted 7 August, 2022; originally announced August 2022.

Comments: NeurIPS 2022

ACM Class: H.1.1; I.2.6

arXiv:2109.12801 [pdf, other]

Effect Of Personalized Calibration On Gaze Estimation Using Deep-Learning

Authors: Nairit Bandyopadhyay, Sébastien Riou, Didier Schwab

Abstract: With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning m… ▽ More With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning model trained for gaze estimation has no knowledge about. To analyse the performance in such scenarios we have tried to simulate a calibration mechanism. In this work we use the MPIIGaze data set. We trained a multi modal convolutional neural network and analysed its performance with and without calibration and this evaluation provides clear insights on how calibration improved the performance of the Deep Learning model in estimating gaze in the wild. △ Less

Submitted 27 September, 2021; originally announced September 2021.

arXiv:2106.01463 [pdf, other]

Lightweight Adapter Tuning for Multilingual Speech Translation

Authors: Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent Besacier

Abstract: Adapter modules were recently introduced as an efficient alternative to fine-tuning in NLP. Adapter tuning consists in freezing pretrained parameters of a model and injecting lightweight modules between layers, resulting in the addition of only a small number of task-specific trainable parameters. While adapter tuning was investigated for multilingual neural machine translation, this paper propose… ▽ More Adapter modules were recently introduced as an efficient alternative to fine-tuning in NLP. Adapter tuning consists in freezing pretrained parameters of a model and injecting lightweight modules between layers, resulting in the addition of only a small number of task-specific trainable parameters. While adapter tuning was investigated for multilingual neural machine translation, this paper proposes a comprehensive analysis of adapters for multilingual speech translation (ST). Starting from different pre-trained models (a multilingual ST trained on parallel data or a multilingual BART (mBART) trained on non-parallel multilingual data), we show that adapters can be used to: (a) efficiently specialize ST to specific language pairs with a low extra cost in terms of parameters, and (b) transfer from an automatic speech recognition (ASR) task and an mBART pre-trained model to a multilingual ST task. Experiments show that adapter tuning offer competitive results to full fine-tuning, while being much more parameter-efficient. △ Less

Submitted 12 July, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted at ACL-IJCNLP 2021

arXiv:2105.14940 [pdf, other]

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?

Authors: Zae Myung Kim, Laurent Besacier, Vassilina Nikoulina, Didier Schwab

Abstract: Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of language-independent representations, or whether a multilingual model partitions its weights among different languages. While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual components of a multilingual neural translation (NMT)… ▽ More Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of language-independent representations, or whether a multilingual model partitions its weights among different languages. While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual components of a multilingual neural translation (NMT) model. In particular, we look at the encoder self-attention and encoder-decoder attention heads (in a many-to-one NMT model) that are more specific to the translation of a certain language pair than others by (1) employing metrics that quantify some aspects of the attention weights such as "variance" or "confidence", and (2) systematically ranking the importance of attention heads with respect to translation quality. Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs and that it is possible to remove nearly one-third of the less important heads without hurting the translation quality greatly. △ Less

Submitted 31 May, 2021; originally announced May 2021.

Comments: 10 pages, accepted at Findings of ACL 2021 (short)

arXiv:2105.13977 [pdf, other]

Perturbation Theory for the Information Bottleneck

Authors: Vudtiwat Ngampruetikorn, David J. Schwab

Abstract: Extracting relevant information from data is crucial for all forms of learning. The information bottleneck (IB) method formalizes this, offering a mathematically precise and conceptually appealing framework for understanding learning phenomena. However the nonlinearity of the IB problem makes it computationally expensive and analytically intractable in general. Here we derive a perturbation theory… ▽ More Extracting relevant information from data is crucial for all forms of learning. The information bottleneck (IB) method formalizes this, offering a mathematically precise and conceptually appealing framework for understanding learning phenomena. However the nonlinearity of the IB problem makes it computationally expensive and analytically intractable in general. Here we derive a perturbation theory for the IB method and report the first complete characterization of the learning onset, the limit of maximum relevant information per bit extracted from data. We test our results on synthetic probability distributions, finding good agreement with the exact numerical solution near the onset of learning. We explore the difference and subtleties in our derivation and previous attempts at deriving a perturbation theory for the learning onset and attribute the discrepancy to a flawed assumption. Our work also provides a fresh perspective on the intimate relationship between the IB method and the strong data processing inequality. △ Less

Submitted 25 October, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: NeurIPS 2021

arXiv:2104.11462 [pdf, ps, other]

doi 10.21437/Interspeech.2021-556

LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech

Authors: Solene Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Esteve, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

Abstract: Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient spee… ▽ More Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient speech systems, their evaluation was mostly made on ASR and using multiple and heterogeneous experimental settings (most of them for English). This questions the objective comparison of SSL approaches and the evaluation of their impact on building speech systems. In this paper, we propose LeBenchmark: a reproducible framework for assessing SSL from speech. It not only includes ASR (high and low resource) tasks but also spoken language understanding, speech translation and emotion recognition. We also focus on speech technologies in a language different than English: French. SSL models of different sizes are trained from carefully sourced and documented datasets. Experiments show that SSL is beneficial for most but not all tasks which confirms the need for exhaustive and reliable benchmarks to evaluate its real impact. LeBenchmark is shared with the scientific community for reproducible research in SSL from speech. △ Less

Submitted 10 June, 2021; v1 submitted 23 April, 2021; originally announced April 2021.

Comments: Will be presented at Interspeech 2021

Journal ref: Proc. Interspeech 2021

arXiv:2103.12719 [pdf, other]

Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations

Authors: Chaitanya K. Ryali, David J. Schwab, Ari S. Morcos

Abstract: Recent progress in self-supervised learning has demonstrated promising results in multiple visual tasks. An important ingredient in high-performing self-supervised methods is the use of data augmentation by training models to place different augmented views of the same image nearby in embedding space. However, commonly used augmentation pipelines treat images holistically, ignoring the semantic re… ▽ More Recent progress in self-supervised learning has demonstrated promising results in multiple visual tasks. An important ingredient in high-performing self-supervised methods is the use of data augmentation by training models to place different augmented views of the same image nearby in embedding space. However, commonly used augmentation pipelines treat images holistically, ignoring the semantic relevance of parts of an image-e.g. a subject vs. a background-which can lead to the learning of spurious correlations. Our work addresses this problem by investigating a class of simple, yet highly effective "background augmentations", which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds. Through a systematic investigation, we show that background augmentations lead to substantial improvements in performance across a spectrum of state-of-the-art self-supervised methods (MoCo-v2, BYOL, SwAV) on a variety of tasks, e.g. $\sim$+1-2% gains on ImageNet, enabling performance on par with the supervised baseline. Further, we find the improvement in limited-labels settings is even larger (up to 4.2%). Background augmentations also improve robustness to a number of distribution shifts, including natural adversarial examples, ImageNet-9, adversarial attacks, ImageNet-Renditions. We also make progress in completely unsupervised saliency detection, in the process of generating saliency masks used for background augmentations. △ Less

Submitted 12 November, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

Comments: Technical Report; Additional Results

arXiv:2011.00747 [pdf, other]

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Authors: Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent Besacier

Abstract: We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one… ▽ More We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one results. Furthermore, our parallel models demonstrate no trade-off between ASR and ST compared to the vanilla multi-task architecture. Our code and pre-trained models are available at https://github.com/formiel/speech-translation. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: Accepted at COLING 2020 (Oral)

Journal ref: The 28th International Conference on Computational Linguistics (COLING 2020)

arXiv:2010.06682 [pdf, other]

Are all negatives created equal in contrastive instance discrimination?

Authors: Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos

Abstract: Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is… ▽ More Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives -- the hardest 5% -- were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment. △ Less

Submitted 25 October, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: Fixed author name error

arXiv:2009.12789 [pdf, other]

Learning Optimal Representations with the Decodable Information Bottleneck

Authors: Yann Dubois, Douwe Kiela, David J. Schwab, Ramakrishna Vedantam

Abstract: We address the question of characterizing and finding optimal representations for supervised learning. Traditionally, this question has been tackled using the Information Bottleneck, which compresses the inputs while retaining information about the targets, in a decoder-agnostic fashion. In machine learning, however, our goal is not compression but rather generalization, which is intimately linked… ▽ More We address the question of characterizing and finding optimal representations for supervised learning. Traditionally, this question has been tackled using the Information Bottleneck, which compresses the inputs while retaining information about the targets, in a decoder-agnostic fashion. In machine learning, however, our goal is not compression but rather generalization, which is intimately linked to the predictive family or decoder of interest (e.g. linear classifier). We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks. △ Less

Submitted 16 July, 2021; v1 submitted 27 September, 2020; originally announced September 2020.

Comments: Accepted at NeurIPS 2020

arXiv:2007.14823 [pdf, other]

Theory of gating in recurrent neural networks

Authors: Kamesh Krishnamurthy, Tankut Can, David J. Schwab

Abstract: Recurrent neural networks (RNNs) are powerful dynamical models, widely used in machine learning (ML) and neuroscience. Prior theoretical work has focused on RNNs with additive interactions. However, gating - i.e. multiplicative - interactions are ubiquitous in real neurons and also the central feature of the best-performing RNNs in ML. Here, we show that gating offers flexible control of two salie… ▽ More Recurrent neural networks (RNNs) are powerful dynamical models, widely used in machine learning (ML) and neuroscience. Prior theoretical work has focused on RNNs with additive interactions. However, gating - i.e. multiplicative - interactions are ubiquitous in real neurons and also the central feature of the best-performing RNNs in ML. Here, we show that gating offers flexible control of two salient features of the collective dynamics: i) timescales and ii) dimensionality. The gate controlling timescales leads to a novel, marginally stable state, where the network functions as a flexible integrator. Unlike previous approaches, gating permits this important function without parameter fine-tuning or special symmetries. Gates also provide a flexible, context-dependent mechanism to reset the memory trace, thus complementing the memory function. The gate modulating the dimensionality can induce a novel, discontinuous chaotic transition, where inputs push a stable system to strong chaotic activity, in contrast to the typically stabilizing effect of inputs. At this transition, unlike additive RNNs, the proliferation of critical points (topological complexity) is decoupled from the appearance of chaotic dynamics (dynamical complexity). The rich dynamics are summarized in phase diagrams, thus providing a map for principled parameter initialization choices to ML practitioners. △ Less

Submitted 1 December, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

Comments: 13 figures

arXiv:2004.11759 [pdf, other]

Learning Term Discrimination

Authors: Jibril Frej, Phillipe Mulhem, Didier Schwab, Jean-Pierre Chevallet

Abstract: Document indexing is a key component for efficient information retrieval (IR). After preprocessing steps such as stemming and stop-word removal, document indexes usually store term-frequencies (tf). Along with tf (that only reflects the importance of a term in a document), traditional IR models use term discrimination values (TDVs) such as inverse document frequency (idf) to favor discriminative t… ▽ More Document indexing is a key component for efficient information retrieval (IR). After preprocessing steps such as stemming and stop-word removal, document indexes usually store term-frequencies (tf). Along with tf (that only reflects the importance of a term in a document), traditional IR models use term discrimination values (TDVs) such as inverse document frequency (idf) to favor discriminative terms during retrieval. In this work, we propose to learn TDVs for document indexing with shallow neural networks that approximate traditional IR ranking functions such as TF-IDF and BM25. Our proposal outperforms, both in terms of nDCG and recall, traditional approaches, even with few positively labelled query-document pairs as learning data. Our learned TDVs, when used to filter out terms of the vocabulary that have zero discrimination value, allow to both significantly lower the memory footprint of the inverted index and speed up the retrieval process (BM25 is up to 3~times faster), without degrading retrieval quality. △ Less

Submitted 28 April, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

Comments: Accepted to ACM SIGIR 2020

arXiv:2003.00152 [pdf, other]

Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

Authors: Jonathan Frankle, David J. Schwab, Ari S. Morcos

Abstract: A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters use… ▽ More A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features. △ Less

Submitted 21 March, 2021; v1 submitted 28 February, 2020; originally announced March 2020.

Comments: Published in ICLR 2021

arXiv:2002.10365 [pdf, other]

The Early Phase of Neural Network Training

Authors: Jonathan Frankle, David J. Schwab, Ari S. Morcos

Abstract: Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here, we examine the changes that dee… ▽ More Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here, we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this behavior, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are not inherently label-dependent, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning. △ Less

Submitted 24 February, 2020; originally announced February 2020.

Comments: ICLR 2020 Camera Ready. Available on OpenReview at https://openreview.net/forum?id=Hkl1iRNFwS

arXiv:2002.00025 [pdf, other]

Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs

Authors: Tankut Can, Kamesh Krishnamurthy, David J. Schwab

Abstract: Recurrent neural networks (RNNs) are powerful dynamical models for data with complex temporal structure. However, training RNNs has traditionally proved challenging due to exploding or vanishing of gradients. RNN models such as LSTMs and GRUs (and their variants) significantly mitigate these issues associated with training by introducing various types of gating units into the architecture. While t… ▽ More Recurrent neural networks (RNNs) are powerful dynamical models for data with complex temporal structure. However, training RNNs has traditionally proved challenging due to exploding or vanishing of gradients. RNN models such as LSTMs and GRUs (and their variants) significantly mitigate these issues associated with training by introducing various types of gating units into the architecture. While these gates empirically improve performance, how the addition of gates influences the dynamics and trainability of GRUs and LSTMs is not well understood. Here, we take the perspective of studying randomly initialized LSTMs and GRUs as dynamical systems, and ask how the salient dynamical properties are shaped by the gates. We leverage tools from random matrix theory and mean-field theory to study the state-to-state Jacobians of GRUs and LSTMs. We show that the update gate in the GRU and the forget gate in the LSTM can lead to an accumulation of slow modes in the dynamics. Moreover, the GRU update gate can poise the system at a marginally stable point. The reset gate in the GRU and the output and input gates in the LSTM control the spectral radius of the Jacobian, and the GRU reset gate also modulates the complexity of the landscape of fixed-points. Furthermore, for the GRU we obtain a phase diagram describing the statistical properties of fixed-points. We also provide a preliminary comparison of training performance to the various dynamical regimes realized by varying hyperparameters. Looking to the future, we have introduced a powerful set of techniques which can be adapted to a broad class of RNNs, to study the influence of various architectural choices on dynamics, and potentially motivate the principled discovery of novel architectures. △ Less

Submitted 15 June, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

Comments: 18+18 pages, 4 figures, to appear in Proceedings of Machine Learning Research Vol. 107, 2020, 1st Annual Conference on Mathematical and Scientific Machine Learning

arXiv:1912.05372 [pdf, ps, other]

FlauBERT: Unsupervised Language Model Pre-training for French

Authors: Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab

Abstract: Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely… ▽ More Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP. △ Less

Submitted 12 March, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

Comments: Accepted to LREC 2020

arXiv:1912.01901 [pdf, other]

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Authors: Jibril Frej, Didier Schwab, Jean-Pierre Chevallet

Abstract: Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for… ▽ More Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR78k and wikIRS78k: two large-scale publicly available datasets that both contain 78,628 queries and 3,060,191 (query, relevant documents) pairs. △ Less

Submitted 17 March, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: Accepted at LREC 2020

MSC Class: H.3.3 ACM Class: H.3.3

arXiv:1911.02898 [pdf, other]

The LIG system for the English-Czech Text Translation Task of IWSLT 2019

Authors: Loïc Vial, Benjamin Lecouteux, Didier Schwab, Hang Le, Laurent Besacier

Abstract: In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained language models, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able to use the output of a pre-trained language model… ▽ More In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained language models, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able to use the output of a pre-trained language model as input embeddings, and we compared its performance under three configurations: 1) without any pre-trained language model (constrained), 2) using a language model trained on the monolingual parts of the allowed English-Czech data (constrained), and 3) using a language model trained on a large quantity of external monolingual data (unconstrained). We used BERT as external pre-trained language model (configuration 3), and BERT architecture for training our own language model (configuration 2). Regarding the training data, we trained our MT system on a small quantity of parallel text: one set only consists of the provided MuST-C corpus, and the other set consists of the MuST-C corpus and the News Commentary corpus from WMT. We observed that using the external pre-trained BERT improves the scores of our system by +0.8 to +1.5 of BLEU on our development set, and +0.97 to +1.94 of BLEU on the test set. However, using our own language model trained only on the allowed parallel data seems to improve the machine translation performances only when the system is trained on the smallest dataset. △ Less

Submitted 7 November, 2019; originally announced November 2019.

Comments: IWSLT 2019

arXiv:1910.00195 [pdf, other]

How noise affects the Hessian spectrum in overparameterized neural networks

Authors: Mingwei Wei, David J Schwab

Abstract: Stochastic gradient descent (SGD) forms the core optimization method for deep neural networks. While some theoretical progress has been made, it still remains unclear why SGD leads the learning dynamics in overparameterized networks to solutions that generalize well. Here we show that for overparameterized networks with a degenerate valley in their loss landscape, SGD on average decreases the trac… ▽ More Stochastic gradient descent (SGD) forms the core optimization method for deep neural networks. While some theoretical progress has been made, it still remains unclear why SGD leads the learning dynamics in overparameterized networks to solutions that generalize well. Here we show that for overparameterized networks with a degenerate valley in their loss landscape, SGD on average decreases the trace of the Hessian of the loss. We also generalize this result to other noise structures and show that isotropic noise in the non-degenerate subspace of the Hessian decreases its determinant. In addition to explaining SGDs role in sculpting the Hessian spectrum, this opens the door to new optimization approaches that may confer better generalization performance. We test our results with experiments on toy models and deep neural networks. △ Less

Submitted 29 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

arXiv:1905.05677 [pdf, other]

Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation

Authors: Loïc Vial, Benjamin Lecouteux, Didier Schwab

Abstract: In this article, we tackle the issue of the limited quantity of manually sense annotated corpora for the task of word sense disambiguation, by exploiting the semantic relationships between senses such as synonymy, hypernymy and hyponymy, in order to compress the sense vocabulary of Princeton WordNet, and thus reduce the number of different sense tags that must be observed to disambiguate all words… ▽ More In this article, we tackle the issue of the limited quantity of manually sense annotated corpora for the task of word sense disambiguation, by exploiting the semantic relationships between senses such as synonymy, hypernymy and hyponymy, in order to compress the sense vocabulary of Princeton WordNet, and thus reduce the number of different sense tags that must be observed to disambiguate all words of the lexical database. We propose two different methods that greatly reduces the size of neural WSD models, with the benefit of improving their coverage without additional training data, and without impacting their precision. In addition to our method, we present a WSD system which relies on pre-trained BERT word vectors in order to achieve results that significantly outperform the state of the art on all WSD evaluation tasks. △ Less

Submitted 27 August, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

Comments: In proceedings of the 10th Global WordNet Conference - GWC 2019. arXiv admin note: text overlap with arXiv:1811.00960

arXiv:1903.02606 [pdf, other]

Mean-field Analysis of Batch Normalization

Authors: Mingwei Wei, James Stokes, David J Schwab

Abstract: Batch Normalization (BatchNorm) is an extremely useful component of modern neural network architectures, enabling optimization using higher learning rates and achieving faster convergence. In this paper, we use mean-field theory to analytically quantify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. We… ▽ More Batch Normalization (BatchNorm) is an extremely useful component of modern neural network architectures, enabling optimization using higher learning rates and achieving faster convergence. In this paper, we use mean-field theory to analytically quantify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. We show that it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix. These findings are then used to justify the use of larger learning rates for networks that use BatchNorm, and we provide quantitative characterization of the maximal allowable learning rate to ensure convergence. Experiments support our theoretically predicted maximum learning rate, and furthermore suggest that networks with smaller values of the BatchNorm parameter achieve lower loss after the same number of epochs of training. △ Less

Submitted 6 March, 2019; originally announced March 2019.

arXiv:1902.04706 [pdf, other]

Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup

Authors: Devin Schwab, Tobias Springenberg, Murilo F. Martins, Thomas Lampe, Michael Neunert, Abbas Abdolmaleki, Tim Hertweck, Roland Hafner, Francesco Nori, Martin Riedmiller

Abstract: We present a method for fast training of vision based control policies on real robots. The key idea behind our method is to perform multi-task Reinforcement Learning with auxiliary tasks that differ not only in the reward to be optimized but also in the state-space in which they operate. In particular, we allow auxiliary task policies to utilize task features that are available only at training-ti… ▽ More We present a method for fast training of vision based control policies on real robots. The key idea behind our method is to perform multi-task Reinforcement Learning with auxiliary tasks that differ not only in the reward to be optimized but also in the state-space in which they operate. In particular, we allow auxiliary task policies to utilize task features that are available only at training-time. This allows for fast learning of auxiliary policies, which subsequently generate good data for training the main, vision-based control policies. This method can be seen as an extension of the Scheduled Auxiliary Control (SAC-X) framework. We demonstrate the efficacy of our method by using both a simulated and real-world Ball-in-a-Cup game controlled by a robot arm. In simulation, our approach leads to significant learning speed-ups when compared to standard SAC-X. On the real robot we show that the task can be learned from-scratch, i.e., with no transfer from simulation and no imitation learning. Videos of our learned policies running on the real robot can be found at https://sites.google.com/view/rss-2019-sawyer-bic/. △ Less

Submitted 18 February, 2019; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: Videos can be found at https://sites.google.com/view/rss-2019-sawyer-bic/

arXiv:1811.00960 [pdf, other]

Improving the Coverage and the Generalization Ability of Neural Word Sense Disambiguation through Hypernymy and Hyponymy Relationships

Authors: Loïc Vial, Benjamin Lecouteux, Didier Schwab

Abstract: In Word Sense Disambiguation (WSD), the predominant approach generally involves a supervised system trained on sense annotated corpora. The limited quantity of such corpora however restricts the coverage and the performance of these systems. In this article, we propose a new method that solves these issues by taking advantage of the knowledge present in WordNet, and especially the hypernymy and hy… ▽ More In Word Sense Disambiguation (WSD), the predominant approach generally involves a supervised system trained on sense annotated corpora. The limited quantity of such corpora however restricts the coverage and the performance of these systems. In this article, we propose a new method that solves these issues by taking advantage of the knowledge present in WordNet, and especially the hypernymy and hyponymy relationships between synsets, in order to reduce the number of different sense tags that are necessary to disambiguate all words of the lexical database. Our method leads to state of the art results on most WSD evaluation tasks, while improving the coverage of supervised systems, reducing the training time and the size of the models, without additional training data. In addition, we exhibit results that significantly outperform the state of the art when our method is combined with an ensembling technique and the addition of the WordNet Gloss Tagged as training corpus. △ Less

Submitted 2 November, 2018; originally announced November 2018.

arXiv:1808.02093 [pdf, other]

Learning to Share and Hide Intentions using Information Regularization

Authors: DJ Strouse, Max Kleiman-Weiner, Josh Tenenbaum, Matt Botvinick, David Schwab

Abstract: Learning to cooperate with friends and compete with foes is a key component of multi-agent reinforcement learning. Typically to do so, one requires access to either a model of or interaction with the other agent(s). Here we show how to learn effective strategies for cooperation and competition in an asymmetric information game with no such model or interaction. Our approach is to encourage an agen… ▽ More Learning to cooperate with friends and compete with foes is a key component of multi-agent reinforcement learning. Typically to do so, one requires access to either a model of or interaction with the other agent(s). Here we show how to learn effective strategies for cooperation and competition in an asymmetric information game with no such model or interaction. Our approach is to encourage an agent to reveal or hide their intentions using an information-theoretic regularizer. We consider both the mutual information between goal and action given state, as well as the mutual information between goal and state. We show how to optimize these regularizers in a way that is easy to integrate with policy gradient reinforcement learning. Finally, we demonstrate that cooperative (competitive) policies learned with our approach lead to more (less) reward for a second agent in two simple asymmetric information games. △ Less

Submitted 1 January, 2019; v1 submitted 6 August, 2018; originally announced August 2018.

Comments: Presented at the 32nd Conference on Neural Information Processing Systems (NIPS 2018)

arXiv:1803.08823 [pdf, other]

doi 10.1016/j.physrep.2019.03.001

A high-bias, low-variance introduction to Machine Learning for physicists

Authors: Pankaj Mehta, Marin Bukov, Ching-Hao Wang, Alexandre G. R. Day, Clint Richardson, Charles K. Fisher, David J. Schwab

Abstract: Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, r… ▽ More Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, generalization, and gradient descent before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python Jupyter notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists may be able to contribute. (Notebooks are available at https://physics.bu.edu/~pankajm/MLnotebooks.html ) △ Less

Submitted 27 May, 2019; v1 submitted 23 March, 2018; originally announced March 2018.

Comments: Notebooks have been updated. 122 pages, 78 figures, 20 Python notebooks

Journal ref: Phyics Reports 810 (2019) 1-124

arXiv:1802.02053 [pdf, other]

Système de traduction automatique statistique Anglais-Arabe

Authors: Marwa Hadj Salah, Didier Schwab, Hervé Blanchon, Mounir Zrigui

Abstract: Machine translation (MT) is the process of translating text written in a source language into text in a target language. In this article, we present our English-Arabic statistical machine translation system. First, we present the general process for setting up a statistical machine translation system, then we describe the tools as well as the different corpora we used to build our MT system. Our s… ▽ More Machine translation (MT) is the process of translating text written in a source language into text in a target language. In this article, we present our English-Arabic statistical machine translation system. First, we present the general process for setting up a statistical machine translation system, then we describe the tools as well as the different corpora we used to build our MT system. Our system was evaluated in terms of the BLUE score (24.51%) △ Less

Submitted 6 February, 2018; originally announced February 2018.

Comments: in French

arXiv:1801.03844 [pdf, ps, other]

Enhancing Translation Language Models with Word Embedding for Information Retrieval

Authors: Jibril Frej, Jean-Pierre Chevallet, Didier Schwab

Abstract: In this paper, we explore the usage of Word Embedding semantic resources for Information Retrieval (IR) task. This embedding, produced by a shallow neural network, have been shown to catch semantic similarities between words (Mikolov et al., 2013). Hence, our goal is to enhance IR Language Models by addressing the term mismatch problem. To do so, we applied the model presented in the paper Integra… ▽ More In this paper, we explore the usage of Word Embedding semantic resources for Information Retrieval (IR) task. This embedding, produced by a shallow neural network, have been shown to catch semantic similarities between words (Mikolov et al., 2013). Hence, our goal is to enhance IR Language Models by addressing the term mismatch problem. To do so, we applied the model presented in the paper Integrating and Evaluating Neural Word Embedding in Information Retrieval by Zuccon et al. (2015) that proposes to estimate the translation probability of a Translation Language Model using the cosine similarity between Word Embedding. The results we obtained so far did not show a statistically significant improvement compared to classical Language Model. △ Less

Submitted 11 January, 2018; originally announced January 2018.

arXiv:1712.09657 [pdf, other]

The information bottleneck and geometric clustering

Authors: DJ Strouse, David J Schwab

Abstract: The information bottleneck (IB) approach to clustering takes a joint distribution $P\!\left(X,Y\right)$ and maps the data $X$ to cluster labels $T$ which retain maximal information about $Y$ (Tishby et al., 1999). This objective results in an algorithm that clusters data points based upon the similarity of their conditional distributions $P\!\left(Y\mid X\right)$. This is in contrast to classic "g… ▽ More The information bottleneck (IB) approach to clustering takes a joint distribution $P\!\left(X,Y\right)$ and maps the data $X$ to cluster labels $T$ which retain maximal information about $Y$ (Tishby et al., 1999). This objective results in an algorithm that clusters data points based upon the similarity of their conditional distributions $P\!\left(Y\mid X\right)$. This is in contrast to classic "geometric clustering'' algorithms such as $k$-means and gaussian mixture models (GMMs) which take a set of observed data points $\left\{ \mathbf{x}_{i}\right\} _{i=1:N}$ and cluster them based upon their geometric (typically Euclidean) distance from one another. Here, we show how to use the deterministic information bottleneck (DIB) (Strouse and Schwab, 2017), a variant of IB, to perform geometric clustering, by choosing cluster labels that preserve information about data point location on a smoothed dataset. We also introduce a novel method to choose the number of clusters, based on identifying solutions where the tradeoff between number of clusters used and spatial information preserved is strongest. We apply this approach to a variety of simple clustering problems, showing that DIB with our model selection procedure recovers the generative cluster labels. We also show that, in particular limits of our model parameters, clustering with DIB and IB is equivalent to $k$-means and EM fitting of a GMM with hard and soft assignments, respectively. Thus, clustering with (D)IB generalizes and provides an information-theoretic perspective on these classic algorithms. △ Less

Submitted 31 May, 2020; v1 submitted 27 December, 2017; originally announced December 2017.

Comments: Updated to final published version with more detailed relationship to GMMs/k-means

Journal ref: Neural Computation 31 (2019) 596-612

arXiv:1705.08828 [pdf, other]

Deep Investigation of Cross-Language Plagiarism Detection Methods

Authors: Jeremy Ferrero, Laurent Besacier, Didier Schwab, Frederic Agnes

Abstract: This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw r… ▽ More This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages. △ Less

Submitted 24 May, 2017; originally announced May 2017.

Comments: Accepted to BUCC (10th Workshop on Building and Using Comparable Corpora) colocated with ACL 2017

arXiv:1704.02293 [pdf, other]

Comparison of Global Algorithms in Word Sense Disambiguation

Authors: Loïc Vial, Andon Tchechmedjiev, Didier Schwab

Abstract: This article compares four probabilistic algorithms (global algorithms) for Word Sense Disambiguation (WSD) in terms of the number of scorer calls (local algo- rithm) and the F1 score as determined by a gold-standard scorer. Two algorithms come from the state of the art, a Simulated Annealing Algorithm (SAA) and a Genetic Algorithm (GA) as well as two algorithms that we first adapt from WSD that a… ▽ More This article compares four probabilistic algorithms (global algorithms) for Word Sense Disambiguation (WSD) in terms of the number of scorer calls (local algo- rithm) and the F1 score as determined by a gold-standard scorer. Two algorithms come from the state of the art, a Simulated Annealing Algorithm (SAA) and a Genetic Algorithm (GA) as well as two algorithms that we first adapt from WSD that are state of the art probabilistic search algorithms, namely a Cuckoo search algorithm (CSA) and a Bat Search algorithm (BS). As WSD requires to evaluate exponentially many word sense combinations (with branching factors of up to 6 or more), probabilistic algorithms allow to find approximate solution in a tractable time by sampling the search space. We find that CSA, GA and SA all eventually converge to similar results (0.98 F1 score), but CSA gets there faster (in fewer scorer calls) and reaches up to 0.95 F1 before SA in fewer scorer calls. In BA a strict convergence criterion prevents it from reaching above 0.89 F1. △ Less

Submitted 7 April, 2017; originally announced April 2017.

arXiv:1704.01346 [pdf, ps, other]

CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity

Authors: Jeremy Ferrero, Frederic Agnes, Laurent Besacier, Didier Schwab

Abstract: We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked… ▽ More We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked 1st on track 4a with a correlation of 83.02% with human annotations. △ Less

Submitted 5 April, 2017; originally announced April 2017.

arXiv:1702.03082 [pdf, other]

UsingWord Embedding for Cross-Language Plagiarism Detection

Authors: J. Ferrero, F. Agnes, L. Besacier, D. Schwab

Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an over… ▽ More This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus. △ Less

Submitted 10 February, 2017; originally announced February 2017.

Comments: Accepted to EACL 2017 (short)

arXiv:1609.03541 [pdf, ps, other]

Comment on "Why does deep and cheap learning work so well?" [arXiv:1608.08225]

Authors: David J. Schwab, Pankaj Mehta

Abstract: In a recent paper, "Why does deep and cheap learning work so well?", Lin and Tegmark claim to show that the map** between deep belief networks and the variational renormalization group derived in [ar** does not hold. In this comment, we show that these claims are incorrect and stem from a misunderstandin… ▽ More In a recent paper, "Why does deep and cheap learning work so well?", Lin and Tegmark claim to show that the map** between deep belief networks and the variational renormalization group derived in [ar** does not hold. In this comment, we show that these claims are incorrect and stem from a misunderstanding of the variational RG procedure proposed by Kadanoff. We also explain why the "counterexample" of Lin and Tegmark is compatible with the map** proposed in [arXiv:1410.3831]. △ Less

Submitted 12 September, 2016; originally announced September 2016.

Comments: Comment on arXiv:1608.08225

arXiv:1605.05775 [pdf, other]

Supervised Learning with Quantum-Inspired Tensor Networks

Authors: E. Miles Stoudenmire, David J. Schwab

Abstract: Tensor networks are efficient representations of high-dimensional tensors which have been very successful for physics and mathematics applications. We demonstrate how algorithms for optimizing such networks can be adapted to supervised learning tasks by using matrix product states (tensor trains) to parameterize models for classifying images. For the MNIST data set we obtain less than 1% test set… ▽ More Tensor networks are efficient representations of high-dimensional tensors which have been very successful for physics and mathematics applications. We demonstrate how algorithms for optimizing such networks can be adapted to supervised learning tasks by using matrix product states (tensor trains) to parameterize models for classifying images. For the MNIST data set we obtain less than 1% test set classification error. We discuss how the tensor network form imparts additional structure to the learned model and suggest a possible generative interpretation. △ Less

Submitted 18 May, 2017; v1 submitted 18 May, 2016; originally announced May 2016.

Comments: 11 pages, 15 figures; updated version includes corrections, links to sample codes, expanded discussion, and additional references

Journal ref: Advances in Neural Information Processing Systems 29, 4799 (2016)

arXiv:1604.00268 [pdf, other]

The deterministic information bottleneck

Authors: DJ Strouse, David J Schwab

Abstract: Lossy compression and clustering fundamentally involve a decision about what features are relevant and which are not. The information bottleneck method (IB) by Tishby, Pereira, and Bialek formalized this notion as an information-theoretic optimization problem and proposed an optimal tradeoff between throwing away as many bits as possible, and selectively kee** those that are most important. In t… ▽ More Lossy compression and clustering fundamentally involve a decision about what features are relevant and which are not. The information bottleneck method (IB) by Tishby, Pereira, and Bialek formalized this notion as an information-theoretic optimization problem and proposed an optimal tradeoff between throwing away as many bits as possible, and selectively kee** those that are most important. In the IB, compression is measure my mutual information. Here, we introduce an alternative formulation that replaces mutual information with entropy, which we call the deterministic information bottleneck (DIB), that we argue better captures this notion of compression. As suggested by its name, the solution to the DIB problem turns out to be a deterministic encoder, or hard clustering, as opposed to the stochastic encoder, or soft clustering, that is optimal under the IB. We compare the IB and DIB on synthetic data, showing that the IB and DIB perform similarly in terms of the IB cost function, but that the DIB significantly outperforms the IB in terms of the DIB cost function. We also empirically find that the DIB offers a considerable gain in computational efficiency over the IB, over a range of convergence parameters. Our derivation of the DIB also suggests a method for continuously interpolating between the soft clustering of the IB and the hard clustering of the DIB. △ Less

Submitted 19 December, 2016; v1 submitted 1 April, 2016; originally announced April 2016.

Comments: 15 pages, 4 figures

arXiv:1410.3831 [pdf, ps, other]

An exact map** between the Variational Renormalization Group and Deep Learning

Authors: Pankaj Mehta, David J. Schwab

Abstract: Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relat… ▽ More Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact map** from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data. △ Less

Submitted 14 October, 2014; originally announced October 2014.

Comments: 8 pages, 3 figures

arXiv:1201.4279 [pdf]

Évaluation et consolidation d'un réseau lexical via un outil pour retrouver le mot sur le bout de la langue

Authors: Alain Joubert, Mathieu Lafourcade, Didier Schwab, Michael Zock

Abstract: Since September 2007, a large scale lexical network for French is under construction through methods based on some kind of popular consensus by means of games (JeuxDeMots project). Human intervention can be considered as marginal. It is limited to corrections, adjustments and validation of the senses of terms, which amounts to less than 0,5 % of the relations in the network. To appreciate the qual… ▽ More Since September 2007, a large scale lexical network for French is under construction through methods based on some kind of popular consensus by means of games (JeuxDeMots project). Human intervention can be considered as marginal. It is limited to corrections, adjustments and validation of the senses of terms, which amounts to less than 0,5 % of the relations in the network. To appreciate the quality of this resource built by non-expert users (players of the game), we use a similar approach to its construction. The resource must be validated by laymen, persistent in time, on open class vocabulary. We suggest to check whether our tool is able to solve the Tip of the Tongue (TOT) problem. Just like JeuxDeMots, our tool can be considered as an on-line game. Like the former, it allows the acquisition of new relations, enriching thus the (existing) network. △ Less

Submitted 20 January, 2012; originally announced January 2012.

Journal ref: TALN, Montpellier : France (2011)

arXiv:0906.0859 [pdf]

A Partial Order on Bipartite Graphs with n Vertices

Authors: Emil Daniel Schwab

Abstract: The paper examines a partial order on bipartite graphs (X1, X2, E) with n vertices, X1UX2={1,2,...,n}. This partial order is a natural partial order of subobjects of an object in a triangular category with bipartite graphs as morphisms. The paper examines a partial order on bipartite graphs (X1, X2, E) with n vertices, X1UX2={1,2,...,n}. This partial order is a natural partial order of subobjects of an object in a triangular category with bipartite graphs as morphisms. △ Less

Submitted 4 June, 2009; originally announced June 2009.

Comments: 10 pages,exposed on 5th International Conference "Actualities and Perspectives on Hardware and Software" - APHS2009, Timisoara, Romania

Journal ref: Ann. Univ. Tibiscus Comp. Sci. Series VII(2009),315-324

Showing 1–44 of 44 results for author: Schwab, D