-
Compositional preference models for aligning LMs
Authors:
Dongyoung Go,
Tomasz Korbak,
Germán Kruszewski,
Jos Rozen,
Marc Dymetman
Abstract:
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a…
▽ More
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.
△ Less
Submitted 14 March, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Should you marginalize over possible tokenizations?
Authors:
Nadezhda Chirkova,
Germán Kruszewski,
Jos Rozen,
Marc Dymetman
Abstract:
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marg…
▽ More
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
disco: a toolkit for Distributional Control of Generative Models
Authors:
Germán Kruszewski,
Jos Rozen,
Marc Dymetman
Abstract:
Pre-trained language models and other generative models have revolutionized NLP and beyond. However, these models tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging to capture. To address these limitations, researchers have introduced distributional control techniques. These techniques, not limited to language, a…
▽ More
Pre-trained language models and other generative models have revolutionized NLP and beyond. However, these models tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging to capture. To address these limitations, researchers have introduced distributional control techniques. These techniques, not limited to language, allow controlling the prevalence (i.e., expectations) of any features of interest in the model's outputs. Despite their potential, the widespread adoption of these techniques has been hindered by the difficulty in adapting complex, disconnected code. Here, we present disco, an open-source Python library that brings these techniques to the broader public.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Aligning Language Models with Preferences through f-divergence Minimization
Authors:
Dongyoung Go,
Tomasz Korbak,
Germán Kruszewski,
Jos Rozen,
Nahyeon Ryu,
Marc Dymetman
Abstract:
Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arisin…
▽ More
Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution that can be evaluated. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences present different alignment and diversity trade-offs. We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work. These distinguishing characteristics between divergences persist as the model size increases, highlighting the importance of selecting appropriate divergence objectives.
△ Less
Submitted 6 June, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting
Authors:
Tomasz Korbak,
Hady Elsahar,
Germán Kruszewski,
Marc Dymetman
Abstract:
The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged t…
▽ More
The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.
△ Less
Submitted 14 November, 2022; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs
Authors:
Bryan Eikema,
Germán Kruszewski,
Hady Elsahar,
Marc Dymetman
Abstract:
Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often di…
▽ More
Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often difficult or impossible to apply due to the need to find a proposal distribution that upper-bounds the target distribution everywhere. Approximate Markov chain Monte Carlo sampling techniques like Metropolis-Hastings are usually easier to design, exploiting a local proposal distribution that performs local edits on an evolving sample. However, these techniques can be inefficient due to the local nature of the proposal distribution and do not provide an estimate of the quality of their samples. In this work, we propose a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality, while providing explicit convergence bounds and diagnostics. QRS capitalizes on the availability of high-quality global proposal distributions obtained from deep learning models. We demonstrate the effectiveness of QRS sampling for discrete EBMs over text for the tasks of controlled text generation with distributional constraints and paraphrase generation. We show that we can sample from such EBMs with arbitrary precision at the cost of sampling efficiency.
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
Controlling Conditional Language Models without Catastrophic Forgetting
Authors:
Tomasz Korbak,
Hady Elsahar,
German Kruszewski,
Marc Dymetman
Abstract:
Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g., hallucinations in abstractive summarization or style violations in c…
▽ More
Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g., hallucinations in abstractive summarization or style violations in code generation). This raises the important question of how to adapt pre-trained generative models to meet all requirements without destroying their general capabilities ("catastrophic forgetting"). Recent work has proposed to solve this problem by representing task-specific requirements through energy-based models (EBMs) and approximating these EBMs using distributional policy gradients (DPG). Despite its effectiveness, this approach is however limited to unconditional distributions. In this paper, we extend DPG to conditional tasks by proposing Conditional DPG (CDPG). We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that fine-tuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and -- in contrast with baseline approaches -- does not result in catastrophic forgetting.
△ Less
Submitted 20 June, 2022; v1 submitted 1 December, 2021;
originally announced December 2021.
-
Energy-Based Models for Code Generation under Compilability Constraints
Authors:
Tomasz Korbak,
Hady Elsahar,
Marc Dymetman,
Germán Kruszewski
Abstract:
Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint s…
▽ More
Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with an imposed constraint of generating only compilable sequences. We then use the KL-Adaptive Distributional Policy Gradient algorithm (Khalifa et al., 2021) to train a generative model approximating the EBM. We conduct experiments showing that our proposed approach is able to improve compilability rates without sacrificing diversity and complexity of the generated samples.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
A Distributional Approach to Controlled Text Generation
Authors:
Muhammad Khalifa,
Hady Elsahar,
Marc Dymetman
Abstract:
We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise" and "distributional" constraints over the target LM -- to our knowledge, the first model with such generality -- while minimizing KL divergence from the initial LM distribution. The optimal target dis…
▽ More
We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise" and "distributional" constraints over the target LM -- to our knowledge, the first model with such generality -- while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence. (Code available at https://github.com/naver/gdc)
△ Less
Submitted 6 May, 2021; v1 submitted 21 December, 2020;
originally announced December 2020.
-
Distributional Reinforcement Learning for Energy-Based Sequential Models
Authors:
Tetiana Parshakova,
Jean-Marc Andreoli,
Marc Dymetman
Abstract:
Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue…
▽ More
Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.
△ Less
Submitted 18 December, 2019;
originally announced December 2019.
-
Character-based NMT with Transformer
Authors:
Rohit Gupta,
Laurent Besacier,
Marc Dymetman,
Matthias Gallé
Abstract:
Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with the Transformer architecture. In particular, our experiments on EN-DE show that character-based Transformer models are more robust than their BPE counterpart, both when translating noisy…
▽ More
Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with the Transformer architecture. In particular, our experiments on EN-DE show that character-based Transformer models are more robust than their BPE counterpart, both when translating noisy text, and when translating text from a different domain. To obtain comparable BLEU scores in clean, in-domain data and close the gap with BPE-based models we use known techniques to train deeper Transformer models.
△ Less
Submitted 12 November, 2019;
originally announced November 2019.
-
Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness
Authors:
Alexandre Bérard,
Ioan Calapodescu,
Marc Dymetman,
Claude Roux,
Jean-Luc Meunier,
Vassilina Nikoulina
Abstract:
We share a French-English parallel corpus of Foursquare restaurant reviews (https://europe.naverlabs.com/research/natural-language-processing/machine-translation-of-restaurant-reviews), and define a new task to encourage research on Neural Machine Translation robustness and domain adaptation, in a real-world scenario where better-quality MT would be greatly beneficial. We discuss the challenges of…
▽ More
We share a French-English parallel corpus of Foursquare restaurant reviews (https://europe.naverlabs.com/research/natural-language-processing/machine-translation-of-restaurant-reviews), and define a new task to encourage research on Neural Machine Translation robustness and domain adaptation, in a real-world scenario where better-quality MT would be greatly beneficial. We discuss the challenges of such user-generated content, and train good baseline models that build upon the latest techniques for MT robustness. We also perform an extensive evaluation (automatic and human) that shows significant improvements over existing online systems. Finally, we propose task-specific metrics based on sentiment analysis or translation accuracy of domain-specific polysemous words.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Global Autoregressive Models for Data-Efficient Sequence Learning
Authors:
Tetiana Parshakova,
Jean-Marc Andreoli,
Marc Dymetman
Abstract:
Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global \textit{a priori} features to compensate for lack of data. We train these models in two s…
▽ More
Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global \textit{a priori} features to compensate for lack of data. We train these models in two steps. In the first step, we obtain an \emph{unnormalized} GAM that maximizes the likelihood of the data, but is improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the \emph{normalized} distribution associated with the GAM, and can be used for fast inference and evaluation. Our experiments focus on language modelling under synthetic conditions and show a strong perplexity reduction of using the second autoregressive model over the standard one.
△ Less
Submitted 19 September, 2019; v1 submitted 16 September, 2019;
originally announced September 2019.
-
Moment Matching Training for Neural Machine Translation: A Preliminary Study
Authors:
Cong Duy Vu Hoang,
Ioan Calapodescu,
Marc Dymetman
Abstract:
In previous works, neural sequence models have been shown to improve significantly if external prior knowledge can be provided, for instance by allowing the model to access the embeddings of explicit features during both training and inference. In this work, we propose a different point of view on how to incorporate prior knowledge in a principled way, using a moment matching framework. In this ap…
▽ More
In previous works, neural sequence models have been shown to improve significantly if external prior knowledge can be provided, for instance by allowing the model to access the embeddings of explicit features during both training and inference. In this work, we propose a different point of view on how to incorporate prior knowledge in a principled way, using a moment matching framework. In this approach, the standard local cross-entropy training of the sequential model is combined with a moment matching training mode that encourages the equality of the expectations of certain predefined features between the model distribution and the empirical distribution. In particular, we show how to derive unbiased estimates of some stochastic gradients that are central to the training, and compare our framework with a formally related one: policy gradient training in reinforcement learning, pointing out some important differences in terms of the kinds of prior assumptions in both approaches. Our initial results are promising, showing the effectiveness of our proposed framework.
△ Less
Submitted 27 December, 2018; v1 submitted 24 December, 2018;
originally announced December 2018.
-
Char2char Generation with Reranking for the E2E NLG Challenge
Authors:
Shubham Agarwal,
Marc Dymetman,
Eric Gaussier
Abstract:
This paper describes our submission to the E2E NLG Challenge. Recently, neural seq2seq approaches have become mainstream in NLG, often resorting to pre- (respectively post-) processing delexicalization (relexicalization) steps at the word-level to handle rare words. By contrast, we train a simple character level seq2seq model, which requires no pre/post-processing (delexicalization, tokenization o…
▽ More
This paper describes our submission to the E2E NLG Challenge. Recently, neural seq2seq approaches have become mainstream in NLG, often resorting to pre- (respectively post-) processing delexicalization (relexicalization) steps at the word-level to handle rare words. By contrast, we train a simple character level seq2seq model, which requires no pre/post-processing (delexicalization, tokenization or even lowercasing), with surprisingly good results. For further improvement, we explore two re-ranking approaches for scoring candidates. We also introduce a synthetic dataset creation procedure, which opens up a new way of creating artificial datasets for Natural Language Generation.
△ Less
Submitted 4 November, 2018;
originally announced November 2018.
-
Symbolic Priors for RNN-based Semantic Parsing
Authors:
Chunyang Xiao,
Marc Dymetman,
Claire Gardent
Abstract:
Seq2seq models based on Recurrent Neural Networks (RNNs) have recently received a lot of attention in the domain of Semantic Parsing for Question Answering. While in principle they can be trained directly on pairs (natural language utterances, logical forms), their performance is limited by the amount of available data. To alleviate this problem, we propose to exploit various sources of prior know…
▽ More
Seq2seq models based on Recurrent Neural Networks (RNNs) have recently received a lot of attention in the domain of Semantic Parsing for Question Answering. While in principle they can be trained directly on pairs (natural language utterances, logical forms), their performance is limited by the amount of available data. To alleviate this problem, we propose to exploit various sources of prior knowledge: the well-formedness of the logical forms is modeled by a weighted context-free grammar; the likelihood that certain entities present in the input utterance are also present in the logical form is modeled by weighted finite-state automata. The grammar and automata are combined together through an efficient intersection algorithm to form a soft guide ("background") to the RNN. We test our method on an extension of the Overnight dataset and show that it not only strongly improves over an RNN baseline, but also outperforms non-RNN models based on rich sets of hand-crafted features.
△ Less
Submitted 20 September, 2018;
originally announced September 2018.
-
Log-Linear RNNs: Towards Recurrent Neural Networks with Flexible Prior Knowledge
Authors:
Marc Dymetman,
Chunyang Xiao
Abstract:
We introduce LL-RNNs (Log-Linear RNNs), an extension of Recurrent Neural Networks that replaces the softmax output layer by a log-linear output layer, of which the softmax is a special case. This conceptually simple move has two main advantages. First, it allows the learner to combat training data sparsity by allowing it to model words (or more generally, output symbols) as complex combinations of…
▽ More
We introduce LL-RNNs (Log-Linear RNNs), an extension of Recurrent Neural Networks that replaces the softmax output layer by a log-linear output layer, of which the softmax is a special case. This conceptually simple move has two main advantages. First, it allows the learner to combat training data sparsity by allowing it to model words (or more generally, output symbols) as complex combinations of attributes without requiring that each combination is directly observed in the training data (as the softmax does). Second, it permits the inclusion of flexible prior knowledge in the form of a priori specified modular features, where the neural network component learns to dynamically control the weights of a log-linear distribution exploiting these features.
We conduct experiments in the domain of language modelling of French, that exploit morphological prior knowledge and show an important decrease in perplexity relative to a baseline RNN.
We provide other motivating iillustrations, and finally argue that the log-linear and the neural-network components contribute complementary strengths to the LL-RNN: the LL aspect allows the model to incorporate rich prior knowledge, while the NN aspect, according to the "representation learning" paradigm, allows the model to discover novel combination of characteristics.
△ Less
Submitted 16 December, 2016; v1 submitted 8 July, 2016;
originally announced July 2016.
-
LSTM-based Mixture-of-Experts for Knowledge-Aware Dialogues
Authors:
Phong Le,
Marc Dymetman,
Jean-Michel Renders
Abstract:
We introduce an LSTM-based method for dynamically integrating several word-prediction experts to obtain a conditional language model which can be good simultaneously at several subtasks. We illustrate this general approach with an application to dialogue where we integrate a neural chat model, good at conversational aspects, with a neural question-answering model, good at retrieving precise inform…
▽ More
We introduce an LSTM-based method for dynamically integrating several word-prediction experts to obtain a conditional language model which can be good simultaneously at several subtasks. We illustrate this general approach with an application to dialogue where we integrate a neural chat model, good at conversational aspects, with a neural question-answering model, good at retrieving precise information from a knowledge-base, and show how the integration combines the strengths of the independent components. We hope that this focused contribution will attract attention on the benefits of using such mixtures of experts in NLP.
△ Less
Submitted 5 May, 2016;
originally announced May 2016.
-
Assisting Composition of Email Responses: a Topic Prediction Approach
Authors:
Spandana Gella,
Marc Dymetman,
Jean Michel Renders,
Sriram Venkatapathy
Abstract:
We propose an approach for hel** agents compose email replies to customer requests. To enable that, we use LDA to extract latent topics from a collection of email exchanges. We then use these latent topics to label our data, obtaining a so-called "silver standard" topic labelling. We exploit this labelled set to train a classifier to: (i) predict the topic distribution of the entire agent's emai…
▽ More
We propose an approach for hel** agents compose email replies to customer requests. To enable that, we use LDA to extract latent topics from a collection of email exchanges. We then use these latent topics to label our data, obtaining a so-called "silver standard" topic labelling. We exploit this labelled set to train a classifier to: (i) predict the topic distribution of the entire agent's email response, based on features of the customer's email; and (ii) predict the topic distribution of the next sentence in the agent's reply, based on the customer's email features and on features of the agent's current sentence. The experimental results on a large email collection from a contact center in the tele- com domain show that the proposed ap- proach is effective in predicting the best topic of the agent's next sentence. In 80% of the cases, the correct topic is present among the top five recommended topics (out of fifty possible ones). This shows the potential of this method to be applied in an interactive setting, where the agent is presented a small list of likely topics to choose from for the next sentence.
△ Less
Submitted 7 October, 2015;
originally announced October 2015.
-
The OS* Algorithm: a Joint Approach to Exact Optimization and Sampling
Authors:
Marc Dymetman,
Guillaume Bouchard,
Simon Carter
Abstract:
Most current sampling algorithms for high-dimensional distributions are based on MCMC techniques and are approximate in the sense that they are valid only asymptotically. Rejection sampling, on the other hand, produces valid samples, but is unrealistically slow in high-dimension spaces. The OS* algorithm that we propose is a unified approach to exact optimization and sampling, based on incremental…
▽ More
Most current sampling algorithms for high-dimensional distributions are based on MCMC techniques and are approximate in the sense that they are valid only asymptotically. Rejection sampling, on the other hand, produces valid samples, but is unrealistically slow in high-dimension spaces. The OS* algorithm that we propose is a unified approach to exact optimization and sampling, based on incremental refinements of a functional upper bound, which combines ideas of adaptive rejection sampling and of A* optimization search. We show that the choice of the refinement can be done in a way that ensures tractability in high-dimension spaces, and we present first experiments in two different settings: inference in high-order HMMs and in large discrete graphical models.
△ Less
Submitted 3 July, 2012;
originally announced July 2012.
-
Some Remarks on the Geometry of Grammar
Authors:
Marc Dymetman
Abstract:
This paper, following (Dymetman:1998), presents an approach to grammar description and processing based on the geometry of cancellation diagrams, a concept which plays a central role in combinatorial group theory (Lyndon-Schuppe:1977). The focus here is on the geometric intuitions and on relating group-theoretical diagrams to the traditional charts associated with context-free grammars and type-…
▽ More
This paper, following (Dymetman:1998), presents an approach to grammar description and processing based on the geometry of cancellation diagrams, a concept which plays a central role in combinatorial group theory (Lyndon-Schuppe:1977). The focus here is on the geometric intuitions and on relating group-theoretical diagrams to the traditional charts associated with context-free grammars and type-0 rewriting systems. The paper is structured as follows. We begin in Section 1 by analyzing charts in terms of constructs called cells, which are a geometrical counterpart to rules. Then we move in Section 2 to a presentation of cancellation diagrams and show how they can be used computationally. In Section 3 we give a formal algebraic presentation of the concept of group computation structure, which is based on the standard notions of free group and conjugacy. We then relate in Section 4 the geometric and the algebraic views of computation by using the fundamental theorem of combinatorial group theory (Rotman:1994). In Section 5 we study in more detail the relationship between the two views on the basis of a simple grammar stated as a group computation structure. In section 6 we extend this grammar to handle non-local constructs such as relative pronouns and quantifiers. We conclude in Section 7 with some brief notes on the differences between normal submonoids and normal subgroups, group computation versus rewriting systems, and the use of group morphisms to study the computational complexity of parsing and generation.
△ Less
Submitted 5 March, 1999;
originally announced March 1999.
-
Group Theory and Grammatical Description
Authors:
Marc Dymetman
Abstract:
This paper presents a model for linguistic description based on group theory. A grammar in this model, or "G-grammar", is a collection of lexical expressions which are products of logical forms, phonological forms, and their inverses. Phrasal descriptions are obtained by forming products of lexical expressions and by cancelling contiguous elements which are inverses of each other. We show applic…
▽ More
This paper presents a model for linguistic description based on group theory. A grammar in this model, or "G-grammar", is a collection of lexical expressions which are products of logical forms, phonological forms, and their inverses. Phrasal descriptions are obtained by forming products of lexical expressions and by cancelling contiguous elements which are inverses of each other. We show applications of this model to parsing and generation, long-distance movement, and quantifier sco**. We believe that by moving from the free monoid over a vocabulary V --- standard in formal language studies --- to the free group over V, deep affinities between linguistic phenomena and classical algebra come to the surface, and that the consequences of tap** the mathematical connections thus established could be considerable.
△ Less
Submitted 7 May, 1998;
originally announced May 1998.
-
Charts, Interaction-Free Grammars, and the Compact Representation of Ambiguity
Authors:
Marc Dymetman
Abstract:
Recently researchers working in the LFG framework have proposed algorithms for taking advantage of the implicit context-free components of a unification grammar [Maxwell 96]. This paper clarifies the mathematical foundations of these techniques, provides a uniform framework in which they can be formally studied and eliminates the need for special purpose runtime data-structures recording ambigui…
▽ More
Recently researchers working in the LFG framework have proposed algorithms for taking advantage of the implicit context-free components of a unification grammar [Maxwell 96]. This paper clarifies the mathematical foundations of these techniques, provides a uniform framework in which they can be formally studied and eliminates the need for special purpose runtime data-structures recording ambiguity. The paper posits the identity: Ambiguous Feature Structures = Grammars, which states that (finitely) ambiguous representations are best seen as unification grammars of a certain type, here called ``interaction-free'' grammars, which generate in a backtrack-free way each of the feature structures subsumed by the ambiguous representation. This work extends a line of research [Billot and Lang 89, Lang 94] which stresses the connection between charts and grammars: a chart can be seen as a specialization of the reference grammar for a given input string. We show how this specialization grammar can be transformed into an interaction-free form which has the same practicality as a listing of the individual solutions, but is produced in less time and space.
△ Less
Submitted 12 May, 1997;
originally announced May 1997.
-
A Simple Transformation for Offline-Parsable Grammars and its Termination Properties
Authors:
Marc Dymetman
Abstract:
We present, in easily reproducible terms, a simple transformation for offline-parsable grammars which results in a provably terminating parsing program directly top-down interpretable in Prolog. The transformation consists in two steps: (1) removal of empty-productions, followed by: (2) left-recursion elimination. It is related both to left-corner parsing (where the grammar is compiled, rather t…
▽ More
We present, in easily reproducible terms, a simple transformation for offline-parsable grammars which results in a provably terminating parsing program directly top-down interpretable in Prolog. The transformation consists in two steps: (1) removal of empty-productions, followed by: (2) left-recursion elimination. It is related both to left-corner parsing (where the grammar is compiled, rather than interpreted through a parsing program, and with the advantage of guaranteed termination in the presence of empty productions) and to the Generalized Greibach Normal Form for DCGs (with the advantage of implementation simplicity).
△ Less
Submitted 14 May, 1996;
originally announced May 1996.
-
Extended Dependency Structures and their Formal Interpretation
Authors:
Marc Dymetman,
Max Copperman
Abstract:
We describe two ``semantically-oriented'' dependency-structure formalisms, U-forms and S-forms. U-forms have been previously used in machine translation as interlingual representations, but without being provided with a formal interpretation. S-forms, which we introduce in this paper, are a scoped version of U-forms, and we define a compositional semantics mechanism for them. Two types of semant…
▽ More
We describe two ``semantically-oriented'' dependency-structure formalisms, U-forms and S-forms. U-forms have been previously used in machine translation as interlingual representations, but without being provided with a formal interpretation. S-forms, which we introduce in this paper, are a scoped version of U-forms, and we define a compositional semantics mechanism for them. Two types of semantic composition are basic: complement incorporation and modifier incorporation. Binding of variables is done at the time of incorporation, permitting much flexibility in composition order and a simple account of the semantic effects of permuting several incorporations.
△ Less
Submitted 9 May, 1996; v1 submitted 29 April, 1996;
originally announced April 1996.
-
Towards an Automatic Dictation System for Translators: the TransTalk Project
Authors:
Marc Dymetman,
Julie Brousseau,
George Foster,
Pierre Isabelle,
Yves Normandin,
Pierre Plamondon
Abstract:
Professional translators often dictate their translations orally and have them typed afterwards. The TransTalk project aims at automating the second part of this process. Its originality as a dictation system lies in the fact that both the acoustic signal produced by the translator and the source text under translation are made available to the system. Probable translations of the source text ca…
▽ More
Professional translators often dictate their translations orally and have them typed afterwards. The TransTalk project aims at automating the second part of this process. Its originality as a dictation system lies in the fact that both the acoustic signal produced by the translator and the source text under translation are made available to the system. Probable translations of the source text can be predicted and these predictions used to help the speech recognition system in its lexical choices. We present the results of the first prototype, which show a marked improvement in the performance of the speech recognition task when translation predictions are taken into account.
△ Less
Submitted 28 September, 1994;
originally announced September 1994.