-
Learning to Play 7 Wonders Duel Without Human Supervision
Authors:
Giovanni Paolini,
Lorenzo Moreschini,
Francesco Veneziano,
Alessandro Iraci
Abstract:
This paper introduces ZeusAI, an artificial intelligence system developed to play the board game 7 Wonders Duel. Inspired by the AlphaZero reinforcement learning algorithm, ZeusAI relies on a combination of Monte Carlo Tree Search and a Transformer Neural Network to learn the game without human supervision. ZeusAI competes at the level of top human players, develops both known and novel strategies…
▽ More
This paper introduces ZeusAI, an artificial intelligence system developed to play the board game 7 Wonders Duel. Inspired by the AlphaZero reinforcement learning algorithm, ZeusAI relies on a combination of Monte Carlo Tree Search and a Transformer Neural Network to learn the game without human supervision. ZeusAI competes at the level of top human players, develops both known and novel strategies, and allows us to test rule variants to improve the game's balance. This work demonstrates how AI can help in understanding and enhancing board games.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
General Purpose Verification for Chain of Thought Prompting
Authors:
Robert Vacareanu,
Anurag Pratik,
Evangelia Spiliopoulou,
Zheng Qi,
Giovanni Paolini,
Neha Anna John,
Jie Ma,
Yassine Benajiba,
Miguel Ballesteros
Abstract:
Many of the recent capabilities demonstrated by Large Language Models (LLMs) arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should…
▽ More
Many of the recent capabilities demonstrated by Large Language Models (LLMs) arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should adhere to while reasoning: (i) Relevance, (ii) Mathematical Accuracy, and (iii) Logical Consistency. We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation. The constraints are applied in the form of verifiers: the model itself is asked to verify if the generated steps satisfy each constraint. To further steer the generations towards high-quality solutions, we use the perplexity of the reasoning steps as an additional verifier. We evaluate our method on 4 distinct types of reasoning tasks, spanning a total of 9 different datasets. Experiments show that our method is always better than vanilla generation, and, in 6 out of the 9 datasets, it is better than best-of N sampling which samples N reasoning chains and picks the lowest perplexity generation.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
Fewer Truncations Improve Language Modeling
Authors:
Hantian Ding,
Zijian Wang,
Giovanni Paolini,
Varun Kumar,
Anoop Deoras,
Dan Roth,
Stefano Soatto
Abstract:
In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and…
▽ More
In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
△ Less
Submitted 2 May, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Taxonomy Expansion for Named Entity Recognition
Authors:
Karthikeyan K,
Yogarshi Vyas,
Jie Ma,
Giovanni Paolini,
Neha Anna John,
Shuai Wang,
Yassine Benajiba,
Vittorio Castelli,
Dan Roth,
Miguel Ballesteros
Abstract:
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To re…
▽ More
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To remedy this, we propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets. We experiment with 6 diverse datasets and show that PLM consistently performs better than most other approaches (0.5 - 2.5 F1), including in novel settings for taxonomy expansion not considered in prior work. The gap between PLM and all other approaches is especially large in settings where there is limited data available for the additional entity types (as much as 11 F1), thus suggesting a more cost effective approaches to taxonomy expansion.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
A Weak Supervision Approach for Few-Shot Aspect Based Sentiment
Authors:
Robert Vacareanu,
Siddharth Varia,
Kishaloy Halder,
Shuai Wang,
Giovanni Paolini,
Neha Anna John,
Miguel Ballesteros,
Smaranda Muresan
Abstract:
We explore how weak supervision on abundant unlabeled data can be leveraged to improve few-shot performance in aspect-based sentiment analysis (ABSA) tasks. We propose a pipeline approach to construct a noisy ABSA dataset, and we use it to adapt a pre-trained sequence-to-sequence model to the ABSA tasks. We test the resulting model on three widely used ABSA datasets, before and after fine-tuning.…
▽ More
We explore how weak supervision on abundant unlabeled data can be leveraged to improve few-shot performance in aspect-based sentiment analysis (ABSA) tasks. We propose a pipeline approach to construct a noisy ABSA dataset, and we use it to adapt a pre-trained sequence-to-sequence model to the ABSA tasks. We test the resulting model on three widely used ABSA datasets, before and after fine-tuning. Our proposed method preserves the full fine-tuning performance while showing significant improvements (15.84% absolute F1) in the few-shot learning scenario for the harder tasks. In zero-shot (i.e., without fine-tuning), our method outperforms the previous state of the art on the aspect extraction sentiment classification (AESC) task and is, additionally, capable of performing the harder aspect sentiment triplet extraction (ASTE) task.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
À-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting
Authors:
Benjamin Bowman,
Alessandro Achille,
Luca Zancato,
Matthew Trager,
Pramuditha Perera,
Giovanni Paolini,
Stefano Soatto
Abstract:
We introduce À-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolation, possibly on different devices, at different times, and on different distributions or domains. Furthermore each prompt only contains information about the subset of data it was exposed…
▽ More
We introduce À-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolation, possibly on different devices, at different times, and on different distributions or domains. Furthermore each prompt only contains information about the subset of data it was exposed to during training. During inference, models can be assembled based on arbitrary selections of data sources, which we call "à-la-carte learning". À-la-carte learning enables constructing bespoke models specific to each user's individual access rights and preferences. We can add or remove information from the model by simply adding or removing the corresponding prompts without retraining from scratch. We demonstrate that à-la-carte built models achieve accuracy within $5\%$ of models trained on the union of the respective sources, with comparable cost in terms of training and inference time. For the continual learning benchmarks Split CIFAR-100 and CORe50, we achieve state-of-the-art performance.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Stacked Residuals of Dynamic Layers for Time Series Anomaly Detection
Authors:
L. Zancato,
A. Achille,
G. Paolini,
A. Chiuso,
S. Soatto
Abstract:
We present an end-to-end differentiable neural network architecture to perform anomaly detection in multivariate time series by incorporating a Sequential Probability Ratio Test on the prediction residual. The architecture is a cascade of dynamical systems designed to separate linearly predictable components of the signal such as trends and seasonality, from the non-linear ones. The former are mod…
▽ More
We present an end-to-end differentiable neural network architecture to perform anomaly detection in multivariate time series by incorporating a Sequential Probability Ratio Test on the prediction residual. The architecture is a cascade of dynamical systems designed to separate linearly predictable components of the signal such as trends and seasonality, from the non-linear ones. The former are modeled by local Linear Dynamic Layers, and their residual is fed to a generic Temporal Convolutional Network that also aggregates global statistics from different time series as context for the local predictions of each one. The last layer implements the anomaly detector, which exploits the temporal structure of the prediction residuals to detect both isolated point anomalies and set-point changes. It is based on a novel application of the classic CUMSUM algorithm, adapted through the use of a variational approximation of f-divergences. The model automatically adapts to the time scales of the observed signals. It approximates a SARIMA model at the get-go, and auto-tunes to the statistics of the signal and its covariates, without the need for supervision, as more data is observed. The resulting system, which we call STRIC, outperforms both state-of-the-art robust statistical methods and deep neural network architectures on multiple anomaly detection benchmarks.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
DIVA: Dataset Derivative of a Learning Task
Authors:
Yonatan Dukler,
Alessandro Achille,
Giovanni Paolini,
Avinash Ravichandran,
Marzia Polito,
Stefano Soatto
Abstract:
We present a method to compute the derivative of a learning task with respect to a dataset. A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN). The "dataset derivative" is a linear operator, computed around the trained model, that informs how perturbations of the weight of each training sample affect the valida…
▽ More
We present a method to compute the derivative of a learning task with respect to a dataset. A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN). The "dataset derivative" is a linear operator, computed around the trained model, that informs how perturbations of the weight of each training sample affect the validation error, usually computed on a separate validation dataset. Our method, DIVA (Differentiable Validation) hinges on a closed-form differentiable expression of the leave-one-out cross-validation error around a pre-trained DNN. Such expression constitutes the dataset derivative. DIVA could be used for dataset auto-curation, for example removing samples with faulty annotations, augmenting a dataset with additional relevant samples, or rebalancing. More generally, DIVA can be used to optimize the dataset, along with the parameters of the model, as part of the training process without the need for a separate validation dataset, unlike bi-level optimization methods customary in AutoML. To illustrate the flexibility of DIVA, we report experiments on sample auto-curation tasks such as outlier rejection, dataset extension, and automatic aggregation of multi-modal data.
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
-
Estimating informativeness of samples with Smooth Unique Information
Authors:
Hrayr Harutyunyan,
Alessandro Achille,
Giovanni Paolini,
Orchid Majumder,
Avinash Ravichandran,
Rahul Bhotika,
Stefano Soatto
Abstract:
We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a lin…
▽ More
We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples. Our work generalizes existing frameworks but enjoys better computational properties for heavily over-parametrized models, which makes it possible to apply it to real-world networks.
△ Less
Submitted 28 March, 2021; v1 submitted 17 January, 2021;
originally announced January 2021.
-
Structured Prediction as Translation between Augmented Natural Languages
Authors:
Giovanni Paolini,
Ben Athiwaratkun,
Jason Krone,
Jie Ma,
Alessandro Achille,
Rishita Anubhai,
Cicero Nogueira dos Santos,
Bing Xiang,
Stefano Soatto
Abstract:
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discri…
▽ More
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as a translation task between augmented natural languages, from which the task-relevant information can be easily extracted. Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction (CoNLL04, ADE, NYT, and ACE2005 datasets), relation classification (FewRel and TACRED), and semantic role labeling (CoNLL-2005 and CoNLL-2012). We accomplish this while using the same architecture and hyperparameters for all tasks and even when training a single model to solve all tasks at the same time (multi-task learning). Finally, we show that our framework can also significantly improve the performance in a low-resource regime, thanks to better use of label semantics.
△ Less
Submitted 2 December, 2021; v1 submitted 14 January, 2021;
originally announced January 2021.
-
Where is the Information in a Deep Neural Network?
Authors:
Alessandro Achille,
Giovanni Paolini,
Stefano Soatto
Abstract:
Whatever information a deep neural network has gleaned from training data is encoded in its weights. How this information affects the response of the network to future data remains largely an open question. Indeed, even defining and measuring information entails some subtleties, since a trained network is a deterministic map, so standard information measures can be degenerate. We measure informati…
▽ More
Whatever information a deep neural network has gleaned from training data is encoded in its weights. How this information affects the response of the network to future data remains largely an open question. Indeed, even defining and measuring information entails some subtleties, since a trained network is a deterministic map, so standard information measures can be degenerate. We measure information in a neural network via the optimal trade-off between accuracy of the response and complexity of the weights, measured by their coding length. Depending on the choice of code, the definition can reduce to standard measures such as Shannon Mutual Information and Fisher Information. However, the more general definition allows us to relate information to generalization and invariance, through a novel notion of effective information in the activations of a deep network. We establish a novel relation between the information in the weights and the effective information in the activations, and use this result to show that models with low (information) complexity not only generalize better, but are bound to learn invariant representations of future inputs. These relations hinge not only on the architecture of the model, but also on how it is trained, highlighting the complex inter-dependency between the class of functions implemented by deep neural networks, the loss function used for training them from finite data, and the inductive bias implicit in the optimization.
△ Less
Submitted 21 June, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
The Information Complexity of Learning Tasks, their Structure and their Distance
Authors:
Alessandro Achille,
Giovanni Paolini,
Glen Mbeng,
Stefano Soatto
Abstract:
We introduce an asymmetric distance in the space of learning tasks, and a framework to compute their complexity. These concepts are foundational for the practice of transfer learning, whereby a parametric model is pre-trained for a task, and then fine-tuned for another. The framework we develop is non-asymptotic, captures the finite nature of the training dataset, and allows distinguishing learnin…
▽ More
We introduce an asymmetric distance in the space of learning tasks, and a framework to compute their complexity. These concepts are foundational for the practice of transfer learning, whereby a parametric model is pre-trained for a task, and then fine-tuned for another. The framework we develop is non-asymptotic, captures the finite nature of the training dataset, and allows distinguishing learning from memorization. It encompasses, as special cases, classical notions from Kolmogorov complexity, Shannon, and Fisher Information. However, unlike some of those frameworks, it can be applied to large-scale models and real-world datasets. Our framework is the first to measure complexity in a way that accounts for the effect of the optimization scheme, which is critical in Deep Learning.
△ Less
Submitted 14 July, 2020; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Collapsibility to a subcomplex of a given dimension is NP-complete
Authors:
Giovanni Paolini
Abstract:
In this paper we extend the works of Tancer and of Malgouyres and Francés, showing that $(d,k)$-collapsibility is NP-complete for $d\geq k+2$ except $(2,0)$. By $(d,k)$-collapsibility we mean the following problem: determine whether a given $d$-dimensional simplicial complex can be collapsed to some $k$-dimensional subcomplex. The question of establishing the complexity status of $(d,k)$-collapsib…
▽ More
In this paper we extend the works of Tancer and of Malgouyres and Francés, showing that $(d,k)$-collapsibility is NP-complete for $d\geq k+2$ except $(2,0)$. By $(d,k)$-collapsibility we mean the following problem: determine whether a given $d$-dimensional simplicial complex can be collapsed to some $k$-dimensional subcomplex. The question of establishing the complexity status of $(d,k)$-collapsibility was asked by Tancer, who proved NP-completeness of $(d,0)$ and $(d,1)$-collapsibility (for $d\geq 3$). Our extended result, together with the known polynomial-time algorithms for $(2,0)$ and $d=k+1$, answers the question completely.
△ Less
Submitted 5 April, 2019; v1 submitted 20 March, 2017;
originally announced March 2017.
-
An algorithm for canonical forms of finite subsets of $\mathbb{Z}^d$ up to affinities
Authors:
Giovanni Paolini
Abstract:
In this paper we describe an algorithm for the computation of canonical forms of finite subsets of $\mathbb{Z}^d$, up to affinities over $\mathbb{Z}$. For fixed dimension $d$, this algorithm has worst-case asymptotic complexity $O(n \log^2 n \, s\,μ(s))$, where $n$ is the number of points in the given subset, $s$ is an upper bound to the size of the binary representation of any of the $n$ points,…
▽ More
In this paper we describe an algorithm for the computation of canonical forms of finite subsets of $\mathbb{Z}^d$, up to affinities over $\mathbb{Z}$. For fixed dimension $d$, this algorithm has worst-case asymptotic complexity $O(n \log^2 n \, s\,μ(s))$, where $n$ is the number of points in the given subset, $s$ is an upper bound to the size of the binary representation of any of the $n$ points, and $μ(s)$ is an upper bound to the number of operations required to multiply two $s$-bit numbers. In particular, the problem is fixed-parameter tractable with respect to the dimension $d$. This problem arises e.g. in the context of computation of invariants of finitely presented groups with abelianized group isomorphic to $\mathbb{Z}^d$. In that context one needs to decide whether two Laurent polynomials in $d$ indeterminates, considered as elements of the group ring over the abelianized group, are equivalent with respect to a change of basis.
△ Less
Submitted 27 September, 2018; v1 submitted 14 August, 2014;
originally announced August 2014.