-
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
Authors:
Katherine M. Collins,
Najoung Kim,
Yonatan Bitton,
Verena Rieser,
Shayegan Omidshafiei,
Yushi Hu,
Sherol Chen,
Senjuti Dutta,
Minsuk Chang,
Kimin Lee,
Youwei Liang,
Georgina Evans,
Sahil Singla,
Gang Li,
Adrian Weller,
Junfeng He,
Deepak Ramachandran,
Krishnamurthy Dj Dvijotham
Abstract:
Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional co…
▽ More
Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Online Bandit Learning with Offline Preference Data
Authors:
Akhil Agnihotri,
Rahul Jain,
Deepak Ramachandran,
Zheng Wen
Abstract:
Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be very noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In parti…
▽ More
Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be very noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an expert of unknown 'competence'. We propose $\texttt{warmPref-PS}$, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the competence of the expert that generated it, we are able to use such a dataset most effectively. We support our claims with novel theoretical analysis of its Bayesian regret, as well as extensive empirical evaluation of an approximate algorithm which performs substantially better (almost 25 to 50% regret reduction in our studies) as compared to baselines.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
e-COP : Episodic Constrained Optimization of Policies
Authors:
Akhil Agnihotri,
Rahul Jain,
Deepak Ramachandran,
Sahil Singla
Abstract:
In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic se…
▽ More
In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the $\texttt{e-COP}$ algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic
Authors:
Connor Pryor,
Quan Yuan,
Jeremiah Liu,
Mehran Kazemi,
Deepak Ramachandran,
Tania Bedrax-Weiss,
Lise Getoor
Abstract:
Dialog Structure Induction (DSI) is the task of inferring the latent dialog structure (i.e., a set of dialog states and their temporal transitions) of a given goal-oriented dialog. It is a critical component for modern dialog system design and discourse analysis. Existing DSI approaches are often purely data-driven, deploy models that infer latent states without access to domain knowledge, underpe…
▽ More
Dialog Structure Induction (DSI) is the task of inferring the latent dialog structure (i.e., a set of dialog states and their temporal transitions) of a given goal-oriented dialog. It is a critical component for modern dialog system design and discourse analysis. Existing DSI approaches are often purely data-driven, deploy models that infer latent states without access to domain knowledge, underperform when the training corpus is limited/noisy, or have difficulty when test dialogs exhibit distributional shifts from the training domain. This work explores a neural-symbolic approach as a potential solution to these problems. We introduce Neural Probabilistic Soft Logic Dialogue Structure Induction (NEUPSL DSI), a principled approach that injects symbolic knowledge into the latent space of a generative neural model. We conduct a thorough empirical investigation on the effect of NEUPSL DSI learning on hidden representation quality, few-shot learning, and out-of-domain generalization performance. Over three dialog structure induction datasets and across unsupervised and semi-supervised settings for standard and cross-domain generalization, the injection of symbolic knowledge using NEUPSL DSI provides a consistent boost in performance over the canonical baselines.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Understanding Subjectivity through the Lens of Motivational Context in Model-Generated Image Satisfaction
Authors:
Senjuti Dutta,
Sherol Chen,
Sunny Mak,
Amnah Ahmad,
Katherine Collins,
Alena Butryna,
Deepak Ramachandran,
Krishnamurthy Dvijotham,
Ellie Pavlick,
Ravi Rajakumar
Abstract:
Image generation models are poised to become ubiquitous in a range of applications. These models are often fine-tuned and evaluated using human quality judgments that assume a universal standard, failing to consider the subjectivity of such tasks. To investigate how to quantify subjectivity, and the scale of its impact, we measure how assessments differ among human annotators across different use…
▽ More
Image generation models are poised to become ubiquitous in a range of applications. These models are often fine-tuned and evaluated using human quality judgments that assume a universal standard, failing to consider the subjectivity of such tasks. To investigate how to quantify subjectivity, and the scale of its impact, we measure how assessments differ among human annotators across different use cases. Simulating the effects of ordinarily latent elements of annotators subjectivity, we contrive a set of motivations (t-shirt graphics, presentation visuals, and phone background images) to contextualize a set of crowdsourcing tasks. Our results show that human evaluations of images vary within individual contexts and across combinations of contexts. Three key factors affecting this subjectivity are image appearance, image alignment with text, and representation of objects mentioned in the text. Our study highlights the importance of taking individual users and contexts into account, both when building and evaluating generative models
△ Less
Submitted 26 February, 2024;
originally announced March 2024.
-
Prompt Expansion for Adaptive Text-to-Image Generation
Authors:
Siddhartha Datta,
Alexander Ku,
Deepak Ramachandran,
Peter Anderson
Abstract:
Text-to-image generation models are powerful but difficult to use. Users craft specific prompts to get better images, though the images can be repetitive. This paper proposes a Prompt Expansion framework that helps users generate high-quality, diverse images with less effort. The Prompt Expansion model takes a text query as input and outputs a set of expanded text prompts that are optimized such t…
▽ More
Text-to-image generation models are powerful but difficult to use. Users craft specific prompts to get better images, though the images can be repetitive. This paper proposes a Prompt Expansion framework that helps users generate high-quality, diverse images with less effort. The Prompt Expansion model takes a text query as input and outputs a set of expanded text prompts that are optimized such that when passed to a text-to-image model, generates a wider variety of appealing images. We conduct a human evaluation study that shows that images generated through Prompt Expansion are more aesthetically pleasing and diverse than those generated by baseline methods. Overall, this paper presents a novel and effective approach to improving the text-to-image generation experience.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Rich Human Feedback for Text-to-Image Generation
Authors:
Youwei Liang,
Junfeng He,
Gang Li,
Peizhao Li,
Arseniy Klimovskiy,
Nicholas Carolan,
Jiao Sun,
Jordi Pont-Tuset,
Sarah Young,
Feng Yang,
Junjie Ke,
Krishnamurthy Dj Dvijotham,
Katie Collins,
Yiwen Luo,
Yang Li,
Kai J Kohlhoff,
Deepak Ramachandran,
Vidhya Navalpakkam
Abstract:
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback…
▽ More
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.
△ Less
Submitted 8 April, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Hel** or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Authors:
Jacob Eisenstein,
Chirag Nagpal,
Alekh Agarwal,
Ahmad Beirami,
Alex D'Amour,
DJ Dvijotham,
Adam Fisch,
Katherine Heller,
Stephen Pfohl,
Deepak Ramachandran,
Peter Shaw,
Jonathan Berant
Abstract:
Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust…
▽ More
Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.
△ Less
Submitted 20 December, 2023; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Modeling subjectivity (by Mimicking Annotator Annotation) in toxic comment identification across diverse communities
Authors:
Senjuti Dutta,
Sid Mittal,
Sherol Chen,
Deepak Ramachandran,
Ravi Rajakumar,
Ian Kivlichan,
Sunny Mak,
Alena Butryna,
Praveen Paritosh
Abstract:
The prevalence and impact of toxic discussions online have made content moderation crucial.Automated systems can play a vital role in identifying toxicity, and reducing the reliance on human moderation.Nevertheless, identifying toxic comments for diverse communities continues to present challenges that are addressed in this paper.The two-part goal of this study is to(1)identify intuitive variances…
▽ More
The prevalence and impact of toxic discussions online have made content moderation crucial.Automated systems can play a vital role in identifying toxicity, and reducing the reliance on human moderation.Nevertheless, identifying toxic comments for diverse communities continues to present challenges that are addressed in this paper.The two-part goal of this study is to(1)identify intuitive variances from annotator disagreement using quantitative analysis and (2)model the subjectivity of these viewpoints.To achieve our goal, we published a new dataset\footnote{\url{https://github.com/XXX}} with expert annotators' annotations and used two other public datasets to identify the subjectivity of toxicity.Then leveraging the Large Language Model(LLM),we evaluate the model's ability to mimic diverse viewpoints on toxicity by varying size of the training data and utilizing same set of annotators as the test set used during model training and a separate set of annotators as the test set.We conclude that subjectivity is evident across all annotator groups, demonstrating the shortcomings of majority-rule voting. Moving forward, subjective annotations should serve as ground truth labels for training models for domains like toxicity in diverse communities.
△ Less
Submitted 31 October, 2023;
originally announced November 2023.
-
Demystifying Embedding Spaces using Large Language Models
Authors:
Guy Tennenholtz,
Yinlam Chow,
Chih-Wei Hsu,
Jihwan Jeong,
Lior Shani,
Azamat Tulepbergenov,
Deepak Ramachandran,
Martin Mladenov,
Craig Boutilier
Abstract:
Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machin…
▽ More
Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing Large Language Models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs.
△ Less
Submitted 13 March, 2024; v1 submitted 6 October, 2023;
originally announced October 2023.
-
TaskLAMA: Probing the Complex Task Understanding of Language Models
Authors:
Quan Yuan,
Mehran Kazemi,
Xin Xu,
Isaac Noble,
Vaiva Imbrasaite,
Deepak Ramachandran
Abstract:
Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between them. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe…
▽ More
Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between them. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from Large Language Models (LLMs). We introduce a high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37% over the base model. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information
Authors:
Mehran Kazemi,
Quan Yuan,
Deepti Bhatia,
Najoung Kim,
Xin Xu,
Vaiva Imbrasaite,
Deepak Ramachandran
Abstract:
Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for develo** robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. W…
▽ More
Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for develo** robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. When reasoning in the real-world, the available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. One widely-applicable way of resolving conflicts is to impose preferences over information sources (e.g., based on source credibility or information recency) and adopt the source with higher preference. In this paper, we formulate the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning, and develop a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA also incorporates reasoning with implicit background knowledge, to better reflect reasoning problems in downstream applications. We benchmark various LMs on BoardgameQA and the results reveal a significant gap in the reasoning capacity of state-of-the-art LMs on this problem, showing that reasoning with conflicting information does not surface out-of-the-box in LMs. While performance can be improved with finetuning, it nevertheless remains poor.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Nuclear T-violation search using octupole-deformed nuclei in a crystal
Authors:
Harish D. Ramachandran,
Amar C. Vutha
Abstract:
Precision measurements with atoms and molecules can search for subtle violations of time-reversal symmetry (T) in nuclei, and thereby probe a variety of new physics models. We present a detailed scheme for a nuclear T-violation search experiment using $^{153}$Eu$^{3+}$ ions doped in non-centrosymmetric sites within a Y$_2$SiO$_5$ crystal. The ions in this solid contain nuclei that are highly sensi…
▽ More
Precision measurements with atoms and molecules can search for subtle violations of time-reversal symmetry (T) in nuclei, and thereby probe a variety of new physics models. We present a detailed scheme for a nuclear T-violation search experiment using $^{153}$Eu$^{3+}$ ions doped in non-centrosymmetric sites within a Y$_2$SiO$_5$ crystal. The ions in this solid contain nuclei that are highly sensitive to T-violation, and avail of large atomic enhancements by being polarized within the solid. But in particular, the system and methods that we discuss here enable the use of vast numbers of nuclei trapped in crystals, while also offering a number of stringent tests to ward off systematic errors. Our approach maps out a path to probe new physics at the PeV energy scale.
△ Less
Submitted 7 July, 2023; v1 submitted 20 April, 2023;
originally announced April 2023.
-
Coherent quantum beats: spectroscopy of energy differences masked by inhomogeneous broadening
Authors:
Harish D. Ramachandran,
Julia E. Ford,
Amar C. Vutha
Abstract:
Precision spectroscopy of solid-state systems is challenging due to inhomogeneous broadening. We describe a technique -- coherent quantum beats -- that enables the measurement of small frequency shifts within an inhomogeneously broadened distribution while addressing the full ensemble. We show that the technique can be used to obtain improvements in signal size and spectral resolution, offering ad…
▽ More
Precision spectroscopy of solid-state systems is challenging due to inhomogeneous broadening. We describe a technique -- coherent quantum beats -- that enables the measurement of small frequency shifts within an inhomogeneously broadened distribution while addressing the full ensemble. We show that the technique can be used to obtain improvements in signal size and spectral resolution, offering advantages for precision measurements in solids.
△ Less
Submitted 30 June, 2023; v1 submitted 12 April, 2023;
originally announced April 2023.
-
Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play
Authors:
Jeremiah Zhe Liu,
Krishnamurthy Dj Dvijotham,
Jihyeon Lee,
Quan Yuan,
Martin Strobel,
Balaji Lakshminarayanan,
Deepak Ramachandran
Abstract:
Standard empirical risk minimization (ERM) training can produce deep neural network (DNN) models that are accurate on average but under-perform in under-represented population subgroups, especially when there are imbalanced group distributions in the long-tailed training data. Therefore, approaches that improve the accuracy-group robustness trade-off frontier of a DNN model (i.e. improving worst-g…
▽ More
Standard empirical risk minimization (ERM) training can produce deep neural network (DNN) models that are accurate on average but under-perform in under-represented population subgroups, especially when there are imbalanced group distributions in the long-tailed training data. Therefore, approaches that improve the accuracy-group robustness trade-off frontier of a DNN model (i.e. improving worst-group accuracy without sacrificing average accuracy, or vice versa) is of crucial importance. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple "plug-in" for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
Understanding Finetuning for Factual Knowledge Extraction from Language Models
Authors:
Mehran Kazemi,
Sid Mittal,
Deepak Ramachandran
Abstract:
Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs…
▽ More
Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language
Authors:
Mehran Kazemi,
Najoung Kim,
Deepti Bhatia,
Xin Xu,
Deepak Ramachandran
Abstract:
Remarkable progress has been made on automated reasoning with natural text, by using Language Models (LMs) and methods such as Chain-of-Thought and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reason…
▽ More
Remarkable progress has been made on automated reasoning with natural text, by using Language Models (LMs) and methods such as Chain-of-Thought and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from the intended conclusion to supporting axioms) is significantly more efficient at proof-finding. Importing this intuition into the LM setting, we develop a Backward Chaining algorithm, called LAMBADA, that decomposes reasoning into four sub-modules. These sub-modules are simply implemented by few-shot prompted LM inference. We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods on challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.
△ Less
Submitted 29 May, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
A Riemannian Genuine Measure of Entanglement for Pure States
Authors:
Dharmaraj Ramachandran,
Radhika Vathsan
Abstract:
While several measures exist for entanglement of multipartite pure states, a true entanglement measure for mixed states still eludes us. A deeper study of the geometry of quantum states may be the way to address this issue, on which context we come up with a measure for pure states based on a geodesic distance on the space of quantum states. Our measure satisfies all the desirable properties of a…
▽ More
While several measures exist for entanglement of multipartite pure states, a true entanglement measure for mixed states still eludes us. A deeper study of the geometry of quantum states may be the way to address this issue, on which context we come up with a measure for pure states based on a geodesic distance on the space of quantum states. Our measure satisfies all the desirable properties of a ``Genuine Measure of Entanglement" (GME), and in comparison with some of the other existing measures, shows better smoothness and discriminance.
△ Less
Submitted 13 January, 2024; v1 submitted 11 November, 2022;
originally announced November 2022.
-
BaF molecules in neon ice: trap**, spectroscopy and optical control of electron spins
Authors:
Samuel J. Li,
Harish D. Ramachandran,
Rhys Anderson,
Amar C. Vutha
Abstract:
We have trapped BaF molecules in neon ice, and used laser-induced fluorescence spectroscopy to map out optical transitions in the trapped molecules. Our measurements show that the neon lattice does not significantly perturb certain optical transitions in the trapped molecules. We used one of these transitions to polarize the electron spins, detect spin flips and measure hyperfine transitions in th…
▽ More
We have trapped BaF molecules in neon ice, and used laser-induced fluorescence spectroscopy to map out optical transitions in the trapped molecules. Our measurements show that the neon lattice does not significantly perturb certain optical transitions in the trapped molecules. We used one of these transitions to polarize the electron spins, detect spin flips and measure hyperfine transitions in the trapped molecules, entirely using lasers. This demonstration with heavy polar molecules opens up new opportunities for precision measurements of beyond-standard-model physics.
△ Less
Submitted 25 January, 2023; v1 submitted 14 July, 2022;
originally announced July 2022.
-
Tackling Provably Hard Representative Selection via Graph Neural Networks
Authors:
Mehran Kazemi,
Anton Tsitsulin,
Hossein Esfandiari,
MohammadHossein Bateni,
Deepak Ramachandran,
Bryan Perozzi,
Vahab Mirrokni
Abstract:
Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result forRS (in the absence of a graph structur…
▽ More
Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result forRS (in the absence of a graph structure) by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study the setting where a (homophilous) graph structure is available, or can be constructed, between the data points.We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RS-GNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks.
△ Less
Submitted 19 July, 2023; v1 submitted 20 May, 2022;
originally announced May 2022.
-
FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue
Authors:
Alon Albalak,
Yi-Lin Tuan,
Pegah Jandaghi,
Connor Pryor,
Luke Yoffe,
Deepak Ramachandran,
Lise Getoor,
Jay Pujara,
William Yang Wang
Abstract:
Task transfer, transferring knowledge contained in related tasks, holds the promise of reducing the quantity of labeled data required to fine-tune language models. Dialogue understanding encompasses many diverse tasks, yet task transfer has not been thoroughly studied in conversational AI. This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer…
▽ More
Task transfer, transferring knowledge contained in related tasks, holds the promise of reducing the quantity of labeled data required to fine-tune language models. Dialogue understanding encompasses many diverse tasks, yet task transfer has not been thoroughly studied in conversational AI. This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer in open-domain dialogue. FETA contains two underlying sets of conversations upon which there are 10 and 7 tasks annotated, enabling the study of intra-dataset task transfer; task transfer without domain adaptation. We utilize three popular language models and three learning algorithms to analyze the transferability between 132 source-target task pairs and create a baseline for future work. We run experiments in the single- and multi-source settings and report valuable findings, e.g., most performance trends are model-specific, and span extraction and multiple-choice tasks benefit the most from task transfer. In addition to task transfer, FETA can be a valuable resource for future research into the efficiency and generalizability of pre-training datasets and model architectures, as well as for learning settings such as continual and multitask learning.
△ Less
Submitted 13 October, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors
Authors:
Christina Göpfert,
Alex Haig,
Yinlam Chow,
Chih-wei Hsu,
Ivan Vendrov,
Tyler Lu,
Deepak Ramachandran,
Hubert Pham,
Mohammad Ghavamzadeh,
Craig Boutilier
Abstract:
Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue). Yet more research is ne…
▽ More
Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue). Yet more research is needed to find the most effective ways to use this feedback. One challenge is inferring a user's semantic intent from the open-ended terms or attributes often used to describe a desired item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs) [26], a recently developed approach for model interpretability in machine learning, we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in recommender systems. One novel feature of our approach is its ability to distinguish objective and subjective attributes (both subjectivity of degree and of sense), and associate different senses of subjective attributes with different users. We demonstrate on both synthetic and real-world data sets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations through interactive item critiquing.
△ Less
Submitted 2 June, 2023; v1 submitted 6 February, 2022;
originally announced February 2022.
-
Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
Authors:
Najoung Kim,
Ellie Pavlick,
Burcu Karagol Ayan,
Deepak Ramachandran
Abstract:
Many Question-Answering (QA) datasets contain unanswerable questions, but their treatment in QA systems remains primitive. Our analysis of the Natural Questions (Kwiatkowski et al. 2019) dataset reveals that a substantial portion of unanswerable questions ($\sim$21%) can be explained based on the presence of unverifiable presuppositions. We discuss the shortcomings of current models in handling su…
▽ More
Many Question-Answering (QA) datasets contain unanswerable questions, but their treatment in QA systems remains primitive. Our analysis of the Natural Questions (Kwiatkowski et al. 2019) dataset reveals that a substantial portion of unanswerable questions ($\sim$21%) can be explained based on the presence of unverifiable presuppositions. We discuss the shortcomings of current models in handling such questions, and describe how an improved system could handle them. Through a user preference study, we demonstrate that the oracle behavior of our proposed system that provides responses based on presupposition failure is preferred over the oracle behavior of existing QA systems. Then we discuss how our proposed system could be implemented, presenting a novel framework that breaks down the problem into three steps: presupposition generation, presupposition verification and explanation generation. We report our progress in tackling each subproblem, and present a preliminary approach to integrating these steps into an existing QA system. We find that adding presuppositions and their verifiability to an existing model yields modest gains in downstream performance and unanswerability detection. The biggest bottleneck is the verification component, which needs to be substantially improved for the integrated system to approach ideal behavior -- even transfer from the best entailment models currently falls short.
△ Less
Submitted 3 September, 2021; v1 submitted 2 January, 2021;
originally announced January 2021.
-
Do Language Embeddings Capture Scales?
Authors:
Xikun Zhang,
Deepak Ramachandran,
Ian Tenney,
Yanai Elazar,
Dan Roth
Abstract:
Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense, and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense r…
▽ More
Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense, and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense reasoning. We identify contextual information in pre-training and numeracy as two key factors affecting their performance and show that a simple method of canonicalizing numbers can have a significant effect on the results.
△ Less
Submitted 24 November, 2020; v1 submitted 11 October, 2020;
originally announced October 2020.
-
How Large Are Lions? Inducing Distributions over Quantitative Attributes
Authors:
Yanai Elazar,
Abhijit Mahabal,
Deepak Ramachandran,
Tania Bedrax-Weiss,
Dan Roth
Abstract:
Most current NLP systems have little knowledge about quantitative attributes of objects and events. We propose an unsupervised method for collecting quantitative information from large amounts of web data, and use it to create a new, very large resource consisting of distributions over physical quantities associated with objects, adjectives, and verbs which we call Distributions over Quantitative…
▽ More
Most current NLP systems have little knowledge about quantitative attributes of objects and events. We propose an unsupervised method for collecting quantitative information from large amounts of web data, and use it to create a new, very large resource consisting of distributions over physical quantities associated with objects, adjectives, and verbs which we call Distributions over Quantitative (DoQ). This contrasts with recent work in this area which has focused on making only relative comparisons such as "Is a lion bigger than a wolf?". Our evaluation shows that DoQ compares favorably with state of the art results on existing datasets for relative comparisons of nouns and adjectives, and on a new dataset we introduce.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
An upper bound for the clique number using clique ceiling numbers
Authors:
R. Dharmarajan,
D. Ramachandran
Abstract:
In this article we present the idea of clique ceiling numbers of the vertices of a given graph that has a universal vertex. We follow up with a polynomial-time algorithm to compute an upper bound for the clique number of such a graph using clique ceiling numbers. We compare this algorithm with some upper bound formulas for the clique number.
In this article we present the idea of clique ceiling numbers of the vertices of a given graph that has a universal vertex. We follow up with a polynomial-time algorithm to compute an upper bound for the clique number of such a graph using clique ceiling numbers. We compare this algorithm with some upper bound formulas for the clique number.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
On the tractability of the maximum clique problem
Authors:
R. Dharmarajan,
D. Ramachandran
Abstract:
The maximum clique problem is a classical NP-complete problem in graph theory and has important applications in many domains. In this paper we show, in a partially non-constructive way, the existence of an exact polynomial-time algorithm for this problem. We outline the algorithm in pseudo-code style. Then we prove its exactness and efficiency by analysis.
The maximum clique problem is a classical NP-complete problem in graph theory and has important applications in many domains. In this paper we show, in a partially non-constructive way, the existence of an exact polynomial-time algorithm for this problem. We outline the algorithm in pseudo-code style. Then we prove its exactness and efficiency by analysis.
△ Less
Submitted 17 May, 2019; v1 submitted 26 March, 2019;
originally announced March 2019.
-
A modified greedy algorithm to improve bounds for the vertex cover number
Authors:
R. Dharmarajan,
D. Ramachandran
Abstract:
In any attempt at designing an efficient algorithm for the minimum vertex cover problem, obtaining good upper and lower bounds for the vertex cover number could be crucial. In this article we present a modified greedy algorithm of worst-case time complexity O(n3) to obtain bounds for the vertex cover number of an input graph of order n. Using simple facts, the proposed algorithm computes a lower b…
▽ More
In any attempt at designing an efficient algorithm for the minimum vertex cover problem, obtaining good upper and lower bounds for the vertex cover number could be crucial. In this article we present a modified greedy algorithm of worst-case time complexity O(n3) to obtain bounds for the vertex cover number of an input graph of order n. Using simple facts, the proposed algorithm computes a lower bound for the vertex cover number. Then using this lower bound it outputs a minimal vertex cover and hence gives an upper bound. The algorithm ensures the output vertex cover is always minimal, which feature is an improvement upon the existing greedy algorithms.
△ Less
Submitted 3 January, 2019;
originally announced January 2019.