Search | arXiv e-print repository

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Authors: Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

Abstract: Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets… ▽ More Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model. △ Less

Submitted 26 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: PDF fixed in v3

arXiv:2307.04205 [pdf, other]

Extending the Forward Forward Algorithm

Authors: Saumya Gandhi, Ritu Gala, Jonah Kornberg, Advaith Sridhar

Abstract: The Forward Forward algorithm, proposed by Geoffrey Hinton in November 2022, is a novel method for training neural networks as an alternative to backpropagation. In this project, we replicate Hinton's experiments on the MNIST dataset, and subsequently extend the scope of the method with two significant contributions. First, we establish a baseline performance for the Forward Forward network on the… ▽ More The Forward Forward algorithm, proposed by Geoffrey Hinton in November 2022, is a novel method for training neural networks as an alternative to backpropagation. In this project, we replicate Hinton's experiments on the MNIST dataset, and subsequently extend the scope of the method with two significant contributions. First, we establish a baseline performance for the Forward Forward network on the IMDb movie reviews dataset. As far as we know, our results on this sentiment analysis task marks the first instance of the algorithm's extension beyond computer vision. Second, we introduce a novel pyramidal optimization strategy for the loss threshold - a hyperparameter specific to the Forward Forward method. Our pyramidal approach shows that a good thresholding strategy causes a difference of up to 8% in test error. Lastly, we perform visualizations of the trained parameters and derived several significant insights, such as a notably larger (10-20x) mean and variance in the weights acquired by the Forward Forward network. Repository: https://github.com/Ads-cmu/ForwardForward △ Less

Submitted 14 July, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

arXiv:2106.03036 [pdf]

Real-Time Cognitive Evaluation of Online Learners through Automatically Generated Questions

Authors: Ritu Gala, Revathi Vijayaraghavan, Valmik Nikam, Arvind Kiwelekar

Abstract: With the increased adoption of E-learning platforms, kee** online learners engaged throughout a lesson is challenging. One approach to tackle this challenge is to probe learn-ers periodically by asking questions. The paper presents an approach to generate questions from a given video lecture automatically. The generated questions are aimed to evaluate learners' lower-level cognitive abilities. T… ▽ More With the increased adoption of E-learning platforms, kee** online learners engaged throughout a lesson is challenging. One approach to tackle this challenge is to probe learn-ers periodically by asking questions. The paper presents an approach to generate questions from a given video lecture automatically. The generated questions are aimed to evaluate learners' lower-level cognitive abilities. The approach automatically extracts text from video lectures to generates wh-kinds of questions. When learners respond with an answer, the proposed approach further evaluates the response and provides feedback. Besides enhancing learner's engagement, this approach's main benefits are that it frees instructors from design-ing questions to check the comprehension of a topic. Thus, instructors can spend this time productively on other activities. △ Less

Submitted 6 June, 2021; originally announced June 2021.

arXiv:2105.12504 [pdf]

doi 10.5121/csit.2021.110605

Blockchain-Based Approach to Foster Student Engagement on Campus

Authors: Ritu Gala, Eshita Shukla, Nidhee Kamble, Revathi Vijayaraghavan, Dhiren Patel

Abstract: On-campus activities like positions of responsibility in campus amenities and participation in research, benefit the students as well as the university, while also making students financially self-sufficient to a certain extent. However, this student participation is stymied by lack of awareness and motivation. Significant impetus to innovation and student participation can be provided by incentiv… ▽ More On-campus activities like positions of responsibility in campus amenities and participation in research, benefit the students as well as the university, while also making students financially self-sufficient to a certain extent. However, this student participation is stymied by lack of awareness and motivation. Significant impetus to innovation and student participation can be provided by incentivization of these activities. In this paper, we propose a system to create a blockchain-based economy, to incentivize students with empirical benefits or monetary awards calculated using objective algorithms. The incentivization algorithms have been designed for three promising use cases: research work, positions of responsibility in universities, and crowdfunding. The demonstrated implementation of this system utilises VJTI Chain, an already established Proof of Authority blockchain in VJTI Mumbai, India. This creates a circular economy within the university which encourages students to earn more rewards by reinforcing positive feedback. △ Less

Submitted 26 May, 2021; originally announced May 2021.

Comments: 13 pages, 6 figures

Journal ref: International Conference of Education (CONEDU 2021) May 22~23, 2021, Zurich, Switzerland. Volume 11, Number 06, pp 53-65

arXiv:2007.09880 [pdf, other]

Mixture Representation Learning with Coupled Autoencoders

Authors: Yeganeh M. Marghi, Rohan Gala, Uygar Sümbül

Abstract: Jointly identifying a mixture of discrete and continuous factors of variability without supervision is a key problem in unraveling complex phenomena. Variational inference has emerged as a promising method to learn interpretable mixture representations. However, posterior approximation in high-dimensional latent spaces, particularly for discrete factors remains challenging. Here, we propose an uns… ▽ More Jointly identifying a mixture of discrete and continuous factors of variability without supervision is a key problem in unraveling complex phenomena. Variational inference has emerged as a promising method to learn interpretable mixture representations. However, posterior approximation in high-dimensional latent spaces, particularly for discrete factors remains challenging. Here, we propose an unsupervised variational framework using multiple interacting networks called cpl-mixVAE that scales well to high-dimensional discrete settings. In this framework, the mixture representation of each network is regularized by imposing a consensus constraint on the discrete factor. We justify the use of this framework by providing both theoretical and experimental results. Finally, we use the proposed method to jointly uncover discrete and continuous factors of variability describing gene expression in a single-cell transcriptomic dataset profiling more than a hundred cortical neuron types. △ Less

Submitted 12 April, 2021; v1 submitted 20 July, 2020; originally announced July 2020.

Comments: 10 pages, 6 figures, conference

arXiv:1911.05663 [pdf, other]

A coupled autoencoder approach for multi-modal analysis of cell types

Authors: Rohan Gala, Nathan Gouwens, Zizhen Yao, Agata Budzillo, Osnat Penn, Bosiljka Tasic, Gabe Murphy, Hongkui Zeng, Uygar Sümbül

Abstract: Recent developments in high throughput profiling of individual neurons have spurred data driven exploration of the idea that there exist natural grou**s of neurons referred to as cell types. The promise of this idea is that the immense complexity of brain circuits can be reduced, and effectively studied by means of interactions between cell types. While clustering of neuron populations based on… ▽ More Recent developments in high throughput profiling of individual neurons have spurred data driven exploration of the idea that there exist natural grou**s of neurons referred to as cell types. The promise of this idea is that the immense complexity of brain circuits can be reduced, and effectively studied by means of interactions between cell types. While clustering of neuron populations based on a particular data modality can be used to define cell types, such definitions are often inconsistent across different characterization modalities. We pose this issue of cross-modal alignment as an optimization problem and develop an approach based on coupled training of autoencoders as a framework for such analyses. We apply this framework to a Patch-seq dataset consisting of transcriptomic and electrophysiological profiles for the same set of neurons to study consistency of representations across modalities, and evaluate cross-modal data prediction ability. We explore the problem where only a subset of neurons is characterized with more than one modality, and demonstrate that representations learned by coupled autoencoders can be used to identify types sampled only by a single modality. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: Main text : 10 pages, 5 figures. Supp text : 6 pages, 3 figures

Showing 1–6 of 6 results for author: Gala, R