Search | arXiv e-print repository

Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks

Authors: Soumajyoti Sarkar, Leonard Lausen

Abstract: Tables stored in databases and tables which are present in web pages and articles account for a large part of semi-structured data that is available on the internet. It then becomes pertinent to develop a modeling approach with large language models (LLMs) that can be used to solve diverse table tasks such as semantic parsing, question answering as well as classification problems. Traditionally, t… ▽ More Tables stored in databases and tables which are present in web pages and articles account for a large part of semi-structured data that is available on the internet. It then becomes pertinent to develop a modeling approach with large language models (LLMs) that can be used to solve diverse table tasks such as semantic parsing, question answering as well as classification problems. Traditionally, there existed separate models specialized for each task individually. It raises the question of how far can we go to build a unified model that works well on some table tasks without significant degradation on others. To that end, we attempt at creating a shared modeling approach in the pretraining stage with encoder-decoder style LLMs that can cater to diverse tasks. We evaluate our approach that continually pretrains and finetunes different model families of T5 with data from tables and surrounding context, on these downstream tasks at different model scales. Through multiple ablation studies, we observe that our pretraining with self-supervised objectives can significantly boost the performance of the models on these tasks. As an example of one improvement, we observe that the instruction finetuned public models which come specialized on text question answering (QA) and have been trained on table data still have room for improvement when it comes to table specific QA. Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models while also comparing the instruction finetuned variants of the models. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2307.08623 [pdf, other]

HYTREL: Hypergraph-enhanced Tabular Data Representation Learning

Authors: Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, George Karypis

Abstract: Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks. However, many of these models do not take into account the row/column permutation invariances, hierarchical structure, etc. that exist in tabular data. To alleviate these limitations, we propose HYTREL, a tabular language model, that captures the permutation invariance… ▽ More Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks. However, many of these models do not take into account the row/column permutation invariances, hierarchical structure, etc. that exist in tabular data. To alleviate these limitations, we propose HYTREL, a tabular language model, that captures the permutation invariances and three more structural properties of tabular data by using hypergraphs - where the table cells make up the nodes and the cells occurring jointly together in each row, column, and the entire table are used to form three different types of hyperedges. We show that HYTREL is maximally invariant under certain conditions for tabular data, i.e., two tables obtain the same representations via HYTREL iff the two tables are identical up to permutations. Our empirical results demonstrate that HYTREL consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining, illustrating the advantages of incorporating the inductive biases associated with tabular data into the representations. Finally, our qualitative analyses showcase that HYTREL can assimilate the table structures to generate robust representations for the cells, rows, columns, and the entire table. △ Less

Submitted 26 October, 2023; v1 submitted 14 July, 2023; originally announced July 2023.

Comments: NeurIPS 2023 (spotlight)

arXiv:2306.03438 [pdf, other]

Large Language Models of Code Fail at Completing Code with Potential Bugs

Authors: Tuan Dinh, **man Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis

Abstract: Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired… ▽ More Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a significant gap in post-mitigation performance. △ Less

Submitted 30 November, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: 27 pages, accepted to NeurIPS 2023

arXiv:2306.00381 [pdf, other]

Better Context Makes Better Code Language Models: A Case Study on Function Call Argument Completion

Authors: Hengzhi Pei, **man Zhao, Leonard Lausen, Sheng Zha, George Karypis

Abstract: Pretrained code language models have enabled great progress towards program synthesis. However, common approaches only consider in-file local context and thus miss information and constraints imposed by other parts of the codebase and its external dependencies. Existing code completion benchmarks also lack such context. To resolve these restrictions we curate a new dataset of permissively licensed… ▽ More Pretrained code language models have enabled great progress towards program synthesis. However, common approaches only consider in-file local context and thus miss information and constraints imposed by other parts of the codebase and its external dependencies. Existing code completion benchmarks also lack such context. To resolve these restrictions we curate a new dataset of permissively licensed Python packages that includes full projects and their dependencies and provide tools to extract non-local information with the help of program analyzers. We then focus on the task of function call argument completion which requires predicting the arguments to function calls. We show that existing code completion models do not yield good results on our completion task. To better solve this task, we query a program analyzer for information relevant to a given function call, and consider ways to provide the analyzer results to different code completion models during inference and training. Our experiments show that providing access to the function implementation and function usages greatly improves the argument completion performance. Our ablation study provides further insights on how different types of information available from the program analyzer and different ways of incorporating the information affect the model performance. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 12 pages. Accepted to AAAI 2023

ACM Class: I.2.2; I.2.7

arXiv:2211.03966 [pdf, ps, other]

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

Authors: Soumajyoti Sarkar, Kaixiang Lin, Sailik Sengupta, Leonard Lausen, Sheng Zha, Saab Mansour

Abstract: The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of s… ▽ More The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing models (by an avg metric of +$6.41$). We then explore two continual pre-training methods -- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks ($+4.64$ avg. gain) when used on monolingual models. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2204.11117 [pdf, other]

Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning

Authors: Vishakh Padmakumar, Leonard Lausen, Miguel Ballesteros, Sheng Zha, He He, George Karypis

Abstract: Recent work has found that multi-task training with a large number of diverse tasks can uniformly improve downstream performance on unseen target tasks. In contrast, literature on task transferability has established that the choice of intermediate tasks can heavily affect downstream task performance. In this work, we aim to disentangle the effect of scale and relatedness of tasks in multi-task re… ▽ More Recent work has found that multi-task training with a large number of diverse tasks can uniformly improve downstream performance on unseen target tasks. In contrast, literature on task transferability has established that the choice of intermediate tasks can heavily affect downstream task performance. In this work, we aim to disentangle the effect of scale and relatedness of tasks in multi-task representation learning. We find that, on average, increasing the scale of multi-task learning, in terms of the number of tasks, indeed results in better learned representations than smaller multi-task setups. However, if the target tasks are known ahead of time, then training on a smaller set of related tasks is competitive to the large-scale multi-task training at a reduced computational cost. △ Less

Submitted 12 July, 2022; v1 submitted 23 April, 2022; originally announced April 2022.

Comments: NAACL 2022 - Camera ready version

arXiv:1907.04433 [pdf, other]

GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing

Authors: Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu

Abstract: We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototy** and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customiza… ▽ More We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototy** and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customization. Leveraging the MXNet ecosystem, the deep learning models in GluonCV and GluonNLP can be deployed onto a variety of platforms with different programming languages. The Apache 2.0 license has been adopted by GluonCV and GluonNLP to allow for software distribution, modification, and usage. △ Less

Submitted 12 February, 2020; v1 submitted 9 July, 2019; originally announced July 2019.

Journal ref: Journal of Machine Learning Research 21 (2020) 1-7

arXiv:1712.05902 [pdf, other]

NSML: A Machine Learning Platform That Enables You to Focus on Your Models

Authors: Nako Sung, Minkyu Kim, Hyunwoo Jo, Youngil Yang, **gwoong Kim, Leonard Lausen, Youngkwan Kim, Gayoung Lee, Donghyun Kwak, Jung-Woo Ha, Sunghun Kim

Abstract: Machine learning libraries such as TensorFlow and PyTorch simplify model implementation. However, researchers are still required to perform a non-trivial amount of manual tasks such as GPU allocation, training status tracking, and comparison of models with different hyperparameter settings. We propose a system to handle these tasks and help researchers focus on models. We present the requirements… ▽ More Machine learning libraries such as TensorFlow and PyTorch simplify model implementation. However, researchers are still required to perform a non-trivial amount of manual tasks such as GPU allocation, training status tracking, and comparison of models with different hyperparameter settings. We propose a system to handle these tasks and help researchers focus on models. We present the requirements of the system based on a collection of discussions from an online study group comprising 25k members. These include automatic GPU allocation, learning status visualization, handling model parameter snapshots as well as hyperparameter modification during learning, and comparison of performance metrics between models via a leaderboard. We describe the system architecture that fulfills these requirements and present a proof-of-concept implementation, NAVER Smart Machine Learning (NSML). We test the system and confirm substantial efficiency improvements for model development. △ Less

Submitted 15 December, 2017; originally announced December 2017.

Comments: 8 pages, 4figures

arXiv:1706.03458 [pdf, other]

Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model

Authors: Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, Wang-chun Woo

Abstract: With the goal of making high-resolution forecasts of regional rainfall, precipitation nowcasting has become an important and fundamental technology underlying various public services ranging from rainstorm warnings to flight safety. Recently, the Convolutional LSTM (ConvLSTM) model has been shown to outperform traditional optical flow based methods for precipitation nowcasting, suggesting that dee… ▽ More With the goal of making high-resolution forecasts of regional rainfall, precipitation nowcasting has become an important and fundamental technology underlying various public services ranging from rainstorm warnings to flight safety. Recently, the Convolutional LSTM (ConvLSTM) model has been shown to outperform traditional optical flow based methods for precipitation nowcasting, suggesting that deep learning models have a huge potential for solving the problem. However, the convolutional recurrence structure in ConvLSTM-based models is location-invariant while natural motion and transformation (e.g., rotation) are location-variant in general. Furthermore, since deep-learning-based precipitation nowcasting is a newly emerging area, clear evaluation protocols have not yet been established. To address these problems, we propose both a new model and a benchmark for precipitation nowcasting. Specifically, we go beyond ConvLSTM and propose the Trajectory GRU (TrajGRU) model that can actively learn the location-variant structure for recurrent connections. Besides, we provide a benchmark that includes a real-world large-scale dataset from the Hong Kong Observatory, a new training loss, and a comprehensive evaluation protocol to facilitate future research and gauge the state of the art. △ Less

Submitted 5 October, 2017; v1 submitted 12 June, 2017; originally announced June 2017.

Comments: NIPS 2017 Spotlight

arXiv:1609.04695 [pdf]

doi 10.1088/2040-8978/18/2/024002

Excitation of surface plasmon polariton modes with multiple nitrogen vacancy centers in single nanodiamonds

Authors: Shailesh Kumar, Jens L. Lausen, Cesar E. Garcia-Ortiz, Sebastian K. H. Andersen, Alexander S. Roberts, Ilya P. Radko, Cameron L. C. Smith, Anders Kristensen, Sergey I. Bozhevolnyi

Abstract: Nitrogen-vacancy (NV) centers in diamonds are interesting due to their remarkable characteristics that are well suited to applications in quantum-information processing and magnetic field sensing, as well as representing stable fluorescent sources. Multiple NV centers in nanodiamonds (NDs) are especially useful as biological fluorophores due to their chemical neutrality, brightness and room-temper… ▽ More Nitrogen-vacancy (NV) centers in diamonds are interesting due to their remarkable characteristics that are well suited to applications in quantum-information processing and magnetic field sensing, as well as representing stable fluorescent sources. Multiple NV centers in nanodiamonds (NDs) are especially useful as biological fluorophores due to their chemical neutrality, brightness and room-temperature photostability. Furthermore, NDs containing multiple NV centers also have potential in high-precision magnetic field and temperature sensing. Coupling NV centers to propagating surface plasmon polariton (SPP) modes gives a base for lab-on-a-chip sensing devices, allows enhanced fluorescence emission and collection which can further enhance the precision of NV-based sensors. Here, we investigate coupling of multiple NV centers in individual NDs to the SPP modes supported by silver surfaces protected by thin dielectric layers and by gold V-grooves (VGs) produced via the self-terminated silicon etching. In the first case, we concentrate on monitoring differences in fluorescence spectra obtained from a source ND, which is illuminated by a pump laser, and from a scattering ND illuminated only by the fluorescence-excited SPP radiation. In the second case, we observe changes in the average NV lifetime when the same ND is characterized outside and inside a VG. Fluorescence emission from the VG terminations is also observed, which confirms the NV coupling to the VG-supported SPP modes. △ Less

Submitted 15 September, 2016; originally announced September 2016.

Comments: 22 pages, 13 figures

Journal ref: J. Opt. 18 (2016) 024002

Showing 1–10 of 10 results for author: Lausen, L