-
Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Authors:
Yuchen Zhou,
Jiayuan Gu,
Tung Yen Chiang,
Fanbo Xiang,
Hao Su
Abstract:
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, lightweight models, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model (Point-SAM) focusing…
▽ More
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, lightweight models, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model (Point-SAM) focusing on point clouds. Our approach utilizes a transformer-based method, extending SAM to the 3D domain. We leverage part-level and object-level annotations and introduce a data engine to generate pseudo labels from SAM, thereby distilling 2D knowledge into our 3D model. Our model outperforms state-of-the-art models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as 3D annotation. Codes and demo can be found at https://github.com/zyc00/Point-SAM.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Measuring model variability using robust non-parametric testing
Authors:
Sin**i Banerjee,
Tim Marrinan,
Reilly Cannon,
Tony Chiang,
Anand D. Sarwate
Abstract:
Training a deep neural network often involves stochastic optimization, meaning each run will produce a different model. The seed used to initialize random elements of the optimization procedure heavily influences the quality of a trained model, which may be obscure from many commonly reported summary statistics, like accuracy. However, random seed is often not included in hyper-parameter optimizat…
▽ More
Training a deep neural network often involves stochastic optimization, meaning each run will produce a different model. The seed used to initialize random elements of the optimization procedure heavily influences the quality of a trained model, which may be obscure from many commonly reported summary statistics, like accuracy. However, random seed is often not included in hyper-parameter optimization, perhaps because the relationship between seed and model quality is hard to describe. This work attempts to describe the relationship between deep net models trained with different random seeds and the behavior of the expected model. We adopt robust hypothesis testing to propose a novel summary statistic for network similarity, referred to as the $α$-trimming level. We use the $α$-trimming level to show that the empirical cumulative distribution function of an ensemble model created from a collection of trained models with different random seeds approximates the average of these functions as the number of models in the collection grows large. This insight provides guidance for how many random seeds should be sampled to ensure that an ensemble of these trained models is a reliable representative. We also show that the $α$-trimming level is more expressive than different performance metrics like validation accuracy, churn, or expected calibration error when taken alone and may help with random seed selection in a more principled fashion. We demonstrate the value of the proposed statistic in real experiments and illustrate the advantage of fine-tuning over random seed with an experiment in transfer learning.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Understanding In-Context Learning with a Pelican Soup Framework
Authors:
Ting-Rui Chiang,
Dani Yogatama
Abstract:
Many existing theoretical analyses of in-context learning for natural language processing are based on latent variable models that leaves gaps between theory and practice. We aim to close these gaps by proposing a theoretical framework, the Pelican Soup Framework. In this framework, we introduce (1) the notion of a common sense knowledge base, (2) a general formalism for natural language classific…
▽ More
Many existing theoretical analyses of in-context learning for natural language processing are based on latent variable models that leaves gaps between theory and practice. We aim to close these gaps by proposing a theoretical framework, the Pelican Soup Framework. In this framework, we introduce (1) the notion of a common sense knowledge base, (2) a general formalism for natural language classification tasks, and the notion of (3) meaning association. Under this framework, we can establish a $\mathcal{O}(1/T)$ loss bound for in-context learning, where $T$ is the number of example-label pairs in the demonstration. Compared with previous works, our bound reflects the effect of the choice of verbalizers and the effect of instruction tuning. An additional notion of \textit{atom concepts} makes our framework possible to explain the generalization to tasks unseen in the language model training data. Finally, we propose a toy setup, Calcutec, and a digit addition task that mimics types of distribution shifts a model needs to overcome to perform in-context learning. We also experiment with GPT2-Large on real-world NLP tasks. Our empirical results demonstrate the efficacy of our framework to explain in-context learning.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
On Retrieval Augmentation and the Limitations of Language Model Training
Authors:
Ting-Rui Chiang,
Xinyan Velocity Yu,
Joshua Robinson,
Ollie Liu,
Isabelle Lee,
Dani Yogatama
Abstract:
Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility -- the "softmax bottleneck." We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional…
▽ More
Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility -- the "softmax bottleneck." We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, $k$NN retrieval augmentation consistently improves performance in this setting. Finally, to make $k$NN retrieval more accessible, we propose using a multi-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costs by over 25x.
△ Less
Submitted 2 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Efficient kernel surrogates for neural network-based regression
Authors:
Saad Qadeer,
Andrew Engel,
Amanda Howard,
Adam Tsou,
Max Vargas,
Panos Stinis,
Tony Chiang
Abstract:
Despite their immense promise in performing a variety of learning tasks, a theoretical understanding of the limitations of Deep Neural Networks (DNNs) has so far eluded practitioners. This is partly due to the inability to determine the closed forms of the learned functions, making it harder to study their generalization properties on unseen datasets. Recent work has shown that randomly initialize…
▽ More
Despite their immense promise in performing a variety of learning tasks, a theoretical understanding of the limitations of Deep Neural Networks (DNNs) has so far eluded practitioners. This is partly due to the inability to determine the closed forms of the learned functions, making it harder to study their generalization properties on unseen datasets. Recent work has shown that randomly initialized DNNs in the infinite width limit converge to kernel machines relying on a Neural Tangent Kernel (NTK) with known closed form. These results suggest, and experimental evidence corroborates, that empirical kernel machines can also act as surrogates for finite width DNNs. The high computational cost of assembling the full NTK, however, makes this approach infeasible in practice, motivating the need for low-cost approximations. In the current work, we study the performance of the Conjugate Kernel (CK), an efficient approximation to the NTK that has been observed to yield fairly similar results. For the regression problem of smooth functions and logistic regression classification, we show that the CK performance is only marginally worse than that of the NTK and, in certain cases, is shown to be superior. In particular, we establish bounds for the relative test losses, verify them with numerical tests, and identify the regularity of the kernel as the key determinant of performance. In addition to providing a theoretical grounding for using CKs instead of NTKs, our framework suggests a recipe for improving DNN accuracy inexpensively. We present a demonstration of this on the foundation model GPT-2 by comparing its performance on a classification task using a conventional approach and our prescription. We also show how our approach can be used to improve physics-informed operator network training for regression tasks as well as convolutional neural network training for vision classification tasks.
△ Less
Submitted 24 January, 2024; v1 submitted 28 October, 2023;
originally announced October 2023.
-
The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining
Authors:
Ting-Rui Chiang,
Dani Yogatama
Abstract:
We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our a…
▽ More
We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct analyses over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Foundation Model's Embedded Representations May Detect Distribution Shift
Authors:
Max Vargas,
Adam Tsou,
Andrew Engel,
Tony Chiang
Abstract:
Sampling biases can cause distribution shifts between train and test datasets for supervised learning tasks, obscuring our ability to understand the generalization capacity of a model. This is especially important considering the wide adoption of pre-trained foundational neural networks -- whose behavior remains poorly understood -- for transfer learning (TL) tasks. We present a case study for TL…
▽ More
Sampling biases can cause distribution shifts between train and test datasets for supervised learning tasks, obscuring our ability to understand the generalization capacity of a model. This is especially important considering the wide adoption of pre-trained foundational neural networks -- whose behavior remains poorly understood -- for transfer learning (TL) tasks. We present a case study for TL on the Sentiment140 dataset and show that many pre-trained foundation models encode different representations of Sentiment140's manually curated test set $M$ from the automatically labeled training set $P$, confirming that a distribution shift has occurred. We argue training on $P$ and measuring performance on $M$ is a biased measure of generalization. Experiments on pre-trained GPT-2 show that the features learnable from $P$ do not improve (and in fact hamper) performance on $M$. Linear probes on pre-trained GPT-2's representations are robust and may even outperform overall fine-tuning, implying a fundamental importance for discerning distribution shift in train/test splits for model interpretation.
△ Less
Submitted 2 February, 2024; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Visual Forecasting as a Mid-level Representation for Avoidance
Authors:
Hsuan-Kung Yang,
Tsung-Chih Chiang,
Ting-Ru Liu,
Chun-Wei Huang,
Jou-Min Liu,
Chun-Yi Lee
Abstract:
The challenge of navigation in environments with dynamic objects continues to be a central issue in the study of autonomous agents. While predictive methods hold promise, their reliance on precise state information makes them less practical for real-world implementation. This study presents visual forecasting as an innovative alternative. By introducing intuitive visual cues, this approach project…
▽ More
The challenge of navigation in environments with dynamic objects continues to be a central issue in the study of autonomous agents. While predictive methods hold promise, their reliance on precise state information makes them less practical for real-world implementation. This study presents visual forecasting as an innovative alternative. By introducing intuitive visual cues, this approach projects the future trajectories of dynamic objects to improve agent perception and enable anticipatory actions. Our research explores two distinct strategies for conveying predictive information through visual forecasting: (1) sequences of bounding boxes, and (2) augmented paths. To validate the proposed visual forecasting strategies, we initiate evaluations in simulated environments using the Unity engine and then extend these evaluations to real-world scenarios to assess both practicality and effectiveness. The results confirm the viability of visual forecasting as a promising solution for navigation and obstacle avoidance in dynamic environments.
△ Less
Submitted 17 September, 2023;
originally announced October 2023.
-
Robust Nonparametric Hypothesis Testing to Understand Variability in Training Neural Networks
Authors:
Sin**i Banerjee,
Reilly Cannon,
Tim Marrinan,
Tony Chiang,
Anand D. Sarwate
Abstract:
Training a deep neural network (DNN) often involves stochastic optimization, which means each run will produce a different model. Several works suggest this variability is negligible when models have the same performance, which in the case of classification is test accuracy. However, models with similar test accuracy may not be computing the same function. We propose a new measure of closeness bet…
▽ More
Training a deep neural network (DNN) often involves stochastic optimization, which means each run will produce a different model. Several works suggest this variability is negligible when models have the same performance, which in the case of classification is test accuracy. However, models with similar test accuracy may not be computing the same function. We propose a new measure of closeness between classification models based on the output of the network before thresholding. Our measure is based on a robust hypothesis-testing framework and can be adapted to other quantities derived from trained models.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
Exploring Learned Representations of Neural Networks with Principal Component Analysis
Authors:
Amit Harlev,
Andrew Engel,
Panos Stinis,
Tony Chiang
Abstract:
Understanding feature representation for deep neural networks (DNNs) remains an open question within the general field of explainable AI. We use principal component analysis (PCA) to study the performance of a k-nearest neighbors classifier (k-NN), nearest class-centers classifier (NCC), and support vector machines on the learned layer-wise representations of a ResNet-18 trained on CIFAR-10. We sh…
▽ More
Understanding feature representation for deep neural networks (DNNs) remains an open question within the general field of explainable AI. We use principal component analysis (PCA) to study the performance of a k-nearest neighbors classifier (k-NN), nearest class-centers classifier (NCC), and support vector machines on the learned layer-wise representations of a ResNet-18 trained on CIFAR-10. We show that in certain layers, as little as 20% of the intermediate feature-space variance is necessary for high-accuracy classification and that across all layers, the first ~100 PCs completely determine the performance of the k-NN and NCC classifiers. We relate our findings to neural collapse and provide partial evidence for the related phenomenon of intermediate neural collapse. Our preliminary work provides three distinct yet interpretable surrogate models for feature representation with an affine linear model the best performing. We also show that leveraging several surrogate models affords us a clever method to estimate where neural collapse may initially occur within the DNN.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Minibatching Offers Improved Generalization Performance for Second Order Optimizers
Authors:
Eric Silk,
Swarnita Chakraborty,
Nairanjana Dasgupta,
Anand D. Sarwate,
Andrew Lumsdaine,
Tony Chiang
Abstract:
Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study…
▽ More
Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study that treats performance as a response variable across multiple training sessions of the same model. Using 2-factor Analysis of Variance (ANOVA) with interactions, we show that batch size used during training has a statistically significant effect on the peak accuracy of the methods, and that full batch largely performed the worst. In addition, we found that second-order optimizers (SOOs) generally exhibited significantly lower variance at specific batch sizes, suggesting they may require less hyperparameter tuning, leading to a reduced overall time to solution for model training.
△ Less
Submitted 25 May, 2023;
originally announced July 2023.
-
Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models
Authors:
Andrew Engel,
Zhichao Wang,
Natalie S. Frank,
Ioana Dumitriu,
Sutanay Choudhury,
Anand Sarwate,
Tony Chiang
Abstract:
A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution…
▽ More
A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution. Approximation is critical for eNTK analysis due to the high computational cost to compute the eNTK. We define new approximate eNTK and perform novel analysis on how well the resulting kernel machine surrogate models correlate with the underlying neural network. We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation. We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models, with the introduced trace NTK the most consistent performer. Open source software allowing users to efficiently calculate kernel functions in the PyTorch framework is available (https://github.com/pnnl/projection\_ntk).
△ Less
Submitted 11 March, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Virtual Guidance as a Mid-level Representation for Navigation
Authors:
Hsuan-Kung Yang,
Tsung-Chih Chiang,
Ting-Ru Liu,
Chun-Wei Huang,
Jou-Min Liu,
Chun-Yi Lee
Abstract:
In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments poses challenges, particularly when the navigation information is multimodal. To address this issue, the paper introduces a novel technique termed "Virtual Guidance," which is designed to visually represent non-visual instructional signals. These visual cues, rendered as colo…
▽ More
In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments poses challenges, particularly when the navigation information is multimodal. To address this issue, the paper introduces a novel technique termed "Virtual Guidance," which is designed to visually represent non-visual instructional signals. These visual cues, rendered as colored paths or spheres, are overlaid onto the agent's camera view, serving as easily comprehensible navigational instructions. We evaluate our proposed method through experiments in both simulated and real-world settings. In the simulated environments, our virtual guidance outperforms baseline hybrid approaches in several metrics, including adherence to planned routes and obstacle avoidance. Furthermore, we extend the concept of virtual guidance to transform text-prompt-based instructions into a visually intuitive format for real-world experiments. Our results validate the adaptability of virtual guidance and its efficacy in enabling policy transfer from simulated scenarios to real-world ones.
△ Less
Submitted 17 September, 2023; v1 submitted 5 March, 2023;
originally announced March 2023.
-
ExReg: Wide-range Photo Exposure Correction via a Multi-dimensional Regressor with Attention
Authors:
Tzu-Hao Chiang,
Hao-Chien Hsueh,
Ching-Chun Hsiao,
Ching-Chun Huang
Abstract:
Photo exposure correction is widely investigated, but fewer studies focus on correcting under and over-exposed images simultaneously. Three issues remain open to handle and correct under and over-exposed images in a unified way. First, a locally-adaptive exposure adjustment may be more flexible instead of learning a global map**. Second, it is an ill-posed problem to determine the suitable expos…
▽ More
Photo exposure correction is widely investigated, but fewer studies focus on correcting under and over-exposed images simultaneously. Three issues remain open to handle and correct under and over-exposed images in a unified way. First, a locally-adaptive exposure adjustment may be more flexible instead of learning a global map**. Second, it is an ill-posed problem to determine the suitable exposure values locally. Third, photos with the same content but different exposures may not reach consistent adjustment results. To this end, we proposed a novel exposure correction network, ExReg, to address the challenges by formulating exposure correction as a multi-dimensional regression process. Given an input image, a compact multi-exposure generation network is introduced to generate images with different exposure conditions for multi-dimensional regression and exposure correction in the next stage. An auxiliary module is designed to predict the region-wise exposure values, guiding the mainly proposed Encoder-Decoder ANP (Attentive Neural Processes) to regress the final corrected image. The experimental results show that ExReg can generate well-exposed results and outperform the SOTA method by 1.3dB in PSNR for extensive exposure problems. In addition, given the same image but under various exposure for testing, the corrected results are more visually consistent and physically accurate.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Spectral Evolution and Invariance in Linear-width Neural Networks
Authors:
Zhichao Wang,
Andrew Engel,
Anand Sarwate,
Ioana Dumitriu,
Tony Chiang
Abstract:
We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invarian…
▽ More
We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
△ Less
Submitted 7 November, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit
Authors:
Jessica Huynh,
Ting-Rui Chiang,
Jeffrey Bigham,
Maxine Eskenazi
Abstract:
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help r…
▽ More
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models
Authors:
Andrew Engel,
Zhichao Wang,
Anand D. Sarwate,
Sutanay Choudhury,
Tony Chiang
Abstract:
We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any archi…
▽ More
We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any architecture supported by PyTorch, such as convolutional networks. A feature of the library is that we expose the user to layerwise NTK components, and show that in some regimes a layerwise calculation is more memory efficient. We conduct preliminary experiments to demonstrate use cases for the software and probe the NTK.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Breaking Down Multilingual Machine Translation
Authors:
Ting-Rui Chiang,
Yi-Pei Chen,
Yi-Ting Yeh,
Graham Neubig
Abstract:
While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we exami…
▽ More
While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019)
△ Less
Submitted 3 April, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Are you doing what I say? On modalities alignment in ALFRED
Authors:
Ting-Rui Chiang,
Yi-Ting Yeh,
Ta-Chung Chi,
Yau-Shian Wang
Abstract:
ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (…
▽ More
ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (BAS). The results show the previous models are indeed failing to perform proper alignment. To address this issue, we introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias
Authors:
Ting-Rui Chiang
Abstract:
Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a qeustion not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answer the question. We theoretically show that, when we can model the distribution of a spurious feature $Π$ conditioned on the co…
▽ More
Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a qeustion not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answer the question. We theoretically show that, when we can model the distribution of a spurious feature $Π$ conditioned on the context, then (1) $Π$ is at least as informative as the spurious feature, and (2) learning from $Π$ is at least as simple as learning from the spurious feature. Therefore, MLM pretraining rescues the model from the simplicity bias caused by the spurious feature. We also explore the efficacy of MLM pretraing in causal settings. Finally we close the gap between our theories and the real world practices by conducting experiments on the hate speech detection and the name entity recognition tasks.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Improving Dialogue State Tracking by Joint Slot Modeling
Authors:
Ting-Rui Chiang,
Yi-Ting Yeh
Abstract:
Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they ar…
▽ More
Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they are able to alleviate the confusion mentioned above, and they push the state-of-the-art on dataset MultiWoZ 2.1 from 58.7 to 61.3. Our implementation is available at https://github.com/CTinRay/Trippy-Joint.
△ Less
Submitted 14 November, 2021; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Relating Neural Text Degeneration to Exposure Bias
Authors:
Ting-Rui Chiang,
Yun-Nung Chen
Abstract:
This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer…
▽ More
This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer from (Holtzman et al., 2020). Motivated by the unknown causation of the text degeneration, in this paper we attempt to relate these two mysteries. Specifically, we first qualitatively quantitatively identify mistakes made before text degeneration occurs. Then we investigate the significance of the mistakes by inspecting the hidden states in GPT-2. Our results show that text degeneration is likely to be partly caused by exposure bias. We also study the self-reinforcing mechanism of text degeneration, explaining why the mistakes amplify. In sum, our study provides a more concrete foundation for further investigation on exposure bias and text degeneration problems.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Why Can You Lay Off Heads? Investigating How BERT Heads Transfer
Authors:
Ting-Rui Chiang,
Yun-Nung Chen
Abstract:
The huge size of the widely used BERT family models has led to recent efforts about model distillation. The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version. Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has…
▽ More
The huge size of the widely used BERT family models has led to recent efforts about model distillation. The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version. Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has not been well studied. Also, the mechanisms behind transfer learning of those BERT models are not well investigated either. Therefore, this work focuses on analyzing the acceptable deduction when distillation for guiding the future distillation procedure. Specifically, we first inspect the prunability of the Transformer heads in RoBERTa and ALBERT using their head importance estimation proposed by Michel et al. (2019), and then check the coherence of the important heads between the pre-trained task and downstream tasks. Hence, the acceptable deduction of performance on the pre-trained task when distilling a model can be derived from the results, and we further compare the behavior of the pruned model before and after fine-tuning. Our studies provide guidance for future directions about BERT family model distillation.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
Deriving monadic quicksort (Declarative Pearl)
Authors:
Shin-Cheng Mu,
Tsung-Ju Chiang
Abstract:
To demonstrate derivation of monadic programs, we present a specification of sorting using the non-determinism monad, and derive pure quicksort on lists and state-monadic quicksort on arrays. In the derivation one may switch between point-free and pointwise styles, and deploy techniques familiar to functional programmers such as pattern matching and induction on structures or on sizes. Derivation…
▽ More
To demonstrate derivation of monadic programs, we present a specification of sorting using the non-determinism monad, and derive pure quicksort on lists and state-monadic quicksort on arrays. In the derivation one may switch between point-free and pointwise styles, and deploy techniques familiar to functional programmers such as pattern matching and induction on structures or on sizes. Derivation of stateful programs resembles reasoning backwards from the postcondition.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
Longest segment of balanced parentheses -- an exercise in program inversion in a segment problem (Functional Pearl)
Authors:
Shin-Cheng Mu,
Tsung-Ju Chiang
Abstract:
Given a string of parentheses, the task is to find the longest consecutive segment that is balanced, in linear time. We find this problem interesting because it involves a combination of techniques: the usual approach for solving segment problems, and a theorem for constructing the inverse of a function -- through which we derive an instance of shift-reduce parsing.
Given a string of parentheses, the task is to find the longest consecutive segment that is balanced, in linear time. We find this problem interesting because it involves a combination of techniques: the usual approach for solving segment problems, and a theorem for constructing the inverse of a function -- through which we derive an instance of shift-reduce parsing.
△ Less
Submitted 21 August, 2021; v1 submitted 24 January, 2021;
originally announced January 2021.
-
An Empirical Study of Content Understanding in Conversational Question Answering
Authors:
Ting-Rui Chiang,
Hao-Tong Ye,
Yun-Nung Chen
Abstract:
With a lot of work about context-free question answering systems, there is an emerging trend of conversational question answering models in the natural language processing field. Thanks to the recently collected datasets, including QuAC and CoQA, there has been more work on conversational question answering, and recent work has achieved competitive performance on both datasets. However, to best of…
▽ More
With a lot of work about context-free question answering systems, there is an emerging trend of conversational question answering models in the natural language processing field. Thanks to the recently collected datasets, including QuAC and CoQA, there has been more work on conversational question answering, and recent work has achieved competitive performance on both datasets. However, to best of our knowledge, two important questions for conversational comprehension research have not been well studied: 1) How well can the benchmark dataset reflect models' content understanding? 2) Do the models well utilize the conversation content when answering questions? To investigate these questions, we design different training settings, testing settings, as well as an attack to verify the models' capability of content understanding on QuAC and CoQA. The experimental results indicate some potential hazards in the benchmark datasets, QuAC and CoQA, for conversational comprehension research. Our analysis also sheds light on both what models may learn and how datasets may bias the models. With deep investigation of the task, it is believed that this work can benefit the future progress of conversation comprehension. The source code is available at https://github.com/MiuLab/CQA-Study.
△ Less
Submitted 27 November, 2019; v1 submitted 24 September, 2019;
originally announced September 2019.
-
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer
Authors:
Ting-Rui Chiang,
Chao-Wei Huang,
Shang-Yu Su,
Yun-Nung Chen
Abstract:
With the increasing research interest in dialogue response generation, there is an emerging branch formulating this task as selecting next sentences, where given the partial dialogue contexts, the goal is to determine the most probable next sentence. Following the recent success of the Transformer model, this paper proposes (1) a new variant of attention mechanism based on multi-head attention, ca…
▽ More
With the increasing research interest in dialogue response generation, there is an emerging branch formulating this task as selecting next sentences, where given the partial dialogue contexts, the goal is to determine the most probable next sentence. Following the recent success of the Transformer model, this paper proposes (1) a new variant of attention mechanism based on multi-head attention, called highway attention, and (2) a recurrent model based on transformer and the proposed highway attention, so-called Highway Recurrent Transformer. Experiments on the response selection task in the seventh Dialog System Technology Challenge (DSTC7) show the capability of the proposed model of modeling both utterance-level and dialogue-level information; the effectiveness of each module is further analyzed as well.
△ Less
Submitted 21 March, 2019;
originally announced March 2019.
-
RAP-Net: Recurrent Attention Pooling Networks for Dialogue Response Selection
Authors:
Chao-Wei Huang,
Ting-Rui Chiang,
Shang-Yu Su,
Yun-Nung Chen
Abstract:
The response selection has been an emerging research topic due to the growing interest in dialogue modeling, where the goal of the task is to select an appropriate response for continuing dialogues. To further push the end-to-end dialogue model toward real-world scenarios, the seventh Dialog System Technology Challenge (DSTC7) proposed a challenging track based on real chatlog datasets. The compet…
▽ More
The response selection has been an emerging research topic due to the growing interest in dialogue modeling, where the goal of the task is to select an appropriate response for continuing dialogues. To further push the end-to-end dialogue model toward real-world scenarios, the seventh Dialog System Technology Challenge (DSTC7) proposed a challenging track based on real chatlog datasets. The competition focuses on dialogue modeling with several advanced characteristics: (1) natural language diversity, (2) capability of precisely selecting a proper response from a large set of candidates or the scenario without any correct answer, and (3) knowledge grounding. This paper introduces recurrent attention pooling networks (RAP-Net), a novel framework for response selection, which can well estimate the relevance between the dialogue contexts and the candidates. The proposed RAP-Net is shown to be effective and can be generalized across different datasets and settings in the DSTC7 experiments.
△ Less
Submitted 21 March, 2019;
originally announced March 2019.
-
Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems
Authors:
Ting-Rui Chiang,
Yun-Nung Chen
Abstract:
Solving math word problems is a challenging task that requires accurate natural language understanding to bridge natural language texts and math expressions. Motivated by the intuition about how human generates the equations given the problem texts, this paper presents a neural approach to automatically solve math word problems by operating symbols according to their semantic meanings in texts. Th…
▽ More
Solving math word problems is a challenging task that requires accurate natural language understanding to bridge natural language texts and math expressions. Motivated by the intuition about how human generates the equations given the problem texts, this paper presents a neural approach to automatically solve math word problems by operating symbols according to their semantic meanings in texts. This paper views the process of generating equation as a bridge between the semantic world and the symbolic world, where the proposed neural math solver is based on an encoder-decoder framework. In the proposed model, the encoder is designed to understand the semantics of problems, and the decoder focuses on tracking semantic meanings of the generated symbols and then deciding which symbol to generate next. The preliminary experiments are conducted in a dataset Math23K, and our model significantly outperforms both the state-of-the-art single model and the best non-retrieval-based model over about 10% accuracy, demonstrating the effectiveness of bridging the symbolic and semantic worlds from math word problems.
△ Less
Submitted 9 June, 2019; v1 submitted 1 November, 2018;
originally announced November 2018.
-
OpenVanilla - A Non-Intrusive Plug-In Framework of Text Services
Authors:
Tien-chien Chiang,
Deng-Liu,
Kang-min Liu,
Weizhong Yang,
Pek-tiong Tan,
Mengjuei Hsieh,
Tsung-hsiang Chang,
Wen-Lien Hsu
Abstract:
This paper has been withdrawn by the author, because it was merged into cs.HC/0508041
This paper has been withdrawn by the author, because it was merged into cs.HC/0508041
△ Less
Submitted 29 June, 2006; v1 submitted 4 August, 2005;
originally announced August 2005.