-
DEM: Distribution Edited Model for Training with Mixed Data Distributions
Authors:
Dhananjay Ram,
Aditya Rawal,
Momchil Hardalov,
Nikolaos Pappas,
Sheng Zha
Abstract:
Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive trainin…
▽ More
Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive training runs. In this paper, we propose a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations. The resulting model, namely Distribution Edited Model (DEM), is 11x cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks, yielding up to 6.2% improvement on MMLU, 11.5% on BBH, 16.1% on DROP, and 9.3% on HELM with models of size 3B to 13B. Notably, DEM does not require full re-training when modifying a single data-source, thus making it very flexible and scalable for training with diverse data sources.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
HLAT: High-quality Large Language Model Pre-trained on AWS Trainium
Authors:
Haozheng Fan,
Hao Zhou,
Guangtai Huang,
Parameswaran Raman,
Xinwei Fu,
Gaurav Gupta,
Dhananjay Ram,
Yida Wang,
Jun Huan
Abstract:
Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (su…
▽ More
Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium to achieve efficient training. Our work demonstrates that AWS Trainium powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Magnetic, thermodynamic, and magnetotransport properties of CeGaGe and PrGaGe single crystals
Authors:
Daloo Ram,
Sudip Malick,
Zakir Hossain,
Dariusz Kaczorowski
Abstract:
We investigate the physical properties of high-quality single crystals CeGaGe and PrGaGe using magnetization, heat capacity, and magnetotransport measurements. Gallium-indium binary flux was used to grow these single crystals that crystallize in a body-centered tetragonal structure. Magnetic susceptibility data reveal a magnetic phase transition around 6.0 and 19.4 K in CeGaGe and PrGaGe, respecti…
▽ More
We investigate the physical properties of high-quality single crystals CeGaGe and PrGaGe using magnetization, heat capacity, and magnetotransport measurements. Gallium-indium binary flux was used to grow these single crystals that crystallize in a body-centered tetragonal structure. Magnetic susceptibility data reveal a magnetic phase transition around 6.0 and 19.4 K in CeGaGe and PrGaGe, respectively, which is further confirmed by heat capacity and electrical resistivity data. A number of additional anomalies have been observed below the ordering temperature in the magnetic susceptibility data, indicating a complex magnetic structure. The magnetic measurements also reveal a strong magnetocrystalline anisotropy in both compounds. Our detailed analysis of the crystalline electric field (CEF) effect as observed in magnetic susceptibility and heat capacity data suggests that the $J$ = 5/2 multiplet of CeGaGe splits into three doublets, while the $J$ = 4 degenerate ground state of PrGaGe splits into five singlets and two doublets. The estimated energy levels from the CEF analysis are consistent with the magnetic entropy.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Electronic structure and physical properties of candidate topological material GdAgGe
Authors:
D. Ram,
J. Singh,
M. K. Hooda,
O. Pavlosiuk,
V. Kanchana,
Z. Hossain,
D. Kaczorowski
Abstract:
We grew needle-shaped single crystals of GdAgGe, which crystallizes in a noncentrosymmetric hexagonal crystal structure with space group P$\overline{6}$2$m$ (189). The magnetic susceptibility data for $H \perp c$ reveal two pronounced antiferromagnetic transitions at $T_{N1}$ = 20 K and $T_{N2}$ = 14.5 K. The magnetic susceptibility anomalies are less prominent for $H \parallel c$. The transition…
▽ More
We grew needle-shaped single crystals of GdAgGe, which crystallizes in a noncentrosymmetric hexagonal crystal structure with space group P$\overline{6}$2$m$ (189). The magnetic susceptibility data for $H \perp c$ reveal two pronounced antiferromagnetic transitions at $T_{N1}$ = 20 K and $T_{N2}$ = 14.5 K. The magnetic susceptibility anomalies are less prominent for $H \parallel c$. The transition at $T_{N1}$ is accompanied by a pronounced heat capacity anomaly confirming the bulk nature of the magnetic transition. Below $T_{N1}$, the electrical resistivity data follows a $T^{3/2}$ dependence. In the magnetically ordered state, GdAgGe shows positive transverse magnetoresistance, which increases with decreasing temperature and increasing field, reaching a value of $\sim$ 27% at 9 T and 10 K. The Hall resistivity data and electronic band structure calculations suggest that both the hole and electron charge carriers contribute to the transport properties. The electronic band structure displays linear band crossings near the Fermi level. The calculations reveal that GdAgGe has a nodal line with drumhead surface states coupled with a nonzero Berry phase, making it a nontrivial nodal-line semimetal.
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Multiple magnetic transitions, metamagnetism and large magnetoresistance in GdAuGe single crystals
Authors:
D. Ram,
J. Singh,
M. K. Hooda,
K. Singh,
V. Kanchana,
D. Kaczorowski,
Z. Hossain
Abstract:
We report the physical properties of GdAuGe single crystals, which were grown using Bi flux. The powder x-ray diffraction data shows that the compound crystallizes in hexagonal NdPtSb-type structure (space group P63mc). Magnetization measurements performed for field configuration H||c and H||ab show that GdAuGe orders antiferromagnetically at the Neel temperature, TN = 17.2 K. Around this temperat…
▽ More
We report the physical properties of GdAuGe single crystals, which were grown using Bi flux. The powder x-ray diffraction data shows that the compound crystallizes in hexagonal NdPtSb-type structure (space group P63mc). Magnetization measurements performed for field configuration H||c and H||ab show that GdAuGe orders antiferromagnetically at the Neel temperature, TN = 17.2 K. Around this temperature, heat capacity and electrical resistivity data exhibit prominent anomaly due to the antiferromagnetic (AFM) transition. In addition to an AFM phase transition, the magnetization data for H||c display the signature of field-induced metamagnetic (MM) transitions below TN. The critical field range for these transitions vary from 0.2 to 6.2 T. The critical fields for the MM transitions decrease with increasing temperature and approach zero value for temperature approaching TN. Interestingly, the magnetoresistance (MR) data (for H||c) record a sharp increase in values at the critical fields that coincide with those seen in magnetization data, tracking the presence of MM transitions. MR is positive and large (169% at 9 T and 2 K) at low temperatures. Above TN, MR becomes small and switches to negative values. Hall resistivity data reveal the predominance of hole charge carriers in the system. In addition, we observe an emergence of step-like feature in the Hall resistivity data within the field range of second MM, and a significantly large anomalous Hall conductivity of 1270 Ω-1 cm-1 at 2 K. The H-T phase diagram constructed from our detailed magnetization and magnetotransport measurements reveals multiple intricate magnetic phase transitions. The electronic and magnetic structure of GdAuGe are also thoroughly investigated using first-principles methods. The electronic band structure calculations reveal that GdAuGe is a Dirac nodal-line semimetal.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
Authors:
Qingru Zhang,
Dhananjay Ram,
Cole Hawkins,
Sheng Zha,
Tuo Zhao
Abstract:
Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inp…
▽ More
Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Rotational Variability and Detection of Superflares in a Young Brown Dwarf by TESS
Authors:
Rajib Kumbhakar,
Soumen Mondal,
Samrat Ghosh,
Diya Ram,
Sudip Pramanik
Abstract:
We present a comprehensive analysis of a Transiting Exoplanet Survey Satellite (TESS) high-quality light curve for a young brown dwarf, MHO~4 having spectral type M7.0, in the Taurus star-forming region. We investigate the rotation periods and characterize the BD's dynamic atmosphere and surface features. We present light curve analysis of MHO~4, and estimate the rotation period to be around 2.224…
▽ More
We present a comprehensive analysis of a Transiting Exoplanet Survey Satellite (TESS) high-quality light curve for a young brown dwarf, MHO~4 having spectral type M7.0, in the Taurus star-forming region. We investigate the rotation periods and characterize the BD's dynamic atmosphere and surface features. We present light curve analysis of MHO~4, and estimate the rotation period to be around 2.224~d. Remarkably, MHO~4 exhibits two significant flaring events. Furthermore, we also estimated bolometric flare energies to be within the energy range of $10^{34}$ to $10^{35}$ erg, which sits in the superflare category.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
Evolutionary Dynamics of Social Inequality and Coincidence of Gini and Kolkata indices under Unrestricted Competition
Authors:
Suchismita Banerjee,
Soumyajyoti Biswas,
Bikas K. Chakrabarti,
Sai Krishna Challagundla,
Asim Ghosh,
Suhaas Reddy Guntaka,
Hanesh Koganti,
Anvesh Reddy Kondapalli,
Raju Maiti,
Manipushpak Mitra,
Dachepalli R. S. Ram
Abstract:
Social inequalities are ubiquitous and here we show that the values of the Gini ($g$) and Kolkata ($k$) indices, two generic inequality indices, approach each other (starting from $g = 0$ and $k = 0.5$ for equality) as the competitions grow in various social institutions like markets, universities, elections, etc. It is further showed that these two indices become equal and stabilize at a value (a…
▽ More
Social inequalities are ubiquitous and here we show that the values of the Gini ($g$) and Kolkata ($k$) indices, two generic inequality indices, approach each other (starting from $g = 0$ and $k = 0.5$ for equality) as the competitions grow in various social institutions like markets, universities, elections, etc. It is further showed that these two indices become equal and stabilize at a value (at $g = k \simeq 0.87$) under unrestricted competitions. We propose to view this coincidence of inequality indices as a generalized version of the (more than a) century old 80-20 law of Pareto. Furthermore, the coincidence of the inequality indices noted here is very similar to the ones seen before for self-organized critical (SOC) systems. The observations here, therefore, stand as a quantitative support towards viewing interacting socio-economic systems in the framework of SOC, an idea conjectured for years.
△ Less
Submitted 4 October, 2022; v1 submitted 14 November, 2021;
originally announced November 2021.
-
Scaling Behavior of the Hirsch Index for Failure Avalanches, Percolation Clusters and Paper Citations
Authors:
Asim Ghosh,
Bikas K. Chakrabarti,
Dachepalli R. S. Ram,
Manipushpak Mitra,
Raju Maiti,
Soumyajyoti Biswas,
Suchismita Banerjee
Abstract:
A popular measure for citation inequalities of individual scientists has been the Hirsch index ($h$). If for any scientist the number $n_c$ of citations is plotted against the serial number $n_p$ of the paper having those many citations (when the papers are ordered from highest cited to lowest) then $h$ corresponds to the nearest lower integer value of $n_p$ below the fixed point of the non-linear…
▽ More
A popular measure for citation inequalities of individual scientists has been the Hirsch index ($h$). If for any scientist the number $n_c$ of citations is plotted against the serial number $n_p$ of the paper having those many citations (when the papers are ordered from highest cited to lowest) then $h$ corresponds to the nearest lower integer value of $n_p$ below the fixed point of the non-linear citation function (or given by $n_c = h = n_p$ if both $n_p$ and $n_c$ are dense set of integers near the $h$ value). The same index can be estimated (from $h=s=n_{s}$) for the avalanche or cluster of size ($s$) distributions ($n_s$) in elastic fiber bundle or percolation models. Another such inequality index, called the Kolkata index ($k$) says that $(1-k)$ fraction of papers attract $k$ fraction of citations ($k=0.80$ corresponds to the 80-20 law of Pareto). We find, for stress ($σ$), lattice occupation probability ($p$) or Kolkata index ($k$) near the bundle failure threshold ($σ_c$) or percolation threshold ($p_c$) or critical value of Kolkata index $k_c$, good fit to Widom-Stauffer like scaling $h/[\sqrt{N}/log N]$ = $f(\sqrt{N}[σ_c -σ]^α)$, $h/[\sqrt{N}/log N]=f(\sqrt{N}|p_c -p|^α)$ or $h/[\sqrt{N_c}/log N_c]=f(\sqrt{N_c}|k_c -k|^α)$ respectively, with asymptotically defined scaling function $f$, for systems of size $N$ (total number of fibers or lattice sites) or $N_c$ (total number of citations), and $α$ denoting the appropriate scaling exponent. We also show that if the number ($N_m$) of members of parliaments or national assemblies of different countries (with population $N$) is identified as their respective $h-$index, then the data fit the scaling relation $N_m \sim \sqrt N /log N$, resolving a major recent controversy.
△ Less
Submitted 24 October, 2022; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Neural Network based End-to-End Query by Example Spoken Term Detection
Authors:
Dhananjay Ram,
Lesly Miculicich,
Hervé Bourlard
Abstract:
This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time war** (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual f…
▽ More
This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time war** (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other. We propose to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously. The proposed approaches are evaluated on two challenging multilingual datasets: Spoken Web Search 2013 and Query by Example Search on Speech Task 2014, demonstrating in each case significant improvements.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Multilingual Bottleneck Features for Query by Example Spoken Term Detection
Authors:
Dhananjay Ram,
Lesly Miculicich,
Hervé Bourlard
Abstract:
State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time war** (DTW) based template matching. Here, we present a study on QbE-STD performance using several monolingual as well as multilingual bottleneck features extracted from feed forward networks. Then, we propose to…
▽ More
State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time war** (DTW) based template matching. Here, we present a study on QbE-STD performance using several monolingual as well as multilingual bottleneck features extracted from feed forward networks. Then, we propose to employ residual networks (ResNet) to estimate the bottleneck features and show significant improvements over the corresponding feed forward network based features. The neural networks are trained on GlobalPhone corpus and QbE-STD experiments are performed on a very challenging QUESST 2014 database.
△ Less
Submitted 30 June, 2019;
originally announced July 2019.
-
Mini-Max Algorithm via Pohozaev Manifold
Authors:
L. A. Maia,
D. Raom,
R. Ruviaro,
Y. D. Sobral
Abstract:
A new algorithm for solving non-homogeneous asymptotically linear and superlinear problems is proposed. The ground state solution of the problem, which in general is obtained as a mini-max of the associated functional, is obtained as the minimum of the functional constrained to the Pohozaev manifold instead. Examples are given of the use of this method for finding numerical radially symmetric posi…
▽ More
A new algorithm for solving non-homogeneous asymptotically linear and superlinear problems is proposed. The ground state solution of the problem, which in general is obtained as a mini-max of the associated functional, is obtained as the minimum of the functional constrained to the Pohozaev manifold instead. Examples are given of the use of this method for finding numerical radially symmetric positive solutions depending on various parameters.
△ Less
Submitted 8 May, 2019;
originally announced May 2019.
-
Document-Level Neural Machine Translation with Hierarchical Attention Networks
Authors:
Lesly Miculicich,
Dhananjay Ram,
Nikolaos Pappas,
James Henderson
Abstract:
Neural Machine Translation (NMT) can be improved by including document-level contextual information. For this purpose, we propose a hierarchical attention model to capture the context in a structured and dynamic manner. The model is integrated in the original NMT architecture as another level of abstraction, conditioning on the NMT model's own previous hidden states. Experiments show that hierarch…
▽ More
Neural Machine Translation (NMT) can be improved by including document-level contextual information. For this purpose, we propose a hierarchical attention model to capture the context in a structured and dynamic manner. The model is integrated in the original NMT architecture as another level of abstraction, conditioning on the NMT model's own previous hidden states. Experiments show that hierarchical attention significantly improves the BLEU score over a strong NMT baseline with the state-of-the-art in context-aware methods, and that both the encoder and decoder benefit from context in complementary ways.
△ Less
Submitted 1 October, 2018; v1 submitted 5 September, 2018;
originally announced September 2018.
-
Self-Attentive Residual Decoder for Neural Machine Translation
Authors:
Lesly Miculicich Werlen,
Nikolaos Pappas,
Dhananjay Ram,
Andrei Popescu-Belis
Abstract:
Neural sequence-to-sequence networks with attention have achieved remarkable performance for machine translation. One of the reasons for their effectiveness is their ability to capture relevant source-side contextual information at each time-step prediction through an attention mechanism. However, the target-side context is solely based on the sequence model which, in practice, is prone to a recen…
▽ More
Neural sequence-to-sequence networks with attention have achieved remarkable performance for machine translation. One of the reasons for their effectiveness is their ability to capture relevant source-side contextual information at each time-step prediction through an attention mechanism. However, the target-side context is solely based on the sequence model which, in practice, is prone to a recency bias and lacks the ability to capture effectively non-sequential dependencies among words. To address this limitation, we propose a target-side-attentive residual recurrent network for decoding, where attention over previous words contributes directly to the prediction of the next word. The residual learning facilitates the flow of information from the distant past and is able to emphasize any of the previously translated words, hence it gains access to a wider context. The proposed model outperforms a neural MT baseline as well as a memory and self-attention network on three language pairs. The analysis of the attention learned by the decoder confirms that it emphasizes a wider context, and that it captures syntactic-like structures.
△ Less
Submitted 1 October, 2018; v1 submitted 14 September, 2017;
originally announced September 2017.
-
A Bayesian Approach to Estimation of Speaker Normalization Parameters
Authors:
Dhananjay Ram,
Debasis Kundu,
Rajesh M. Hegde
Abstract:
In this work, a Bayesian approach to speaker normalization is proposed to compensate for the degradation in performance of a speaker independent speech recognition system. The speaker normalization method proposed herein uses the technique of vocal tract length normalization (VTLN). The VTLN parameters are estimated using a novel Bayesian approach which utilizes the Gibbs sampler, a special type o…
▽ More
In this work, a Bayesian approach to speaker normalization is proposed to compensate for the degradation in performance of a speaker independent speech recognition system. The speaker normalization method proposed herein uses the technique of vocal tract length normalization (VTLN). The VTLN parameters are estimated using a novel Bayesian approach which utilizes the Gibbs sampler, a special type of Markov Chain Monte Carlo method. Additionally the hyperparameters are estimated using maximum likelihood approach. This model is used assuming that human vocal tract can be modeled as a tube of uniform cross section. It captures the variation in length of the vocal tract of different speakers more effectively, than the linear model used in literature. The work has also investigated different methods like minimization of Mean Square Error (MSE) and Mean Absolute Error (MAE) for the estimation of VTLN parameters. Both single pass and two pass approaches are then used to build a VTLN based speech recognizer. Experimental results on recognition of vowels and Hindi phrases from a medium vocabulary indicate that the Bayesian method improves the performance by a considerable margin.
△ Less
Submitted 19 October, 2016;
originally announced October 2016.