-
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Authors:
Abhishek Divekar,
Greg Durrett
Abstract:
It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to…
▽ More
It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our extensive codebase at https://github.com/amazon-science/synthesizrr
△ Less
Submitted 8 July, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives
Authors:
Abhishek Divekar,
Meet Parekh,
Vaibhav Savla,
Rudra Mishra,
Mahesh Shirole
Abstract:
Machine Learning has been steadily gaining traction for its use in Anomaly-based Network Intrusion Detection Systems (A-NIDS). Research into this domain is frequently performed using the KDD~CUP~99 dataset as a benchmark. Several studies question its usability while constructing a contemporary NIDS, due to the skewed response distribution, non-stationarity, and failure to incorporate modern attack…
▽ More
Machine Learning has been steadily gaining traction for its use in Anomaly-based Network Intrusion Detection Systems (A-NIDS). Research into this domain is frequently performed using the KDD~CUP~99 dataset as a benchmark. Several studies question its usability while constructing a contemporary NIDS, due to the skewed response distribution, non-stationarity, and failure to incorporate modern attacks. In this paper, we compare the performance for KDD-99 alternatives when trained using classification models commonly found in literature: Neural Network, Support Vector Machine, Decision Tree, Random Forest, Naive Bayes and K-Means. Applying the SMOTE oversampling technique and random undersampling, we create a balanced version of NSL-KDD and prove that skewed target classes in KDD-99 and NSL-KDD hamper the efficacy of classifiers on minority classes (U2R and R2L), leading to possible security risks. We explore UNSW-NB15, a modern substitute to KDD-99 with greater uniformity of pattern distribution. We benchmark this dataset before and after SMOTE oversampling to observe the effect on minority performance. Our results indicate that classifiers trained on UNSW-NB15 match or better the Weighted F1-Score of those trained on NSL-KDD and KDD-99 in the binary case, thus advocating UNSW-NB15 as a modern substitute to these datasets.
△ Less
Submitted 13 November, 2018;
originally announced November 2018.
-
Guaranteed sparse signal recovery with highly coherent sensing matrices
Authors:
Guangliang Chen,
Atul Divekar,
Deanna Needell
Abstract:
Compressive sensing is a methodology for the reconstruction of sparse or compressible signals using far fewer samples than required by the Nyquist criterion. However, many of the results in compressive sensing concern random sampling matrices such as Gaussian and Bernoulli matrices. In common physically feasible signal acquisition and reconstruction scenarios such as super-resolution of images, th…
▽ More
Compressive sensing is a methodology for the reconstruction of sparse or compressible signals using far fewer samples than required by the Nyquist criterion. However, many of the results in compressive sensing concern random sampling matrices such as Gaussian and Bernoulli matrices. In common physically feasible signal acquisition and reconstruction scenarios such as super-resolution of images, the sensing matrix has a non-random structure with highly correlated columns. Here we present a compressive sensing type recovery algorithm, called Partial Inversion (PartInv), that overcomes the correlations among the columns. We provide theoretical justification as well as empirical comparisons.
△ Less
Submitted 31 March, 2014; v1 submitted 1 November, 2013;
originally announced November 2013.
-
Changes in quasi-periodic variations of solar photospheric fields: precursor to the deep solar minimum in the cycle 23?
Authors:
Susanta Kumar Bisoi,
P. Janardhan,
D. Chakrabarty,
S. Ananthakrishnan,
Ankur Divekar
Abstract:
Using both wavelet and Fourier analysis, a study has been undertaken of the changes in the quasi-periodic variations in solar photospheric fields in the build-up to one of the deepest solar minima experienced in the past 100 years. This unusual and deep solar minimum occurred between solar cycles 23 and 24. The study, carried out using ground based synoptic magnetograms spanning the period 1975.14…
▽ More
Using both wavelet and Fourier analysis, a study has been undertaken of the changes in the quasi-periodic variations in solar photospheric fields in the build-up to one of the deepest solar minima experienced in the past 100 years. This unusual and deep solar minimum occurred between solar cycles 23 and 24. The study, carried out using ground based synoptic magnetograms spanning the period 1975.14 to 2009.86, covered solar cycles 21, 22 and 23. A hemispheric asymmetry in periodicities of the photospheric fields was seen only at latitudes above $\pm45{^{\circ}}$ when the data was divided, based on a wavelet analysis, into two parts: one prior to 1996 and the other after 1996. Furthermore, the hemispheric asymmetry was observed to be confined to the latitude range 45${^{\circ}}$ to 60${^{\circ}}$. This can be attributed to the variations in polar surges that primarily depend on both the emergence of surface magnetic flux and varying solar surface flows. The observed asymmetry when coupled with the fact that both solar fields above $\pm45{^{\circ}}$ and micro-turbulence levels in the inner-heliosphere have been decreasing since the early to mid nineties (Janardhan et al., 2011) suggests that around this time active changes occurred in the solar dynamo that governs the underlying basic processes in the sun. These changes in turn probably initiated the build-up to the very deep solar minimum at the end of the cycle 23. The decline in fields above $\pm45{^{\circ}}$ for well over a solar cycle, would imply that weak polar fields have been generated in the past two successive solar cycles \textit{viz.} cycles 22 and 23. A continuation of this declining trend beyond 22 years, if it occurs, will have serious implications on our current understanding of the solar dynamo.
△ Less
Submitted 28 April, 2013;
originally announced April 2013.
-
Using Correlated Subset Structure for Compressive Sensing Recovery
Authors:
Atul Divekar,
Deanna Needell
Abstract:
Compressive sensing is a methodology for the reconstruction of sparse or compressible signals using far fewer samples than required by the Nyquist criterion. However, many of the results in compressive sensing concern random sampling matrices such as Gaussian and Bernoulli matrices. In common physically feasible signal acquisition and reconstruction scenarios such as super-resolution of images, th…
▽ More
Compressive sensing is a methodology for the reconstruction of sparse or compressible signals using far fewer samples than required by the Nyquist criterion. However, many of the results in compressive sensing concern random sampling matrices such as Gaussian and Bernoulli matrices. In common physically feasible signal acquisition and reconstruction scenarios such as super-resolution of images, the sensing matrix has a non-random structure with highly correlated columns. Here we present a compressive sensing recovery algorithm that exploits this correlation structure. We provide algorithmic justification as well as empirical comparisons.
△ Less
Submitted 10 June, 2013; v1 submitted 15 February, 2013;
originally announced February 2013.