Synthetic Datasets for Program Similarity Research
Authors:
Alexander Interrante-Grant,
Michael Wang,
Lisa Baer,
Ryan Whelan,
Tim Leek
Abstract:
Program similarity has become an increasingly popular area of research with various security applications such as plagiarism detection, author identification, and malware analysis. However, program similarity research faces a few unique dataset quality problems in evaluating the effectiveness of novel approaches. First, few high-quality datasets for binary program similarity exist and are widely u…
▽ More
Program similarity has become an increasingly popular area of research with various security applications such as plagiarism detection, author identification, and malware analysis. However, program similarity research faces a few unique dataset quality problems in evaluating the effectiveness of novel approaches. First, few high-quality datasets for binary program similarity exist and are widely used in this domain. Second, there are potentially many different, disparate definitions of what makes one program similar to another and in many cases there is often a large semantic gap between the labels provided by a dataset and any useful notion of behavioral or semantic similarity. In this paper, we present HELIX - a framework for generating large, synthetic program similarity datasets. We also introduce Blind HELIX, a tool built on top of HELIX for extracting HELIX components from library code automatically using program slicing. We evaluate HELIX and Blind HELIX by comparing the performance of program similarity tools on a HELIX dataset to a hand-crafted dataset built from multiple, disparate notions of program similarity. Using Blind HELIX, we show that HELIX can generate realistic and useful datasets of virtually infinite size for program similarity research with ground truth labels that embody practical notions of program similarity. Finally, we discuss the results and reason about relative tool ranking.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
Nonlinear functional map** of the human brain
Authors:
Nicholas Allgaier,
Tobias Banaschewski,
Gareth Barker,
Arun L. W. Bokde,
Josh C. Bongard,
Uli Bromberg,
Christian Büchel,
Anna Cattrell,
Patricia J. Conrod,
Christopher M. Danforth,
Sylvane Desrivières,
Peter S. Dodds,
Herta Flor,
Vincent Frouin,
Jürgen Gallinat,
Penny Gowland,
Andreas Heinz,
Bernd Ittermann,
Scott Mackey,
Jean-Luc Martinot,
Kevin Murphy,
Frauke Nees,
Dimitri Papadopoulos-Orfanos,
Luise Poustka,
Michael N. Smolka
, et al. (5 additional authors not shown)
Abstract:
The field of neuroimaging has truly become data rich, and novel analytical methods capable of gleaning meaningful information from large stores of imaging data are in high demand. Those methods that might also be applicable on the level of individual subjects, and thus potentially useful clinically, are of special interest. In the present study, we introduce just such a method, called nonlinear fu…
▽ More
The field of neuroimaging has truly become data rich, and novel analytical methods capable of gleaning meaningful information from large stores of imaging data are in high demand. Those methods that might also be applicable on the level of individual subjects, and thus potentially useful clinically, are of special interest. In the present study, we introduce just such a method, called nonlinear functional map** (NFM), and demonstrate its application in the analysis of resting state fMRI from a 242-subject subset of the IMAGEN project, a European study of adolescents that includes longitudinal phenotypic, behavioral, genetic, and neuroimaging data. NFM employs a computational technique inspired by biological evolution to discover and mathematically characterize interactions among ROI (regions of interest), without making linear or univariate assumptions. We show that statistics of the resulting interaction relationships comport with recent independent work, constituting a preliminary cross-validation. Furthermore, nonlinear terms are ubiquitous in the models generated by NFM, suggesting that some of the interactions characterized here are not discoverable by standard linear methods of analysis. We discuss one such nonlinear interaction in the context of a direct comparison with a procedure involving pairwise correlation, designed to be an analogous linear version of functional map**. We find another such interaction that suggests a novel distinction in brain function between drinking and non-drinking adolescents: a tighter coupling of ROI associated with emotion, reward, and interoceptive processes such as thirst, among drinkers. Finally, we outline many improvements and extensions of the methodology to reduce computational expense, complement other analytical tools like graph-theoretic analysis, and allow for voxel level NFM to eliminate the necessity of ROI selection.
△ Less
Submitted 8 September, 2015;
originally announced October 2015.