Search | arXiv e-print repository

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Authors: Anastasia Drozdova, Polina Guseva, Ekaterina Trofimova, Anna Scherbakova, Andrey Ustyuzhanin

Abstract: Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However, without annotation number of methods that could be applied is somewhat limited. To address the lack of annotated datasets, we present the Code4ML corpus. It conta… ▽ More Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However, without annotation number of methods that could be applied is somewhat limited. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions and dataset descriptions publicly available from Kaggle - the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can potentially help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: Under review

arXiv:2201.11252 [pdf, other]

Semantic Code Classification for Automated Machine Learning

Authors: Polina Guseva, Anastasia Drozdova, Natalia Denisenko, Daria Sapozhnikova, Ivan Pyaternev, Anna Scherbakova, Andrey Ustuzhanin

Abstract: A range of applications for automatic machine learning need the generation process to be controllable. In this work, we propose a way to control the output via a sequence of simple actions, that are called semantic code classes. Finally, we present a semantic code classification task and discuss methods for solving this problem on the Natural Language to Machine Learning (NL2ML) dataset. A range of applications for automatic machine learning need the generation process to be controllable. In this work, we propose a way to control the output via a sequence of simple actions, that are called semantic code classes. Finally, we present a semantic code classification task and discuss methods for solving this problem on the Natural Language to Machine Learning (NL2ML) dataset. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: 15 pages including references, New In ML workshop at NeurIPS'21

arXiv:2111.07256 [pdf]

Towards annotation of text worlds in a literary work

Authors: Elena Mikhalkova, Timofei Protasov, Anastasiia Drozdova, Anastasiia Bashmakova, Polina Gavin

Abstract: Literary texts are usually rich in meanings and their interpretation complicates corpus studies and automatic processing. There have been several attempts to create collections of literary texts with annotation of literary elements like the author's speech, characters, events, scenes etc. However, they resulted in small collections and standalone rules for annotation. The present article describes… ▽ More Literary texts are usually rich in meanings and their interpretation complicates corpus studies and automatic processing. There have been several attempts to create collections of literary texts with annotation of literary elements like the author's speech, characters, events, scenes etc. However, they resulted in small collections and standalone rules for annotation. The present article describes an experiment on lexical annotation of text worlds in a literary work and quantitative methods of their comparison. The experiment shows that for a well-agreed tag assignment annotation rules should be set much more strictly. However, if borders between text worlds and other elements are the result of a subjective interpretation, they should be modeled as fuzzy entities. △ Less

Submitted 14 November, 2021; originally announced November 2021.

Comments: Conference: Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference Dialogue At: Moscow, Russia Volume: Issue 18. Supplementary volume

arXiv:2008.01009 [pdf, other]

The Splay-List: A Distribution-Adaptive Concurrent Skip-List

Authors: Vitaly Aksenov, Dan Alistarh, Alexandra Drozdova, Amirkeivan Mohtashami

Abstract: The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the seque… ▽ More The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the sequential case, e.g. the splay-trees; however, they often are hard to translate efficiently in the concurrent case. In this paper, we investigate distribution-adaptive concurrent data structures and propose a new design called the splay-list. At a high level, the splay-list is similar to a standard skip-list, with the key distinction that the height of each element adapts dynamically to its access rate: popular elements ``move up,'' whereas rarely-accessed elements decrease in height. We show that the splay-list provides order-optimal amortized complexity bounds for a subset of operations while being amenable to efficient concurrent implementation. Experimental results show that the splay-list can leverage distribution-adaptivity to improve on the performance of classic concurrent designs, and can outperform the only previously-known distribution-adaptive design in certain settings. △ Less

Submitted 3 August, 2020; originally announced August 2020.

arXiv:cs/9902014

Proceedings from Critical Infrastructure: The Path Ahead (XIWT Symposium on Cross-Industry Activities for Information Infrastructure Robustness)

Authors: Barry M. Leiner, Ekaterina A. Drozdova

Abstract: The Cross-Industry Working Team (XIWT), with the support of the Stanford University Consortium for Research on Information Security and Policy (CRISP), sponsored a symposium on cross-industry activities aimed at improving the reliability, dependability, and robustness of the information infrastructure. Held 3-4 November 1998 in Crystal City, Virginia, the symposium engaged representatives from i… ▽ More The Cross-Industry Working Team (XIWT), with the support of the Stanford University Consortium for Research on Information Security and Policy (CRISP), sponsored a symposium on cross-industry activities aimed at improving the reliability, dependability, and robustness of the information infrastructure. Held 3-4 November 1998 in Crystal City, Virginia, the symposium engaged representatives from industry, academia, and government in discussion of current and potential cross-industry, cross-sector activities including information exchange, collaborative operations, and cooperative research and development. This proceedings summarizes the discussions and results of the meeting. △ Less

Submitted 8 February, 1999; originally announced February 1999.

ACM Class: C.2.0

Showing 1–5 of 5 results for author: Drozdova, A