-
The Lean Data Scientist: Recent Advances towards Overcoming the Data Bottleneck
Authors:
Chen Shani,
Jonathan Zarecki,
Dafna Shahaf
Abstract:
Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry. Recent algorithms (in particular, deep networks) are increasingly data-hungry, requiring large datasets for training. Thus, the dominant paradigm in ML today involves constructing large, task-specific datasets.
However, obtaining quality datasets of such magnitude proves to be a difficult ch…
▽ More
Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry. Recent algorithms (in particular, deep networks) are increasingly data-hungry, requiring large datasets for training. Thus, the dominant paradigm in ML today involves constructing large, task-specific datasets.
However, obtaining quality datasets of such magnitude proves to be a difficult challenge. A variety of methods have been proposed to address this data bottleneck problem, but they are scattered across different areas, and it is hard for a practitioner to keep up with the latest developments. In this work, we propose a taxonomy of these methods. Our goal is twofold: (1) We wish to raise the community's awareness of the methods that already exist and encourage more efficient use of resources, and (2) we hope that such a taxonomy will contribute to our understanding of the problem, inspiring novel ideas and strategies to replace current annotation-heavy approaches.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Topo2vec: Topography Embedding Using the Fractal Effect
Authors:
Jonathan Kavitzky,
Jonathan Zarecki,
Idan Brusilovsky,
Uriel Singer
Abstract:
Recent advances in deep learning have transformed many fields by introducing generic embedding spaces, capable of achieving great predictive performance with minimal labeling effort. The geology field has not yet met such success. In this work, we introduce an extension for self-supervised learning techniques tailored for exploiting the fractal-effect in remote-sensing images. The fractal-effect a…
▽ More
Recent advances in deep learning have transformed many fields by introducing generic embedding spaces, capable of achieving great predictive performance with minimal labeling effort. The geology field has not yet met such success. In this work, we introduce an extension for self-supervised learning techniques tailored for exploiting the fractal-effect in remote-sensing images. The fractal-effect assumes that the same structures (for example rivers, peaks and saddles) will appear in all scales. We demonstrate our method's effectiveness on elevation data, we also use the effect in inference. We perform an extensive analysis on several classification tasks and emphasize its effectiveness in detecting the same class on different scales. To the best of our knowledge, it is the first attempt to build a generic representation for topographic images.
△ Less
Submitted 19 August, 2021;
originally announced August 2021.
-
Textual Membership Queries
Authors:
Jonathan Zarecki,
Shaul Markovitch
Abstract:
Human labeling of data can be very time-consuming and expensive, yet, in many cases it is critical for the success of the learning process. In order to minimize human labeling efforts, we propose a novel active learning solution that does not rely on existing sources of unlabeled data. It uses a small amount of labeled data as the core set for the synthesis of useful membership queries (MQs) - unl…
▽ More
Human labeling of data can be very time-consuming and expensive, yet, in many cases it is critical for the success of the learning process. In order to minimize human labeling efforts, we propose a novel active learning solution that does not rely on existing sources of unlabeled data. It uses a small amount of labeled data as the core set for the synthesis of useful membership queries (MQs) - unlabeled instances generated by an algorithm for human labeling. Our solution uses modification operators, functions that modify instances to some extent. We apply the operators on a small set of instances (core set), creating a set of new membership queries. Using this framework, we look at the instance space as a search space and apply search algorithms in order to generate new examples highly relevant to the learner. We implement this framework in the textual domain and test it on several text classification tasks and show improved classifier performance as more MQs are labeled and incorporated into the training set. To the best of our knowledge, this is the first work on membership queries in the textual domain.
△ Less
Submitted 7 August, 2020; v1 submitted 11 May, 2018;
originally announced May 2018.