-
CELESTIAL: Classification Enabled via Labelless Embeddings with Self-supervised Telescope Image Analysis Learning
Authors:
Suhas Kotha,
Anirudh Koul,
Siddha Ganju,
Meher Kasam
Abstract:
A common class of problems in remote sensing is scene classification, a fundamentally important task for natural hazards identification, geographic image retrieval, and environment monitoring. Recent developments in this field rely label-dependent supervised learning techniques which is antithetical to the 35 petabytes of unlabelled satellite imagery in NASA GIBS. To solve this problem, we establi…
▽ More
A common class of problems in remote sensing is scene classification, a fundamentally important task for natural hazards identification, geographic image retrieval, and environment monitoring. Recent developments in this field rely label-dependent supervised learning techniques which is antithetical to the 35 petabytes of unlabelled satellite imagery in NASA GIBS. To solve this problem, we establish CELESTIAL-a self-supervised learning pipeline for effectively leveraging sparsely-labeled satellite imagery. This pipeline successfully adapts SimCLR, an algorithm that first learns image representations on unlabelled data and then fine-tunes this knowledge on the provided labels. Our results show CELESTIAL requires only a third of the labels that the supervised method needs to attain the same accuracy on an experimental dataset. The first unsupervised tier can enable applications such as reverse image search for NASA Worldview (i.e. searching similar atmospheric phenomenon over years of unlabelled data with minimal samples) and the second supervised tier can lower the necessity of expensive data annotation significantly. In the future, we hope we can generalize the CELESTIAL pipeline to other data types, algorithms, and applications.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
Scalable Reverse Image Search Engine for NASAWorldview
Authors:
Abhigya Sodani,
Michael Levy,
Anirudh Koul,
Meher Anand Kasam,
Siddha Ganju
Abstract:
Researchers often spend weeks sifting through decades of unlabeled satellite imagery(on NASA Worldview) in order to develop datasets on which they can start conducting research. We developed an interactive, scalable and fast image similarity search engine (which can take one or more images as the query image) that automatically sifts through the unlabeled dataset reducing dataset generation time f…
▽ More
Researchers often spend weeks sifting through decades of unlabeled satellite imagery(on NASA Worldview) in order to develop datasets on which they can start conducting research. We developed an interactive, scalable and fast image similarity search engine (which can take one or more images as the query image) that automatically sifts through the unlabeled dataset reducing dataset generation time from weeks to minutes. In this work, we describe key components of the end to end pipeline. Our similarity search system was created to be able to identify similar images from a potentially petabyte scale database that are similar to an input image, and for this we had to break down each query image into its features, which were generated by a classification layer stripped CNN trained in a supervised manner. To store and search these features efficiently, we had to make several scalability improvements. To improve the speed, reduce the storage, and shrink memory requirements for embedding search, we add a fully connected layer to our CNN make all images into a 128 length vector before entering the classification layers. This helped us compress the size of our image features from 2048 (for ResNet, which was initially tried as our featurizer) to 128 for our new custom model. Additionally, we utilize existing approximate nearest neighbor search libraries to significantly speed up embedding search. Our system currently searches over our entire database of images at 5 seconds per query on a single virtual machine in the cloud. In the future, we would like to incorporate a SimCLR based featurizing model which could be trained without any labelling by a human (since the classification aspect of the model is irrelevant to this use case).
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
Scalable Data Balancing for Unlabeled Satellite Imagery
Authors:
Deep Patel,
Erin Gao,
Anirudh Koul,
Siddha Ganju,
Meher Anand Kasam
Abstract:
Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annota…
▽ More
Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annotated datasets is challenging, especially for large scale satellite systems such as the unlabeled NASA's 35 PB Earth Imagery dataset. Although the NASA Earth Imagery dataset is unlabeled, there are implicit properties of the data source that we can rely on to hypothesize about its imbalance, such as distribution of land and water in the case of the Earth's imagery. We present a new iterative method to balance unlabeled data. Our method utilizes image embeddings as a proxy for image labels that can be used to balance data, and ultimately when trained increases overall accuracy.
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
Reducing Effects of Swath Gaps on Unsupervised Machine Learning Models for NASA MODIS Instruments
Authors:
Sarah Chen,
Esther Cao,
Anirudh Koul,
Siddha Ganju,
Satyarth Praveen,
Meher Anand Kasam
Abstract:
Due to the nature of their pathways, NASA Terra and NASA Aqua satellites capture imagery containing swath gaps, which are areas of no data. Swath gaps can overlap the region of interest (ROI) completely, often rendering the entire imagery unusable by Machine Learning (ML) models. This problem is further exacerbated when the ROI rarely occurs (e.g. a hurricane) and, on occurrence, is partially over…
▽ More
Due to the nature of their pathways, NASA Terra and NASA Aqua satellites capture imagery containing swath gaps, which are areas of no data. Swath gaps can overlap the region of interest (ROI) completely, often rendering the entire imagery unusable by Machine Learning (ML) models. This problem is further exacerbated when the ROI rarely occurs (e.g. a hurricane) and, on occurrence, is partially overlapped with a swath gap. With annotated data as supervision, a model can learn to differentiate between the area of focus and the swath gap. However, annotation is expensive and currently the vast majority of existing data is unannotated. Hence, we propose an augmentation technique that considerably removes the existence of swath gaps in order to allow CNNs to focus on the ROI, and thus successfully use data with swath gaps for training. We experiment on the UC Merced Land Use Dataset, where we add swath gaps through empty polygons (up to 20 percent areas) and then apply augmentation techniques to fill the swath gaps. We compare the model trained with our augmentation techniques on the swath gap-filled data with the model trained on the original swath gap-less data and note highly augmented performance. Additionally, we perform a qualitative analysis using activation maps that visualizes the effectiveness of our trained network in not paying attention to the swath gaps. We also evaluate our results with a human baseline and show that, in certain cases, the filled swath gaps look so realistic that even a human evaluator did not distinguish between original satellite images and swath gap-filled images. Since this method is aimed at unlabeled data, it is widely generalizable and impactful for large scale unannotated datasets from various space data domains.
△ Less
Submitted 31 July, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
SpaceML: Distributed Open-source Research with Citizen Scientists for the Advancement of Space Technology for NASA
Authors:
Anirudh Koul,
Siddha Ganju,
Meher Kasam,
James Parr
Abstract:
Traditionally, academic labs conduct open-ended research with the primary focus on discoveries with long-term value, rather than direct products that can be deployed in the real world. On the other hand, research in the industry is driven by its expected commercial return on investment, and hence focuses on a real world product with short-term timelines. In both cases, opportunity is selective, of…
▽ More
Traditionally, academic labs conduct open-ended research with the primary focus on discoveries with long-term value, rather than direct products that can be deployed in the real world. On the other hand, research in the industry is driven by its expected commercial return on investment, and hence focuses on a real world product with short-term timelines. In both cases, opportunity is selective, often available to researchers with advanced educational backgrounds. Research often happens behind closed doors and may be kept confidential until either its publication or product release, exacerbating the problem of AI reproducibility and slowing down future research by others in the field. As many research organizations tend to exclusively focus on specific areas, opportunities for interdisciplinary research reduce. Undertaking long-term bold research in unexplored fields with non-commercial yet great public value is hard due to factors including the high upfront risk, budgetary constraints, and a lack of availability of data and experts in niche fields. Only a few companies or well-funded research labs can afford to do such long-term research. With research organizations focused on an exploding array of fields and resources spread thin, opportunities for the maturation of interdisciplinary research reduce. Apart from these exigencies, there is also a need to engage citizen scientists through open-source contributors to play an active part in the research dialogue. We present a short case study of SpaceML, an extension of the Frontier Development Lab, an AI accelerator for NASA. SpaceML distributes open-source research and invites volunteer citizen scientists to partake in development and deployment of high social value products at the intersection of space and AI.
△ Less
Submitted 16 February, 2021; v1 submitted 19 December, 2020;
originally announced December 2020.
-
Learnings from Frontier Development Lab and SpaceML -- AI Accelerators for NASA and ESA
Authors:
Siddha Ganju,
Anirudh Koul,
Alexander Lavin,
Josh Veitch-Michaelis,
Meher Kasam,
James Parr
Abstract:
Research with AI and ML technologies lives in a variety of settings with often asynchronous goals and timelines: academic labs and government organizations pursue open-ended research focusing on discoveries with long-term value, while research in industry is driven by commercial pursuits and hence focuses on short-term timelines and return on investment. The journey from research to product is oft…
▽ More
Research with AI and ML technologies lives in a variety of settings with often asynchronous goals and timelines: academic labs and government organizations pursue open-ended research focusing on discoveries with long-term value, while research in industry is driven by commercial pursuits and hence focuses on short-term timelines and return on investment. The journey from research to product is often tacit or ad hoc, resulting in technology transition failures, further exacerbated when research and development is interorganizational and interdisciplinary. Even more, much of the ability to produce results remains locked in the private repositories and know-how of the individual researcher, slowing the impact on future research by others and contributing to the ML community's challenges in reproducibility. With research organizations focused on an exploding array of fields, opportunities for the handover and maturation of interdisciplinary research reduce. With these tensions, we see an emerging need to measure the correctness, impact, and relevance of research during its development to enable better collaboration, improved reproducibility, faster progress, and more trusted outcomes. We perform a case study of the Frontier Development Lab (FDL), an AI accelerator under a public-private partnership from NASA and ESA. FDL research follows principled practices that are grounded in responsible development, conduct, and dissemination of AI research, enabling FDL to churn successful interdisciplinary and interorganizational research projects, measured through NASA's Technology Readiness Levels. We also take a look at the SpaceML Open Source Research Program, which helps accelerate and transition FDL's research to deployable projects with wide spread adoption amongst citizen scientists.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.