-
Nodule detection and generation on chest X-rays: NODE21 Challenge
Authors:
Ecem Sogancioglu,
Bram van Ginneken,
Finn Behrendt,
Marcel Bengs,
Alexander Schlaefer,
Miron Radu,
Di Xu,
Ke Sheng,
Fabien Scalzo,
Eric Marcus,
Samuele Papa,
Jonas Teuwen,
Ernst Th. Scholten,
Steven Schalekamp,
Nils Hendrix,
Colin Jacobs,
Ward Hendrix,
Clara I Sánchez,
Keelin Murphy
Abstract:
Pulmonary nodules may be an early manifestation of lung cancer, the leading cause of cancer-related deaths among both men and women. Numerous studies have established that deep learning methods can yield high-performance levels in the detection of lung nodules in chest X-rays. However, the lack of gold-standard public datasets slows down the progression of the research and prevents benchmarking of…
▽ More
Pulmonary nodules may be an early manifestation of lung cancer, the leading cause of cancer-related deaths among both men and women. Numerous studies have established that deep learning methods can yield high-performance levels in the detection of lung nodules in chest X-rays. However, the lack of gold-standard public datasets slows down the progression of the research and prevents benchmarking of methods for this task. To address this, we organized a public research challenge, NODE21, aimed at the detection and generation of lung nodules in chest X-rays. While the detection track assesses state-of-the-art nodule detection systems, the generation track determines the utility of nodule generation algorithms to augment training data and hence improve the performance of the detection systems. This paper summarizes the results of the NODE21 challenge and performs extensive additional experiments to examine the impact of the synthetically generated nodule training images on the detection algorithm performance.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Constructing a Highlight Classifier with an Attention Based LSTM Neural Network
Authors:
Michael Kuehne,
Marius Radu
Abstract:
Data is being produced in larger quantities than ever before in human history. It's only natural to expect a rise in demand for technology that aids humans in sifting through and analyzing this inexhaustible supply of information. This need exists in the market research industry, where large amounts of consumer research data is collected through video recordings. At present, the standard method fo…
▽ More
Data is being produced in larger quantities than ever before in human history. It's only natural to expect a rise in demand for technology that aids humans in sifting through and analyzing this inexhaustible supply of information. This need exists in the market research industry, where large amounts of consumer research data is collected through video recordings. At present, the standard method for analyzing video data is human labor. Market researchers manually review the vast majority of consumer research video in order to identify relevant portions - highlights. The industry state of the art turnaround ratio is 2.2 - for every hour of video content 2.2 hours of manpower are required. In this study we present a novel approach for NLP-based highlight identification and extraction based on a supervised learning model that aides market researchers in sifting through their data. Our approach hinges on a manually curated user-generated highlight clips constructed from long and short-form video data. The problem is best suited for an NLP approach due to the availability of video transcription. We evaluate multiple classes of models, from gradient boosting to recurrent neural networks, comparing their performance in extraction and identification of highlights. The best performing models are then evaluated using four sampling methods designed to analyze documents much larger than the maximum input length of the classifiers. We report very high performances for the standalone classifiers, ROC AUC scores in the range 0.93-0.94, but observe a significant drop in effectiveness when evaluated on large documents. Based on our results we suggest combinations of models/sampling algorithms for various use cases.
△ Less
Submitted 12 February, 2020;
originally announced February 2020.
-
Microsoft Malware Classification Challenge
Authors:
Royi Ronen,
Marian Radu,
Corina Feuerstein,
Elad Yom-Tov,
Mansour Ahmadi
Abstract:
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 rese…
▽ More
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
△ Less
Submitted 22 February, 2018;
originally announced February 2018.
-
Minimally Supervised Feature Selection for Classification (Master's Thesis, University Politehnica of Bucharest)
Authors:
Alexandra Maria Radu
Abstract:
In the context of the highly increasing number of features that are available nowadays we design a robust and fast method for feature selection. The method tries to select the most representative features that are independent from each other, but are strong together. We propose an algorithm that requires very limited labeled data (as few as one labeled frame per class) and can accommodate as many…
▽ More
In the context of the highly increasing number of features that are available nowadays we design a robust and fast method for feature selection. The method tries to select the most representative features that are independent from each other, but are strong together. We propose an algorithm that requires very limited labeled data (as few as one labeled frame per class) and can accommodate as many unlabeled samples. We also present here the supervised approach from which we started. We compare our two formulations with established methods like AdaBoost, SVM, Lasso, Elastic Net and FoBa and show that our method is much faster and it has constant training time. Moreover, the unsupervised approach outperforms all the methods with which we compared and the difference might be quite prominent. The supervised approach is in most cases better than the other methods, especially when the number of training shots is very limited. All that the algorithm needs is to choose from a pool of positively correlated features. The methods are evaluated on the Youtube-Objects dataset of videos and on MNIST digits dataset, while at training time we also used features obtained on CIFAR10 dataset and others pre-trained on ImageNet dataset. Thereby, we also proved that transfer learning is useful, even though the datasets differ very much: from low-resolution centered images from 10 classes, to high-resolution images with objects from 1000 classes occurring in different regions of the images or to very difficult videos with very high intraclass variance. 7
△ Less
Submitted 9 December, 2015;
originally announced December 2015.