-
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
Authors:
Jongheon Jeong,
Yang Zou,
Taewan Kim,
Dongqing Zhang,
Avinash Ravichandran,
Onkar Dabeer
Abstract:
Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently…
▽ More
Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently CLIP, a vision-language model, has shown revolutionary generality with competitive zero-/few-shot performance in comparison to full-supervision. But CLIP falls short on anomaly classification and segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a compositional ensemble on state words and prompt templates and (2) efficient extraction and aggregation of window/patch/image-level features aligned with text. We also propose its few-normal-shot extension WinCLIP+, which uses complementary information from normal images. In MVTec-AD (and VisA), without further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AUROC in zero-shot anomaly classification and segmentation while WinCLIP+ does 93.1%/95.2% (83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins.
△ Less
Submitted 26 March, 2023;
originally announced March 2023.
-
A Meta-Learning Approach to Predicting Performance and Data Requirements
Authors:
Achin Jain,
Gurumurthy Swaminathan,
Paolo Favaro,
Hao Yang,
Avinash Ravichandran,
Hrayr Harutyunyan,
Alessandro Achille,
Onkar Dabeer,
Bernt Schiele,
Ashwin Swaminathan,
Stefano Soatto
Abstract:
We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset (e.g., 5 samples per class) for extrapolation. This is because the log-performance error against the log-dataset size follows a nonlinear progression in the few-…
▽ More
We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset (e.g., 5 samples per class) for extrapolation. This is because the log-performance error against the log-dataset size follows a nonlinear progression in the few-shot regime followed by a linear progression in the high-shot regime. We introduce a novel piecewise power law (PPL) that handles the two data regimes differently. To estimate the parameters of the PPL, we introduce a random forest regressor trained via meta learning that generalizes across classification/detection tasks, ResNet/ViT based architectures, and random/pre-trained initializations. The PPL improves the performance estimation on average by 37% across 16 classification and 33% across 10 detection datasets, compared to the power law. We further extend the PPL to provide a confidence bound and use it to limit the prediction horizon that reduces over-estimation of data by 76% on classification and 91% on detection datasets.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers
Authors:
Achin Jain,
Kibok Lee,
Gurumurthy Swaminathan,
Hao Yang,
Bernt Schiele,
Avinash Ravichandran,
Onkar Dabeer
Abstract:
Annotating bounding boxes for object detection is expensive, time-consuming, and error-prone. In this work, we propose a DETR based framework called ComplETR that is designed to explicitly complete missing annotations in partially annotated dense scene datasets. This reduces the need to annotate every object instance in the scene thereby reducing annotation cost. ComplETR augments object queries i…
▽ More
Annotating bounding boxes for object detection is expensive, time-consuming, and error-prone. In this work, we propose a DETR based framework called ComplETR that is designed to explicitly complete missing annotations in partially annotated dense scene datasets. This reduces the need to annotate every object instance in the scene thereby reducing annotation cost. ComplETR augments object queries in DETR decoder with patch information of objects in the image. Combined with a matching loss, it can effectively find objects that are similar to the input patch and complete the missing annotations. We show that our framework outperforms the state-of-the-art methods such as Soft Sampling and Unbiased Teacher by itself, while at the same time can be used in conjunction with these methods to further improve their performance. Our framework is also agnostic to the choice of the downstream object detectors; we show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR on multiple dense scene datasets.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
SPot-the-Difference Self-Supervised Pre-training for Anomaly Detection and Segmentation
Authors:
Yang Zou,
Jongheon Jeong,
Latha Pemula,
Dongqing Zhang,
Onkar Dabeer
Abstract:
Visual anomaly detection is commonly used in industrial quality inspection. In this paper, we present a new dataset as well as a new self-supervised learning method for ImageNet pre-training to improve anomaly detection and segmentation in 1-class and 2-class 5/10/high-shot training setups. We release the Visual Anomaly (VisA) Dataset consisting of 10,821 high-resolution color images (9,621 normal…
▽ More
Visual anomaly detection is commonly used in industrial quality inspection. In this paper, we present a new dataset as well as a new self-supervised learning method for ImageNet pre-training to improve anomaly detection and segmentation in 1-class and 2-class 5/10/high-shot training setups. We release the Visual Anomaly (VisA) Dataset consisting of 10,821 high-resolution color images (9,621 normal and 1,200 anomalous samples) covering 12 objects in 3 domains, making it the largest industrial anomaly detection dataset to date. Both image and pixel-level labels are provided. We also propose a new self-supervised framework - SPot-the-difference (SPD) - which can regularize contrastive self-supervised pre-training, such as SimSiam, MoCo and SimCLR, to be more suitable for anomaly detection tasks. Our experiments on VisA and MVTec-AD dataset show that SPD consistently improves these contrastive pre-training baselines and even the supervised pre-training. For example, SPD improves Area Under the Precision-Recall curve (AU-PR) for anomaly segmentation by 5.9% and 6.8% over SimSiam and supervised pre-training respectively in the 2-class high-shot regime. We open-source the project at http://github.com/amazon-research/spot-diff .
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Rethinking Few-Shot Object Detection on a Multi-Domain Benchmark
Authors:
Kibok Lee,
Hao Yang,
Satyaki Chakraborty,
Zhaowei Cai,
Gurumurthy Swaminathan,
Avinash Ravichandran,
Onkar Dabeer
Abstract:
Most existing works on few-shot object detection (FSOD) focus on a setting where both pre-training and few-shot learning datasets are from a similar domain. However, few-shot algorithms are important in multiple domains; hence evaluation needs to reflect the broad applications. We propose a Multi-dOmain Few-Shot Object Detection (MoFSOD) benchmark consisting of 10 datasets from a wide range of dom…
▽ More
Most existing works on few-shot object detection (FSOD) focus on a setting where both pre-training and few-shot learning datasets are from a similar domain. However, few-shot algorithms are important in multiple domains; hence evaluation needs to reflect the broad applications. We propose a Multi-dOmain Few-Shot Object Detection (MoFSOD) benchmark consisting of 10 datasets from a wide range of domains to evaluate FSOD algorithms. We comprehensively analyze the impacts of freezing layers, different architectures, and different pre-training datasets on FSOD performance. Our empirical results show several key factors that have not been explored in previous works: 1) contrary to previous belief, on a multi-domain benchmark, fine-tuning (FT) is a strong baseline for FSOD, performing on par or better than the state-of-the-art (SOTA) algorithms; 2) utilizing FT as the baseline allows us to explore multiple architectures, and we found them to have a significant impact on down-stream few-shot tasks, even with similar pre-training performances; 3) by decoupling pre-training and few-shot learning, MoFSOD allows us to explore the impact of different pre-training datasets, and the right choice can boost the performance of the down-stream tasks significantly. Based on these findings, we list possible avenues of investigation for improving FSOD performance and propose two simple modifications to existing algorithms that lead to SOTA performance on the MoFSOD benchmark. The code is available at https://github.com/amazon-research/few-shot-object-detection-benchmark.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Map** Component
Authors:
Onkar Dabeer,
Radhika Gowaikar,
Slawomir K. Grzechnik,
Mythreya J. Lakshman,
Gerhard Reitmayr,
Kiran Somasundaram,
Ravi Teja Sukhavasi,
Xinzhou Wu
Abstract:
Autonomous vehicles rely on precise high definition (HD) 3d maps for navigation. This paper presents the map** component of an end-to-end system for crowdsourcing precise 3d maps with semantically meaningful landmarks such as traffic signs (6 dof pose, shape and size) and traffic lanes (3d splines). The system uses consumer grade parts, and in particular, relies on a single front facing camera a…
▽ More
Autonomous vehicles rely on precise high definition (HD) 3d maps for navigation. This paper presents the map** component of an end-to-end system for crowdsourcing precise 3d maps with semantically meaningful landmarks such as traffic signs (6 dof pose, shape and size) and traffic lanes (3d splines). The system uses consumer grade parts, and in particular, relies on a single front facing camera and a consumer grade GPS. Using real-time sign and lane triangulation on-device in the vehicle, with offline sign/lane clustering across multiple journeys and offline Bundle Adjustment across multiple journeys in the backend, we construct maps with mean absolute accuracy at sign corners of less than 20 cm from 25 journeys. To the best of our knowledge, this is the first end-to-end HD map** pipeline in global coordinates in the automotive context using cost effective sensors.
△ Less
Submitted 31 March, 2017; v1 submitted 29 March, 2017;
originally announced March 2017.
-
Transmit Beamforming for MIMO Communication Systems with Low Precision ADC at the Receiver
Authors:
Tapan Shah,
Onkar Dabeer
Abstract:
Multiple antenna systems have been extensively used by standards designing multi-gigabit communication systems operating in bandwidth of several GHz. In this paper, we study the use of transmitter (Tx) beamforming techniques to improve the performance of a MIMO system with a low precision ADC. We motivate an approach to use eigenmode transmit beamforming (which imposes a diagonal structure in the…
▽ More
Multiple antenna systems have been extensively used by standards designing multi-gigabit communication systems operating in bandwidth of several GHz. In this paper, we study the use of transmitter (Tx) beamforming techniques to improve the performance of a MIMO system with a low precision ADC. We motivate an approach to use eigenmode transmit beamforming (which imposes a diagonal structure in the complete MIMO system) and use an eigenmode power allocation which minimizes the uncoded BER of the finite precision system. Although we cannot guarantee optimality of this approach, we observe that even low with precision ADC, it performs comparably to full precision system with no eigenmode power allocation. For example, in a high throughput MIMO system with a finite precision ADC at the receiver, simulation results show that for a 3/4 LDPC coded 2x2 MIMO OFDM 16-QAM system with 3-bit precision ADC at the receiver, a BER of 0.0001 is achieved at an SNR of 26 dB. This is 1 dB better than that required for the same system with full precision but equal eigenmode power allocation.
△ Less
Submitted 6 October, 2013;
originally announced October 2013.
-
Clustered regression with unknown clusters
Authors:
Kishor Barman,
Onkar Dabeer
Abstract:
We consider a collection of prediction experiments, which are clustered in the sense that groups of experiments ex- hibit similar relationship between the predictor and response variables. The experiment clusters as well as the regres- sion relationships are unknown. The regression relation- ships define the experiment clusters, and in general, the predictor and response variables may not exhibit…
▽ More
We consider a collection of prediction experiments, which are clustered in the sense that groups of experiments ex- hibit similar relationship between the predictor and response variables. The experiment clusters as well as the regres- sion relationships are unknown. The regression relation- ships define the experiment clusters, and in general, the predictor and response variables may not exhibit any clus- tering. We call this prediction problem clustered regres- sion with unknown clusters (CRUC) and in this paper we focus on linear regression. We study and compare several methods for CRUC, demonstrate their applicability to the Yahoo Learning-to-rank Challenge (YLRC) dataset, and in- vestigate an associated mathematical model. CRUC is at the crossroads of many prior works and we study several prediction algorithms with diverse origins: an adaptation of the expectation-maximization algorithm, an approach in- spired by K-means clustering, the singular value threshold- ing approach to matrix rank minimization under quadratic constraints, an adaptation of the Curds and Whey method in multiple regression, and a local regression (LoR) scheme reminiscent of neighborhood methods in collaborative filter- ing. Based on empirical evaluation on the YLRC dataset as well as simulated data, we identify the LoR method as a good practical choice: it yields best or near-best prediction performance at a reasonable computational load, and it is less sensitive to the choice of the algorithm parameter. We also provide some analysis of the LoR method for an asso- ciated mathematical model, which sheds light on optimal parameter choice and prediction performance.
△ Less
Submitted 23 March, 2011;
originally announced March 2011.
-
Analysis of a Collaborative Filter Based on Popularity Amongst Neighbors
Authors:
Kishor Barman,
Onkar Dabeer
Abstract:
In this paper, we analyze a collaborative filter that answers the simple question: What is popular amongst your friends? While this basic principle seems to be prevalent in many practical implementations, there does not appear to be much theoretical analysis of its performance. In this paper, we partly fill this gap. While recent works on this topic, such as the low-rank matrix completion literatu…
▽ More
In this paper, we analyze a collaborative filter that answers the simple question: What is popular amongst your friends? While this basic principle seems to be prevalent in many practical implementations, there does not appear to be much theoretical analysis of its performance. In this paper, we partly fill this gap. While recent works on this topic, such as the low-rank matrix completion literature, consider the probability of error in recovering the entire rating matrix, we consider probability of an error in an individual recommendation (bit error rate (BER)). For a mathematical model introduced in [1],[2], we identify three regimes of operation for our algorithm (named Popularity Amongst Friends (PAF)) in the limit as the matrix size grows to infinity. In a regime characterized by large number of samples and small degrees of freedom (defined precisely for the model in the paper), the asymptotic BER is zero; in a regime characterized by large number of samples and large degrees of freedom, the asymptotic BER is bounded away from 0 and 1/2 (and is identified exactly except for a special case); and in a regime characterized by a small number of samples, the algorithm fails. We also present numerical results for the MovieLens and Netflix datasets. We discuss the empirical performance in light of our theoretical results and compare with an approach based on low-rank matrix completion.
△ Less
Submitted 15 July, 2011; v1 submitted 9 June, 2010;
originally announced June 2010.
-
Local Popularity Based Collaborative Filters
Authors:
Kishor Barman,
Onkar Dabeer
Abstract:
Motivated by applications such as recommendation systems, we consider the estimation of a binary random field X obtained by row and column permutations of a block constant random matrix. The estimation of X is based on observations Y, which are obtained by passing entries of X through a binary symmetric channel (BSC) and an erasure channel. We focus on the analysis of a specific algorithm based…
▽ More
Motivated by applications such as recommendation systems, we consider the estimation of a binary random field X obtained by row and column permutations of a block constant random matrix. The estimation of X is based on observations Y, which are obtained by passing entries of X through a binary symmetric channel (BSC) and an erasure channel. We focus on the analysis of a specific algorithm based on local popularity when the erasure rate approaches unity at a specified rate. We study the bit error rate (BER) in the limit as the matrix size approaches infinity. Our main result states that if the cluster size (that is, the size of the constancy blocks in the original matrix) is above a certain threshold, then the BER approaches zero, but below the threshold, the BER is lower bounded away from zero. The lower bound depends on the noise level in the observations and the size of the clusters in relation to the threshold. The threshold depends on the rate at which the erasure probability approaches unity.
△ Less
Submitted 4 May, 2010; v1 submitted 19 January, 2010;
originally announced January 2010.
-
A Channel Coding Perspective of Collaborative Filtering
Authors:
S. T. Aditya,
Onkar Dabeer,
Bikash Kumar Dey
Abstract:
We consider the problem of collaborative filtering from a channel coding perspective. We model the underlying rating matrix as a finite alphabet matrix with block constant structure. The observations are obtained from this underlying matrix through a discrete memoryless channel with a noisy part representing noisy user behavior and an erasure part representing missing data. Moreover, the cluster…
▽ More
We consider the problem of collaborative filtering from a channel coding perspective. We model the underlying rating matrix as a finite alphabet matrix with block constant structure. The observations are obtained from this underlying matrix through a discrete memoryless channel with a noisy part representing noisy user behavior and an erasure part representing missing data. Moreover, the clusters over which the underlying matrix is constant are {\it unknown}. We establish a sharp threshold result for this model: if the largest cluster size is smaller than $C_1 \log(mn)$ (where the rating matrix is of size $m \times n$), then the underlying matrix cannot be recovered with any estimator, but if the smallest cluster size is larger than $C_2 \log(mn)$, then we show a polynomial time estimator with diminishing probability of error. In the case of uniform cluster size, not only the order of the threshold, but also the constant is identified.
△ Less
Submitted 18 August, 2009;
originally announced August 2009.
-
A Channel Coding Perspective of Recommendation Systems
Authors:
S. T. Aditya,
Onkar Dabeer,
Bikash Kumar Dey
Abstract:
Motivated by recommendation systems, we consider the problem of estimating block constant binary matrices (of size $m \times n$) from sparse and noisy observations. The observations are obtained from the underlying block constant matrix after unknown row and column permutations, erasures, and errors. We derive upper and lower bounds on the achievable probability of error. For fixed erasure and e…
▽ More
Motivated by recommendation systems, we consider the problem of estimating block constant binary matrices (of size $m \times n$) from sparse and noisy observations. The observations are obtained from the underlying block constant matrix after unknown row and column permutations, erasures, and errors. We derive upper and lower bounds on the achievable probability of error. For fixed erasure and error probability, we show that there exists a constant $C_1$ such that if the cluster sizes are less than $C_1 \ln(mn)$, then for any algorithm the probability of error approaches one as $m, n \tends \infty$. On the other hand, we show that a simple polynomial time algorithm gives probability of error diminishing to zero provided the cluster sizes are greater than $C_2 \ln(mn)$ for a suitable constant $C_2$.
△ Less
Submitted 13 January, 2009;
originally announced January 2009.
-
Transceiver Design with Low-Precision Analog-to-Digital Conversion : An Information-Theoretic Perspective
Authors:
Jaspreet Singh,
Onkar Dabeer,
Upamanyu Madhow
Abstract:
Modern communication receiver architectures center around digital signal processing (DSP), with the bulk of the receiver processing being performed on digital signals obtained after analog-to-digital conversion (ADC). In this paper, we explore Shannon-theoretic performance limits when ADC precision is drastically reduced, from typical values of 8-12 bits used in current communication transceiver…
▽ More
Modern communication receiver architectures center around digital signal processing (DSP), with the bulk of the receiver processing being performed on digital signals obtained after analog-to-digital conversion (ADC). In this paper, we explore Shannon-theoretic performance limits when ADC precision is drastically reduced, from typical values of 8-12 bits used in current communication transceivers, to 1-3 bits. The goal is to obtain insight on whether DSP-centric transceiver architectures are feasible as communication bandwidths scale up, recognizing that high-precision ADC at high sampling rates is either unavailable, or too costly or power-hungry. Specifically, we evaluate the communication limits imposed by low-precision ADC for the ideal real discrete-time Additive White Gaussian Noise (AWGN) channel, under an average power constraint on the input. For an ADC with K quantization bins (i.e., a precision of log2 K bits), we show that the Shannon capacity is achievable by a discrete input distribution with at most K + 1 mass points. For 2-bin (1-bit) symmetric ADC, this result is tightened to show that binary antipodal signaling is optimum for any signal-to-noise ratio (SNR). For multi-bit ADC, the capacity is computed numerically, and the results obtained are used to make the following encouraging observations regarding system design with low-precision ADC : (a) even at moderately high SNR of up to 20 dB, 2-3 bit quantization results in only 10-20% reduction of spectral efficiency, which is acceptable for large communication bandwidths, (b) standard equiprobable pulse amplitude modulation with ADC thresholds set to implement maximum likelihood hard decisions is asymptotically optimum at high SNR, and works well at low to moderate SNRs as well.
△ Less
Submitted 7 April, 2008;
originally announced April 2008.
-
Capacity of the Discrete-Time AWGN Channel Under Output Quantization
Authors:
Jaspreet Singh,
Onkar Dabeer,
Upamanyu Madhow
Abstract:
We investigate the limits of communication over the discrete-time Additive White Gaussian Noise (AWGN) channel, when the channel output is quantized using a small number of bits. We first provide a proof of our recent conjecture on the optimality of a discrete input distribution in this scenario. Specifically, we show that for any given output quantizer choice with K quantization bins (i.e., a p…
▽ More
We investigate the limits of communication over the discrete-time Additive White Gaussian Noise (AWGN) channel, when the channel output is quantized using a small number of bits. We first provide a proof of our recent conjecture on the optimality of a discrete input distribution in this scenario. Specifically, we show that for any given output quantizer choice with K quantization bins (i.e., a precision of log2 K bits), the input distribution, under an average power constraint, need not have any more than K + 1 mass points to achieve the channel capacity. The cutting-plane algorithm is employed to compute this capacity and to generate optimum input distributions. Numerical optimization over the choice of the quantizer is then performed (for 2-bit and 3-bit symmetric quantization), and the results we obtain show that the loss due to low-precision output quantization, which is small at low signal-to-noise ratio (SNR) as expected, can be quite acceptable even for moderate to high SNR values. For example, at SNRs up to 20 dB, 2-3 bit quantization achieves 80-90% of the capacity achievable using infinite-precision quantization.
△ Less
Submitted 15 May, 2008; v1 submitted 8 January, 2008;
originally announced January 2008.