-
Gallery Filter Network for Person Search
Authors:
Lucas Jaffe,
Avideh Zakhor
Abstract:
In person search, we aim to localize a query person from one scene in other gallery scenes. The cost of this search operation is dependent on the number of gallery scenes, making it beneficial to reduce the pool of likely scenes. We describe and demonstrate the Gallery Filter Network (GFN), a novel module which can efficiently discard gallery scenes from the search process, and benefit scoring for…
▽ More
In person search, we aim to localize a query person from one scene in other gallery scenes. The cost of this search operation is dependent on the number of gallery scenes, making it beneficial to reduce the pool of likely scenes. We describe and demonstrate the Gallery Filter Network (GFN), a novel module which can efficiently discard gallery scenes from the search process, and benefit scoring for persons detected in remaining scenes. We show that the GFN is robust under a range of different conditions by testing on different retrieval sets, including cross-camera, occluded, and low-resolution scenarios. In addition, we develop the base SeqNeXt person search model, which improves and simplifies the original SeqNet model. We show that the SeqNeXt+GFN combination yields significant performance gains over other state-of-the-art methods on the standard PRW and CUHK-SYSU person search datasets. To aid experimentation for this and other models, we provide standardized tooling for the data processing and evaluation pipeline typically used for person search research.
△ Less
Submitted 25 October, 2022; v1 submitted 23 October, 2022;
originally announced October 2022.
-
Misinformation Detection in Social Media Video Posts
Authors:
Kehan Wang,
David Chan,
Seth Z. Zhao,
John Canny,
Avideh Zakhor
Abstract:
With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-…
▽ More
With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-modal datasets, we collect 160,000 video posts from Twitter, and leverage self-supervised learning to learn expressive representations of joint visual and textual data. In this work, we propose two new methods for detecting semantic inconsistencies within short-form social media video posts, based on contrastive learning and masked language modeling. We demonstrate that our new approaches outperform current state-of-the-art methods on both artificial data generated by random-swap** of positive samples and in the wild on a new manually-labeled test set for semantic misinformation.
△ Less
Submitted 30 July, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Recognition-Aware Learned Image Compression
Authors:
Maxime Kawawa-Beaudan,
Ryan Roggenkemper,
Avideh Zakhor
Abstract:
Learned image compression methods generally optimize a rate-distortion loss, trading off improvements in visual distortion for added bitrate. Increasingly, however, compressed imagery is used as an input to deep learning networks for various tasks such as classification, object detection, and superresolution. We propose a recognition-aware learned compression method, which optimizes a rate-distort…
▽ More
Learned image compression methods generally optimize a rate-distortion loss, trading off improvements in visual distortion for added bitrate. Increasingly, however, compressed imagery is used as an input to deep learning networks for various tasks such as classification, object detection, and superresolution. We propose a recognition-aware learned compression method, which optimizes a rate-distortion loss alongside a task-specific loss, jointly learning compression and recognition networks. We augment a hierarchical autoencoder-based compression network with an EfficientNet recognition model and use two hyperparameters to trade off between distortion, bitrate, and recognition performance. We characterize the classification accuracy of our proposed method as a function of bitrate and find that for low bitrates our method achieves as much as 26% higher recognition accuracy at equivalent bitrates compared to traditional methods such as Better Portable Graphics (BPG).
△ Less
Submitted 31 January, 2022;
originally announced February 2022.
-
Drone Object Detection Using RGB/IR Fusion
Authors:
Lizhi Yang,
Ruhang Ma,
Avideh Zakhor
Abstract:
Object detection using aerial drone imagery has received a great deal of attention in recent years. While visible light images are adequate for detecting objects in most scenarios, thermal cameras can extend the capabilities of object detection to night-time or occluded objects. As such, RGB and Infrared (IR) fusion methods for object detection are useful and important. One of the biggest challeng…
▽ More
Object detection using aerial drone imagery has received a great deal of attention in recent years. While visible light images are adequate for detecting objects in most scenarios, thermal cameras can extend the capabilities of object detection to night-time or occluded objects. As such, RGB and Infrared (IR) fusion methods for object detection are useful and important. One of the biggest challenges in applying deep learning methods to RGB/IR object detection is the lack of available training data for drone IR imagery, especially at night. In this paper, we develop several strategies for creating synthetic IR images using the AIRSim simulation engine and CycleGAN. Furthermore, we utilize an illumination-aware fusion framework to fuse RGB and IR images for object detection on the ground. We characterize and test our methods for both simulated and actual data. Our solution is implemented on an NVIDIA Jetson Xavier running on an actual drone, requiring about 28 milliseconds of processing per RGB/IR image pair.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Immediate Proximity Detection Using Wi-Fi-Enabled Smartphones
Authors:
Zach Van Hyfte,
Avideh Zakhor
Abstract:
Smartphone apps for exposure notification and contact tracing have been shown to be effective in controlling the COVID-19 pandemic. However, Bluetooth Low Energy tokens similar to those broadcast by existing apps can still be picked up far away from the transmitting device. In this paper, we present a new class of methods for detecting whether or not two Wi-Fi-enabled devices are in immediate phys…
▽ More
Smartphone apps for exposure notification and contact tracing have been shown to be effective in controlling the COVID-19 pandemic. However, Bluetooth Low Energy tokens similar to those broadcast by existing apps can still be picked up far away from the transmitting device. In this paper, we present a new class of methods for detecting whether or not two Wi-Fi-enabled devices are in immediate physical proximity, i.e. 2 or fewer meters apart, as established by the U.S. Centers for Disease Control and Prevention (CDC). Our goal is to enhance the accuracy of smartphone-based exposure notification and contact tracing systems. We present a set of binary machine learning classifiers that take as input pairs of Wi-Fi RSSI fingerprints. We empirically verify that a single classifier cannot generalize well to a range of different environments with vastly different numbers of detectable Wi-Fi Access Points (APs). However, specialized classifiers, tailored to situations where the number of detectable APs falls within a certain range, are able to detect immediate physical proximity significantly more accurately. As such, we design three classifiers for situations with low, medium, and high numbers of detectable APs. These classifiers distinguish between pairs of RSSI fingerprints recorded 2 or fewer meters apart and pairs recorded further apart but still in Bluetooth range. We characterize their balanced accuracy for this task to be between 66.8% and 77.8%.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
Authors:
Scott McCrae,
Kehan Wang,
Avideh Zakhor
Abstract:
As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and…
▽ More
As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
Fast, Accurate Barcode Detection in Ultra High-Resolution Images
Authors:
Jerome Quenum,
Kehan Wang,
Avideh Zakhor
Abstract:
Object detection in Ultra High-Resolution (UHR) images has long been a challenging problem in computer vision due to the varying scales of the targeted objects. When it comes to barcode detection, resizing UHR input images to smaller sizes often leads to the loss of pertinent information, while processing them directly is highly inefficient and computationally expensive. In this paper, we propose…
▽ More
Object detection in Ultra High-Resolution (UHR) images has long been a challenging problem in computer vision due to the varying scales of the targeted objects. When it comes to barcode detection, resizing UHR input images to smaller sizes often leads to the loss of pertinent information, while processing them directly is highly inefficient and computationally expensive. In this paper, we propose using semantic segmentation to achieve a fast and accurate detection of barcodes of various scales in UHR images. Our pipeline involves a modified Region Proposal Network (RPN) on images of size greater than 10k$\times$10k and a newly proposed Y-Net segmentation network, followed by a post-processing workflow for fitting a bounding box around each segmented barcode mask. The end-to-end system has a latency of 16 milliseconds, which is $2.5\times$ faster than YOLOv4 and $5.9\times$ faster than Mask R-CNN. In terms of accuracy, our method outperforms YOLOv4 and Mask R-CNN by a $mAP$ of 5.5% and 47.1% respectively, on a synthetic dataset. We have made available the generated synthetic barcode dataset and its code at http://www.github.com/viplabB/SBD/.
△ Less
Submitted 10 June, 2021; v1 submitted 13 February, 2021;
originally announced February 2021.
-
Temporal LiDAR Frame Prediction for Autonomous Driving
Authors:
David Deng,
Avideh Zakhor
Abstract:
Anticipating the future in a dynamic scene is critical for many fields such as autonomous driving and robotics. In this paper we propose a class of novel neural network architectures to predict future LiDAR frames given previous ones. Since the ground truth in this application is simply the next frame in the sequence, we can train our models in a self-supervised fashion. Our proposed architectures…
▽ More
Anticipating the future in a dynamic scene is critical for many fields such as autonomous driving and robotics. In this paper we propose a class of novel neural network architectures to predict future LiDAR frames given previous ones. Since the ground truth in this application is simply the next frame in the sequence, we can train our models in a self-supervised fashion. Our proposed architectures are based on FlowNet3D and Dynamic Graph CNN. We use Chamfer Distance (CD) and Earth Mover's Distance (EMD) as loss functions and evaluation metrics. We train and evaluate our models using the newly released nuScenes dataset, and characterize their performance and complexity with several baselines. Compared to directly using FlowNet3D, our proposed architectures achieve CD and EMD nearly an order of magnitude lower. In addition, we show that our predictions generate reasonable scene flow approximations without using any labelled supervision.
△ Less
Submitted 17 December, 2020;
originally announced December 2020.
-
GenScan: A Generative Method for Populating Parametric 3D Scan Datasets
Authors:
Mohammad Keshavarzi,
Oladapo Afolabi,
Luisa Caldas,
Allen Y. Yang,
Avideh Zakhor
Abstract:
The availability of rich 3D datasets corresponding to the geometrical complexity of the built environments is considered an ongoing challenge for 3D deep learning methodologies. To address this challenge, we introduce GenScan, a generative system that populates synthetic 3D scan datasets in a parametric fashion. The system takes an existing captured 3D scan as an input and outputs alternative vari…
▽ More
The availability of rich 3D datasets corresponding to the geometrical complexity of the built environments is considered an ongoing challenge for 3D deep learning methodologies. To address this challenge, we introduce GenScan, a generative system that populates synthetic 3D scan datasets in a parametric fashion. The system takes an existing captured 3D scan as an input and outputs alternative variations of the building layout including walls, doors, and furniture with corresponding textures. GenScan is a fully automated system that can also be manually controlled by a user through an assigned user interface. Our proposed system utilizes a combination of a hybrid deep neural network and a parametrizer module to extract and transform elements of a given 3D scan. GenScan takes advantage of style transfer techniques to generate new textures for the generated scenes. We believe our system would facilitate data augmentation to expand the currently limited 3D geometry datasets commonly used in 3D computer vision, generative design, and general 3D deep learning tasks.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
Duodepth: Static Gesture Recognition Via Dual Depth Sensors
Authors:
Ilya Chugunov,
Avideh Zakhor
Abstract:
Static gesture recognition is an effective non-verbal communication channel between a user and their devices; however many modern methods are sensitive to the relative pose of the user's hands with respect to the capture device, as parts of the gesture can become occluded. We present two methodologies for gesture recognition via synchronized recording from two depth cameras to alleviate this occlu…
▽ More
Static gesture recognition is an effective non-verbal communication channel between a user and their devices; however many modern methods are sensitive to the relative pose of the user's hands with respect to the capture device, as parts of the gesture can become occluded. We present two methodologies for gesture recognition via synchronized recording from two depth cameras to alleviate this occlusion problem. One is a more classic approach using iterative closest point registration to accurately fuse point clouds and a single PointNet architecture for classification, and the other is a dual Point-Net architecture for classification without registration. On a manually collected data-set of 20,100 point clouds we show a 39.2% reduction in misclassification for the fused point cloud method, and 53.4% for the dual PointNet, when compared to a standard single camera pipeline.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.