Search | arXiv e-print repository

CATSE: A Context-Aware Framework for Causal Target Sound Extraction

Authors: Shrishail Baligar, Mikolaj Kegler, Bryce Irvin, Marko Stamenovic, Shawn Newsam

Abstract: Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE m… ▽ More Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: Submitted to EUSIPCO 2024

arXiv:2210.13207 [pdf]

GeoAI at ACM SIGSPATIAL: The New Frontier of Geospatial Artificial Intelligence Research

Authors: Dalton Lunga, Yingjie Hu, Shawn Newsam, Song Gao, Bruno Martins, Lexie Yang, Xueqing Deng

Abstract: Geospatial Artificial Intelligence (GeoAI) is an interdisciplinary field enjoying tremendous adoption. However, the efficient design and implementation of GeoAI systems face many open challenges. This is mainly due to the lack of non-standardized approaches to artificial intelligence tool development, inadequate platforms, and a lack of multidisciplinary engagements, which all motivate domain expe… ▽ More Geospatial Artificial Intelligence (GeoAI) is an interdisciplinary field enjoying tremendous adoption. However, the efficient design and implementation of GeoAI systems face many open challenges. This is mainly due to the lack of non-standardized approaches to artificial intelligence tool development, inadequate platforms, and a lack of multidisciplinary engagements, which all motivate domain experts to seek a shared stage with scientists and engineers to solve problems of significant impact on society. Since its inception in 2017, the GeoAI series of workshops has been co-located with the Association for Computing Machinery International Conference on Advances in Geographic Information Systems. The workshop series has fostered a nexus for geoscientists, computer scientists, engineers, entrepreneurs, and decision-makers, from academia, industry, and government to engage in artificial intelligence, spatiotemporal data computing, and geospatial data science research, motivated by various challenges. In this article, we revisit and discuss the state of GeoAI open research directions, the recent developments, and an emerging agenda calling for a continued cross-disciplinary community engagement. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: 12 pages, 1 figure, 1 table

arXiv:2204.05547 [pdf, other]

DistPro: Searching A Fast Knowledge Distillation Process via Meta Optimization

Authors: Xueqing Deng, Dawei Sun, Shawn Newsam, Peng Wang

Abstract: Recent Knowledge distillation (KD) studies show that different manually designed schemes impact the learned results significantly. Yet, in KD, automatically searching an optimal distillation scheme has not yet been well explored. In this paper, we propose DistPro, a novel framework which searches for an optimal KD process via differentiable meta-learning. Specifically, given a pair of student and… ▽ More Recent Knowledge distillation (KD) studies show that different manually designed schemes impact the learned results significantly. Yet, in KD, automatically searching an optimal distillation scheme has not yet been well explored. In this paper, we propose DistPro, a novel framework which searches for an optimal KD process via differentiable meta-learning. Specifically, given a pair of student and teacher networks, DistPro first sets up a rich set of KD connection from the transmitting layers of the teacher to the receiving layers of the student, and in the meanwhile, various transforms are also proposed for comparing feature maps along its pathway for the distillation. Then, each combination of a connection and a transform choice (pathway) is associated with a stochastic weighting process which indicates its importance at every step during the distillation. In the searching stage, the process can be effectively learned through our proposed bi-level meta-optimization strategy. In the distillation stage, DistPro adopts the learned processes for knowledge distillation, which significantly improves the student accuracy especially when faster training is required. Lastly, we find the learned processes can be generalized between similar tasks and networks. In our experiments, DistPro produces state-of-the-art (SoTA) accuracy under varying number of learning epochs on popular datasets, i.e. CIFAR100 and ImageNet, which demonstrate the effectiveness of our framework. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: 14 pages, 5 figures

arXiv:2204.05538 [pdf, other]

NightLab: A Dual-level Architecture with Hardness Detection for Segmentation at Night

Authors: Xueqing Deng, Peng Wang, Xiaochen Lian, Shawn Newsam

Abstract: The semantic segmentation of nighttime scenes is a challenging problem that is key to impactful applications like self-driving cars. Yet, it has received little attention compared to its daytime counterpart. In this paper, we propose NightLab, a novel nighttime segmentation framework that leverages multiple deep learning models imbued with night-aware features to yield State-of-The-Art (SoTA) perf… ▽ More The semantic segmentation of nighttime scenes is a challenging problem that is key to impactful applications like self-driving cars. Yet, it has received little attention compared to its daytime counterpart. In this paper, we propose NightLab, a novel nighttime segmentation framework that leverages multiple deep learning models imbued with night-aware features to yield State-of-The-Art (SoTA) performance on multiple night segmentation benchmarks. Notably, NightLab contains models at two levels of granularity, i.e. image and regional, and each level is composed of light adaptation and segmentation modules. Given a nighttime image, the image level model provides an initial segmentation estimate while, in parallel, a hardness detection module identifies regions and their surrounding context that need further analysis. A regional level model focuses on these difficult regions to provide a significantly improved segmentation. All the models in NightLab are trained end-to-end using a set of proposed night-aware losses without handcrafted heuristics. Extensive experiments on the NightCity and BDD100K datasets show NightLab achieves SoTA performance compared to concurrent methods. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: 8pages, 6 figures, accept at CVPR 2022

arXiv:2203.03809 [pdf, other]

Image Search with Text Feedback by Additive Attention Compositional Learning

Authors: Yuxin Tian, Shawn Newsam, Kofi Boakye

Abstract: Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this p… ▽ More Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this problem, Additive Attention Compositional Learning (AACL), that uses a multi-modal transformer-based architecture and effectively models the image-text contexts. Specifically, we propose a novel image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. We also introduce a new challenging benchmark derived from the Shop**100k dataset. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shop**100k), each with strong baselines. Extensive experiments show that AACL achieves new state-of-the-art results on all three datasets. △ Less

Submitted 7 March, 2022; originally announced March 2022.

arXiv:2106.13227 [pdf, other]

AutoAdapt: Automated Segmentation Network Search for Unsupervised Domain Adaptation

Authors: Xueqing Deng, Yi Zhu, Yuxin Tian, Shawn Newsam

Abstract: Neural network-based semantic segmentation has achieved remarkable results when large amounts of annotated data are available, that is, in the supervised case. However, such data is expensive to collect and so methods have been developed to adapt models trained on related, often synthetic data for which labels are readily available. Current adaptation approaches do not consider the dependence of t… ▽ More Neural network-based semantic segmentation has achieved remarkable results when large amounts of annotated data are available, that is, in the supervised case. However, such data is expensive to collect and so methods have been developed to adapt models trained on related, often synthetic data for which labels are readily available. Current adaptation approaches do not consider the dependence of the generalization/transferability of these models on network architecture. In this paper, we perform neural architecture search (NAS) to provide architecture-level perspective and analysis for domain adaptation. We identify the optimization gap that exists when searching architectures for unsupervised domain adaptation which makes this NAS problem uniquely difficult. We propose bridging this gap by using maximum mean discrepancy and regional weighted entropy to estimate the accuracy metric. Experimental results on several widely adopted benchmarks show that our proposed AutoAdapt framework indeed discovers architectures that improve the performance of a number of existing adaptation techniques. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: short version has been accepted at 1st NAS workshop co-organized with CVPR 2021

arXiv:2012.04222 [pdf, other]

Scale Aware Adaptation for Land-Cover Classification in Remote Sensing Imagery

Authors: Xueqing Deng, Yi Zhu, Yuxin Tian, Shawn Newsam

Abstract: Land-cover classification using remote sensing imagery is an important Earth observation task. Recently, land cover classification has benefited from the development of fully connected neural networks for semantic segmentation. The benchmark datasets available for training deep segmentation models in remote sensing imagery tend to be small, however, often consisting of only a handful of images fro… ▽ More Land-cover classification using remote sensing imagery is an important Earth observation task. Recently, land cover classification has benefited from the development of fully connected neural networks for semantic segmentation. The benchmark datasets available for training deep segmentation models in remote sensing imagery tend to be small, however, often consisting of only a handful of images from a single location with a single scale. This limits the models' ability to generalize to other datasets. Domain adaptation has been proposed to improve the models' generalization but we find these approaches are not effective for dealing with the scale variation commonly found between remote sensing image collections. We therefore propose a scale aware adversarial learning framework to perform joint cross-location and cross-scale land-cover classification. The framework has a dual discriminator architecture with a standard feature discriminator as well as a novel scale discriminator. We also introduce a scale attention module which produces scale-enhanced features. Experimental results show that the proposed framework outperforms state-of-the-art domain adaptation methods by a large margin. △ Less

Submitted 8 December, 2020; originally announced December 2020.

Comments: The open-sourced codes are available on Github: https://github.com/xdeng7/scale-aware_da

arXiv:1912.10667 [pdf, other]

Generalizing Deep Models for Overhead Image Segmentation Through Getis-Ord Gi* Pooling

Authors: Xueqing Deng, Yi Zhu, Yuxin Tian, Shawn Newsam

Abstract: That most deep learning models are purely data driven is both a strength and a weakness. Given sufficient training data, the optimal model for a particular problem can be learned. However, this is usually not the case and so instead the model is either learned from scratch from a limited amount of training data or pre-trained on a different problem and then fine-tuned. Both of these situations are… ▽ More That most deep learning models are purely data driven is both a strength and a weakness. Given sufficient training data, the optimal model for a particular problem can be learned. However, this is usually not the case and so instead the model is either learned from scratch from a limited amount of training data or pre-trained on a different problem and then fine-tuned. Both of these situations are potentially suboptimal and limit the generalizability of the model. Inspired by this, we investigate methods to inform or guide deep learning models for geospatial image analysis to increase their performance when a limited amount of training data is available or when they are applied to scenarios other than which they were trained on. In particular, we exploit the fact that there are certain fundamental rules as to how things are distributed on the surface of the Earth and these rules do not vary substantially between locations. Based on this, we develop a novel feature pooling method for convolutional neural networks using Getis-Ord Gi* analysis from geostatistics. Experimental results show our proposed pooling function has significantly better generalization performance compared to a standard data-driven approach when applied to overhead image segmentation. △ Less

Submitted 23 December, 2019; originally announced December 2019.

arXiv:1907.10211 [pdf, other]

Motion-Aware Feature for Improved Video Anomaly Detection

Authors: Yi Zhu, Shawn Newsam

Abstract: Motivated by our observation that motion information is the key to good anomaly detection performance in video, we propose a temporal augmented network to learn a motion-aware feature. This feature alone can achieve competitive performance with previous state-of-the-art methods, and when combined with them, can achieve significant performance improvements. Furthermore, we incorporate temporal cont… ▽ More Motivated by our observation that motion information is the key to good anomaly detection performance in video, we propose a temporal augmented network to learn a motion-aware feature. This feature alone can achieve competitive performance with previous state-of-the-art methods, and when combined with them, can achieve significant performance improvements. Furthermore, we incorporate temporal context into the Multiple Instance Learning (MIL) ranking model by using an attention block. The learned attention weights can help to differentiate between anomalous and normal video segments better. With the proposed motion-aware feature and the temporal MIL ranking model, we outperform previous approaches by a large margin on both anomaly detection and anomalous action recognition tasks in the UCF Crime dataset. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: BMVC 2019

arXiv:1902.06923 [pdf, other]

Using Conditional Generative Adversarial Networks to Generate Ground-Level Views From Overhead Imagery

Authors: Xueqing Deng, Yi Zhu, Shawn Newsam

Abstract: This paper develops a deep-learning framework to synthesize a ground-level view of a location given an overhead image. We propose a novel conditional generative adversarial network (cGAN) in which the trained generator generates realistic looking and representative ground-level images using overhead imagery as auxiliary information. The generator is an encoder-decoder network which allows us to co… ▽ More This paper develops a deep-learning framework to synthesize a ground-level view of a location given an overhead image. We propose a novel conditional generative adversarial network (cGAN) in which the trained generator generates realistic looking and representative ground-level images using overhead imagery as auxiliary information. The generator is an encoder-decoder network which allows us to compare low- and high-level features as well as their concatenation for encoding the overhead imagery. We also demonstrate how our framework can be used to perform land cover classification by modifying the trained cGAN to extract features from overhead imagery. This is interesting because, although we are using this modified cGAN as a feature extractor for overhead imagery, it incorporates knowledge of how locations look from the ground. △ Less

Submitted 19 February, 2019; originally announced February 2019.

Comments: 5 pages. arXiv admin note: text overlap with arXiv:1806.05129

arXiv:1812.01593 [pdf, other]

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Authors: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan Catanzaro

Abstract: Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels.… ▽ More Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018. Our code and videos can be found at https://nv-adlr.github.io/publication/2018-Segmentation. △ Less

Submitted 2 July, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: CVPR 2019 Oral. Code link: https://github.com/NVIDIA/semantic-segmentation. YouTube link: https://www.youtube.com/watch?v=aEbXjGZDZSQ

arXiv:1810.12522 [pdf, other]

Random Temporal Skip** for Multirate Video Analysis

Authors: Yi Zhu, Shawn Newsam

Abstract: Current state-of-the-art approaches to video understanding adopt temporal jittering to simulate analyzing the video at varying frame rates. However, this does not work well for multirate videos, in which actions or subactions occur at different speeds. The frame sampling rate should vary in accordance with the different motion speeds. In this work, we propose a simple yet effective strategy, terme… ▽ More Current state-of-the-art approaches to video understanding adopt temporal jittering to simulate analyzing the video at varying frame rates. However, this does not work well for multirate videos, in which actions or subactions occur at different speeds. The frame sampling rate should vary in accordance with the different motion speeds. In this work, we propose a simple yet effective strategy, termed random temporal skip**, to address this situation. This strategy effectively handles multirate videos by randomizing the sampling rate during training. It is an exhaustive approach, which can potentially cover all motion speed variations. Furthermore, due to the large temporal skip**, our network can see video clips that originally cover over 100 frames. Such a time range is enough to analyze most actions/events. We also introduce an occlusion-aware optical flow learning method that generates improved motion maps for human action recognition. Our framework is end-to-end trainable, runs in real-time, and achieves state-of-the-art performance on six widely adopted video benchmarks. △ Less

Submitted 30 October, 2018; originally announced October 2018.

Comments: Accepted at ACCV 2018. Camera ready

arXiv:1810.12521 [pdf, other]

Gated Transfer Network for Transfer Learning

Authors: Yi Zhu, Jia Xue, Shawn Newsam

Abstract: Deep neural networks have led to a series of breakthroughs in computer vision given sufficient annotated training datasets. For novel tasks with limited labeled data, the prevalent approach is to transfer the knowledge learned in the pre-trained models to the new tasks by fine-tuning. Classic model fine-tuning utilizes the fact that well trained neural networks appear to learn cross domain feature… ▽ More Deep neural networks have led to a series of breakthroughs in computer vision given sufficient annotated training datasets. For novel tasks with limited labeled data, the prevalent approach is to transfer the knowledge learned in the pre-trained models to the new tasks by fine-tuning. Classic model fine-tuning utilizes the fact that well trained neural networks appear to learn cross domain features. These features are treated equally during transfer learning. In this paper, we explore the impact of feature selection in model fine-tuning by introducing a transfer module, which assigns weights to features extracted from pre-trained models. The proposed transfer module proves the importance of feature selection for transferring models from source to target domains. It is shown to significantly improve upon fine-tuning results with only marginal extra computational cost. We also incorporate an auxiliary classifier as an extra regularizer to avoid over-fitting. Finally, we build a Gated Transfer Network (GTN) based on our transfer module and achieve state-of-the-art results on six different tasks. △ Less

Submitted 30 October, 2018; originally announced October 2018.

Comments: Accepted at ACCV 2018. Camera ready

arXiv:1806.05129 [pdf, other]

What Is It Like Down There? Generating Dense Ground-Level Views and Image Features From Overhead Imagery Using Conditional Generative Adversarial Networks

Authors: Xueqing Deng, Yi Zhu, Shawn Newsam

Abstract: This paper investigates conditional generative adversarial networks (cGANs) to overcome a fundamental limitation of using geotagged media for geographic discovery, namely its sparse and uneven spatial distribution. We train a cGAN to generate ground-level views of a location given overhead imagery. We show the "fake" ground-level images are natural looking and are structurally similar to the real… ▽ More This paper investigates conditional generative adversarial networks (cGANs) to overcome a fundamental limitation of using geotagged media for geographic discovery, namely its sparse and uneven spatial distribution. We train a cGAN to generate ground-level views of a location given overhead imagery. We show the "fake" ground-level images are natural looking and are structurally similar to the real images. More significantly, we show the generated images are representative of the locations and that the representations learned by the cGANs are informative. In particular, we show that dense feature maps generated using our framework are more effective for land-cover classification than approaches which spatially interpolate features extracted from sparse ground-level images. To our knowledge, ours is the first work to use cGANs to generate ground-level views given overhead imagery and to explore the benefits of the learned representations. △ Less

Submitted 23 September, 2018; v1 submitted 13 June, 2018; originally announced June 2018.

Comments: 10 pages, 5 figures, camera-ready version of ACM SIGSPATIAL 2018 (ORAL)

arXiv:1805.02733 [pdf, other]

Learning Optical Flow via Dilated Networks and Occlusion Reasoning

Authors: Yi Zhu, Shawn Newsam

Abstract: Despite the significant progress that has been made on estimating optical flow recently, most estimation methods, including classical and deep learning approaches, still have difficulty with multi-scale estimation, real-time computation, and/or occlusion reasoning. In this paper, we introduce dilated convolution and occlusion reasoning into unsupervised optical flow estimation to address these iss… ▽ More Despite the significant progress that has been made on estimating optical flow recently, most estimation methods, including classical and deep learning approaches, still have difficulty with multi-scale estimation, real-time computation, and/or occlusion reasoning. In this paper, we introduce dilated convolution and occlusion reasoning into unsupervised optical flow estimation to address these issues. The dilated convolution allows our network to avoid upsampling via deconvolution and the resulting gridding artifacts. Dilated convolution also results in a smaller memory footprint which speeds up interference. The occlusion reasoning prevents our network from learning incorrect deformations due to occluded image regions during training. Our proposed method outperforms state-of-the-art unsupervised approaches on the KITTI benchmark. We also demonstrate its generalization capability by applying it to action recognition in video. △ Less

Submitted 7 May, 2018; originally announced May 2018.

Comments: Accepted at ICIP 2018

arXiv:1803.08460 [pdf, other]

Towards Universal Representation for Unseen Action Recognition

Authors: Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, Ling Shao

Abstract: Unseen Action Recognition (UAR) aims to recognise novel action categories without training examples. While previous methods focus on inner-dataset seen/unseen splits, this paper proposes a pipeline using a large-scale training source to achieve a Universal Representation (UR) that can generalise to a more realistic Cross-Dataset UAR (CD-UAR) scenario. We first address UAR as a Generalised Multiple… ▽ More Unseen Action Recognition (UAR) aims to recognise novel action categories without training examples. While previous methods focus on inner-dataset seen/unseen splits, this paper proposes a pipeline using a large-scale training source to achieve a Universal Representation (UR) that can generalise to a more realistic Cross-Dataset UAR (CD-UAR) scenario. We first address UAR as a Generalised Multiple-Instance Learning (GMIL) problem and discover 'building-blocks' from the large-scale ActivityNet dataset using distribution kernels. Essential visual and semantic components are preserved in a shared space to achieve the UR that can efficiently generalise to new datasets. Predicted UR exemplars can be improved by a simple semantic adaptation, and then an unseen action can be directly recognised using UR during the test. Without further training, extensive experiments manifest significant improvements over the UCF101 and HMDB51 benchmarks. △ Less

Submitted 22 March, 2018; originally announced March 2018.

Comments: Accepted at CVPR 2018

arXiv:1802.07452 [pdf, other]

Spatial Morphing Kernel Regression For Feature Interpolation

Authors: Xueqing Deng, Yi Zhu, Shawn Newsam

Abstract: In recent years, geotagged social media has become popular as a novel source for geographic knowledge discovery. Ground-level images and videos provide a different perspective than overhead imagery and can be applied to a range of applications such as land use map**, activity detection, pollution map**, etc. The sparse and uneven distribution of this data presents a problem, however, for gener… ▽ More In recent years, geotagged social media has become popular as a novel source for geographic knowledge discovery. Ground-level images and videos provide a different perspective than overhead imagery and can be applied to a range of applications such as land use map**, activity detection, pollution map**, etc. The sparse and uneven distribution of this data presents a problem, however, for generating dense maps. We therefore investigate the problem of spatially interpolating the high-dimensional features extracted from sparse social media to enable dense labeling using standard classifiers. Further, we show how prior knowledge about region boundaries can be used to improve the interpolation through spatial morphing kernel regression. We show that an interpolate-then-classify framework can produce dense maps from sparse observations but that care must be taken in choosing the interpolation method. We also show that the spatial morphing kernel improves the results. △ Less

Submitted 4 May, 2018; v1 submitted 21 February, 2018; originally announced February 2018.

Comments: accepted by ICIP 2018

arXiv:1802.02668 [pdf, other]

Fine-Grained Land Use Classification at the City Scale Using Ground-Level Images

Authors: Yi Zhu, Xueqing Deng, Shawn Newsam

Abstract: We perform fine-grained land use map** at the city scale using ground-level images. Map** land use is considerably more difficult than map** land cover and is generally not possible using overhead imagery as it requires close-up views and seeing inside buildings. We postulate that the growing collections of georeferenced, ground-level images suggest an alternate approach to this geographic k… ▽ More We perform fine-grained land use map** at the city scale using ground-level images. Map** land use is considerably more difficult than map** land cover and is generally not possible using overhead imagery as it requires close-up views and seeing inside buildings. We postulate that the growing collections of georeferenced, ground-level images suggest an alternate approach to this geographic knowledge discovery problem. We develop a general framework that uses Flickr images to map 45 different land-use classes for the City of San Francisco. Individual images are classified using a novel convolutional neural network containing two streams, one for recognizing objects and another for recognizing scenes. This network is trained in an end-to-end manner directly on the labeled training images. We propose several strategies to overcome the noisiness of our user-generated data including search-based training set augmentation and online adaptive training. We derive a ground truth map of San Francisco in order to evaluate our method. We demonstrate the effectiveness of our approach through geo-visualization and quantitative analysis. Our framework achieves over 29% recall at the individual land parcel level which represents a strong baseline for the challenging 45-way land use classification problem especially given the noisiness of the image data. △ Less

Submitted 7 February, 2018; originally announced February 2018.

arXiv:1711.03641 [pdf, other]

Quantitative Comparison of Open-Source Data for Fine-Grain Map** of Land Use

Authors: Xueqing Deng, Shawn Newsam

Abstract: This paper performs a quantitative comparison of open-source data available on the Internet for the fine-grain map** of land use. Three points of interest (POI) data sources--Google Places, Bing Maps, and the Yellow Pages--and one volunteered geographic information data source--Open Street Map (OSM)--are compared with each other at the parcel level for San Francisco with respect to a proposed fi… ▽ More This paper performs a quantitative comparison of open-source data available on the Internet for the fine-grain map** of land use. Three points of interest (POI) data sources--Google Places, Bing Maps, and the Yellow Pages--and one volunteered geographic information data source--Open Street Map (OSM)--are compared with each other at the parcel level for San Francisco with respect to a proposed fine-grain land-use taxonomy. The sources are also compared to coarse-grain authoritative data which we consider to be the ground truth. Results show limited agreement among the data sources as well as limited accuracy with respect to the authoritative data even at coarse class granularity. We conclude that POI and OSM data do not appear to be sufficient alone for fine-grain land-use map**. △ Less

Submitted 9 November, 2017; originally announced November 2017.

Comments: ACM SIGSPATIAL 2017 Workshop on Urban GIS

arXiv:1707.06316 [pdf, other]

DenseNet for Dense Flow

Authors: Yi Zhu, Shawn Newsam

Abstract: Classical approaches for estimating optical flow have achieved rapid progress in the last decade. However, most of them are too slow to be applied in real-time video analysis. Due to the great success of deep learning, recent work has focused on using CNNs to solve such dense prediction problems. In this paper, we investigate a new deep architecture, Densely Connected Convolutional Networks (Dense… ▽ More Classical approaches for estimating optical flow have achieved rapid progress in the last decade. However, most of them are too slow to be applied in real-time video analysis. Due to the great success of deep learning, recent work has focused on using CNNs to solve such dense prediction problems. In this paper, we investigate a new deep architecture, Densely Connected Convolutional Networks (DenseNet), to learn optical flow. This specific architecture is ideal for the problem at hand as it provides shortcut connections throughout the network, which leads to implicit deep supervision. We extend current DenseNet to a fully convolutional network to learn motion estimation in an unsupervised manner. Evaluation results on three standard benchmarks demonstrate that DenseNet is a better fit than other widely adopted CNN architectures for optical flow estimation. △ Less

Submitted 19 July, 2017; originally announced July 2017.

Comments: Accepted at ICIP 2017

arXiv:1706.07911 [pdf, other]

Large-Scale Map** of Human Activity using Geo-Tagged Videos

Authors: Yi Zhu, Sen Liu, Shawn Newsam

Abstract: This paper is the first work to perform spatio-temporal map** of human activity using the visual content of geo-tagged videos. We utilize a recent deep-learning based video analysis framework, termed hidden two-stream networks, to recognize a range of activities in YouTube videos. This framework is efficient and can run in real time or faster which is important for recognizing events as they occ… ▽ More This paper is the first work to perform spatio-temporal map** of human activity using the visual content of geo-tagged videos. We utilize a recent deep-learning based video analysis framework, termed hidden two-stream networks, to recognize a range of activities in YouTube videos. This framework is efficient and can run in real time or faster which is important for recognizing events as they occur in streaming video or for reducing latency in analyzing already captured video. This is, in turn, important for using video in smart-city applications. We perform a series of experiments to show our approach is able to accurately map activities both spatially and temporally. We also demonstrate the advantages of using the visual content over the tags/titles. △ Less

Submitted 28 November, 2017; v1 submitted 24 June, 2017; originally announced June 2017.

Comments: Accepted at ACM SIGSPATIAL 2017

arXiv:1706.03424 [pdf]

doi 10.1016/j.isprsjprs.2018.01.004

PatternNet: A Benchmark Dataset for Performance Evaluation of Remote Sensing Image Retrieval

Authors: Weixun Zhou, Shawn Newsam, Congmin Li, Zhenfeng Shao

Abstract: Remote sensing image retrieval(RSIR), which aims to efficiently retrieve data of interest from large collections of remote sensing data, is a fundamental task in remote sensing. Over the past several decades, there has been significant effort to extract powerful feature representations for this task since the retrieval performance depends on the representative strength of the features. Benchmark d… ▽ More Remote sensing image retrieval(RSIR), which aims to efficiently retrieve data of interest from large collections of remote sensing data, is a fundamental task in remote sensing. Over the past several decades, there has been significant effort to extract powerful feature representations for this task since the retrieval performance depends on the representative strength of the features. Benchmark datasets are also critical for develo**, evaluating, and comparing RSIR approaches. Current benchmark datasets are deficient in that 1) they were originally collected for land use/land cover classification and not image retrieval, 2) they are relatively small in terms of the number of classes as well the number of sample images per class, and 3) the retrieval performance has saturated. These limitations have severely restricted the development of novel feature representations for RSIR, particularly the recent deep-learning based features which require large amounts of training data. We therefore present in this paper, a new large-scale remote sensing dataset termed "PatternNet" that was collected specifically for RSIR. PatternNet was collected from high-resolution imagery and contains 38 classes with 800 images per class. We also provide a thorough review of RSIR approaches ranging from traditional handcrafted feature based methods to recent deep learning based ones. We evaluate over 35 methods to establish extensive baseline results for future RSIR research using the PatternNet benchmark. △ Less

Submitted 10 July, 2017; v1 submitted 11 June, 2017; originally announced June 2017.

Comments: 49 pages

arXiv:1704.03503 [pdf, other]

UC Merced Submission to the ActivityNet Challenge 2016

Authors: Yi Zhu, Shawn Newsam, Zaikun Xu

Abstract: This notebook paper describes our system for the untrimmed classification task in the ActivityNet challenge 2016. We investigate multiple state-of-the-art approaches for action recognition in long, untrimmed videos. We exploit hand-crafted motion boundary histogram features as well feature activations from deep networks such as VGG16, GoogLeNet, and C3D. These features are separately fed to linear… ▽ More This notebook paper describes our system for the untrimmed classification task in the ActivityNet challenge 2016. We investigate multiple state-of-the-art approaches for action recognition in long, untrimmed videos. We exploit hand-crafted motion boundary histogram features as well feature activations from deep networks such as VGG16, GoogLeNet, and C3D. These features are separately fed to linear, one-versus-rest support vector machine classifiers to produce confidence scores for each action class. These predictions are then fused along with the softmax scores of the recent ultra-deep ResNet-101 using weighted averaging. △ Less

Submitted 11 April, 2017; originally announced April 2017.

Comments: Notebook paper for ActivityNet 2016 challenge, untrimmed video classification track

arXiv:1704.00389 [pdf, other]

Hidden Two-Stream Convolutional Networks for Action Recognition

Authors: Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander G. Hauptmann

Abstract: Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architectu… ▽ More Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches. △ Less

Submitted 30 October, 2018; v1 submitted 2 April, 2017; originally announced April 2017.

Comments: Accepted at ACCV 2018, camera ready. Code available at https://github.com/bryanyzhu/Hidden-Two-Stream

arXiv:1702.02295 [pdf, other]

Guided Optical Flow Learning

Authors: Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander G. Hauptmann

Abstract: We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of gr… ▽ More We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of ground truth optical flow, unsupervised CNNs often perform worse as they are naturally ill-conditioned. We therefore propose a novel framework in which proxy ground truth data generated from classical approaches is used to guide the CNN learning. The models are further refined in an unsupervised fashion using an image reconstruction loss. Our guided learning approach is competitive with or superior to state-of-the-art approaches on three standard benchmark datasets yet is completely unsupervised and can run in real time. △ Less

Submitted 1 July, 2017; v1 submitted 8 February, 2017; originally announced February 2017.

Comments: CVPR17 Workshop. Code available at https://github.com/bryanyzhu/GuidedNet

arXiv:1612.07403 [pdf, other]

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

Authors: Yi Zhu, Shawn Newsam

Abstract: This paper studies the joint learning of action recognition and temporal localization in long, untrimmed videos. We employ a multi-task learning framework that performs the three highly related steps of action proposal, action recognition, and action localization refinement in parallel instead of the standard sequential pipeline that performs the steps in order. We develop a novel temporal actionn… ▽ More This paper studies the joint learning of action recognition and temporal localization in long, untrimmed videos. We employ a multi-task learning framework that performs the three highly related steps of action proposal, action recognition, and action localization refinement in parallel instead of the standard sequential pipeline that performs the steps in order. We develop a novel temporal actionness regression module that estimates what proportion of a clip contains action. We use it for temporal localization but it could have other applications like video retrieval, surveillance, summarization, etc. We also introduce random shear augmentation during training to simulate viewpoint change. We evaluate our framework on three popular video benchmarks. Results demonstrate that our joint model is efficient in terms of storage and computation in that we do not need to compute and cache dense trajectory features, and that it is several times faster than its sequential ConvNets counterpart. Yet, despite being more efficient, it outperforms state-of-the-art methods with respect to accuracy. △ Less

Submitted 4 April, 2017; v1 submitted 21 December, 2016; originally announced December 2016.

Comments: WACV 2017 camera ready, minor updates about test time efficiency

arXiv:1610.03023 [pdf]

doi 10.3390/rs9050489

Learning Low Dimensional Convolutional Neural Networks for High-Resolution Remote Sensing Image Retrieval

Authors: Weixun Zhou, Shawn Newsam, Congmin Li, Zhenfeng Shao

Abstract: Learning powerful feature representations for image retrieval has always been a challenging task in the field of remote sensing. Traditional methods focus on extracting low-level hand-crafted features which are not only time-consuming but also tend to achieve unsatisfactory performance due to the content complexity of remote sensing images. In this paper, we investigate how to extract deep feature… ▽ More Learning powerful feature representations for image retrieval has always been a challenging task in the field of remote sensing. Traditional methods focus on extracting low-level hand-crafted features which are not only time-consuming but also tend to achieve unsatisfactory performance due to the content complexity of remote sensing images. In this paper, we investigate how to extract deep feature representations based on convolutional neural networks (CNN) for high-resolution remote sensing image retrieval (HRRSIR). To this end, two effective schemes are proposed to generate powerful feature representations for HRRSIR. In the first scheme, the deep features are extracted from the fully-connected and convolutional layers of the pre-trained CNN models, respectively; in the second scheme, we propose a novel CNN architecture based on conventional convolution layers and a three-layer perceptron. The novel CNN model is then trained on a large remote sensing dataset to learn low dimensional features. The two schemes are evaluated on several public and challenging datasets, and the results indicate that the proposed schemes and in particular the novel CNN are able to achieve state-of-the-art performance. △ Less

Submitted 30 December, 2016; v1 submitted 10 October, 2016; originally announced October 2016.

Journal ref: Remote Sens., 9(5), 489 (2017)

arXiv:1609.06772 [pdf, other]

Spatio-Temporal Sentiment Hotspot Detection Using Geotagged Photos

Authors: Yi Zhu, Shawn Newsam

Abstract: We perform spatio-temporal analysis of public sentiment using geotagged photo collections. We develop a deep learning-based classifier that predicts the emotion conveyed by an image. This allows us to associate sentiment with place. We perform spatial hotspot detection and show that different emotions have distinct spatial distributions that match expectations. We also perform temporal analysis us… ▽ More We perform spatio-temporal analysis of public sentiment using geotagged photo collections. We develop a deep learning-based classifier that predicts the emotion conveyed by an image. This allows us to associate sentiment with place. We perform spatial hotspot detection and show that different emotions have distinct spatial distributions that match expectations. We also perform temporal analysis using the capture time of the photos. Our spatio-temporal hotspot detection correctly identifies emerging concentrations of specific emotions and year-by-year analyses of select locations show there are strong temporal correlations between the predicted emotions and known events. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: To appear in ACM SIGSPATIAL 2016

arXiv:1609.06653 [pdf, other]

Land Use Classification using Convolutional Neural Networks Applied to Ground-Level Images

Authors: Yi Zhu, Shawn Newsam

Abstract: Land use map** is a fundamental yet challenging task in geographic science. In contrast to land cover map**, it is generally not possible using overhead imagery. The recent, explosive growth of online geo-referenced photo collections suggests an alternate approach to geographic knowledge discovery. In this work, we present a general framework that uses ground-level images from Flickr for land… ▽ More Land use map** is a fundamental yet challenging task in geographic science. In contrast to land cover map**, it is generally not possible using overhead imagery. The recent, explosive growth of online geo-referenced photo collections suggests an alternate approach to geographic knowledge discovery. In this work, we present a general framework that uses ground-level images from Flickr for land use map**. Our approach benefits from several novel aspects. First, we address the nosiness of the online photo collections, such as imprecise geolocation and uneven spatial distribution, by performing location and indoor/outdoor filtering, and semi- supervised dataset augmentation. Our indoor/outdoor classifier achieves state-of-the-art performance on several bench- mark datasets and approaches human-level accuracy. Second, we utilize high-level semantic image features extracted using deep learning, specifically convolutional neural net- works, which allow us to achieve upwards of 76% accuracy on a challenging eight class land use map** problem. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: ACM SIGSPATIAL 2015, Best Poster Award

arXiv:1608.04339 [pdf, other]

Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition

Authors: Yi Zhu, Shawn Newsam

Abstract: This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normalization (STDN) to enforce temporal consistency in our estimated de… ▽ More This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normalization (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modified depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components significantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches. △ Less

Submitted 15 August, 2016; originally announced August 2016.

Comments: ECCVW 2016, Web-scale Vision and Social Media (VSM) workshop

Showing 1–30 of 30 results for author: Newsam, S