Search | arXiv e-print repository

Multimodal Wildland Fire Smoke Detection

Authors: Siddhant Baldota, Shreyas Anantha Ramaprasad, Jaspreet Kaur Bhamra, Shane Luna, Ravi Ramachandra, Eugene Zen, Harrison Kim, Daniel Crawl, Ismael Perez, Ilkay Altintas, Garrison W. Cottrell, Mai H. Nguyen

Abstract: Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency… ▽ More Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency to provide tools for effective wildfire management. Early detection of wildfires is essential to minimizing potentially catastrophic destruction. In this paper, we present our work on integrating multiple data sources in SmokeyNet, a deep learning model using spatio-temporal information to detect smoke from wildland fires. Camera image data is integrated with weather sensor measurements and processed by SmokeyNet to create a multimodal wildland fire smoke detection system. We present our results comparing performance in terms of both accuracy and time-to-detection for multimodal data vs. a single data source. With a time-to-detection of only a few minutes, SmokeyNet can serve as an automated early notification system, providing a useful tool in the fight against destructive wildfires. △ Less

Submitted 28 December, 2022; originally announced December 2022.

arXiv:2112.08598 [pdf, other]

doi 10.3390/rs14041007

FIgLib & SmokeyNet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection

Authors: Anshuman Dewangan, Yash Pande, Hans-Werner Braun, Frank Vernon, Ismael Perez, Ilkay Altintas, Garrison W. Cottrell, Mai H. Nguyen

Abstract: The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high-fire-risk days, a small fire ignition can rapidly grow and become out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have… ▽ More The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high-fire-risk days, a small fire ignition can rapidly grow and become out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have suffered from small or unreliable datasets that make it difficult to extrapolate performance to real-world scenarios. In this work, we present the Fire Ignition Library (FIgLib), a publicly available dataset of nearly 25,000 labeled wildfire smoke images as seen from fixed-view cameras deployed in Southern California. We also introduce SmokeyNet, a novel deep learning architecture using spatiotemporal information from camera imagery for real-time wildfire smoke detection. When trained on the FIgLib dataset, SmokeyNet outperforms comparable baselines and rivals human performance. We hope that the availability of the FIgLib dataset and the SmokeyNet architecture will inspire further research into deep learning methods for wildfire smoke detection, leading to automated notification systems that reduce the time to wildfire response. △ Less

Submitted 14 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Journal ref: Remote Sensing. 2022; 14(4):1007

arXiv:2005.08072 [pdf, other]

Speech Recognition and Multi-Speaker Diarization of Long Conversations

Authors: Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison Cottrell

Abstract: Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This… ▽ More Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This American Life radio program to better compare these approaches when applied to extended multi-speaker conversations. We find that training separate ASR and SD models perform better when utterance boundaries are known but otherwise joint models can perform better. To handle long conversations with unknown utterance boundaries, we introduce a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD. △ Less

Submitted 4 November, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

arXiv:2004.02032 [pdf, other]

Generating Rationales in Visual Question Answering

Authors: Hammad A. Ayyubi, Md. Mehrab Tanjim, Julian J. McAuley, Garrison W. Cottrell

Abstract: Despite recent advances in Visual QuestionAnswering (VQA), it remains a challenge todetermine how much success can be attributedto sound reasoning and comprehension ability.We seek to investigate this question by propos-ing a new task ofrationale generation. Es-sentially, we task a VQA model with generat-ing rationales for the answers it predicts. Weuse data from the Visual Commonsense Rea-soning… ▽ More Despite recent advances in Visual QuestionAnswering (VQA), it remains a challenge todetermine how much success can be attributedto sound reasoning and comprehension ability.We seek to investigate this question by propos-ing a new task ofrationale generation. Es-sentially, we task a VQA model with generat-ing rationales for the answers it predicts. Weuse data from the Visual Commonsense Rea-soning (VCR) task, as it contains ground-truthrationales along with visual questions and an-swers. We first investigate commonsense un-derstanding in one of the leading VCR mod-els, ViLBERT, by generating rationales frompretrained weights using a state-of-the-art lan-guage model, GPT-2. Next, we seek to jointlytrain ViLBERT with GPT-2 in an end-to-endfashion with the dual task of predicting the an-swer in VQA and generating rationales. Weshow that this kind of training injects com-monsense understanding in the VQA modelthrough quantitative and qualitative evaluationmetrics △ Less

Submitted 4 April, 2020; originally announced April 2020.

arXiv:2003.04887 [pdf, other]

ReZero is All You Need: Fast Convergence at Large Depth

Authors: Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, Julian McAuley

Abstract: Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry… ▽ More Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8. △ Less

Submitted 24 June, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

arXiv:2002.07405 [pdf, other]

Deflecting Adversarial Attacks

Authors: Yao Qin, Nicholas Frosst, Colin Raffel, Garrison Cottrell, Geoffrey Hinton

Abstract: There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we "deflect'' adversarial attacks by causing the attacker to produce an input that semantically resembles the attack's target class. To this end, we first propose a stronger defense based on Ca… ▽ More There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we "deflect'' adversarial attacks by causing the attacker to produce an input that semantically resembles the attack's target class. To this end, we first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance on both standard and defense-aware attacks. We then show that undetected attacks against our defense often perceptually resemble the adversarial target class by performing a human study where participants are asked to label images produced by the attack. These attack images can no longer be called "adversarial'' because our network classifies them the same way as humans do. △ Less

Submitted 18 February, 2020; originally announced February 2020.

arXiv:1908.09451 [pdf, ps, other]

Improving Neural Story Generation by Targeted Common Sense Grounding

Authors: Huanru Henry Mao, Bodhisattwa Prasad Majumder, Julian McAuley, Garrison W. Cottrell

Abstract: Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary trainin… ▽ More Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary training signals from datasets designed to provide common sense grounding. When combined with our two-stage fine-tuning pipeline, our method achieves improved common sense reasoning and state-of-the-art perplexity on the Writing Prompts (Fan et al., 2018) story generation dataset. △ Less

Submitted 27 February, 2020; v1 submitted 25 August, 2019; originally announced August 2019.

arXiv:1907.04868 [pdf, other]

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Authors: Chris Donahue, Huanru Henry Mao, Yiting Ethan Li, Garrison W. Cottrell, Julian McAuley

Abstract: We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit.… ▽ More We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit. Their success on piano score generation is partially explained by the large volumes of symbolic data readily available for that domain. We leverage the recently-introduced NES-MDB dataset of four-instrument scores from an early video game sound synthesis chip (the NES), which we find to be well-suited to training with the Transformer architecture. To further improve the performance of our model, we propose a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset. Despite differences between the two corpora, we find that this transfer learning procedure improves both quantitative and qualitative performance for our primary task. △ Less

Submitted 10 July, 2019; originally announced July 2019.

Comments: Published as a conference paper at ISMIR 2019

arXiv:1907.02957 [pdf, other]

Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions

Authors: Yao Qin, Nicholas Frosst, Sara Sabour, Colin Raffel, Garrison Cottrell, Geoffrey Hinton

Abstract: Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and… ▽ More Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples. △ Less

Submitted 18 February, 2020; v1 submitted 5 July, 2019; originally announced July 2019.

Journal ref: ICLR 2020

arXiv:1903.10346 [pdf, other]

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

Authors: Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, Colin Raffel

Abstract: Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial… ▽ More Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial examples applied to speech recognition systems have neither of these properties: humans can easily identify the adversarial perturbations, and they are not effective when played over-the-air. This paper makes advances on both of these fronts. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets. Next, we make progress towards physical-world over-the-air audio adversarial examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions. △ Less

Submitted 7 June, 2019; v1 submitted 22 March, 2019; originally announced March 2019.

Comments: International Conference on Machine Learning (ICML), 2019

arXiv:1805.08403 [pdf, other]

Autofocus Layer for Semantic Segmentation

Authors: Yao Qin, Konstantinos Kamnitsas, Siddharth Ancha, Jay Nanavati, Garrison Cottrell, Antonio Criminisi, Aditya Nori

Abstract: We propose the autofocus convolutional layer for semantic segmentation with the objective of enhancing the capabilities of neural networks for multi-scale processing. Autofocus layers adaptively change the size of the effective receptive field based on the processed context to generate more powerful features. This is achieved by parallelising multiple convolutional layers with different dilation r… ▽ More We propose the autofocus convolutional layer for semantic segmentation with the objective of enhancing the capabilities of neural networks for multi-scale processing. Autofocus layers adaptively change the size of the effective receptive field based on the processed context to generate more powerful features. This is achieved by parallelising multiple convolutional layers with different dilation rates, combined by an attention mechanism that learns to focus on the optimal scales driven by context. By sharing the weights of the parallel convolutions we make the network scale-invariant, with only a modest increase in the number of parameters. The proposed autofocus layer can be easily integrated into existing networks to improve a model's representational power. We evaluate our models on the challenging tasks of multi-organ segmentation in pelvic CT and brain tumor segmentation in MRI and achieve very promising performance. △ Less

Submitted 11 June, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

Comments: Published on MICCAI 2018

arXiv:1801.00887 [pdf, other]

doi 10.1109/ICSC.2018.00077

DeepJ: Style-Specific Music Generation

Authors: Huanru Henry Mao, Taylor Shin, Garrison W. Cottrell

Abstract: Recent advances in deep neural networks have enabled algorithms to compose music that is comparable to music composed by humans. However, few algorithms allow the user to generate music with tunable parameters. The ability to tune properties of generated music will yield more practical benefits for aiding artists, filmmakers, and composers in their creative tasks. In this paper, we introduce DeepJ… ▽ More Recent advances in deep neural networks have enabled algorithms to compose music that is comparable to music composed by humans. However, few algorithms allow the user to generate music with tunable parameters. The ability to tune properties of generated music will yield more practical benefits for aiding artists, filmmakers, and composers in their creative tasks. In this paper, we introduce DeepJ - an end-to-end generative model that is capable of composing music conditioned on a specific mixture of composer styles. Our innovations include methods to learn musical style and music dynamics. We use our model to demonstrate a simple technique for controlling the style of generated music as a proof of concept. Evaluation of our model using human raters shows that we have improved over the Biaxial LSTM approach. △ Less

Submitted 2 January, 2018; originally announced January 2018.

arXiv:1711.05255 [pdf, other]

Deep-ESN: A Multiple Projection-encoding Hierarchical Reservoir Computing Framework

Authors: Qianli Ma, Lifeng Shen, Garrison W. Cottrell

Abstract: As an efficient recurrent neural network (RNN) model, reservoir computing (RC) models, such as Echo State Networks, have attracted widespread attention in the last decade. However, while they have had great success with time series data [1], [2], many time series have a multiscale structure, which a single-hidden-layer RC model may have difficulty capturing. In this paper, we propose a novel hiera… ▽ More As an efficient recurrent neural network (RNN) model, reservoir computing (RC) models, such as Echo State Networks, have attracted widespread attention in the last decade. However, while they have had great success with time series data [1], [2], many time series have a multiscale structure, which a single-hidden-layer RC model may have difficulty capturing. In this paper, we propose a novel hierarchical reservoir computing framework we call Deep Echo State Networks (Deep-ESNs). The most distinctive feature of a Deep-ESN is its ability to deal with time series through hierarchical projections. Specifically, when an input time series is projected into the high-dimensional echo-state space of a reservoir, a subsequent encoding layer (e.g., a PCA, autoencoder, or a random projection) can project the echo-state representations into a lower-dimensional space. These low-dimensional representations can then be processed by another ESN. By using projection layers and encoding layers alternately in the hierarchical framework, a Deep-ESN can not only attenuate the effects of the collinearity problem in ESNs, but also fully take advantage of the temporal kernel property of ESNs to explore multiscale dynamics of time series. To fuse the multiscale representations obtained by each reservoir, we add connections from each encoding layer to the last output layer. Theoretical analyses prove that stability of a Deep-ESN is guaranteed by the echo state property (ESP), and the time complexity is equivalent to a conventional ESN. Experimental results on some artificial and real world time series demonstrate that Deep-ESNs can capture multiscale dynamics, and outperform both standard ESNs and previous hierarchical ESN-based models. △ Less

Submitted 13 November, 2017; originally announced November 2017.

arXiv:1708.03901 [pdf, other]

Belief Tree Search for Active Object Recognition

Authors: Mohsen Malmir, Garrison W. Cottrell

Abstract: Active Object Recognition (AOR) has been approached as an unsupervised learning problem, in which optimal trajectories for object inspection are not known and are to be discovered by reducing label uncertainty measures or training with reinforcement learning. Such approaches have no guarantees of the quality of their solution. In this paper, we treat AOR as a Partially Observable Markov Decision P… ▽ More Active Object Recognition (AOR) has been approached as an unsupervised learning problem, in which optimal trajectories for object inspection are not known and are to be discovered by reducing label uncertainty measures or training with reinforcement learning. Such approaches have no guarantees of the quality of their solution. In this paper, we treat AOR as a Partially Observable Markov Decision Process (POMDP) and find near-optimal policies on training data using Belief Tree Search (BTS) on the corresponding belief Markov Decision Process (MDP). AOR then reduces to the problem of knowledge transfer from near-optimal policies on training set to the test set. We train a Long Short Term Memory (LSTM) network to predict the best next action on the training set rollouts. We sho that the proposed AOR method generalizes well to novel views of familiar objects and also to novel objects. We compare this supervised scheme against guided policy search, and find that the LSTM network reaches higher recognition accuracy compared to the guided policy method. We further look into optimizing the observation function to increase the total collected reward of optimal policy. In AOR, the observation function is known only approximately. We propose a gradient-based method update to this approximate observation function to increase the total reward of any policy. We show that by optimizing the observation function and retraining the supervised LSTM network, the AOR performance on the test set improves significantly. △ Less

Submitted 13 August, 2017; originally announced August 2017.

Comments: IROS 2017

arXiv:1707.05911 [pdf, other]

Recognizing and Curating Photo Albums via Event-Specific Image Importance

Authors: Yufei Wang, Zhe Lin, Xiaohui Shen, Radomir Mech, Gavin Miller, Garrison W. Cottrell

Abstract: Automatic organization of personal photos is a problem with many real world ap- plications, and can be divided into two main tasks: recognizing the event type of the photo collection, and selecting interesting images from the collection. In this paper, we attempt to simultaneously solve both tasks: album-wise event recognition and image- wise importance prediction. We collected an album dataset wi… ▽ More Automatic organization of personal photos is a problem with many real world ap- plications, and can be divided into two main tasks: recognizing the event type of the photo collection, and selecting interesting images from the collection. In this paper, we attempt to simultaneously solve both tasks: album-wise event recognition and image- wise importance prediction. We collected an album dataset with both event type labels and image importance labels, refined from an existing CUFED dataset. We propose a hybrid system consisting of three parts: A siamese network-based event-specific image importance prediction, a Convolutional Neural Network (CNN) that recognizes the event type, and a Long Short-Term Memory (LSTM)-based sequence level event recognizer. We propose an iterative updating procedure for event type and image importance score prediction. We experimentally verified that image importance score prediction and event type recognition can each help the performance of the other. △ Less

Submitted 18 July, 2017; originally announced July 2017.

Comments: Accepted as oral in BMVC 2017

arXiv:1705.09425 [pdf, other]

Hierarchical Cellular Automata for Visual Saliency

Authors: Yao Qin, Mengyang Feng, Huchuan Lu, Garrison W. Cottrell

Abstract: Saliency detection, finding the most important parts of an image, has become increasingly popular in computer vision. In this paper, we introduce Hierarchical Cellular Automata (HCA) -- a temporally evolving model to intelligently detect salient objects. HCA consists of two main components: Single-layer Cellular Automata (SCA) and Cuboid Cellular Automata (CCA). As an unsupervised propagation mech… ▽ More Saliency detection, finding the most important parts of an image, has become increasingly popular in computer vision. In this paper, we introduce Hierarchical Cellular Automata (HCA) -- a temporally evolving model to intelligently detect salient objects. HCA consists of two main components: Single-layer Cellular Automata (SCA) and Cuboid Cellular Automata (CCA). As an unsupervised propagation mechanism, Single-layer Cellular Automata can exploit the intrinsic relevance of similar regions through interactions with neighbors. Low-level image features as well as high-level semantic information extracted from deep neural networks are incorporated into the SCA to measure the correlation between different image patches. With these hierarchical deep features, an impact factor matrix and a coherence matrix are constructed to balance the influences on each cell's next state. The saliency values of all cells are iteratively updated according to a well-defined update rule. Furthermore, we propose CCA to integrate multiple saliency maps generated by SCA at different scales in a Bayesian framework. Therefore, single-layer propagation and multi-layer integration are jointly modeled in our unified HCA. Surprisingly, we find that the SCA can improve all existing methods that we applied it to, resulting in a similar precision level regardless of the original results. The CCA can act as an efficient pixel-wise aggregation algorithm that can integrate state-of-the-art methods, resulting in even better results. Extensive experiments on four challenging datasets demonstrate that the proposed algorithm outperforms state-of-the-art conventional methods and is competitive with deep learning based approaches. △ Less

Submitted 25 May, 2017; originally announced May 2017.

arXiv:1705.04282 [pdf, other]

Learning to see people like people

Authors: Amanda Song, Linjie Li, Chad Atalla, Garrison Cottrell

Abstract: Humans make complex inferences on faces, ranging from objective properties (gender, ethnicity, expression, age, identity, etc) to subjective judgments (facial attractiveness, trustworthiness, sociability, friendliness, etc). While the objective aspects of face perception have been extensively studied, relatively fewer computational models have been developed for the social impressions of faces. Br… ▽ More Humans make complex inferences on faces, ranging from objective properties (gender, ethnicity, expression, age, identity, etc) to subjective judgments (facial attractiveness, trustworthiness, sociability, friendliness, etc). While the objective aspects of face perception have been extensively studied, relatively fewer computational models have been developed for the social impressions of faces. Bridging this gap, we develop a method to predict human impressions of faces in 40 subjective social dimensions, using deep representations from state-of-the-art neural networks. We find that model performance grows as the human consensus on a face trait increases, and that model predictions outperform human groups in correlation with human averages. This illustrates the learnability of subjective social perception of faces, especially when there is high human consensus. Our system can be used to decide which photographs from a personal collection will make the best impression. The results are significant for the field of social robotics, demonstrating that robots can learn the subjective judgments defining the underlying fabric of human interaction. △ Less

Submitted 5 May, 2017; originally announced May 2017.

Comments: 10 pages

arXiv:1704.06972 [pdf, other]

Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition

Authors: Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, Garrison W. Cottrell

Abstract: Recently, there has been a lot of interest in automatically generating descriptions for an image. Most existing language-model based approaches for this task learn to generate an image description word by word in its original word order. However, for humans, it is more natural to locate the objects and their relationships first, and then elaborate on each object, describing notable attributes. We… ▽ More Recently, there has been a lot of interest in automatically generating descriptions for an image. Most existing language-model based approaches for this task learn to generate an image description word by word in its original word order. However, for humans, it is more natural to locate the objects and their relationships first, and then elaborate on each object, describing notable attributes. We present a coarse-to-fine method that decomposes the original image description into a skeleton sentence and its attributes, and generates the skeleton sentence and attribute phrases separately. By this decomposition, our method can generate more accurate and novel descriptions than the previous state-of-the-art. Experimental results on the MS-COCO and a larger scale Stock3M datasets show that our algorithm yields consistent improvements across different evaluation metrics, especially on the SPICE metric, which has much higher correlation with human ratings than the conventional metrics. Furthermore, our algorithm can generate descriptions with varied length, benefiting from the separate control of the skeleton and attributes. This enables image description generation that better accommodates user preferences. △ Less

Submitted 23 April, 2017; originally announced April 2017.

Comments: Accepted by CVPR 2017

arXiv:1704.02971 [pdf, other]

A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction

Authors: Yao Qin, Dong** Song, Haifeng Chen, Wei Cheng, Guofei Jiang, Garrison Cottrell

Abstract: The Nonlinear autoregressive exogenous (NARX) model, which predicts the current value of a time series based upon its previous values as well as the current and past values of multiple driving (exogenous) series, has been studied for decades. Despite the fact that various NARX models have been developed, few of them can capture the long-term temporal dependencies appropriately and select the relev… ▽ More The Nonlinear autoregressive exogenous (NARX) model, which predicts the current value of a time series based upon its previous values as well as the current and past values of multiple driving (exogenous) series, has been studied for decades. Despite the fact that various NARX models have been developed, few of them can capture the long-term temporal dependencies appropriately and select the relevant driving series to make predictions. In this paper, we propose a dual-stage attention-based recurrent neural network (DA-RNN) to address these two issues. In the first stage, we introduce an input attention mechanism to adaptively extract relevant driving series (a.k.a., input features) at each time step by referring to the previous encoder hidden state. In the second stage, we use a temporal attention mechanism to select relevant encoder hidden states across all time steps. With this dual-stage attention scheme, our model can not only make predictions effectively, but can also be easily interpreted. Thorough empirical studies based upon the SML 2010 dataset and the NASDAQ 100 Stock dataset demonstrate that the DA-RNN can outperform state-of-the-art methods for time series prediction. △ Less

Submitted 14 August, 2017; v1 submitted 7 April, 2017; originally announced April 2017.

Comments: International Joint Conference on Artificial Intelligence (IJCAI), 2017

arXiv:1702.08502 [pdf, other]

Understanding Convolution for Semantic Segmentation

Authors: Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, Garrison Cottrell

Abstract: Recent advances in deep learning, especially deep convolutional neural networks (CNNs), have led to significant improvement over previous semantic segmentation systems. Here we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are of both theoretical and practical value. First, we design dense upsampling convolution (DUC) to generate pixel-lev… ▽ More Recent advances in deep learning, especially deep convolutional neural networks (CNNs), have led to significant improvement over previous semantic segmentation systems. Here we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are of both theoretical and practical value. First, we design dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a state-of-art result of 80.1% mIOU in the test set at the time of submission. We also have achieved state-of-the-art overall on the KITTI road estimation benchmark and the PASCAL VOC2012 segmentation task. Our source code can be found at https://github.com/TuSimple/TuSimple-DUC . △ Less

Submitted 31 May, 2018; v1 submitted 27 February, 2017; originally announced February 2017.

Comments: WACV 2018. Updated acknowledgements. Source code: https://github.com/TuSimple/TuSimple-DUC

arXiv:1604.07872 [pdf]

Are Face and Object Recognition Independent? A Neurocomputational Modeling Exploration

Authors: Panqu Wang, Isabel Gauthier, Garrison Cottrell

Abstract: Are face and object recognition abilities independent? Although it is commonly believed that they are, Gauthier et al.(2014) recently showed that these abilities become more correlated as experience with nonface categories increases. They argued that there is a single underlying visual ability, v, that is expressed in performance with both face and nonface categories as experience grows. Using the… ▽ More Are face and object recognition abilities independent? Although it is commonly believed that they are, Gauthier et al.(2014) recently showed that these abilities become more correlated as experience with nonface categories increases. They argued that there is a single underlying visual ability, v, that is expressed in performance with both face and nonface categories as experience grows. Using the Cambridge Face Memory Test and the Vanderbilt Expertise Test, they showed that the shared variance between Cambridge Face Memory Test and Vanderbilt Expertise Test performance increases monotonically as experience increases. Here, we address why a shared resource across different visual domains does not lead to competition and to an inverse correlation in abilities? We explain this conundrum using our neurocomputational model of face and object processing (The Model, TM). Our results show that, as in the behavioral data, the correlation between subordinate level face and object recognition accuracy increases as experience grows. We suggest that different domains do not compete for resources because the relevant features are shared between faces and objects. The essential power of experience is to generate a "spreading transform" for faces that generalizes to objects that must be individuated. Interestingly, when the task of the network is basic level categorization, no increase in the correlation between domains is observed. Hence, our model predicts that it is the type of experience that matters and that the source of the correlation is in the fusiform face area, rather than in cortical areas that subserve basic level categorization. This result is consistent with our previous modeling elucidating why the FFA is recruited for novel domains of expertise (Tong et al., 2008). △ Less

Submitted 26 April, 2016; originally announced April 2016.

Journal ref: Journal of Cognitive Neuroscience, 28(4):558-574. 2016

arXiv:1604.07457 [pdf, other]

Modeling the Contribution of Central Versus Peripheral Vision in Scene, Object, and Face Recognition

Authors: Panqu Wang, Garrison Cottrell

Abstract: It is commonly believed that the central visual field is important for recognizing objects and faces, and the peripheral region is useful for scene recognition. However, the relative importance of central versus peripheral information for object, scene, and face recognition is unclear. In a behavioral study, Larson and Loschky (2009) investigated this question by measuring the scene recognition ac… ▽ More It is commonly believed that the central visual field is important for recognizing objects and faces, and the peripheral region is useful for scene recognition. However, the relative importance of central versus peripheral information for object, scene, and face recognition is unclear. In a behavioral study, Larson and Loschky (2009) investigated this question by measuring the scene recognition accuracy as a function of visual angle, and demonstrated that peripheral vision was indeed more useful in recognizing scenes than central vision. In this work, we modeled and replicated the result of Larson and Loschky (2009), using deep convolutional neural networks. Having fit the data for scenes, we used the model to predict future data for large-scale scene recognition as well as for objects and faces. Our results suggest that the relative order of importance of using central visual field information is face recognition>object recognition>scene recognition, and vice-versa for peripheral information. △ Less

Submitted 25 April, 2016; originally announced April 2016.

Comments: CogSci 2016 Conference Paper

arXiv:1602.08486 [pdf, other]

A Single Model Explains both Visual and Auditory Precortical Coding

Authors: Honghao Shan, Matthew H. Tong, Garrison W. Cottrell

Abstract: Precortical neural systems encode information collected by the senses, but the driving principles of the encoding used have remained a subject of debate. We present a model of retinal coding that is based on three constraints: information preservation, minimization of the neural wiring, and response equalization. The resulting novel version of sparse principal components analysis successfully capt… ▽ More Precortical neural systems encode information collected by the senses, but the driving principles of the encoding used have remained a subject of debate. We present a model of retinal coding that is based on three constraints: information preservation, minimization of the neural wiring, and response equalization. The resulting novel version of sparse principal components analysis successfully captures a number of known characteristics of the retinal coding system, such as center-surround receptive fields, color opponency channels, and spatiotemporal responses that correspond to magnocellular and parvocellular pathways. Furthermore, when trained on auditory data, the same model learns receptive fields well fit by gammatone filters, commonly used to model precortical auditory coding. This suggests that efficient coding may be a unifying principle of precortical encoding across modalities. △ Less

Submitted 7 April, 2016; v1 submitted 26 February, 2016; originally announced February 2016.

arXiv:1512.05484 [pdf, other]

Deep Active Object Recognition by Joint Label and Action Prediction

Authors: Mohsen Malmir, Karan Sikka, Deborah Forster, Ian Fasel, Javier R. Movellan, Garrison W. Cottrell

Abstract: An active object recognition system has the advantage of being able to act in the environment to capture images that are more suited for training and that lead to better performance at test time. In this paper, we propose a deep convolutional neural network for active object recognition that simultaneously predicts the object label, and selects the next action to perform on the object with the aim… ▽ More An active object recognition system has the advantage of being able to act in the environment to capture images that are more suited for training and that lead to better performance at test time. In this paper, we propose a deep convolutional neural network for active object recognition that simultaneously predicts the object label, and selects the next action to perform on the object with the aim of improving recognition performance. We treat active object recognition as a reinforcement learning problem and derive the cost function to train the network for joint prediction of the object label and the action. A generative model of object similarities based on the Dirichlet distribution is proposed and embedded in the network for encoding the state of the system. The training is carried out by simultaneously minimizing the label and action prediction errors using gradient descent. We empirically show that the proposed network is able to predict both the object label and the actions on GERMS, a dataset for active object recognition. We compare the test label prediction accuracy of the proposed model with Dirichlet and Naive Bayes state encoding. The results of experiments suggest that the proposed model equipped with Dirichlet state encoding is superior in performance, and selects images that lead to better training and higher accuracy of label prediction at test time. △ Less

Submitted 17 December, 2015; originally announced December 2015.

arXiv:1511.04103 [pdf, other]

Basic Level Categorization Facilitates Visual Object Recognition

Authors: Panqu Wang, Garrison W. Cottrell

Abstract: Recent advances in deep learning have led to significant progress in the computer vision field, especially for visual object recognition tasks. The features useful for object classification are learned by feed-forward deep convolutional neural networks (CNNs) automatically, and they are shown to be able to predict and decode neural representations in the ventral visual pathway of humans and monkey… ▽ More Recent advances in deep learning have led to significant progress in the computer vision field, especially for visual object recognition tasks. The features useful for object classification are learned by feed-forward deep convolutional neural networks (CNNs) automatically, and they are shown to be able to predict and decode neural representations in the ventral visual pathway of humans and monkeys. However, despite the huge amount of work on optimizing CNNs, there has not been much research focused on linking CNNs with guiding principles from the human visual cortex. In this work, we propose a network optimization strategy inspired by both of the developmental trajectory of children's visual object recognition capabilities, and Bar (2003), who hypothesized that basic level information is carried in the fast magnocellular pathway through the prefrontal cortex (PFC) and then projected back to inferior temporal cortex (IT), where subordinate level categorization is achieved. We instantiate this idea by training a deep CNN to perform basic level object categorization first, and then train it on subordinate level categorization. We apply this idea to training AlexNet (Krizhevsky et al., 2012) on the ILSVRC 2012 dataset and show that the top-5 accuracy increases from 80.13% to 82.14%, demonstrating the effectiveness of the method. We also show that subsequent transfer learning on smaller datasets gives superior results. △ Less

Submitted 7 January, 2016; v1 submitted 12 November, 2015; originally announced November 2015.

Comments: ICLR 2016 submission R1

arXiv:1412.6177 [pdf, other]

Example Selection For Dictionary Learning

Authors: Tomoki Tsuchida, Garrison W. Cottrell

Abstract: In unsupervised learning, an unbiased uniform sampling strategy is typically used, in order that the learned features faithfully encode the statistical structure of the training data. In this work, we explore whether active example selection strategies - algorithms that select which examples to use, based on the current estimate of the features - can accelerate learning. Specifically, we investiga… ▽ More In unsupervised learning, an unbiased uniform sampling strategy is typically used, in order that the learned features faithfully encode the statistical structure of the training data. In this work, we explore whether active example selection strategies - algorithms that select which examples to use, based on the current estimate of the features - can accelerate learning. Specifically, we investigate effects of heuristic and saliency-inspired selection algorithms on the dictionary learning task with sparse activations. We show that some selection algorithms do improve the speed of learning, and we speculate on why they might work. △ Less

Submitted 31 March, 2015; v1 submitted 18 December, 2014; originally announced December 2014.

arXiv:1312.6077 [pdf]

Efficient Visual Coding: From Retina To V2

Authors: Honghao Shan, Garrison Cottrell

Abstract: The human visual system has a hierarchical structure consisting of layers of processing, such as the retina, V1, V2, etc. Understanding the functional roles of these visual processing layers would help to integrate the psychophysiological and neurophysiological models into a consistent theory of human vision, and would also provide insights to computer vision research. One classical theory of the… ▽ More The human visual system has a hierarchical structure consisting of layers of processing, such as the retina, V1, V2, etc. Understanding the functional roles of these visual processing layers would help to integrate the psychophysiological and neurophysiological models into a consistent theory of human vision, and would also provide insights to computer vision research. One classical theory of the early visual pathway hypothesizes that it serves to capture the statistical structure of the visual inputs by efficiently coding the visual information in its outputs. Until recently, most computational models following this theory have focused upon explaining the receptive field properties of one or two visual layers. Recent work in deep networks has eliminated this concern, however, there is till the retinal layer to consider. Here we improve on a previously-described hierarchical model Recursive ICA (RICA) [1] which starts with PCA, followed by a layer of sparse coding or ICA, followed by a component-wise nonlinearity derived from considerations of the variable distributions expected by ICA. This process is then repeated. In this work, we improve on this model by using a new version of sparse PCA (sPCA), which results in biologically-plausible receptive fields for both the sPCA and ICA/sparse coding. When applied to natural image patches, our model learns visual features exhibiting the receptive field properties of retinal ganglion cells/lateral geniculate nucleus (LGN) cells, V1 simple cells, V1 complex cells, and V2 cells. Our work provides predictions for experimental neuroscience studies. For example, our result suggests that a previous neurophysiological study improperly discarded some of their recorded neurons; we predict that their discarded neurons capture the shape contour of objects. △ Less

Submitted 17 December, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

Comments: For the ICLR 2014 conference

Showing 1–27 of 27 results for author: Cottrell, G