-
EmoNets: Multimodal deep learning approaches for emotion recognition in video
Authors:
Samira Ebrahimi Kahou,
Xavier Bouthillier,
Pascal Lamblin,
Caglar Gulcehre,
Vincent Michalski,
Kishore Konda,
Sébastien Jean,
Pierre Froumenty,
Yann Dauphin,
Nicolas Boulanger-Lewandowski,
Raul Chandias Ferrari,
Mehdi Mirza,
David Warde-Farley,
Aaron Courville,
Pascal Vincent,
Roland Memisevic,
Christopher Pal,
Yoshua Bengio
Abstract:
The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple…
▽ More
The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset.
△ Less
Submitted 29 March, 2015; v1 submitted 5 March, 2015;
originally announced March 2015.
-
Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research
Authors:
Atousa Torabi,
Christopher Pal,
Hugo Larochelle,
Aaron Courville
Abstract:
In this work, we introduce a dataset of video annotated with high quality natural language phrases describing the visual content in a given segment of time. Our dataset is based on the Descriptive Video Service (DVS) that is now encoded on many digital media products such as DVDs. DVS is an audio narration describing the visual elements and actions in a movie for the visually impaired. It is tempo…
▽ More
In this work, we introduce a dataset of video annotated with high quality natural language phrases describing the visual content in a given segment of time. Our dataset is based on the Descriptive Video Service (DVS) that is now encoded on many digital media products such as DVDs. DVS is an audio narration describing the visual elements and actions in a movie for the visually impaired. It is temporally aligned with the movie and mixed with the original movie soundtrack. We describe an automatic DVS segmentation and alignment method for movies, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention. Using this method, we have collected the largest DVS-derived dataset for video description of which we are aware. Our dataset currently includes over 84.6 hours of paired video/sentences from 92 DVDs and is growing.
△ Less
Submitted 3 March, 2015;
originally announced March 2015.
-
Describing Videos by Exploiting Temporal Structure
Authors:
Li Yao,
Atousa Torabi,
Kyunghyun Cho,
Nicolas Ballas,
Christopher Pal,
Hugo Larochelle,
Aaron Courville
Abstract:
Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully…
▽ More
Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Our approach exceeds the current state-of-art for both BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on a new, larger and more challenging dataset of paired video and natural language descriptions.
△ Less
Submitted 30 September, 2015; v1 submitted 27 February, 2015;
originally announced February 2015.
-
Design space exploration for image processing architectures on FPGA targets
Authors:
Chandrajit Pal,
Avik Kotal,
Asit Samanta,
Amlan Chakrabarti,
Ranjan Ghosh
Abstract:
Due to the emergence of embedded applications in image and video processing, communication and cryptography, improvement of pictorial information for better human perception like deblurring, denoising in several fields such as satellite imaging, medical imaging, mobile applications etc. are gaining importance for renewed research. Behind such developments, the primary responsibility lies with the…
▽ More
Due to the emergence of embedded applications in image and video processing, communication and cryptography, improvement of pictorial information for better human perception like deblurring, denoising in several fields such as satellite imaging, medical imaging, mobile applications etc. are gaining importance for renewed research. Behind such developments, the primary responsibility lies with the advancement of semiconductor technology leading to FPGA based programmable logic devices, which combines the advantages of both custom hardware and dedicated DSP resources. In addition, FPGA provides powerful reconfiguration feature and hence is an ideal target for rapid prototy**. We have endeavoured to exploit exceptional features of FPGA technology in respect to hardware parallelism leading to higher computational density and throughput, and have observed better performances than those one can get just merely porting the image processing software algorithms to hardware. In this paper, we intend to present an elaborate review, based on our expertise and experiences, on undertaking necessary transformation to an image processing software algorithm including the optimization techniques that makes its operation in hardware comparatively faster.
△ Less
Submitted 15 April, 2014;
originally announced April 2014.
-
A brief experience on journey through hardware developments for image processing and its applications on Cryptography
Authors:
Sangeet Saha,
Chandrajit pal,
Rourab paul,
Satyabrata Maity,
Suman Sau
Abstract:
The importance of embedded applications on image and video processing,communication and cryptography domain has been taking a larger space in current research era. Improvement of pictorial information for betterment of human perception like deblurring, de-noising in several fields such as satellite imaging, medical imaging etc are renewed research thrust. Specifically we would like to elaborate ou…
▽ More
The importance of embedded applications on image and video processing,communication and cryptography domain has been taking a larger space in current research era. Improvement of pictorial information for betterment of human perception like deblurring, de-noising in several fields such as satellite imaging, medical imaging etc are renewed research thrust. Specifically we would like to elaborate our experience on the significance of computer vision as one of the domains where hardware implemented algorithms perform far better than those implemented through software. So far embedded design engineers have successfully implemented their designs by means of Application Specific Integrated Circuits (ASICs) and/or Digital Signal Processors (DSP), however with the advancement of VLSI technology a very powerful hardware device namely the Field Programmable Gate Array (FPGA) combining the key advantages of ASICs and DSPs was developed which have the possibility of reprogramming making them a very attractive device for rapid prototy**.Communication of image and video data in multiple FPGA is no longer far away from the thrust of secured transmission among them, and then the relevance of cryptography is indeed unavoidable. This paper shows how the ** a hardware-software co-design environment.
△ Less
Submitted 27 December, 2012;
originally announced December 2012.
-
An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems
Authors:
Ardhendu Mandal,
Subhas Chandra Pal
Abstract:
A parallel computer system is a collection of processing elements that communicate and cooperate to solve large computational problems efficiently. To achieve this, at first the large computational problem is partitioned into several tasks with different work-loads and then are assigned to the different processing elements for computation. Distribution of the work load is known as Load Balancing.…
▽ More
A parallel computer system is a collection of processing elements that communicate and cooperate to solve large computational problems efficiently. To achieve this, at first the large computational problem is partitioned into several tasks with different work-loads and then are assigned to the different processing elements for computation. Distribution of the work load is known as Load Balancing. An appropriate distribution of work-loads across the various processing elements is very important as disproportional workloads can eliminate the performance benefit of parallelizing the job. Hence, load balancing on parallel systems is a critical and challenging activity. Load balancing algorithms can be broadly categorized as static or dynamic. Static load balancing algorithms distribute the tasks to processing elements at compile time, while dynamic algorithms bind tasks to processing elements at run time. This paper explains only the different dynamic load balancing techniques in brief used in parallel systems and concluding with the comparative performance analysis result of these algorithms.
△ Less
Submitted 8 September, 2011;
originally announced September 2011.