Search | arXiv e-print repository

Learning Data Association for Multi-Object Tracking using Only Coordinates

Authors: Mehdi Miah, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: We propose a novel Transformer-based module to address the data association problem for multi-object tracking. From detections obtained by a pretrained detector, this module uses only coordinates from bounding boxes to estimate an affinity score between pairs of tracks extracted from two distinct temporal windows. This module, named TWiX, is trained on sets of tracks with the objective of discrimi… ▽ More We propose a novel Transformer-based module to address the data association problem for multi-object tracking. From detections obtained by a pretrained detector, this module uses only coordinates from bounding boxes to estimate an affinity score between pairs of tracks extracted from two distinct temporal windows. This module, named TWiX, is trained on sets of tracks with the objective of discriminating pairs of tracks coming from the same object from those which are not. Our module does not use the intersection over union measure, nor does it requires any motion priors or any camera motion compensation technique. By inserting TWiX within an online cascade matching pipeline, our tracker C-TWiX achieves state-of-the-art performance on the DanceTrack and KITTIMOT datasets, and gets competitive results on the MOT17 dataset. The code will be made available upon publication. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Preprint submitted to Pattern Recognition

arXiv:2403.03296 [pdf, other]

CenterDisks: Real-time instance segmentation with disk covering

Authors: Katia Jodogne-Del Litto, Guillaume-Alexandre Bilodeau

Abstract: Increasing the accuracy of instance segmentation methods is often done at the expense of speed. Using coarser representations, we can reduce the number of parameters and thus obtain real-time masks. In this paper, we take inspiration from the set cover problem to predict mask approximations. Given ground-truth binary masks of objects of interest as training input, our method learns to predict the… ▽ More Increasing the accuracy of instance segmentation methods is often done at the expense of speed. Using coarser representations, we can reduce the number of parameters and thus obtain real-time masks. In this paper, we take inspiration from the set cover problem to predict mask approximations. Given ground-truth binary masks of objects of interest as training input, our method learns to predict the approximate coverage of these objects by disks without supervision on their location or radius. Each object is represented by a fixed number of disks with different radii. In the learning phase, we consider the radius as proportional to a standard deviation in order to compute the error to propagate on a set of two-dimensional Gaussian functions rather than disks. We trained and tested our instance segmentation method on challenging datasets showing dense urban settings with various road users. Our method achieve state-of-the art results on the IDD and KITTI dataset with an inference time of 0.040 s on a single RTX 3090 GPU. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.18503 [pdf, other]

Detection of Micromobility Vehicles in Urban Traffic Videos

Authors: Khalil Sabri, Célia Djilali, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Wassim Bouachir

Abstract: Urban traffic environments present unique challenges for object detection, particularly with the increasing presence of micromobility vehicles like e-scooters and bikes. To address this object detection problem, this work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks.… ▽ More Urban traffic environments present unique challenges for object detection, particularly with the increasing presence of micromobility vehicles like e-scooters and bikes. To address this object detection problem, this work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks. This is done by applying aggregated feature maps from consecutive frames processed through motion flow to the YOLOX architecture. This fusion brings a temporal perspective to YOLOX detection abilities, allowing for a better understanding of urban mobility patterns and substantially improving detection reliability. Tested on a custom dataset curated for urban micromobility scenarios, our model showcases substantial improvement over existing state-of-the-art methods, demonstrating the need to consider spatio-temporal information for detecting such small and thin objects. Our approach enhances detection in challenging conditions, including occlusions, ensuring temporal consistency, and effectively mitigating motion blur. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.10752 [pdf, other]

STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Authors: Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: Consecutive frames in a video contain redundancy, but they may also contain relevant complementary information for the detection task. The objective of our work is to leverage this complementary information to improve detection. Therefore, we propose a spatio-temporal fusion framework (STF). We first introduce multi-frame and single-frame attention modules that allow a neural network to share feat… ▽ More Consecutive frames in a video contain redundancy, but they may also contain relevant complementary information for the detection task. The objective of our work is to leverage this complementary information to improve detection. Therefore, we propose a spatio-temporal fusion framework (STF). We first introduce multi-frame and single-frame attention modules that allow a neural network to share feature maps between nearby frames to obtain more robust object representations. Second, we introduce a dual-frame fusion module that merges feature maps in a learnable manner to improve them. Our evaluation is conducted on three different benchmarks including video sequences of moving road users. The performed experiments demonstrate that the proposed spatio-temporal fusion module leads to improved detection performance compared to baseline object detectors. Code is available at https://github.com/noreenanwar/STF-module △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: 8 pages,3 figures

arXiv:2312.06486 [pdf, other]

STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction

Authors: Xi Ye, Guillaume-Alexandre Bilodeau

Abstract: Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic di… ▽ More Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at \url{https://github.com/XiYe20/STDiffProject}. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Journal ref: AAAI2024

arXiv:2311.03177 [pdf, other]

1D-Convolutional transformer for Parkinson disease diagnosis from gait

Authors: Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Abstract: This paper presents an efficient deep neural network model for diagnosing Parkinson's disease from gait. More specifically, we introduce a hybrid ConvNet-Transformer architecture to accurately diagnose the disease by detecting the severity stage. The proposed architecture exploits the strengths of both Convolutional Neural Networks and Transformers in a single end-to-end model, where the former is… ▽ More This paper presents an efficient deep neural network model for diagnosing Parkinson's disease from gait. More specifically, we introduce a hybrid ConvNet-Transformer architecture to accurately diagnose the disease by detecting the severity stage. The proposed architecture exploits the strengths of both Convolutional Neural Networks and Transformers in a single end-to-end model, where the former is able to extract relevant local features from Vertical Ground Reaction Force (VGRF) signal, while the latter allows to capture long-term spatio-temporal dependencies in data. In this manner, our hybrid architecture achieves an improved performance compared to using either models individually. Our experimental results show that our approach is effective for detecting the different stages of Parkinson's disease from gait data, with a final accuracy of 88%, outperforming other state-of-the-art AI methods on the Physionet gait dataset. Moreover, our method can be generalized and adapted for other classification problems to jointly address the feature relevance and spatio-temporal dependency problems in 1D signals. Our source code and pre-trained models are publicly available at https://github.com/SafwenNaimi/1D-Convolutional-transformer-for-Parkinson-disease-diagnosis-from-gait. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 17 pages, 5 Figures, 6 Tables. Accepted for publication in Neural Computing and Applications (NCAA) 2023

arXiv:2310.17080 [pdf, other]

Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images

Authors: Safwen Naimi, Olfa Koubaa, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Gregory Jeddore, Patricia Baines, David Correia, Andre Arsenault

Abstract: Lichens are symbiotic organisms composed of fungi, algae, and/or cyanobacteria that thrive in a variety of environments. They play important roles in carbon and nitrogen cycling, and contribute directly and indirectly to biodiversity. Ecologists typically monitor lichens by using them as indicators to assess air quality and habitat conditions. In particular, epiphytic lichens, which live on trees,… ▽ More Lichens are symbiotic organisms composed of fungi, algae, and/or cyanobacteria that thrive in a variety of environments. They play important roles in carbon and nitrogen cycling, and contribute directly and indirectly to biodiversity. Ecologists typically monitor lichens by using them as indicators to assess air quality and habitat conditions. In particular, epiphytic lichens, which live on trees, are key markers of air quality and environmental health. A new method of monitoring epiphytic lichens involves using time-lapse cameras to gather images of lichen populations. These cameras are used by ecologists in Newfoundland and Labrador to subsequently analyze and manually segment the images to determine lichen thalli condition and change. These methods are time-consuming and susceptible to observer bias. In this work, we aim to automate the monitoring of lichens over extended periods and to estimate their biomass and condition to facilitate the task of ecologists. To accomplish this, our proposed framework uses semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens on time-lapse images. We show that our method has the potential to significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada's forests. To the best of our knowledge, this is the first time that such an approach has been used to assist ecologists in monitoring and analyzing epiphytic lichens. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 6 pages, 3 Figures, 8 Tables, Accepted for publication in IEEE International Conference on Machine Learning and Applications (ICMLA), copyright IEEE

arXiv:2310.17078 [pdf, other]

HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait

Authors: Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Abstract: In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvN… ▽ More In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to assess the PD severity stage. Our hybrid architecture exploits the strengths of both Convolutional Neural Networks (ConvNets) and Transformers to accurately detect PD and determine the severity stage. In particular, we take advantage of ConvNets to capture local patterns and correlations in the data, while we exploit Transformers for handling long-term dependencies in the input signal. We show that our hybrid method achieves superior performance when compared to other state-of-the-art methods, with a PD detection accuracy of 97% and a severity staging accuracy of 87%. Our source code is available at: https://github.com/SafwenNaimi △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 6 pages, 6 figures, 3 tables, Accepted for publication in IEEE International Conference on Machine Learning and Applications (ICMLA), copyright IEEE

arXiv:2305.05490 [pdf, other]

Real-time instance segmentation with polygons using an Intersection-over-Union loss

Authors: Katia Jodogne-Del Litto, Guillaume-Alexandre Bilodeau

Abstract: Predicting a binary mask for an object is more accurate but also more computationally expensive than a bounding box. Polygonal masks as developed in CenterPoly can be a good compromise. In this paper, we improve over CenterPoly by enhancing the classical regression L1 loss with a novel region-based loss and a novel order loss, as well as with a new training process for the vertices prediction head… ▽ More Predicting a binary mask for an object is more accurate but also more computationally expensive than a bounding box. Polygonal masks as developed in CenterPoly can be a good compromise. In this paper, we improve over CenterPoly by enhancing the classical regression L1 loss with a novel region-based loss and a novel order loss, as well as with a new training process for the vertices prediction head. Moreover, the previous methods that predict polygonal masks use different coordinate systems, but it is not clear if one is better than another, if we abstract the architecture requirement. We therefore investigate their impact on the prediction. We also use a new evaluation protocol with oracle predictions for the detection head, to further isolate the segmentation process and better compare the polygonal masks with binary masks. Our instance segmentation method is trained and tested with challenging datasets containing urban scenes, with a high density of road users. Experiments show, in particular, that using a combination of a regression loss and a region-based loss allows significant improvements on the Cityscapes and IDD test set compared to CenterPoly. Moreover the inference stage remains fast enough to reach real-time performance with an average of 0.045 s per frame for 2048$\times$1024 images on a single RTX 2070 GPU. The code is available $\href{https://github.com/KatiaJDL/CenterPoly-v2}{\text{here}}$. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.11291 [pdf, other]

VisiTherS: Visible-thermal infrared stereo disparity estimation of human silhouette

Authors: Noreen Anwar, Philippe Duplessis-Guindon, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: This paper presents a novel approach for visible-thermal infrared stereoscopy, focusing on the estimation of disparities of human silhouettes. Visible-thermal infrared stereo poses several challenges, including occlusions and differently textured matching regions in both spectra. Finding matches between two spectra with varying colors, textures, and shapes adds further complexity to the task. To a… ▽ More This paper presents a novel approach for visible-thermal infrared stereoscopy, focusing on the estimation of disparities of human silhouettes. Visible-thermal infrared stereo poses several challenges, including occlusions and differently textured matching regions in both spectra. Finding matches between two spectra with varying colors, textures, and shapes adds further complexity to the task. To address the aforementioned challenges, this paper proposes a novel approach where a high-resolution convolutional neural network is used to better capture relationships between the two spectra. To do so, a modified HRNet backbone is used for feature extraction. This HRNet backbone is capable of capturing fine details and textures as it extracts features at multiple scales, thereby enabling the utilization of both local and global information. For matching visible and thermal infrared regions, our method extracts features on each patch using two modified HRNet streams. Features from the two streams are then combined for predicting the disparities by concatenation and correlation. Results on public datasets demonstrate the effectiveness of the proposed approach by improving the results by approximately 18 percentage points on the $\leq$ 1 pixel error, highlighting its potential for improving accuracy in this task. The code of VisiTherS is available on GitHub at the following link https://github.com/philippeDG/VisiTherS. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: 8 pages,3 Figures,CVPR workshop

arXiv:2304.09012 [pdf, other]

GUILGET: GUI Layout GEneration with Transformer

Authors: Andrey Sobolevsky, Guillaume-Alexandre Bilodeau, **ghui Cheng, ** L. C. Guo

Abstract: Sketching out Graphical User Interface (GUI) layout is part of the pipeline of designing a GUI and a crucial task for the success of a software application. Arranging all components inside a GUI layout manually is a time-consuming task. In order to assist designers, we developed a method named GUILGET to automatically generate GUI layouts from positional constraints represented as GUI arrangement… ▽ More Sketching out Graphical User Interface (GUI) layout is part of the pipeline of designing a GUI and a crucial task for the success of a software application. Arranging all components inside a GUI layout manually is a time-consuming task. In order to assist designers, we developed a method named GUILGET to automatically generate GUI layouts from positional constraints represented as GUI arrangement graphs (GUI-AGs). The goal is to support the initial step of GUI design by producing realistic and diverse GUI layouts. The existing image layout generation techniques often cannot incorporate GUI design constraints. Thus, GUILGET needs to adapt existing techniques to generate GUI layouts that obey to constraints specific to GUI designs. GUILGET is based on transformers in order to capture the semantic in relationships between elements from GUI-AG. Moreover, the model learns constraints through the minimization of losses responsible for placing each component inside its parent layout, for not letting components overlap if they are inside the same parent, and for component alignment. Our experiments, which are conducted on the CLAY dataset, reveal that our model has the best understanding of relationships from GUI-AG and has the best performances in most of evaluation metrics. Therefore, our work contributes to improved GUI layout generation by proposing a novel method that effectively accounts for the constraints on GUI elements and paves the road for a more efficient GUI design pipeline. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 12 pages, 5 figures, Canadian AI Conference 2023

arXiv:2304.06114 [pdf, ps, other]

TopTrack: Tracking Objects By Their Top

Authors: Jacob Meilleur, Guillaume-Alexandre Bilodeau

Abstract: In recent years, the joint detection-and-tracking paradigm has been a very popular way of tackling the multi-object tracking (MOT) task. Many of the methods following this paradigm use the object center keypoint for detection. However, we argue that the center point is not optimal since it is often not visible in crowded scenarios, which results in many missed detections when the objects are parti… ▽ More In recent years, the joint detection-and-tracking paradigm has been a very popular way of tackling the multi-object tracking (MOT) task. Many of the methods following this paradigm use the object center keypoint for detection. However, we argue that the center point is not optimal since it is often not visible in crowded scenarios, which results in many missed detections when the objects are partially occluded. We propose TopTrack, a joint detection-and-tracking method that uses the top of the object as a keypoint for detection instead of the center because it is more often visible. Furthermore, TopTrack processes consecutive frames in separate streams in order to facilitate training. We performed experiments to show that using the object top as a keypoint for detection can reduce the amount of missed detections, which in turn leads to more complete trajectories and less lost trajectories. TopTrack manages to achieve competitive results with other state-of-the-art trackers on two MOT benchmarks. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: 14 pages, 7 figures, submitted to Machine Vision and Applications

arXiv:2212.06026 [pdf, other]

Video Prediction by Efficient Transformers

Authors: Xi Ye, Guillaume-Alexandre Bilodeau

Abstract: Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation attention mechanism is proposed to reduce the complexity of standard Transformers. Then, a full autoregressive model, a partial autoregressive model and a non-aut… ▽ More Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation attention mechanism is proposed to reduce the complexity of standard Transformers. Then, a full autoregressive model, a partial autoregressive model and a non-autoregressive model are developed based on the new efficient Transformer. The partial autoregressive model has a similar performance with the full autoregressive model but a faster inference speed. The non-autoregressive model not only achieves a faster inference speed but also mitigates the quality degradation problem of the autoregressive counterparts, but it requires additional parameters and loss function for learning. Given the same attention mechanism, we conducted a comprehensive study to compare the proposed three video prediction variants. Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models. The source code is available at https://github.com/XiYe20/VPTR. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: Accepted by Image and Vision Computing. arXiv admin note: text overlap with arXiv:2203.15836

arXiv:2210.05810 [pdf, other]

A unified model for continuous conditional video prediction

Authors: Xi Ye, Guillaume-Alexandre Bilodeau

Abstract: Different conditional video prediction tasks, like video future frame prediction and video frame interpolation, are normally solved by task-related models even though they share many common underlying characteristics. Furthermore, almost all conditional video prediction models can only achieve discrete prediction. In this paper, we propose a unified model that addresses these two issues at the sam… ▽ More Different conditional video prediction tasks, like video future frame prediction and video frame interpolation, are normally solved by task-related models even though they share many common underlying characteristics. Furthermore, almost all conditional video prediction models can only achieve discrete prediction. In this paper, we propose a unified model that addresses these two issues at the same time. We show that conditional video prediction can be formulated as a neural process, which maps input spatio-temporal coordinates to target pixel values given context spatio-temporal coordinates and context pixel values. Specifically, we feed the implicit neural representation of coordinates and context pixel features into a Transformer-based non-autoregressive conditional video prediction model. Our task-specific models outperform previous work for video future frame prediction and video interpolation on multiple datasets. Importantly, the model is able to interpolate or predict with an arbitrary high frame rate, i.e., continuous prediction. Our source code is available at \url{https://npvp.github.io}. △ Less

Submitted 6 April, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted by CVPR2023 Workshop

arXiv:2204.10677 [pdf, other]

Improving tracking with a tracklet associator

Authors: Rémi Nahon, Guillaume-Alexandre Bilodeau, Gilles Pesant

Abstract: Multiple object tracking (MOT) is a task in computer vision that aims to detect the position of various objects in videos and to associate them to a unique identity. We propose an approach based on Constraint Programming (CP) whose goal is to be grafted to any existing tracker in order to improve its object association results. We developed a modular algorithm divided into three independent phases… ▽ More Multiple object tracking (MOT) is a task in computer vision that aims to detect the position of various objects in videos and to associate them to a unique identity. We propose an approach based on Constraint Programming (CP) whose goal is to be grafted to any existing tracker in order to improve its object association results. We developed a modular algorithm divided into three independent phases. The first phase consists in recovering the tracklets provided by a base tracker and to cut them at the places where uncertain associations are spotted, for example, when tracklets overlap, which may cause identity switches. In the second phase, we associate the previously constructed tracklets using a Belief Propagation Constraint Programming algorithm, where we propose various constraints that assign scores to each of the tracklets based on multiple characteristics, such as their dynamics or the distance between them in time and space. Finally, the third phase is a rudimentary interpolation model to fill in the remaining holes in the trajectories we built. Experiments show that our model leads to improvements in the results for all three of the state-of-the-art trackers on which we tested it (3 to 4 points gained on HOTA and IDF1). △ Less

Submitted 22 April, 2022; originally announced April 2022.

Comments: 8 pages, 6 figures, CRV 2022

MSC Class: ACM-class: I.4.8

arXiv:2204.09089 [pdf, other]

4D-MultispectralNet: Multispectral Stereoscopic Disparity Estimation using Human Masks

Authors: Philippe Duplessis-Guindon, Guillaume-Alexandre Bilodeau

Abstract: Multispectral stereoscopy is an emerging field. A lot of work has been done in classical stereoscopy, but multispectral stereoscopy is not studied as frequently. This type of stereoscopy can be used in autonomous vehicles to complete the information given by RGB cameras. It helps to identify objects in the surroundings when the conditions are more difficult, such as in night scenes. This paper foc… ▽ More Multispectral stereoscopy is an emerging field. A lot of work has been done in classical stereoscopy, but multispectral stereoscopy is not studied as frequently. This type of stereoscopy can be used in autonomous vehicles to complete the information given by RGB cameras. It helps to identify objects in the surroundings when the conditions are more difficult, such as in night scenes. This paper focuses on the RGB-LWIR spectrum. RGB-LWIR stereoscopy has the same challenges as classical stereoscopy, that is occlusions, textureless surfaces and repetitive patterns, plus specific ones related to the different modalities. Finding matches between two spectrums adds another layer of complexity. Color, texture and shapes are more likely to vary from a spectrum to another. To address this additional challenge, this paper focuses on estimating the disparity of people present in a scene. Given the fact that people's shape is captured in both RGB and LWIR, we propose a novel method that uses segmentation masks of the human in both spectrum and than concatenate them to the original images before the first layer of a Siamese Network. This method helps to improve the accuracy, particularly within the one pixel error range. △ Less

Submitted 19 April, 2022; originally announced April 2022.

Comments: 6 pages, 2 figures

arXiv:2204.08671 [pdf, other]

ActAR: Actor-Driven Pose Embeddings for Video Action Recognition

Authors: Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: Human action recognition (HAR) in videos is one of the core tasks of video understanding. Based on video sequences, the goal is to recognize actions performed by humans. While HAR has received much attention in the visible spectrum, action recognition in infrared videos is little studied. Accurate recognition of human actions in the infrared domain is a highly challenging task because of the redun… ▽ More Human action recognition (HAR) in videos is one of the core tasks of video understanding. Based on video sequences, the goal is to recognize actions performed by humans. While HAR has received much attention in the visible spectrum, action recognition in infrared videos is little studied. Accurate recognition of human actions in the infrared domain is a highly challenging task because of the redundant and indistinguishable texture features present in the sequence. Furthermore, in some cases, challenges arise from the irrelevant information induced by the presence of multiple active persons not contributing to the actual action of interest. Therefore, most existing methods consider a standard paradigm that does not take into account these challenges, which is in some part due to the ambiguous definition of the recognition task in some cases. In this paper, we propose a new method that simultaneously learns to recognize efficiently human actions in the infrared spectrum, while automatically identifying the key-actors performing the action without using any prior knowledge or explicit annotations. Our method is composed of three stages. In the first stage, optical flow-based key-actor identification is performed. Then for each key-actor, we estimate key-poses that will guide the frame selection process. A scale-invariant encoding process along with embedded pose filtering are performed in order to enhance the quality of action representations. Experimental results on InfAR dataset show that our proposed model achieves promising recognition performance and learns useful action representations. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2204.00423 [pdf, other]

Transformers for 1D Signals in Parkinson's Disease Detection from Gait

Authors: Duc Minh Dimitri Nguyen, Mehdi Miah, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: This paper focuses on the detection of Parkinson's disease based on the analysis of a patient's gait. The growing popularity and success of Transformer networks in natural language processing and image recognition motivated us to develop a novel method for this problem based on an automatic features extraction via Transformers. The use of Transformers in 1D signal is not really widespread yet, but… ▽ More This paper focuses on the detection of Parkinson's disease based on the analysis of a patient's gait. The growing popularity and success of Transformer networks in natural language processing and image recognition motivated us to develop a novel method for this problem based on an automatic features extraction via Transformers. The use of Transformers in 1D signal is not really widespread yet, but we show in this paper that they are effective in extracting relevant features from 1D signals. As Transformers require a lot of memory, we decoupled temporal and spatial information to make the model smaller. Our architecture used temporal Transformers, dimension reduction layers to reduce the dimension of the data, a spatial Transformer, two fully connected layers and an output layer for the final prediction. Our model outperforms the current state-of-the-art algorithm with 95.2\% accuracy in distinguishing a Parkinsonian patient from a healthy one on the Physionet dataset. A key learning from this work is that Transformers allow for greater stability in results. The source code and pre-trained models are released in https://github.com/DucMinhDimitriNguyen/Transformers-for-1D-signals-in-Parkinson-s-disease-detection-from-gait.git △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: International Conference on Pattern Recognition (ICPR 2022)

arXiv:2203.15836 [pdf, other]

VPTR: Efficient Transformers for Video Prediction

Authors: Xi Ye, Guillaume-Alexandre Bilodeau

Abstract: In this paper, we propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism. Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed. In addition, a non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce… ▽ More In this paper, we propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism. Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed. In addition, a non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart. In order to avoid the prediction of very similar future frames, a contrastive feature loss is applied to maximize the mutual information between predicted and ground-truth future frame features. This work is the first that makes a formal comparison of the two types of attention-based video future frames prediction models over different scenarios. The proposed models reach a performance competitive with more complex state-of-the-art models. The source code is available at \emph{https://github.com/XiYe20/VPTR}. △ Less

Submitted 29 March, 2022; originally announced March 2022.

arXiv:2111.03715 [pdf, other]

Leveraging Sentiment Analysis Knowledge to Solve Emotion Detection Tasks

Authors: Maude Nguyen-The, Guillaume-Alexandre Bilodeau, Jan Rockemann

Abstract: Identifying and understanding underlying sentiment or emotions in text is a key component of multiple natural language processing applications. While simple polarity sentiment analysis is a well-studied subject, fewer advances have been made in identifying more complex, finer-grained emotions using only textual data. In this paper, we present a Transformer-based model with a Fusion of Adapter laye… ▽ More Identifying and understanding underlying sentiment or emotions in text is a key component of multiple natural language processing applications. While simple polarity sentiment analysis is a well-studied subject, fewer advances have been made in identifying more complex, finer-grained emotions using only textual data. In this paper, we present a Transformer-based model with a Fusion of Adapter layers which leverages knowledge from more simple sentiment analysis tasks to improve the emotion detection task on large scale dataset, such as CMU-MOSEI, using the textual modality only. Results show that our proposed method is competitive with other approaches. We obtained state-of-the-art results for emotion recognition on CMU-MOSEI even while using only the textual modality. △ Less

Submitted 5 November, 2021; originally announced November 2021.

arXiv:2111.01606 [pdf, other]

PolyTrack: Tracking with Bounding Polygons

Authors: Gaspar Faure, Hughes Perreault, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: In this paper, we present a novel method called PolyTrack for fast multi-object tracking and segmentation using bounding polygons. Polytrack detects objects by producing heatmaps of their center keypoint. For each of them, a rough segmentation is done by computing a bounding polygon over each instance instead of the traditional bounding box. Tracking is done by taking two consecutive frames as inp… ▽ More In this paper, we present a novel method called PolyTrack for fast multi-object tracking and segmentation using bounding polygons. Polytrack detects objects by producing heatmaps of their center keypoint. For each of them, a rough segmentation is done by computing a bounding polygon over each instance instead of the traditional bounding box. Tracking is done by taking two consecutive frames as input and computing a center offset for each object detected in the first frame to predict its location in the second frame. A Kalman filter is also applied to reduce the number of ID switches. Since our target application is automated driving systems, we apply our method on urban environment videos. We trained and evaluated PolyTrack on the MOTS and KITTIMOTS datasets. Results show that tracking polygons can be a good alternative to bounding box and mask tracking. The code of PolyTrack is available at https://github.com/gafaua/PolyTrack. △ Less

Submitted 2 November, 2021; originally announced November 2021.

Comments: NeurIPS 2021 Machine Learning for Autonomous Driving Workshop

arXiv:2110.11284 [pdf, other]

Multi-Object Tracking and Segmentation with a Space-Time Memory Network

Authors: Mehdi Miah, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets. The proposed tracker, MeNToS, addresses particularly the long-term data association problem, when objects are not observable for long time intervals. Indeed, the recently introduced HOTA metric (High Order Tracking Accuracy), which has a better alignment than the formerly… ▽ More We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets. The proposed tracker, MeNToS, addresses particularly the long-term data association problem, when objects are not observable for long time intervals. Indeed, the recently introduced HOTA metric (High Order Tracking Accuracy), which has a better alignment than the formerly established MOTA (Multiple Object Tracking Accuracy) with the human visual assessment of tracking, has shown that improvements are still needed for data association, despite the recent improvement in object detection. In MeNToS, after creating tracklets using instance segmentation and optical flow, the proposed method relies on a space-time memory network originally developed for one-shot video object segmentation to improve the association of sequence of detections (tracklets) with temporal gaps. We evaluate our tracker on KITTIMOTS and MOTSChallenge and we show the benefit of our data association strategy with the HOTA metric. Additional ablation studies demonstrate that our approach using a space-time memory network gives better and more robust long-term association than those based on a re-identification network. Our project page is at \url{www.mehdimiah.com/mentos+}. △ Less

Submitted 15 May, 2023; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: arXiv admin note: text overlap with arXiv:2107.07067 Accepted at CRV 2023 (Conference on Robots and Vision)

arXiv:2109.12414 [pdf, other]

Vehicle Detection and Tracking From Surveillance Cameras in Urban Scenes

Authors: Oumayma Messoussi, Felipe Gohring de Magalhaes, Francois Lamarre, Francis Perreault, Ibrahima Sogoba, Guillaume-Alexandre Bilodeau, Gabriela Nicolescu

Abstract: Detecting and tracking vehicles in urban scenes is a crucial step in many traffic-related applications as it helps to improve road user safety among other benefits. Various challenges remain unresolved in multi-object tracking (MOT) including target information description, long-term occlusions and fast motion. We propose a multi-vehicle detection and tracking system following the tracking-by-dete… ▽ More Detecting and tracking vehicles in urban scenes is a crucial step in many traffic-related applications as it helps to improve road user safety among other benefits. Various challenges remain unresolved in multi-object tracking (MOT) including target information description, long-term occlusions and fast motion. We propose a multi-vehicle detection and tracking system following the tracking-by-detection paradigm that tackles the previously mentioned challenges. Our MOT method extends an Intersection-over-Union (IOU)-based tracker with vehicle re-identification features. This allows us to utilize appearance information to better match objects after long occlusion phases and/or when object location is significantly shifted due to fast motion. We outperform our baseline MOT method on the UA-DETRAC benchmark while maintaining a total processing speed suitable for online use cases. △ Less

Submitted 25 September, 2021; originally announced September 2021.

arXiv:2109.07298 [pdf, other]

FFAVOD: Feature Fusion Architecture for Video Object Detection

Authors: Hughes Perreault, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Maguelonne Héritier

Abstract: A significant amount of redundancy exists between consecutive frames of a video. Object detectors typically produce detections for one image at a time, without any capabilities for taking advantage of this redundancy. Meanwhile, many applications for object detection work with videos, including intelligent transportation systems, advanced driver assistance systems and video surveillance. Our work… ▽ More A significant amount of redundancy exists between consecutive frames of a video. Object detectors typically produce detections for one image at a time, without any capabilities for taking advantage of this redundancy. Meanwhile, many applications for object detection work with videos, including intelligent transportation systems, advanced driver assistance systems and video surveillance. Our work aims at taking advantage of the similarity between video frames to produce better detections. We propose FFAVOD, standing for feature fusion architecture for video object detection. We first introduce a novel video object detection architecture that allows a network to share feature maps between nearby frames. Second, we propose a feature fusion module that learns to merge feature maps to enhance them. We show that using the proposed architecture and the fusion module can improve the performance of three base object detectors on two object detection benchmarks containing sequences of moving road users. Additionally, to further increase performance, we propose an improvement to the SpotNet attention module. Using our architecture on the improved SpotNet detector, we obtain the state-of-the-art performance on the UA-DETRAC public benchmark as well as on the UAVDT dataset. Code is available at https://github.com/hu64/FFAVOD. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: Accepted for publication in Pattern Recognition Letters

arXiv:2108.08923 [pdf, other]

CenterPoly: real-time instance segmentation using bounding polygons

Authors: Hughes Perreault, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Maguelonne Héritier

Abstract: We present a novel method, called CenterPoly, for real-time instance segmentation using bounding polygons. We apply it to detect road users in dense urban environments, making it suitable for applications in intelligent transportation systems like automated vehicles. CenterPoly detects objects by their center keypoint while predicting a fixed number of polygon vertices for each object, thus perfor… ▽ More We present a novel method, called CenterPoly, for real-time instance segmentation using bounding polygons. We apply it to detect road users in dense urban environments, making it suitable for applications in intelligent transportation systems like automated vehicles. CenterPoly detects objects by their center keypoint while predicting a fixed number of polygon vertices for each object, thus performing detection and segmentation in parallel. Most of the network parameters are shared by the network heads, making it fast and lightweight enough to run at real-time speed. To properly convert mask ground-truth to polygon ground-truth, we designed a vertex selection strategy to facilitate the learning of the polygons. Additionally, to better segment overlap** objects in dense urban scenes, we also train a relative depth branch to determine which instances are closer and which are further, using available weak annotations. We propose several models with different backbones to show the possible speed / accuracy trade-offs. The models were trained and evaluated on Cityscapes, KITTI and IDD and the results are reported on their public benchmark, which are state-of-the-art at real-time speeds. Code is available at https://github.com/hu64/CenterPoly △ Less

Submitted 15 September, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

Comments: Accepted to the 2nd Autonomous Vehicle Vision Workshop (AVVision)

arXiv:2107.07067 [pdf, other]

MeNToS: Tracklets Association with a Space-Time Memory Network

Authors: Mehdi Miah, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: We propose a method for multi-object tracking and segmentation (MOTS) that does not require fine-tuning or per benchmark hyperparameter selection. The proposed method addresses particularly the data association problem. Indeed, the recently introduced HOTA metric, that has a better alignment with the human visual assessment by evenly balancing detections and associations quality, has shown that im… ▽ More We propose a method for multi-object tracking and segmentation (MOTS) that does not require fine-tuning or per benchmark hyperparameter selection. The proposed method addresses particularly the data association problem. Indeed, the recently introduced HOTA metric, that has a better alignment with the human visual assessment by evenly balancing detections and associations quality, has shown that improvements are still needed for data association. After creating tracklets using instance segmentation and optical flow, the proposed method relies on a space-time memory network (STM) developed for one-shot video object segmentation to improve the association of tracklets with temporal gaps. To the best of our knowledge, our method, named MeNToS, is the first to use the STM network to track object masks for MOTS. We took the 4th place in the RobMOTS challenge. The project page is https://mehdimiah.com/mentos.html. △ Less

Submitted 14 July, 2021; originally announced July 2021.

Comments: Presented at the "Robust Video Scene Understanding: Tracking and Video Segmentation" workshop (CVPR-W 2021)

arXiv:2106.06059 [pdf, other]

Predicting Next Local Appearance for Video Anomaly Detection

Authors: Pankaj Raj Roy, Guillaume-Alexandre Bilodeau, Lama Seoud

Abstract: We present a local anomaly detection method in videos. As opposed to most existing methods that are computationally expensive and are not very generalizable across different video scenes, we propose an adversarial framework that learns the temporal local appearance variations by predicting the appearance of a normally behaving object in the next frame of a scene by only relying on its current and… ▽ More We present a local anomaly detection method in videos. As opposed to most existing methods that are computationally expensive and are not very generalizable across different video scenes, we propose an adversarial framework that learns the temporal local appearance variations by predicting the appearance of a normally behaving object in the next frame of a scene by only relying on its current and past appearances. In the presence of an abnormally behaving object, the reconstruction error between the real and the predicted next appearance of that object indicates the likelihood of an anomaly. Our method is competitive with the existing state-of-the-art while being significantly faster for both training and inference and being better at generalizing to unseen video scenes. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted as an oral presentation for MVA'2021

arXiv:2103.01222 [pdf, other]

Multiple Convolutional Features in Siamese Networks for Object Tracking

Authors: Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: Siamese trackers demonstrated high performance in object tracking due to their balance between accuracy and speed. Unlike classification-based CNNs, deep similarity networks are specifically designed to address the image similarity problem, and thus are inherently more appropriate for the tracking task. However, Siamese trackers mainly use the last convolutional layers for similarity analysis and… ▽ More Siamese trackers demonstrated high performance in object tracking due to their balance between accuracy and speed. Unlike classification-based CNNs, deep similarity networks are specifically designed to address the image similarity problem, and thus are inherently more appropriate for the tracking task. However, Siamese trackers mainly use the last convolutional layers for similarity analysis and target search, which restricts their performance. In this paper, we argue that using a single convolutional layer as feature representation is not an optimal choice in a deep similarity framework. We present a Multiple Features-Siamese Tracker (MFST), a novel tracking algorithm exploiting several hierarchical feature maps for robust tracking. Since convolutional layers provide several abstraction levels in characterizing an object, fusing hierarchical features allows to obtain a richer and more efficient representation of the target. Moreover, we handle the target appearance variations by calibrating the deep features extracted from two different CNN models. Based on this advanced feature representation, our method achieves high tracking accuracy, while outperforming the standard siamese tracker on object tracking benchmarks. The source code and trained models are available at https://github.com/zhenxili96/MFST. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: Accepted for Machine Vision and Applications, 2021. arXiv admin note: substantial text overlap with arXiv:2103.00810

arXiv:2103.00810 [pdf, other]

MFST: Multi-Features Siamese Tracker

Authors: Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: Siamese trackers have recently achieved interesting results due to their balance between accuracy and speed. This success is mainly due to the fact that deep similarity networks were specifically designed to address the image similarity problem. Therefore, they are inherently more appropriate than classical CNNs for the tracking task. However, Siamese trackers rely on the last convolutional layers… ▽ More Siamese trackers have recently achieved interesting results due to their balance between accuracy and speed. This success is mainly due to the fact that deep similarity networks were specifically designed to address the image similarity problem. Therefore, they are inherently more appropriate than classical CNNs for the tracking task. However, Siamese trackers rely on the last convolutional layers for similarity analysis and target search, which restricts their performance. In this paper, we argue that using a single convolutional layer as feature representation is not the optimal choice within the deep similarity framework, as multiple convolutional layers provide several abstraction levels in characterizing an object. Starting from this motivation, we present the Multi-Features Siamese Tracker (MFST), a novel tracking algorithm exploiting several hierarchical feature maps for robust deep similarity tracking. MFST proceeds by fusing hierarchical features to ensure a richer and more efficient representation. Moreover, we handle appearance variation by calibrating deep features extracted from two different CNN models. Based on this advanced feature representation, our algorithm achieves high tracking accuracy, while outperforming several state-of-the-art trackers, including standard Siamese trackers. The code and trained models are available at https://github.com/zhenxili96/MFST. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: ICPR 2021, Oral

arXiv:2011.06722 [pdf, other]

Local Anomaly Detection in Videos using Object-Centric Adversarial Learning

Authors: Pankaj Raj Roy, Guillaume-Alexandre Bilodeau, Lama Seoud

Abstract: We propose a novel unsupervised approach based on a two-stage object-centric adversarial framework that only needs object regions for detecting frame-level local anomalies in videos. The first stage consists in learning the correspondence between the current appearance and past gradient images of objects in scenes deemed normal, allowing us to either generate the past gradient from current appeara… ▽ More We propose a novel unsupervised approach based on a two-stage object-centric adversarial framework that only needs object regions for detecting frame-level local anomalies in videos. The first stage consists in learning the correspondence between the current appearance and past gradient images of objects in scenes deemed normal, allowing us to either generate the past gradient from current appearance or the reverse. The second stage extracts the partial reconstruction errors between real and generated images (appearance and past gradient) with normal object behaviour, and trains a discriminator in an adversarial fashion. In inference mode, we employ the trained image generators with the adversarially learned binary classifier for outputting region-level anomaly detection scores. We tested our method on four public benchmarks, UMN, UCSD, Avenue and ShanghaiTech and our proposed object-centric adversarial approach yields competitive or even superior results compared to state-of-the-art methods. △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: Accepted for The First International Workshop on Deep Learning for Human-Centric Activity Understanding (ICPR2020 workshop)

arXiv:2010.08841 [pdf, other]

A Grid-based Representation for Human Action Recognition

Authors: Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that i… ▽ More Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for this task, and are limited in the way they fuse the temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets demonstrating that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges. △ Less

Submitted 29 October, 2020; v1 submitted 17 October, 2020; originally announced October 2020.

Comments: Accepted on 25th International Conference on Pattern Recognition (ICPR 2020)

arXiv:2010.07881 [pdf, other]

An Empirical Analysis of Visual Features for Multiple Object Tracking in Urban Scenes

Authors: Mehdi Miah, Justine Pepin, Nicolas Saunier, Guillaume-Alexandre Bilodeau

Abstract: This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identificat… ▽ More This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identification (ReID) features. In this study, we assess how good these features are at discriminating objects enclosed by a bounding box in urban scene tracking scenarios. Several affinity measures, namely the $\mathrm{L}_1$, $\mathrm{L}_2$ and the Bhattacharyya distances, Rank-1 counts and the cosine similarity, are also assessed for their impact on the discriminative power of the features. Results on several datasets show that features from ReID networks are the best for discriminating instances from one another regardless of the quality of the detector. If a ReID model is not available, color histograms may be selected if the detector has a good recall and there are few occlusions; otherwise, deep features are more robust to detectors with lower recall. The project page is http://www.mehdimiah.com/visual_features. △ Less

Submitted 15 October, 2020; originally announced October 2020.

Comments: Accepted on 25th International Conference on Pattern Recognition (ICPR 2020)

arXiv:2007.15560 [pdf, other]

Unsupervised Disentanglement GAN for Domain Adaptive Person Re-Identification

Authors: Yacine Khraimeche, Guillaume-Alexandre Bilodeau, David Steele, Harshad Mahadik

Abstract: While recent person re-identification (ReID) methods achieve high accuracy in a supervised setting, their generalization to an unlabelled domain is still an open problem. In this paper, we introduce a novel unsupervised disentanglement generative adversarial network (UD-GAN) to address the domain adaptation issue of supervised person ReID. Our framework jointly trains a ReID network for discrimina… ▽ More While recent person re-identification (ReID) methods achieve high accuracy in a supervised setting, their generalization to an unlabelled domain is still an open problem. In this paper, we introduce a novel unsupervised disentanglement generative adversarial network (UD-GAN) to address the domain adaptation issue of supervised person ReID. Our framework jointly trains a ReID network for discriminative features extraction in a source labelled domain using identity annotation, and adapts the ReID model to an unlabelled target domain by learning disentangled latent representations on the domain. Identity-unrelated features in the target domain are distilled from the latent features. As a result, the ReID features better encompass the identity of a person in the unsupervised domain. We conducted experiments on the Market1501, DukeMTMC and MSMT17 datasets. Results show that the unsupervised domain adaptation problem in ReID is very challenging. Nevertheless, our method shows improvement in half of the domain transfers and achieve state-of-the-art performance for one of them. △ Less

Submitted 30 July, 2020; originally announced July 2020.

Comments: 8 pages, 5 figures, submitted to ICPR 2020

arXiv:2005.00088 [pdf, other]

Domain Siamese CNNs for Sparse Multispectral Disparity Estimation

Authors: David-Alexandre Beaupre, Guillaume-Alexandre Bilodeau

Abstract: Multispectral disparity estimation is a difficult task for many reasons: it has all the same challenges as traditional visible-visible disparity estimation (occlusions, repetitive patterns, textureless surfaces), in addition of having very few common visual information between images (e.g. color information vs. thermal information). In this paper, we propose a new CNN architecture able to do dispa… ▽ More Multispectral disparity estimation is a difficult task for many reasons: it has all the same challenges as traditional visible-visible disparity estimation (occlusions, repetitive patterns, textureless surfaces), in addition of having very few common visual information between images (e.g. color information vs. thermal information). In this paper, we propose a new CNN architecture able to do disparity estimation between images from different spectrum, namely thermal and visible in our case. Our proposed model takes two patches as input and proceeds to do domain feature extraction for each of them. Features from both domains are then merged with two fusion operations, namely correlation and concatenation. These merged vectors are then forwarded to their respective classification heads, which are responsible for classifying the inputs as being same or not. Using two merging operations gives more robustness to our feature extraction process, which leads to more precise disparity estimation. Our method was tested using the publicly available LITIV 2014 and LITIV 2018 datasets, and showed best results when compared to other state of the art methods. △ Less

Submitted 30 April, 2020; originally announced May 2020.

arXiv:2003.13644 [pdf, other]

Supervised and Unsupervised Detections for Multiple Object Tracking in Traffic Scenes: A Comparative Study

Authors: Hui-Lee Ooi, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: In this paper, we propose a multiple object tracker, called MF-Tracker, that integrates multiple classical features (spatial distances and colours) and modern features (detection labels and re-identification features) in its tracking framework. Since our tracker can work with detections coming either from unsupervised and supervised object detectors, we also investigated the impact of supervised a… ▽ More In this paper, we propose a multiple object tracker, called MF-Tracker, that integrates multiple classical features (spatial distances and colours) and modern features (detection labels and re-identification features) in its tracking framework. Since our tracker can work with detections coming either from unsupervised and supervised object detectors, we also investigated the impact of supervised and unsupervised detection inputs in our method and for tracking road users in general. We also compared our results with existing methods that were applied on the UA-Detrac and the UrbanTracker datasets. Results show that our proposed method is performing very well in both datasets with different inputs (MOTA ranging from 0:3491 to 0:5805 for unsupervised inputs on the UrbanTracker dataset and an average MOTA of 0:7638 for supervised inputs on the UA Detrac dataset) under different circumstances. A well-trained supervised object detector can give better results in challenging scenarios. However, in simpler scenarios, if good training data is not available, unsupervised method can perform well and can be a good alternative. △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: Accepted for ICIAR 2020

arXiv:2003.10898 [pdf, other]

RN-VID: A Feature Fusion Architecture for Video Object Detection

Authors: Hughes Perreault, Maguelonne Héritier, Pierre Gravel, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: Consecutive frames in a video are highly redundant. Therefore, to perform the task of video object detection, executing single frame detectors on every frame without reusing any information is quite wasteful. It is with this idea in mind that we propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection. Our contributions are twofold. First, we propose a new architec… ▽ More Consecutive frames in a video are highly redundant. Therefore, to perform the task of video object detection, executing single frame detectors on every frame without reusing any information is quite wasteful. It is with this idea in mind that we propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection. Our contributions are twofold. First, we propose a new architecture that allows the usage of information from nearby frames to enhance feature maps. Second, we propose a novel module to merge feature maps of same dimensions using re-ordering of channels and 1 x 1 convolutions. We then demonstrate that RN-VID achieves better mean average precision (mAP) than corresponding single frame detectors with little additional cost during inference. △ Less

Submitted 2 April, 2020; v1 submitted 24 March, 2020; originally announced March 2020.

arXiv:2003.04468 [pdf, other]

Tracking Road Users using Constraint Programming

Authors: Alexandre Pineault, Guillaume-Alexandre Bilodeau, Gilles Pesant

Abstract: In this paper, we aim at improving the tracking of road users in urban scenes. We present a constraint programming (CP) approach for the data association phase found in the tracking-by-detection paradigm of the multiple object tracking (MOT) problem. Such an approach can solve the data association problem more efficiently than graph-based methods and can handle better the combinatorial explosion o… ▽ More In this paper, we aim at improving the tracking of road users in urban scenes. We present a constraint programming (CP) approach for the data association phase found in the tracking-by-detection paradigm of the multiple object tracking (MOT) problem. Such an approach can solve the data association problem more efficiently than graph-based methods and can handle better the combinatorial explosion occurring when multiple frames are analyzed. Because our focus is on the data association problem, our MOT method only uses simple image features, which are the center position and color of detections for each frame. Constraints are defined on these two features and on the general MOT problem. For example, we enforce color appearance preservation over trajectories and constrain the extent of motion between frames. Filtering layers are used in order to eliminate detection candidates before using CP and to remove dummy trajectories produced by the CP solver. Our proposed method was tested on a motorized vehicles tracking dataset and produces results that outperform the top methods of the UA-DETRAC benchmark. △ Less

Submitted 9 March, 2020; originally announced March 2020.

arXiv:2002.05540 [pdf, other]

SpotNet: Self-Attention Multi-Task Network for Object Detection

Authors: Hughes Perreault, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Maguelonne Héritier

Abstract: Humans are very good at directing their visual attention toward relevant areas when they search for different types of objects. For instance, when we search for cars, we will look at the streets, not at the top of buildings. The motivation of this paper is to train a network to do the same via a multi-task learning approach. To train visual attention, we produce foreground/background segmentation… ▽ More Humans are very good at directing their visual attention toward relevant areas when they search for different types of objects. For instance, when we search for cars, we will look at the streets, not at the top of buildings. The motivation of this paper is to train a network to do the same via a multi-task learning approach. To train visual attention, we produce foreground/background segmentation labels in a semi-supervised way, using background subtraction or optical flow. Using these labels, we train an object detection model to produce foreground/background segmentation maps as well as bounding boxes while sharing most model parameters. We use those segmentation maps inside the network as a self-attention mechanism to weight the feature map used to produce the bounding boxes, decreasing the signal of non-relevant areas. We show that by using this method, we obtain a significant mAP improvement on two traffic surveillance datasets, with state-of-the-art results on both UA-DETRAC and UAVDT. △ Less

Submitted 11 June, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

arXiv:1911.13114 [pdf, other]

Color inference from semantic labeling for person search in videos

Authors: Jules Simon, Guillaume-Alexandre Bilodeau, David Steele, Harshad Mahadik

Abstract: We propose an explainable model to generate semantic color labels for person search. In this context, persons are described from their semantic parts, such as hat, shirt, etc. Person search consists in looking for people based on these descriptions. In this work, we aim to improve the accuracy of color labels for people. Our goal is to handle the high variability of human perception. Existing solu… ▽ More We propose an explainable model to generate semantic color labels for person search. In this context, persons are described from their semantic parts, such as hat, shirt, etc. Person search consists in looking for people based on these descriptions. In this work, we aim to improve the accuracy of color labels for people. Our goal is to handle the high variability of human perception. Existing solutions are based on hand-crafted features or learnt features that are not explainable. Moreover most of them only focus on a limited set of colors. We propose a method based on binary search trees and a large peer-labelled color name dataset. This allows us to synthesize the human perception of colors. Using semantic segmentation and our color labeling method, we label segments of pedestrians with their associated colors. We evaluate our solution on person search on datasets such as PCN, and show a precision as high as 80.4%. △ Less

Submitted 6 April, 2020; v1 submitted 29 November, 2019; originally announced November 2019.

Comments: 8 pages, 7 figures ICIAR 2020

arXiv:1910.11509 [pdf, other]

doi 10.1016/j.eswa.2019.113075

Deep 1D-Convnet for accurate Parkinson disease detection and severity prediction from gait

Authors: Imanne El Maachi, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: Diagnosing Parkinson's disease is a complex task that requires the evaluation of several motor and non-motor symptoms. During diagnosis, gait abnormalities are among the important symptoms that physicians should consider. However, gait evaluation is challenging and relies on the expertise and subjectivity of clinicians. In this context, the use of an intelligent gait analysis algorithm may assist… ▽ More Diagnosing Parkinson's disease is a complex task that requires the evaluation of several motor and non-motor symptoms. During diagnosis, gait abnormalities are among the important symptoms that physicians should consider. However, gait evaluation is challenging and relies on the expertise and subjectivity of clinicians. In this context, the use of an intelligent gait analysis algorithm may assist physicians in order to facilitate the diagnosis process. This paper proposes a novel intelligent Parkinson detection system based on deep learning techniques to analyze gait information. We used 1D convolutional neural network (1D-Convnet) to build a Deep Neural Network (DNN) classifier. The proposed model processes 18 1D-signals coming from foot sensors measuring the vertical ground reaction force (VGRF). The first part of the network consists of 18 parallel 1D-Convnet corresponding to system inputs. The second part is a fully connected network that connects the concatenated outputs of the 1D-Convnets to obtain a final classification. We tested our algorithm in Parkinson's detection and in the prediction of the severity of the disease with the Unified Parkinson's Disease Rating Scale (UPDRS). Our experiments demonstrate the high efficiency of the proposed method in the detection of Parkinson disease based on gait data. The proposed algorithm achieved an accuracy of 98.7 %. To our knowledge, this is the state-of-the-start performance in Parkinson's gait recognition. Furthermore, we achieved an accuracy of 85.3 % in Parkinson's severity prediction. To the best of our knowledge, this is the first algorithm to perform a severity prediction based on the UPDRS. Our results show that the model is able to learn intrinsic characteristics from gait data and to generalize to unseen subjects, which could be helpful in a clinical diagnosis. △ Less

Submitted 16 May, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

Comments: Source code available at https://github.com/imanneelmaachi/Parkinson-disease-detection-and-severity-prediction-from-gait

Journal ref: Expert Systems with Applications, 113075 (2019)

arXiv:1905.06381 [pdf, other]

Tracking in Urban Traffic Scenes from Background Subtraction and Object Detection

Authors: Hui-Lee Ooi, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: In this paper, we propose to combine detections from background subtraction and from a multiclass object detector for multiple object tracking (MOT) in urban traffic scenes. These objects are associated across frames using spatial, colour and class label information, and trajectory prediction is evaluated to yield the final MOT outputs. The proposed method was tested on the Urban tracker dataset a… ▽ More In this paper, we propose to combine detections from background subtraction and from a multiclass object detector for multiple object tracking (MOT) in urban traffic scenes. These objects are associated across frames using spatial, colour and class label information, and trajectory prediction is evaluated to yield the final MOT outputs. The proposed method was tested on the Urban tracker dataset and shows competitive performances compared to state-of-the-art approaches. Results show that the integration of different detection inputs remains a challenging task that greatly affects the MOT performance. △ Less

Submitted 15 May, 2019; originally announced May 2019.

arXiv:1903.12049 [pdf, other]

Road User Detection in Videos

Authors: Hughes Perreault, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Pierre Gravel

Abstract: Successive frames of a video are highly redundant, and the most popular object detection methods do not take advantage of this fact. Using multiple consecutive frames can improve detection of small objects or difficult examples and can improve speed and detection consistency in a video sequence, for instance by interpolating features between frames. In this work, a novel approach is introduced to… ▽ More Successive frames of a video are highly redundant, and the most popular object detection methods do not take advantage of this fact. Using multiple consecutive frames can improve detection of small objects or difficult examples and can improve speed and detection consistency in a video sequence, for instance by interpolating features between frames. In this work, a novel approach is introduced to perform online video object detection using two consecutive frames of video sequences involving road users. Two new models, RetinaNet-Double and RetinaNet-Flow, are proposed, based respectively on the concatenation of a target frame with a preceding frame, and the concatenation of the optical flow with the target frame. The models are trained and evaluated on three public datasets. Experiments show that using a preceding frame improves performance over single frame detectors, but using explicit optical flow usually does not. △ Less

Submitted 28 March, 2019; originally announced March 2019.

arXiv:1903.11040 [pdf, other]

Adversarially Learned Abnormal Trajectory Classifier

Authors: Pankaj Raj Roy, Guillaume-Alexandre Bilodeau

Abstract: We address the problem of abnormal event detection from trajectory data. In this paper, a new adversarial approach is proposed for building a deep neural network binary classifier, trained in an unsupervised fashion, that can distinguish normal from abnormal trajectory-based events without the need for setting manual detection threshold. Inspired by the generative adversarial network (GAN) framewo… ▽ More We address the problem of abnormal event detection from trajectory data. In this paper, a new adversarial approach is proposed for building a deep neural network binary classifier, trained in an unsupervised fashion, that can distinguish normal from abnormal trajectory-based events without the need for setting manual detection threshold. Inspired by the generative adversarial network (GAN) framework, our GAN version is a discriminative one in which the discriminator is trained to distinguish normal and abnormal trajectory reconstruction errors given by a deep autoencoder. With urban traffic videos and their associated trajectories, our proposed method gives the best accuracy for abnormal trajectory detection. In addition, our model can easily be generalized for abnormal trajectory-based event detection and can still yield the best behavioural detection results as demonstrated on the CAVIAR dataset. △ Less

Submitted 3 April, 2019; v1 submitted 26 March, 2019; originally announced March 2019.

Comments: Accepted for the 16th Conference on Computer and Robot Vision (CRV) 2019

arXiv:1809.02851 [pdf, other]

Online Mutual Foreground Segmentation for Multispectral Stereo Videos

Authors: Pierre-Luc St-Charles, Guillaume-Alexandre Bilodeau, Robert Bergevin

Abstract: The segmentation of video sequences into foreground and background regions is a low-level process commonly used in video content analysis and smart surveillance applications. Using a multispectral camera setup can improve this process by providing more diverse data to help identify objects despite adverse imaging conditions. The registration of several data sources is however not trivial if the ap… ▽ More The segmentation of video sequences into foreground and background regions is a low-level process commonly used in video content analysis and smart surveillance applications. Using a multispectral camera setup can improve this process by providing more diverse data to help identify objects despite adverse imaging conditions. The registration of several data sources is however not trivial if the appearance of objects produced by each sensor differs substantially. This problem is further complicated when parallax effects cannot be ignored when using close-range stereo pairs. In this work, we present a new method to simultaneously tackle multispectral segmentation and stereo registration. Using an iterative procedure, we estimate the labeling result for one problem using the provisional result of the other. Our approach is based on the alternating minimization of two energy functions that are linked through the use of dynamic priors. We rely on the integration of shape and appearance cues to find proper multispectral correspondences, and to properly segment objects in low contrast regions. We also formulate our model as a frame processing pipeline using higher order terms to improve the temporal coherence of our results. Our method is evaluated under different configurations on multiple multispectral datasets, and our implementation is available online. △ Less

Submitted 21 December, 2018; v1 submitted 8 September, 2018; originally announced September 2018.

Comments: Preprint accepted for publication in IJCV (December 2018)

arXiv:1809.02073 [pdf, other]

Multiple Object Tracking in Urban Traffic Scenes with a Multiclass Object Detector

Authors: Hui-Lee Ooi, Guillaume-Alexandre Bilodeau, Nicolas Saunier, David-Alexandre Beaupré

Abstract: Multiple object tracking (MOT) in urban traffic aims to produce the trajectories of the different road users that move across the field of view with different directions and speeds and that can have varying appearances and sizes. Occlusions and interactions among the different objects are expected and common due to the nature of urban road traffic. In this work, a tracking framework employing clas… ▽ More Multiple object tracking (MOT) in urban traffic aims to produce the trajectories of the different road users that move across the field of view with different directions and speeds and that can have varying appearances and sizes. Occlusions and interactions among the different objects are expected and common due to the nature of urban road traffic. In this work, a tracking framework employing classification label information from a deep learning detection approach is used for associating the different objects, in addition to object position and appearances. We want to investigate the performance of a modern multiclass object detector for the MOT task in traffic scenes. Results show that the object labels improve tracking performance, but that the output of object detectors are not always reliable. △ Less

Submitted 6 September, 2018; originally announced September 2018.

Comments: 13th International Symposium on Visual Computing (ISVC)

arXiv:1809.00957 [pdf, other]

Road User Abnormal Trajectory Detection using a Deep Autoencoder

Authors: Pankaj Raj Roy, Guillaume-Alexandre Bilodeau

Abstract: In this paper, we focus on the development of a method that detects abnormal trajectories of road users at traffic intersections. The main difficulty with this is the fact that there are very few abnormal data and the normal ones are insufficient for the training of any kinds of machine learning model. To tackle these problems, we proposed the solution of using a deep autoencoder network trained s… ▽ More In this paper, we focus on the development of a method that detects abnormal trajectories of road users at traffic intersections. The main difficulty with this is the fact that there are very few abnormal data and the normal ones are insufficient for the training of any kinds of machine learning model. To tackle these problems, we proposed the solution of using a deep autoencoder network trained solely through augmented data considered as normal. By generating artificial abnormal trajectories, our method is tested on four different outdoor urban users scenes and performs better compared to some classical outlier detection methods. △ Less

Submitted 25 August, 2018; originally announced September 2018.

Comments: This paper has been accepted for oral presentation at ISVC'18

arXiv:1808.07349 [pdf, other]

Multi-Branch Siamese Networks with Online Selection for Object Tracking

Authors: Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Abstract: In this paper, we propose a robust object tracking algorithm based on a branch selection mechanism to choose the most efficient object representations from multi-branch siamese networks. While most deep learning trackers use a single CNN for target representation, the proposed Multi-Branch Siamese Tracker (MBST) employs multiple branches of CNNs pre-trained for different tasks, and used for variou… ▽ More In this paper, we propose a robust object tracking algorithm based on a branch selection mechanism to choose the most efficient object representations from multi-branch siamese networks. While most deep learning trackers use a single CNN for target representation, the proposed Multi-Branch Siamese Tracker (MBST) employs multiple branches of CNNs pre-trained for different tasks, and used for various target representations in our tracking method. With our branch selection mechanism, the appropriate CNN branch is selected depending on the target characteristics in an online manner. By using the most adequate target representation with respect to the tracked object, our method achieves real-time tracking, while obtaining improved performance compared to standard Siamese network trackers on object tracking benchmarks. △ Less

Submitted 31 August, 2018; v1 submitted 22 August, 2018; originally announced August 2018.

Comments: ISVC2018, oral presentation

arXiv:1801.09646 [pdf, other]

Improving Multiple Object Tracking with Optical Flow and Edge Preprocessing

Authors: David-Alexandre Beaupré, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Abstract: In this paper, we present a new method for detecting road users in an urban environment which leads to an improvement in multiple object tracking. Our method takes as an input a foreground image and improves the object detection and segmentation. This new image can be used as an input to trackers that use foreground blobs from background subtraction. The first step is to create foreground images f… ▽ More In this paper, we present a new method for detecting road users in an urban environment which leads to an improvement in multiple object tracking. Our method takes as an input a foreground image and improves the object detection and segmentation. This new image can be used as an input to trackers that use foreground blobs from background subtraction. The first step is to create foreground images for all the frames in an urban video. Then, starting from the original blobs of the foreground image, we merge the blobs that are close to one another and that have similar optical flow. The next step is extracting the edges of the different objects to detect multiple objects that might be very close (and be merged in the same blob) and to adjust the size of the original blobs. At the same time, we use the optical flow to detect occlusion of objects that are moving in opposite directions. Finally, we make a decision on which information we keep in order to construct a new foreground image with blobs that can be used for tracking. The system is validated on four videos of an urban traffic dataset. Our method improves the recall and precision metrics for the object detection task compared to the vanilla background subtraction method and improves the CLEAR MOT metrics in the tracking tasks for most videos. △ Less

Submitted 29 January, 2018; originally announced January 2018.

arXiv:1801.03551 [pdf, other]

From Superpixel to Human Shape Modelling for Carried Object Detection

Authors: Farnoosh Ghadiri, Robert Bergevin, Guillaume-Alexandre Bilodeau

Abstract: Detecting carried objects is one of the requirements for develo** systems to reason about activities involving people and objects. We present an approach to detect carried objects from a single video frame with a novel method that incorporates features from multiple scales. Initially, a foreground mask in a video frame is segmented into multi-scale superpixels. Then the human-like regions in the… ▽ More Detecting carried objects is one of the requirements for develo** systems to reason about activities involving people and objects. We present an approach to detect carried objects from a single video frame with a novel method that incorporates features from multiple scales. Initially, a foreground mask in a video frame is segmented into multi-scale superpixels. Then the human-like regions in the segmented area are identified by matching a set of extracted features from superpixels against learned features in a codebook. A carried object probability map is generated using the complement of the matching probabilities of superpixels to human-like regions and background information. A group of superpixels with high carried object probability and strong edge support is then merged to obtain the shape of the carried object. We applied our method to two challenging datasets, and results show that our method is competitive with or better than the state-of-the-art. △ Less

Submitted 10 January, 2018; originally announced January 2018.

arXiv:1801.01974 [pdf]

doi 10.1109/TIFS.2018.2866295

Domain-Specific Face Synthesis for Video Face Recognition from a Single Sample Per Person

Authors: Fania Mokhayeri, Eric Granger, Guillaume-Alexandre Bilodeau

Abstract: The performance of still-to-video FR systems can decline significantly because faces captured in unconstrained operational domain (OD) over multiple video cameras have a different underlying data distribution compared to faces captured under controlled conditions in the enrollment domain (ED) with a still camera. This is particularly true when individuals are enrolled to the system using a single… ▽ More The performance of still-to-video FR systems can decline significantly because faces captured in unconstrained operational domain (OD) over multiple video cameras have a different underlying data distribution compared to faces captured under controlled conditions in the enrollment domain (ED) with a still camera. This is particularly true when individuals are enrolled to the system using a single reference still. To improve the robustness of these systems, it is possible to augment the reference set by generating synthetic faces based on the original still. However, without knowledge of the OD, many synthetic images must be generated to account for all possible capture conditions. FR systems may, therefore, require complex implementations and yield lower accuracy when training on many less relevant images. This paper introduces an algorithm for domain-specific face synthesis (DSFS) that exploits the representative intra-class variation information available from the OD. Prior to operation, a compact set of faces from unknown persons appearing in the OD is selected through clustering in the captured condition space. The domain-specific variations of these face images are projected onto the reference stills by integrating an image-based face relighting technique inside the 3D reconstruction framework. A compact set of synthetic faces is generated that resemble individuals of interest under the capture conditions relevant to the OD. In a particular implementation based on sparse representation classification, the synthetic faces generated with the DSFS are employed to form a cross-domain dictionary that account for structured sparsity. Experimental results reveal that augmenting the reference gallery set of FR systems using the proposed DSFS approach can provide a higher level of accuracy compared to state-of-the-art approaches, with only a moderate increase in its computational complexity. △ Less

Submitted 1 October, 2018; v1 submitted 6 January, 2018; originally announced January 2018.

Journal ref: Transaction on Information Forensics and Security, Vol. 14, Issue 3, pp. 757-772, 2018

Showing 1–50 of 60 results for author: Bilodeau, G