Search | arXiv e-print repository

doi 10.13140/RG.2.2.29455.19362

Attack on Scene Flow using Point Clouds

Authors: Haniyeh Ehsani Oskouie, Mohammad-Shahram Moin, Shohreh Kasaei

Abstract: Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many dom… ▽ More Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants shows a higher vulnerability for the optical flow networks. △ Less

Submitted 17 June, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.11335 [pdf, other]

SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap

Authors: Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Giancola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Baptiste Standaert, Amir Mohammad Mansourian, Xin Zhou, Shohreh Kasaei, Bernard Ghanem, Alexandre Alahi, Marc Van Droogenbroeck, Christophe De Vleeschouwer

Abstract: Tracking and identifying athletes on the pitch holds a central role in collecting essential insights from the game, such as estimating the total distance covered by players or understanding team tactics. This tracking and identification process is crucial for reconstructing the game state, defined by the athletes' positions and identities on a 2D top-view of the pitch, (i.e. a minimap). However, r… ▽ More Tracking and identifying athletes on the pitch holds a central role in collecting essential insights from the game, such as estimating the total distance covered by players or understanding team tactics. This tracking and identification process is crucial for reconstructing the game state, defined by the athletes' positions and identities on a 2D top-view of the pitch, (i.e. a minimap). However, reconstructing the game state from videos captured by a single camera is challenging. It requires understanding the position of the athletes and the viewpoint of the camera to localize and identify players within the field. In this work, we formalize the task of Game State Reconstruction and introduce SoccerNet-GSR, a novel Game State Reconstruction dataset focusing on football videos. SoccerNet-GSR is composed of 200 video sequences of 30 seconds, annotated with 9.37 million line points for pitch localization and camera calibration, as well as over 2.36 million athlete positions on the pitch with their respective role, team, and jersey number. Furthermore, we introduce GS-HOTA, a novel metric to evaluate game state reconstruction methods. Finally, we propose and release an end-to-end baseline for game state reconstruction, bootstrap** the research on this task. Our experiments show that GSR is a challenging novel task, which opens the field for future research. Our dataset and codebase are publicly available at https://github.com/SoccerNet/sn-gamestate. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2403.05451 [pdf, other]

Attention-guided Feature Distillation for Semantic Segmentation

Authors: Amir M. Mansourian, Arya Jalali, Rozhan Ahmadi, Shohreh Kasaei

Abstract: In contrast to existing complex methodologies commonly employed for distilling knowledge from a teacher to a student, the pro-posed method showcases the efficacy of a simple yet powerful method for utilizing refined feature maps to transfer attention. The proposed method has proven to be effective in distilling rich information, outperforming existing methods in semantic segmentation as a dense pr… ▽ More In contrast to existing complex methodologies commonly employed for distilling knowledge from a teacher to a student, the pro-posed method showcases the efficacy of a simple yet powerful method for utilizing refined feature maps to transfer attention. The proposed method has proven to be effective in distilling rich information, outperforming existing methods in semantic segmentation as a dense prediction task. The proposed Attention-guided Feature Distillation (AttnFD) method, em-ploys the Convolutional Block Attention Module (CBAM), which refines feature maps by taking into account both channel-specific and spatial information content. By only using the Mean Squared Error (MSE) loss function between the refined feature maps of the teacher and the student,AttnFD demonstrates outstanding performance in semantic segmentation, achieving state-of-the-art results in terms of mean Intersection over Union (mIoU) on the PascalVoc 2012 and Cityscapes datasets. The Code is available at https://github.com/AmirMansurian/AttnFD. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 17 pages, 8 figures, and 3 tables

arXiv:2402.02474 [pdf, other]

Deep Spectral Improvement for Unsupervised Image Instance Segmentation

Authors: Farnoosh Arefi, Amir M. Mansourian, Shohreh Kasaei

Abstract: Deep spectral methods reframe the image decomposition process as a graph partitioning task by extracting features using self-supervised learning and utilizing the Laplacian of the affinity matrix to obtain eigensegments. However, instance segmentation has received less attention compared to other tasks within the context of deep spectral methods. This paper addresses the fact that not all channels… ▽ More Deep spectral methods reframe the image decomposition process as a graph partitioning task by extracting features using self-supervised learning and utilizing the Laplacian of the affinity matrix to obtain eigensegments. However, instance segmentation has received less attention compared to other tasks within the context of deep spectral methods. This paper addresses the fact that not all channels of the feature map extracted from a self-supervised backbone contain sufficient information for instance segmentation purposes. In fact, Some channels are noisy and hinder the accuracy of the task. To overcome this issue, this paper proposes two channel reduction modules: Noise Channel Reduction (NCR) and Deviation-based Channel Reduction (DCR). The NCR retains channels with lower entropy, as they are less likely to be noisy, while DCR prunes channels with low standard deviation, as they lack sufficient information for effective instance segmentation. Furthermore, the paper demonstrates that the dot product, commonly used in deep spectral methods, is not suitable for instance segmentation due to its sensitivity to feature map values, potentially leading to incorrect instance segments. A new similarity metric called Bray-Curtis over Chebyshev (BoC) is proposed to address this issue. It takes into account the distribution of features in addition to their values, providing a more robust similarity measure for instance segmentation. Quantitative and qualitative results on the Youtube-VIS2019 dataset highlight the improvements achieved by the proposed channel reduction methods and the use of BoC instead of the conventional dot product for creating the affinity matrix. These improvements are observed in terms of mean Intersection over Union and extracted instance segments, demonstrating enhanced instance segmentation performance. The code is available on: https://github.com/farnooshar/SpecUnIIS △ Less

Submitted 6 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: 11 pages, 13 figures and 5 tables

arXiv:2401.17828 [pdf, other]

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Authors: Rozhan Ahmadi, Shohreh Kasaei

Abstract: In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs).… ▽ More In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of CNNs, CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer. Code available at: https://github.com/RozhanAhmadi/SWTformer △ Less

Submitted 11 March, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

Comments: 7 pages, 4 figures, 3 tables

arXiv:2401.09942 [pdf, other]

doi 10.1145/3606038.3616172

Multi-task Learning for Joint Re-identification, Team Affiliation, and Role Classification for Sports Visual Tracking

Authors: Amir M. Mansourian, Vladimir Somers, Christophe De Vleeschouwer, Shohreh Kasaei

Abstract: Effective tracking and re-identification of players is essential for analyzing soccer videos. But, it is a challenging task due to the non-linear motion of players, the similarity in appearance of players from the same team, and frequent occlusions. Therefore, the ability to extract meaningful embeddings to represent players is crucial in develo** an effective tracking and re-identification syst… ▽ More Effective tracking and re-identification of players is essential for analyzing soccer videos. But, it is a challenging task due to the non-linear motion of players, the similarity in appearance of players from the same team, and frequent occlusions. Therefore, the ability to extract meaningful embeddings to represent players is crucial in develo** an effective tracking and re-identification system. In this paper, a multi-purpose part-based person representation method, called PRTreID, is proposed that performs three tasks of role classification, team affiliation, and re-identification, simultaneously. In contrast to available literature, a single network is trained with multi-task supervision to solve all three tasks, jointly. The proposed joint method is computationally efficient due to the shared backbone. Also, the multi-task learning leads to richer and more discriminative representations, as demonstrated by both quantitative and qualitative results. To demonstrate the effectiveness of PRTreID, it is integrated with a state-of-the-art tracking method, using a part-based post-processing module to handle long-term tracking. The proposed tracking method outperforms all existing tracking methods on the challenging SoccerNet tracking dataset. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Journal ref: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports (MMSports 2023), October 29, 2023, Ottawa, ON, Canada

arXiv:2401.00833 [pdf, other]

Rethinking RAFT for Efficient Optical Flow

Authors: Navid Eslami, Farnoosh Arefi, Amir M. Mansourian, Shohreh Kasaei

Abstract: Despite significant progress in deep learning-based optical flow methods, accurately estimating large displacements and repetitive patterns remains a challenge. The limitations of local features and similarity search patterns used in these algorithms contribute to this issue. Additionally, some existing methods suffer from slow runtime and excessive graphic memory consumption. To address these pro… ▽ More Despite significant progress in deep learning-based optical flow methods, accurately estimating large displacements and repetitive patterns remains a challenge. The limitations of local features and similarity search patterns used in these algorithms contribute to this issue. Additionally, some existing methods suffer from slow runtime and excessive graphic memory consumption. To address these problems, this paper proposes a novel approach based on the RAFT framework. The proposed Attention-based Feature Localization (AFL) approach incorporates the attention mechanism to handle global feature extraction and address repetitive patterns. It introduces an operator for matching pixels with corresponding counterparts in the second frame and assigning accurate flow values. Furthermore, an Amorphous Lookup Operator (ALO) is proposed to enhance convergence speed and improve RAFTs ability to handle large displacements by reducing data redundancy in its search operator and expanding the search space for similarity extraction. The proposed method, Efficient RAFT (Ef-RAFT),achieves significant improvements of 10% on the Sintel dataset and 5% on the KITTI dataset over RAFT. Remarkably, these enhancements are attained with a modest 33% reduction in speed and a mere 13% increase in memory usage. The code is available at: https://github.com/n3slami/Ef-RAFT △ Less

Submitted 1 January, 2024; originally announced January 2024.

Comments: 7 pages, 5 figures, 4 tables

MSC Class: ACM-class: F.2.2; I.2.7

arXiv:2401.00496 [pdf, other]

SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge

Authors: Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, Maxence Boels, Jiayu Huo, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin, Mengya Xu, An Wang, Yanan Wu, Long Bai, Hongliang Ren, Atsushi Yamada, Yuriko Harai, Yuto Ishikawa, Kazuyuki Hayashi , et al. (25 additional authors not shown)

Abstract: Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segme… ▽ More Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segmentation algorithms are often trained and make predictions in isolation from each other, without exploiting potential cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The aim of the challenge is twofold. First, to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. Second, to further explore the potential of multitask-based learning approaches and determine their comparative advantage against their single-task counterparts. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation. The complete SAR-RARP50 dataset is available at: https://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/191091 △ Less

Submitted 23 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

arXiv:2309.15589 [pdf, other]

doi 10.1016/j.media.2024.103155

Domain generalization across tumor types, laboratories, and species -- insights from the 2022 edition of the Mitosis Domain Generalization Challenge

Authors: Marc Aubreville, Nikolas Stathonikos, Taryn A. Donovan, Robert Klopfleisch, Jonathan Ganz, Jonas Ammeling, Frauke Wilm, Mitko Veta, Samir Jabari, Markus Eckstein, Jonas Annuscheit, Christian Krumnow, Engin Bozaba, Sercan Cayir, Hongyan Gu, Xiang 'Anthony' Chen, Mostafa Jahanifar, Adam Shephard, Satoshi Kondo, Satoshi Kasai, Sujatha Kotte, VG Saipradeep, Maxime W. Lafarge, Viktor H. Koelzer, Ziyue Wang , et al. (5 additional authors not shown)

Abstract: Recognition of mitotic figures in histologic tumor specimens is highly relevant to patient outcome assessment. This task is challenging for algorithms and human experts alike, with deterioration of algorithmic performance under shifts in image representations. Considerable covariate shifts occur when assessment is performed on different tumor types, images are acquired using different digitization… ▽ More Recognition of mitotic figures in histologic tumor specimens is highly relevant to patient outcome assessment. This task is challenging for algorithms and human experts alike, with deterioration of algorithmic performance under shifts in image representations. Considerable covariate shifts occur when assessment is performed on different tumor types, images are acquired using different digitization devices, or specimens are produced in different laboratories. This observation motivated the inception of the 2022 challenge on MItosis Domain Generalization (MIDOG 2022). The challenge provided annotated histologic tumor images from six different domains and evaluated the algorithmic approaches for mitotic figure detection provided by nine challenge participants on ten independent domains. Ground truth for mitotic figure detection was established in two ways: a three-expert consensus and an independent, immunohistochemistry-assisted set of labels. This work represents an overview of the challenge tasks, the algorithmic strategies employed by the participants, and potential factors contributing to their success. With an $F_1$ score of 0.764 for the top-performing team, we summarize that domain generalization across various tumor domains is possible with today's deep learning-based recognition pipelines. However, we also found that domain characteristics not present in the training set (feline as new species, spindle cell shape as new morphology and a new scanner) led to small but significant decreases in performance. When assessed against the immunohistochemistry-assisted reference standard, all methods resulted in reduced recall scores, but with only minor changes in the order of participants in the ranking. △ Less

Submitted 31 January, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

Journal ref: Medical Image Analysis Volume 94, May 2024, 103155

arXiv:2308.04243 [pdf, other]

AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation

Authors: Amir M. Mansourian, Rozhan Ahmadi, Shohreh Kasaei

Abstract: In recent years, deep neural networks have achieved remarkable accuracy in computer vision tasks. With inference time being a crucial factor, particularly in dense prediction tasks such as semantic segmentation, knowledge distillation has emerged as a successful technique for improving the accuracy of lightweight student networks. The existing methods often neglect the information in channels and… ▽ More In recent years, deep neural networks have achieved remarkable accuracy in computer vision tasks. With inference time being a crucial factor, particularly in dense prediction tasks such as semantic segmentation, knowledge distillation has emerged as a successful technique for improving the accuracy of lightweight student networks. The existing methods often neglect the information in channels and among different classes. To overcome these limitations, this paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation. The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. This is followed by calculating inter-class similarity matrices for distillation using KL divergence between distributions of each pair of classes. To further improve the effectiveness of the proposed method, an Adaptive Loss Weighting (ALW) training strategy is proposed. Unlike existing methods, the ALW strategy gradually reduces the influence of the teacher network towards the end of training process to account for errors in teacher's predictions. Extensive experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method in terms of mIoU and pixel accuracy. The proposed method outperforms most of existing knowledge distillation methods as demonstrated by both quantitative and qualitative evaluations. Code is available at: https://github.com/AmirMansurian/AICSD △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 10 pages, 5 figures, 5 tables

arXiv:2305.18033 [pdf]

doi 10.1038/s41597-023-02422-6

The ACROBAT 2022 Challenge: Automatic Registration Of Breast Cancer Tissue

Authors: Philippe Weitz, Masi Valkonen, Leslie Solorzano, Circe Carr, Kimmo Kartasalo, Constance Boissin, Sonja Koivukoski, Aino Kuusela, Dusan Rasic, Yanbo Feng, Sandra Sinius Pouplier, Abhinav Sharma, Kajsa Ledesma Eriksson, Stephanie Robertson, Christian Marzahl, Chandler D. Gatenbee, Alexander R. A. Anderson, Marek Wodzinski, Artur Jurgas, Niccolò Marini, Manfredo Atzori, Henning Müller, Daniel Budelmann, Nick Weiss, Stefan Heldmann , et al. (16 additional authors not shown)

Abstract: The alignment of tissue between histopathological whole-slide-images (WSI) is crucial for research and clinical applications. Advances in computing, deep learning, and availability of large WSI datasets have revolutionised WSI analysis. Therefore, the current state-of-the-art in WSI registration is unclear. To address this, we conducted the ACROBAT challenge, based on the largest WSI registration… ▽ More The alignment of tissue between histopathological whole-slide-images (WSI) is crucial for research and clinical applications. Advances in computing, deep learning, and availability of large WSI datasets have revolutionised WSI analysis. Therefore, the current state-of-the-art in WSI registration is unclear. To address this, we conducted the ACROBAT challenge, based on the largest WSI registration dataset to date, including 4,212 WSIs from 1,152 breast cancer patients. The challenge objective was to align WSIs of tissue that was stained with routine diagnostic immunohistochemistry to its H&E-stained counterpart. We compare the performance of eight WSI registration algorithms, including an investigation of the impact of different WSI properties and clinical covariates. We find that conceptually distinct WSI registration methods can lead to highly accurate registration performances and identify covariates that impact performances across methods. These results establish the current state-of-the-art in WSI registration and guide researchers in selecting and develo** methods. △ Less

Submitted 29 May, 2023; originally announced May 2023.

arXiv:2305.07152 [pdf, other]

Surgical tool classification and localization: results and methods from the MICCAI 2022 SurgToolLoc challenge

Authors: Aneeq Zia, Kiran Bhattacharyya, Xi Liu, Max Berniker, Ziheng Wang, Rogerio Nespolo, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Bo Liu, David Austin, Yiheng Wang, Michal Futrega, Jean-Francois Puget, Zhenqiang Li, Yoichi Sato, Ryo Fujii, Ryo Hachiuma, Mana Masuda, Hideo Saito, An Wang, Mengya Xu, Mobarakol Islam, Long Bai, Winnie Pang , et al. (46 additional authors not shown)

Abstract: The ability to automatically detect and track surgical instruments in endoscopic videos can enable transformational interventions. Assessing surgical performance and efficiency, identifying skilled tool use and choreography, and planning operational and logistical aspects of OR resources are just a few of the applications that could benefit. Unfortunately, obtaining the annotations needed to train… ▽ More The ability to automatically detect and track surgical instruments in endoscopic videos can enable transformational interventions. Assessing surgical performance and efficiency, identifying skilled tool use and choreography, and planning operational and logistical aspects of OR resources are just a few of the applications that could benefit. Unfortunately, obtaining the annotations needed to train machine learning models to identify and localize surgical tools is a difficult task. Annotating bounding boxes frame-by-frame is tedious and time-consuming, yet large amounts of data with a wide variety of surgical tools and surgeries must be captured for robust training. Moreover, ongoing annotator training is needed to stay up to date with surgical instrument innovation. In robotic-assisted surgery, however, potentially informative data like timestamps of instrument installation and removal can be programmatically harvested. The ability to rely on tool installation data alone would significantly reduce the workload to train robust tool-tracking models. With this motivation in mind we invited the surgical data science community to participate in the challenge, SurgToolLoc 2022. The goal was to leverage tool presence data as weak labels for machine learning models trained to detect tools and localize them in video frames with bounding boxes. We present the results of this challenge along with many of the team's efforts. We conclude by discussing these results in the broader context of machine learning and surgical data science. The training data used for this challenge consisting of 24,695 video clips with tool presence labels is also being released publicly and can be accessed at https://console.cloud.google.com/storage/browser/isi-surgtoolloc-2022. △ Less

Submitted 31 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2303.06274 [pdf]

CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting

Authors: Simon Graham, Quoc Dang Vu, Mostafa Jahanifar, Martin Weigert, Uwe Schmidt, Wenhua Zhang, Jun Zhang, Sen Yang, **xi Xiang, Xiyue Wang, Josef Lorenz Rumberger, Elias Baumann, Peter Hirsch, Lihao Liu, Chenyang Hong, Angelica I. Aviles-Rivero, Ayushi Jain, Heeyoung Ahn, Yiyu Hong, Hussam Azzuni, Min Xu, Mohammad Yaqub, Marie-Claire Blache, Benoît Piégu, Bertrand Vernay , et al. (64 additional authors not shown)

Abstract: Nuclear detection, segmentation and morphometric profiling are essential in hel** us further understand the relationship between histology and patient outcome. To drive innovation in this area, we setup a community-wide challenge using the largest available dataset of its kind to assess nuclear segmentation and cellular composition. Our challenge, named CoNIC, stimulated the development of repro… ▽ More Nuclear detection, segmentation and morphometric profiling are essential in hel** us further understand the relationship between histology and patient outcome. To drive innovation in this area, we setup a community-wide challenge using the largest available dataset of its kind to assess nuclear segmentation and cellular composition. Our challenge, named CoNIC, stimulated the development of reproducible algorithms for cellular recognition with real-time result inspection on public leaderboards. We conducted an extensive post-challenge analysis based on the top-performing models using 1,658 whole-slide images of colon tissue. With around 700 million detected nuclei per model, associated features were used for dysplasia grading and survival analysis, where we demonstrated that the challenge's improvement over the previous state-of-the-art led to significant boosts in downstream performance. Our findings also suggest that eosinophils and neutrophils play an important role in the tumour microevironment. We release challenge models and WSI-level results to foster the development of further methods for biomarker discovery. △ Less

Submitted 14 March, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

arXiv:2302.06294 [pdf, other]

doi 10.1016/j.media.2023.102888

CholecTriplet2022: Show me a tool and tell me the triplet -- an endoscopic vision challenge for surgical action triplet detection

Authors: Chinedu Innocent Nwoye, Tong Yu, Saurav Sharma, Aditya Murali, Deepak Alapatt, Armine Vardazaryan, Kun Yuan, Jonas Hajek, Wolfgang Reiter, Amine Yamlahi, Finn-Henri Smidt, Xiaoyang Zou, Guoyan Zheng, Bruno Oliveira, Helena R. Torres, Satoshi Kondo, Satoshi Kasai, Felix Holm, Ege Özsoy, Shuangchun Gui, Han Li, Sista Raviteja, Rachana Sathish, Pranav Poudel, Binod Bhattarai , et al. (24 additional authors not shown)

Abstract: Formalizing surgical activities as triplets of the used instruments, actions performed, and target anatomies is becoming a gold standard approach for surgical activity modeling. The benefit is that this formalization helps to obtain a more detailed understanding of tool-tissue interaction which can be used to develop better Artificial Intelligence assistance for image-guided surgery. Earlier effor… ▽ More Formalizing surgical activities as triplets of the used instruments, actions performed, and target anatomies is becoming a gold standard approach for surgical activity modeling. The benefit is that this formalization helps to obtain a more detailed understanding of tool-tissue interaction which can be used to develop better Artificial Intelligence assistance for image-guided surgery. Earlier efforts and the CholecTriplet challenge introduced in 2021 have put together techniques aimed at recognizing these triplets from surgical footage. Estimating also the spatial locations of the triplets would offer a more precise intraoperative context-aware decision support for computer-assisted intervention. This paper presents the CholecTriplet2022 challenge, which extends surgical action triplet modeling from recognition to detection. It includes weakly-supervised bounding box localization of every visible surgical instrument (or tool), as the key actors, and the modeling of each tool-activity in the form of <instrument, verb, target> triplet. The paper describes a baseline method and 10 new deep learning algorithms presented at the challenge to solve the task. It also provides thorough methodological comparisons of the methods, an in-depth analysis of the obtained results across multiple metrics, visual and procedural challenges; their significance, and useful insights for future research directions and applications in surgery. △ Less

Submitted 14 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: MICCAI EndoVis CholecTriplet2022 challenge report. Published at Elsevier journal of Medical Image Analysis. 25 pages, 15 figures, 8 tables

Journal ref: Medical Image Analysis, Volume 89, 2023, 102888, ISSN 1361-8415

arXiv:2302.01738 [pdf, other]

AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge

Authors: Coen de Vente, Koenraad A. Vermeer, Nicolas Jaccard, He Wang, Hongyi Sun, Firas Khader, Daniel Truhn, Temirgali Aimyshev, Yerkebulan Zhanibekuly, Tien-Dung Le, Adrian Galdran, Miguel Ángel González Ballester, Gustavo Carneiro, Devika R G, Hrishikesh P S, Densen Puthussery, Hong Liu, Zekang Yang, Satoshi Kondo, Satoshi Kasai, Edward Wang, Ashritha Durvasula, Jónathan Heras, Miguel Ángel Zapata, Teresa Araújo , et al. (11 additional authors not shown)

Abstract: The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios… ▽ More The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening. △ Less

Submitted 10 February, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: 19 pages, 8 figures, 3 tables

arXiv:2301.10575 [pdf, other]

Trainable Loss Weights in Super-Resolution

Authors: Arash Chaichi Mellatshahi, Shohreh Kasaei

Abstract: In recent years, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity conventionally. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible… ▽ More In recent years, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity conventionally. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while kee** the output components between zero and one. As experimental results shows, weighted loss by the proposed method leads to better results than the unweighted loss and weighted loss based on uncertainty in both signal-to-noise and perceptual similarity senses on the state-of-the-art networks. Code is available online. △ Less

Submitted 27 November, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: 9 pages, 6 figures, 2 table

MSC Class: 68T07 (Primary) 68T45 (Secondary) ACM Class: I.4

arXiv:2210.14164 [pdf, other]

No-Box Attacks on 3D Point Cloud Classification

Authors: Hanieh Naderi, Chinthaka Dinesh, Ivan V. Bajic, Shohreh Kasaei

Abstract: Adversarial attacks pose serious challenges for deep neural network (DNN)-based analysis of various input signals. In the case of 3D point clouds, methods have been developed to identify points that play a key role in network decision, and these become crucial in generating existing adversarial attacks. For example, a saliency map approach is a popular method for identifying adversarial drop point… ▽ More Adversarial attacks pose serious challenges for deep neural network (DNN)-based analysis of various input signals. In the case of 3D point clouds, methods have been developed to identify points that play a key role in network decision, and these become crucial in generating existing adversarial attacks. For example, a saliency map approach is a popular method for identifying adversarial drop points, whose removal would significantly impact the network decision. Generally, methods for identifying adversarial points rely on the access to the DNN model itself to determine which points are critically important for the model's decision. This paper aims to provide a novel viewpoint on this problem, where adversarial points can be predicted without access to the target DNN model, which is referred to as a ``no-box'' attack. To this end, we define 14 point cloud features and use multiple linear regression to examine whether these features can be used for adversarial point prediction, and which combination of features is best suited for this purpose. Experiments show that a suitable combination of features is able to predict adversarial points of four different networks -- PointNet, PointNet++, DGCNN, and PointConv -- significantly better than a random guess and comparable to white-box attacks. Additionally, we show that no-box attack is transferable to unseen models. The results also provide further insight into DNNs for point cloud classification, by showing which features play key roles in their decision-making process. △ Less

Submitted 27 January, 2024; v1 submitted 19 October, 2022; originally announced October 2022.

Comments: 10 pages, 6 figures

arXiv:2208.12635 [pdf]

A Two Step Approach for Whole Slide Image Registration

Authors: Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa

Abstract: Multi-stain whole-slide-image (WSI) registration is an active field of research. It is unclear, however, how the current WSI registration methods would perform on a real-world data set. AutomatiC Registration Of Breast cAncer Tissue (ACROBAT) challenge is held to verify the performance of the current WSI registration methods by using a new dataset that originates from routine diagnostics to assess… ▽ More Multi-stain whole-slide-image (WSI) registration is an active field of research. It is unclear, however, how the current WSI registration methods would perform on a real-world data set. AutomatiC Registration Of Breast cAncer Tissue (ACROBAT) challenge is held to verify the performance of the current WSI registration methods by using a new dataset that originates from routine diagnostics to assess real-world applicability. In this report, we present our solution for the ACROBAT challenge. We employ a two-step approach including rigid and non-rigid transforms. The experimental results show that the median 90th percentile is 1,250 um for the validation dataset. △ Less

Submitted 24 August, 2022; originally announced August 2022.

arXiv:2208.12041 [pdf]

Multi-Modality Abdominal Multi-Organ Segmentation with Deep Supervised 3D Segmentation Model

Authors: Satoshi Kondo, Satoshi Kasai

Abstract: To promote the development of medical image segmentation technology, AMOS, a large-scale abdominal multi-organ dataset for versatile medical image segmentation, is provided and AMOS 2022 challenge is held by using the dataset. In this report, we present our solution for the AMOS 2022 challenge. We employ residual U-Net with deep super vision as our base model. The experimental results show that th… ▽ More To promote the development of medical image segmentation technology, AMOS, a large-scale abdominal multi-organ dataset for versatile medical image segmentation, is provided and AMOS 2022 challenge is held by using the dataset. In this report, we present our solution for the AMOS 2022 challenge. We employ residual U-Net with deep super vision as our base model. The experimental results show that the mean scores of Dice similarity coefficient and normalized surface dice are 0.8504 and 0.8476 for CT only task and CT/MRI task, respectively. △ Less

Submitted 23 August, 2022; originally announced August 2022.

arXiv:2204.13260 [pdf]

doi 10.1126/sciadv.abq5652

Pattern recognition with neuromorphic computing using magnetic-field induced dynamics of skyrmions

Authors: Tomoyuki Yokouchi, Satoshi Sugimoto, Bivas Rana, Shinichiro Seki, Naoki Ogawa, Yuki Shiomi, Shinya Kasai, Yoshichika Otani

Abstract: Nonlinear phenomena in physical systems can be used for brain-inspired computing with low energy consumption. Response from the dynamics of a topological spin structure called skyrmion is one of the candidates for such a neuromorphic computing. However, its ability has not been well explored experimentally. Here, we experimentally demonstrate neuromorphic computing using nonlinear response origina… ▽ More Nonlinear phenomena in physical systems can be used for brain-inspired computing with low energy consumption. Response from the dynamics of a topological spin structure called skyrmion is one of the candidates for such a neuromorphic computing. However, its ability has not been well explored experimentally. Here, we experimentally demonstrate neuromorphic computing using nonlinear response originating from magnetic-field induced dynamics of skyrmions. We designed a simple-structured skyrmion-based neuromorphic device and succeeded in handwritten digit recognition with the accuracy as large as 94.7 % and waveform recognition. Notably, there exists a positive correlation between the recognition accuracy and the number of skyrmions in the devices. The large degree of freedoms of skyrmion systems, such as the position and the size, originate the more complex nonlinear map** and the larger output dimension, and thus high accuracy. Our results provide a guideline for develo** energy-saving and high-performance skyrmion neuromorphic computing devices. △ Less

Submitted 27 April, 2022; originally announced April 2022.

arXiv:2202.11944 [pdf, ps, other]

Computer Aided Diagnosis and Out-of-Distribution Detection in Glaucoma Screening Using Color Fundus Photography

Authors: Satoshi Kondo, Satoshi Kasai, Kosuke Hirasawa

Abstract: Artificial Intelligence for RObust Glaucoma Screening (AIROGS) Challenge is held for develo** solutions for glaucoma screening from color fundus photography that are robust to real-world scenarios. This report describes our method submitted to the AIROGS challenge. Our method employs convolutional neural networks to classify input images to "referable glaucoma" or "no referable glaucoma". In add… ▽ More Artificial Intelligence for RObust Glaucoma Screening (AIROGS) Challenge is held for develo** solutions for glaucoma screening from color fundus photography that are robust to real-world scenarios. This report describes our method submitted to the AIROGS challenge. Our method employs convolutional neural networks to classify input images to "referable glaucoma" or "no referable glaucoma". In addition, we introduce an inference-time out-of-distribution (OOD) detection method to identify ungradable images. Our OOD detection is based on an energy-based method combined with activation rectification. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2202.11804 [pdf, ps, other]

Nuclei panoptic segmentation and composition regression with multi-task deep neural networks

Authors: Satoshi Kondo, Satoshi Kasai

Abstract: Nuclear segmentation, classification and quantification within Haematoxylin & Eosin stained histology images enables the extraction of interpretable cell-based features that can be used in downstream explainable models in computational pathology. The Colon Nuclei Identification and Counting (CoNIC) Challenge is held to help drive forward research and innovation for automatic nuclei recognition in… ▽ More Nuclear segmentation, classification and quantification within Haematoxylin & Eosin stained histology images enables the extraction of interpretable cell-based features that can be used in downstream explainable models in computational pathology. The Colon Nuclei Identification and Counting (CoNIC) Challenge is held to help drive forward research and innovation for automatic nuclei recognition in computational pathology. This report describes our proposed method submitted to the CoNIC challenge. Our method employs a multi-task learning framework, which performs a panoptic segmentation task and a regression task. For the panoptic segmentation task, we use encoder-decoder type deep neural networks predicting a direction map in addition to a segmentation map in order to separate neighboring nuclei into different instances △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2202.11287 [pdf, other]

LPF-Defense: 3D Adversarial Defense based on Frequency Analysis

Authors: Hanieh Naderi, Kimia Noorbakhsh, Arian Etemadi, Shohreh Kasaei

Abstract: Although 3D point cloud classification has recently been widely deployed in different application scenarios, it is still very vulnerable to adversarial attacks. This increases the importance of robust training of 3D models in the face of adversarial attacks. Based on our analysis on the performance of existing adversarial attacks, more adversarial perturbations are found in the mid and high-freque… ▽ More Although 3D point cloud classification has recently been widely deployed in different application scenarios, it is still very vulnerable to adversarial attacks. This increases the importance of robust training of 3D models in the face of adversarial attacks. Based on our analysis on the performance of existing adversarial attacks, more adversarial perturbations are found in the mid and high-frequency components of input data. Therefore, by suppressing the high-frequency content in the training phase, the models robustness against adversarial examples is improved. Experiments showed that the proposed defense method decreases the success rate of six attacks on PointNet, PointNet++ ,, and DGCNN models. In particular, improvements are achieved with an average increase of classification accuracy by 3.8 % on drop100 attack and 4.26 % on drop200 attack compared to the state-of-the-art methods. The method also improves models accuracy on the original dataset compared to other available methods. △ Less

Submitted 24 August, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

Comments: 15 pages, 7 figures

arXiv:2202.07537 [pdf, ps, other]

doi 10.1109/TIT.2023.3249636

Information-Theoretic Analysis of Minimax Excess Risk

Authors: Hassan Hafez-Kolahi, Behrad Moniri, Shohreh Kasaei

Abstract: Two main concepts studied in machine learning theory are generalization gap (difference between train and test error) and excess risk (difference between test error and the minimum possible error). While information-theoretic tools have been used extensively to study the generalization gap of learning algorithms, the information-theoretic nature of excess risk has not yet been fully investigated.… ▽ More Two main concepts studied in machine learning theory are generalization gap (difference between train and test error) and excess risk (difference between test error and the minimum possible error). While information-theoretic tools have been used extensively to study the generalization gap of learning algorithms, the information-theoretic nature of excess risk has not yet been fully investigated. In this paper, some steps are taken toward this goal. We consider the frequentist problem of minimax excess risk as a zero-sum game between the algorithm designer and the world. Then, we argue that it is desirable to modify this game in a way that the order of play can be swapped. We then prove that, under some regularity conditions, if the world and designer can play randomly the duality gap is zero and the order of play can be changed. In this case, a Bayesian problem surfaces in the dual representation. This makes it possible to utilize recent information-theoretic results on minimum excess risk in Bayesian learning to provide bounds on the minimax excess risk. We demonstrate the applicability of the results by providing information theoretic insight on two important classes of problems: classification when the hypothesis space has finite VC-dimension, and regularized least squares. △ Less

Submitted 28 February, 2023; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: Published in the IEEE Transactions on Information Theory

arXiv:2110.03745 [pdf, other]

Adversarial Attack by Limited Point Cloud Surface Modifications

Authors: Atrin Arya, Hanieh Naderi, Shohreh Kasaei

Abstract: Recent research has revealed that the security of deep neural networks that directly process 3D point clouds to classify objects can be threatened by adversarial samples. Although existing adversarial attack methods achieve high success rates, they do not restrict the point modifications enough to preserve the point cloud appearance. To overcome this shortcoming, two constraints are proposed. Thes… ▽ More Recent research has revealed that the security of deep neural networks that directly process 3D point clouds to classify objects can be threatened by adversarial samples. Although existing adversarial attack methods achieve high success rates, they do not restrict the point modifications enough to preserve the point cloud appearance. To overcome this shortcoming, two constraints are proposed. These include applying hard boundary constraints on the number of modified points and on the point perturbation norms. Due to the restrictive nature of the problem, the search space contains many local maxima. The proposed method addresses this issue by using a high step-size at the beginning of the algorithm to search the main surface of the point cloud fast and effectively. Then, in order to converge to the desired output, the step-size is gradually decreased. To evaluate the performance of the proposed method, it is run on the ModelNet40 and ScanObjectNN datasets by employing the state-of-the-art point cloud classification models; including PointNet, PointNet++, and DGCNN. The obtained results show that it can perform successful attacks and achieve state-of-the-art results by only a limited number of point modifications while preserving the appearance of the point cloud. Moreover, due to the effective search algorithm, it can perform successful attacks in just a few steps. Additionally, the proposed step-size scheduling algorithm shows an improvement of up to $14.5\%$ when adopted by other methods as well. The proposed method also performs effectively against popular defense methods. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2107.03463 [pdf, other]

CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search

Authors: Seyed Mojtaba Marvasti-Zadeh, Javad Khaghani, Li Cheng, Hossein Ghanei-Yakhdan, Shohreh Kasaei

Abstract: A strong visual object tracker nowadays relies on its well-crafted modules, which typically consist of manually-designed network architectures to deliver high-quality tracking results. Not surprisingly, the manual design process becomes a particularly challenging barrier, as it demands sufficient prior experience, enormous effort, intuition, and perhaps some good luck. Meanwhile, neural architectu… ▽ More A strong visual object tracker nowadays relies on its well-crafted modules, which typically consist of manually-designed network architectures to deliver high-quality tracking results. Not surprisingly, the manual design process becomes a particularly challenging barrier, as it demands sufficient prior experience, enormous effort, intuition, and perhaps some good luck. Meanwhile, neural architecture search has gaining grounds in practical applications as a promising method in tackling the issue of automated search of feasible network structures. In this work, we propose a novel cell-level differentiable architecture search mechanism with early stop** to automate the network design of the tracking module, aiming to adapt backbone features to the objective of Siamese tracking networks during offline training. Besides, the proposed early stop** strategy avoids over-fitting and performance collapse problems leading to generalization improvement. The proposed approach is simple, efficient, and with no need to stack a series of modules to construct a network. Our approach is easy to be incorporated into existing trackers, which is empirically validated using different differentiable architecture search-based methods and tracking objectives. Extensive experimental evaluations demonstrate the superior performance of our approach over five commonly-used benchmarks. △ Less

Submitted 26 October, 2021; v1 submitted 2 July, 2021; originally announced July 2021.

Comments: The first two authors contributed equally to this work. Accepted manuscript in BMVC 2021

arXiv:2105.04180 [pdf, other]

Rate-Distortion Analysis of Minimum Excess Risk in Bayesian Learning

Authors: Hassan Hafez-Kolahi, Behrad Moniri, Shohreh Kasaei, Mahdieh Soleymani Baghshah

Abstract: In parametric Bayesian learning, a prior is assumed on the parameter $W$ which determines the distribution of samples. In this setting, Minimum Excess Risk (MER) is defined as the difference between the minimum expected loss achievable when learning from data and the minimum expected loss that could be achieved if $W$ was observed. In this paper, we build upon and extend the recent results of (Xu… ▽ More In parametric Bayesian learning, a prior is assumed on the parameter $W$ which determines the distribution of samples. In this setting, Minimum Excess Risk (MER) is defined as the difference between the minimum expected loss achievable when learning from data and the minimum expected loss that could be achieved if $W$ was observed. In this paper, we build upon and extend the recent results of (Xu & Raginsky, 2020) to analyze the MER in Bayesian learning and derive information-theoretic bounds on it. We formulate the problem as a (constrained) rate-distortion optimization and show how the solution can be bounded above and below by two other rate-distortion functions that are easier to study. The lower bound represents the minimum possible excess risk achievable by any process using $R$ bits of information from the parameter $W$. For the upper bound, the optimization is further constrained to use $R$ bits from the training set, a setting which relates MER to information-theoretic bounds on the generalization gap in frequentist learning. We derive information-theoretic bounds on the difference between these upper and lower bounds and show that they can provide order-wise tight rates for MER under certain conditions. This analysis gives more insight into the information-theoretic nature of Bayesian learning as well as providing novel bounds. △ Less

Submitted 17 July, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

Comments: Accepted at ICML 2021

arXiv:2103.07640 [pdf, other]

Generating Unrestricted Adversarial Examples via Three Parameters

Authors: Hanieh Naderi, Leili Goli, Shohreh Kasaei

Abstract: Deep neural networks have been shown to be vulnerable to adversarial examples deliberately constructed to misclassify victim models. As most adversarial examples have restricted their perturbations to $L_{p}$-norm, existing defense methods have focused on these types of perturbations and less attention has been paid to unrestricted adversarial examples; which can create more realistic attacks, abl… ▽ More Deep neural networks have been shown to be vulnerable to adversarial examples deliberately constructed to misclassify victim models. As most adversarial examples have restricted their perturbations to $L_{p}$-norm, existing defense methods have focused on these types of perturbations and less attention has been paid to unrestricted adversarial examples; which can create more realistic attacks, able to deceive models without affecting human predictions. To address this problem, the proposed adversarial attack generates an unrestricted adversarial example with a limited number of parameters. The attack selects three points on the input image and based on their locations transforms the image into an adversarial example. By limiting the range of movement and location of these three points and using a discriminatory network, the proposed unrestricted adversarial example preserves the image appearance. Experimental results show that the proposed adversarial examples obtain an average success rate of 93.5% in terms of human evaluation on the MNIST and SVHN datasets. It also reduces the model accuracy by an average of 73% on six datasets MNIST, FMNIST, SVHN, CIFAR10, CIFAR100, and ImageNet. It should be noted that, in the case of attacks, lower accuracy in the victim model denotes a more successful attack. The adversarial train of the attack also improves model robustness against a randomly transformed image. △ Less

Submitted 13 March, 2021; originally announced March 2021.

arXiv:2010.04516 [pdf, other]

Be Your Own Best Competitor! Multi-Branched Adversarial Knowledge Transfer

Authors: Mahdi Ghorbani, Fahimeh Fooladgar, Shohreh Kasaei

Abstract: Deep neural network architectures have attained remarkable improvements in scene understanding tasks. Utilizing an efficient model is one of the most important constraints for limited-resource devices. Recently, several compression methods have been proposed to diminish the heavy computational burden and memory consumption. Among them, the pruning and quantizing methods exhibit a critical drop in… ▽ More Deep neural network architectures have attained remarkable improvements in scene understanding tasks. Utilizing an efficient model is one of the most important constraints for limited-resource devices. Recently, several compression methods have been proposed to diminish the heavy computational burden and memory consumption. Among them, the pruning and quantizing methods exhibit a critical drop in performances by compressing the model parameters. While the knowledge distillation methods improve the performance of compact models by focusing on training lightweight networks with the supervision of cumbersome networks. In the proposed method, the knowledge distillation has been performed within the network by constructing multiple branches over the primary stream of the model, known as the self-distillation method. Therefore, the ensemble of sub-neural network models has been proposed to transfer the knowledge among themselves with the knowledge distillation policies as well as an adversarial learning strategy. Hence, The proposed ensemble of sub-models is trained against a discriminator model adversarially. Besides, their knowledge is transferred within the ensemble by four different loss functions. The proposed method has been devoted to both lightweight image classification and encoder-decoder architectures to boost the performance of small and compact models without incurring extra computational overhead at the inference process. Extensive experimental results on the main challenging datasets show that the proposed network outperforms the primary model in terms of accuracy at the same number of parameters and computational cost. The obtained results show that the proposed model has achieved significant improvement over earlier ideas of self-distillation methods. The effectiveness of the proposed models has also been illustrated in the encoder-decoder model. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: 11 pages, 4 figures

arXiv:2009.09235 [pdf, other]

Open-Ended Fine-Grained 3D Object Categorization by Combining Shape and Texture Features in Multiple Colorspaces

Authors: Nils Keunecke, S. Hamidreza Kasaei

Abstract: As a consequence of an ever-increasing number of service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments,it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to… ▽ More As a consequence of an ever-increasing number of service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments,it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to learn about new object categories in an open-ended fashion while working in the environment.Towards this goal, we propose a deep transfer learning approach to generate a scale- and pose-invariant object representation by considering shape and texture information in multiple colorspaces. The obtained global object representation is then fed to an instance-based object category learning and recognition,where a non-expert human user exists in the learning loop and can interactively guide the process of experience acquisition by teaching new object categories, or by correcting insufficient or erroneous categories. In this work, shape information encodes the common patterns of all categories, while texture information is used to describes the appearance of each instance in detail.Multiple color space combinations and network architectures are evaluated to find the most descriptive system. Experimental results showed that the proposed network architecture out-performed the selected state-of-the-art approaches in terms of object classification accuracy and scalability. Furthermore, we performed a real robot experiment in the context of serve-a-beer scenario to show the real-time performance of the proposed approach. △ Less

Submitted 28 May, 2021; v1 submitted 19 September, 2020; originally announced September 2020.

arXiv:2008.13015 [pdf, other]

Adaptive Exploitation of Pre-trained Deep Convolutional Neural Networks for Robust Visual Tracking

Authors: Seyed Mojtaba Marvasti-Zadeh, Hossein Ghanei-Yakhdan, Shohreh Kasaei

Abstract: Due to the automatic feature extraction procedure via multi-layer nonlinear transformations, the deep learning-based visual trackers have recently achieved great success in challenging scenarios for visual tracking purposes. Although many of those trackers utilize the feature maps from pre-trained convolutional neural networks (CNNs), the effects of selecting different models and exploiting variou… ▽ More Due to the automatic feature extraction procedure via multi-layer nonlinear transformations, the deep learning-based visual trackers have recently achieved great success in challenging scenarios for visual tracking purposes. Although many of those trackers utilize the feature maps from pre-trained convolutional neural networks (CNNs), the effects of selecting different models and exploiting various combinations of their feature maps are still not compared completely. To the best of our knowledge, all those methods use a fixed number of convolutional feature maps without considering the scene attributes (e.g., occlusion, deformation, and fast motion) that might occur during tracking. As a pre-requisition, this paper proposes adaptive discriminative correlation filters (DCF) based on the methods that can exploit CNN models with different topologies. First, the paper provides a comprehensive analysis of four commonly used CNN models to determine the best feature maps of each model. Second, with the aid of analysis results as attribute dictionaries, adaptive exploitation of deep features is proposed to improve the accuracy and robustness of visual trackers regarding video characteristics. Third, the generalization of the proposed method is validated on various tracking datasets as well as CNN models with similar architectures. Finally, extensive experimental results demonstrate the effectiveness of the proposed adaptive method compared with state-of-the-art visual tracking methods. △ Less

Submitted 22 December, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

Comments: Accepted Manuscript in Multimedia Tools and Applications (MTAP), Springer

arXiv:2007.06866 [pdf, other]

Alleviating Over-segmentation Errors by Detecting Action Boundaries

Authors: Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, Hirokatsu Kataoka

Abstract: We propose an effective framework for the temporal action segmentation task, namely an Action Segment Refinement Framework (ASRF). Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB). The long-term feature extractor provides shared features for the two branches with a wide temporal receptiv… ▽ More We propose an effective framework for the temporal action segmentation task, namely an Action Segment Refinement Framework (ASRF). Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB). The long-term feature extractor provides shared features for the two branches with a wide temporal receptive field. The ASB classifies video frames with action classes, while the BRB regresses the action boundary probabilities. The action boundaries predicted by the BRB refine the output from the ASB, which results in a significant performance improvement. Our contributions are three-fold: (i) We propose a framework for temporal action segmentation, the ASRF, which divides temporal action segmentation into frame-wise action classification and action boundary regression. Our framework refines frame-level hypotheses of action classes using predicted action boundaries. (ii) We propose a loss function for smoothing the transition of action probabilities, and analyze combinations of various loss functions for temporal action segmentation. (iii) Our framework outperforms state-of-the-art methods on three challenging datasets, offering an improvement of up to 13.7% in terms of segmental edit distance and up to 16.1% in terms of segmental F1 score. Our code will be publicly available soon. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Comments: under review

arXiv:2006.02597 [pdf, other]

COMET: Context-Aware IoU-Guided Network for Small Object Tracking

Authors: Seyed Mojtaba Marvasti-Zadeh, Javad Khaghani, Hossein Ghanei-Yakhdan, Shohreh Kasaei, Li Cheng

Abstract: We consider the problem of tracking an unknown small target from aerial videos of medium to high altitudes. This is a challenging problem, which is even more pronounced in unavoidable scenarios of drastic camera motion and high density. To address this problem, we introduce a context-aware IoU-guided tracker (COMET) that exploits a multitask two-stream network and an offline reference proposal gen… ▽ More We consider the problem of tracking an unknown small target from aerial videos of medium to high altitudes. This is a challenging problem, which is even more pronounced in unavoidable scenarios of drastic camera motion and high density. To address this problem, we introduce a context-aware IoU-guided tracker (COMET) that exploits a multitask two-stream network and an offline reference proposal generation strategy. The proposed network fully exploits target-related information by multi-scale feature learning and attention modules. The proposed strategy introduces an efficient sampling strategy to generalize the network on the target and its parts without imposing extra computational complexity during online tracking. These strategies contribute considerably in handling significant occlusions and viewpoint changes. Empirically, COMET outperforms the state-of-the-arts in a range of aerial view datasets that focusing on tracking small objects. Specifically, COMET outperforms the celebrated ATOM tracker by an average margin of 6.2% (and 7%) in precision (and success) score on challenging benchmarks of UAVDT, VisDrone-2019, and Small-90. △ Less

Submitted 18 September, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: Accepted manuscript in ACCV 2020

arXiv:2005.09183 [pdf, other]

Retrieving and Highlighting Action with Spatiotemporal Reference

Authors: Seito Kasai, Yuchi Ishikawa, Masaki Hayashi, Yoshimitsu Aoki, Kensho Hara, Hirokatsu Kataoka

Abstract: In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks whic… ▽ More In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks which focus on classification or window-based localization. Leveraging weak supervision from annotated captions, our framework acquires spatiotemporal relevance maps and generates local embeddings which relate to the nouns and verbs in captions. Through experiments, we show that our model generates various maps conditioned on different actions, in which conventional visual reasoning methods only go as far as to show a single deterministic saliency map. Also, our model improves retrieval recall over our baseline without alignment by 2-3% on the MSR-VTT dataset. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: Accepted to ICIP 2020

arXiv:2004.02933 [pdf, other]

Efficient Scale Estimation Methods using Lightweight Deep Convolutional Neural Networks for Visual Tracking

Authors: Seyed Mojtaba Marvasti-Zadeh, Hossein Ghanei-Yakhdan, Shohreh Kasaei

Abstract: In recent years, visual tracking methods that are based on discriminative correlation filters (DCF) have been very promising. However, most of these methods suffer from a lack of robust scale estimation skills. Although a wide range of recent DCF-based methods exploit the features that are extracted from deep convolutional neural networks (CNNs) in their translation model, the scale of the visual… ▽ More In recent years, visual tracking methods that are based on discriminative correlation filters (DCF) have been very promising. However, most of these methods suffer from a lack of robust scale estimation skills. Although a wide range of recent DCF-based methods exploit the features that are extracted from deep convolutional neural networks (CNNs) in their translation model, the scale of the visual target is still estimated by hand-crafted features. Whereas the exploitation of CNNs imposes a high computational burden, this paper exploits pre-trained lightweight CNNs models to propose two efficient scale estimation methods, which not only improve the visual tracking performance but also provide acceptable tracking speeds. The proposed methods are formulated based on either holistic or region representation of convolutional feature maps to efficiently integrate into DCF formulations to learn a robust scale model in the frequency domain. Moreover, against the conventional scale estimation methods with iterative feature extraction of different target regions, the proposed methods exploit proposed one-pass feature extraction processes that significantly improve the computational efficiency. Comprehensive experimental results on the OTB-50, OTB-100, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed visual tracking methods outperform the state-of-the-art methods, effectively. △ Less

Submitted 11 December, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: Accepted Manuscript in Neural Computing and Applications (NCAA), Springer

arXiv:2004.02932 [pdf, other]

Beyond Background-Aware Correlation Filters: Adaptive Context Modeling by Hand-Crafted and Deep RGB Features for Visual Tracking

Authors: Seyed Mojtaba Marvasti-Zadeh, Hossein Ghanei-Yakhdan, Shohreh Kasaei

Abstract: In recent years, the background-aware correlation filters have achie-ved a lot of research interest in the visual target tracking. However, these methods cannot suitably model the target appearance due to the exploitation of hand-crafted features. On the other hand, the recent deep learning-based visual tracking methods have provided a competitive performance along with extensive computations. In… ▽ More In recent years, the background-aware correlation filters have achie-ved a lot of research interest in the visual target tracking. However, these methods cannot suitably model the target appearance due to the exploitation of hand-crafted features. On the other hand, the recent deep learning-based visual tracking methods have provided a competitive performance along with extensive computations. In this paper, an adaptive background-aware correlation filter-based tracker is proposed that effectively models the target appearance by using either the histogram of oriented gradients (HOG) or convolutional neural network (CNN) feature maps. The proposed method exploits the fast 2D non-maximum suppression (NMS) algorithm and the semantic information comparison to detect challenging situations. When the HOG-based response map is not reliable, or the context region has a low semantic similarity with prior regions, the proposed method constructs the CNN context model to improve the target region estimation. Furthermore, the rejection option allows the proposed method to update the CNN context model only on valid regions. Comprehensive experimental results demonstrate that the proposed adaptive method clearly outperforms the accuracy and robustness of visual target tracking compared to the state-of-the-art methods on the OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 datasets. △ Less

Submitted 29 September, 2021; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: To be appeared in Multimedia Tools and Applications, Springer, 2021

arXiv:2004.01382 [pdf, other]

Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking

Authors: Seyed Mojtaba Marvasti-Zadeh, Hossein Ghanei-Yakhdan, Shohreh Kasaei, Kamal Nasrollahi, Thomas B. Moeslund

Abstract: Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs… ▽ More Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking. △ Less

Submitted 20 September, 2021; v1 submitted 3 April, 2020; originally announced April 2020.

Comments: To be appeared in The Visual Computer (International Journal of Computer Graphics), Springer, 2021

arXiv:2003.08151 [pdf, other]

The State of Lifelong Learning in Service Robots: Current Bottlenecks in Object Perception and Manipulation

Authors: S. Hamidreza Kasaei, Jorik Melsen, Floris van Beers, Christiaan Steenkist, Klemen Voncina

Abstract: Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of t… ▽ More Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of time but also to continually adapt to new environments and safely interact with non-expert human users. Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object in predefined settings. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by human experts. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In such environments, no matter how extensive the training data used for batch learning, a robot will always face new objects. Therefore, apart from batch learning, the robot should be able to continually learn about new object categories and grasp affordances from very few training examples on-site. Moreover, apart from robot self-learning, non-expert users could interactively guide the process of experience acquisition by teaching new concepts, or by correcting insufficient or erroneous concepts. In this way, the robot will constantly learn how to help humans in everyday tasks by gaining more and more experiences without the need for re-programming. △ Less

Submitted 6 May, 2021; v1 submitted 18 March, 2020; originally announced March 2020.

arXiv:2002.03892 [pdf, other]

Learning to Grasp 3D Objects using Deep Residual U-Nets

Authors: Yikun Li, Lambert Schomaker, S. Hamidreza Kasaei

Abstract: Grasp synthesis is one of the challenging tasks for any robot object manipulation task. In this paper, we present a new deep learning-based grasp synthesis approach for 3D objects. In particular, we propose an end-to-end 3D Convolutional Neural Network to predict the objects' graspable areas. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure… ▽ More Grasp synthesis is one of the challenging tasks for any robot object manipulation task. In this paper, we present a new deep learning-based grasp synthesis approach for 3D objects. In particular, we propose an end-to-end 3D Convolutional Neural Network to predict the objects' graspable areas. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure and residual network-styled blocks. It devised to plan 6-DOF grasps for any desired object, be efficient to compute and use, and be robust against varying point cloud density and Gaussian noise. We have performed extensive experiments to assess the performance of the proposed approach concerning graspable part detection, grasp success rate, and robustness to varying point cloud density and Gaussian noise. Experiments validate the promising performance of the proposed architecture in all aspects. A video showing the performance of our approach in the simulation environment can be found at: http://youtu.be/5_yAJCc8owo △ Less

Submitted 12 September, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:2002.03779 [pdf, other]

Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition

Authors: S. Hamidreza Kasaei, Maryam Ghorbani, Jits Schilperoort, Wessel van der Rest

Abstract: Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either th… ▽ More Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either the shape information only and ignore the role of color information or vice versa. Furthermore, they mainly utilize the $L_n$ Minkowski family functions to measure the similarity of two object views, while there are various distance measures that are applicable to compare two object views. In this paper, we explore the importance of shape information, color constancy, color spaces, and various similarity measures in open-ended 3D object recognition. Towards this goal, we extensively evaluate the performance of object recognition approaches in three different configurations, including \textit{color-only}, \textit{shape-only}, and \textit{ combinations of color and shape}, in both offline and online settings. Experimental results concerning scalability, memory usage, and object recognition performance show that all of the \textit{combinations of color and shape} yields significant improvements over the \textit{shape-only} and \textit{color-only} approaches. The underlying reason is that color information is an important feature to distinguish objects that have very similar geometric properties with different colors and vice versa. Moreover, by combining color and shape information, we demonstrate that the robot can learn new object categories from very few training examples in a real-world setting. △ Less

Submitted 26 September, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:2001.00526 [pdf, other]

doi 10.1007/s11042-020-09223-8

Lightweight Residual Densely Connected Convolutional Neural Network

Authors: Fahimeh Fooladgar, Shohreh Kasaei

Abstract: Extremely efficient convolutional neural network architectures are one of the most important requirements for limited-resource devices (such as embedded and mobile devices). The computing power and memory size are two important constraints of these devices. Recently, some architectures have been proposed to overcome these limitations by considering specific hardware-software equipment. In this pap… ▽ More Extremely efficient convolutional neural network architectures are one of the most important requirements for limited-resource devices (such as embedded and mobile devices). The computing power and memory size are two important constraints of these devices. Recently, some architectures have been proposed to overcome these limitations by considering specific hardware-software equipment. In this paper, the lightweight residual densely connected blocks are proposed to guaranty the deep supervision, efficient gradient flow, and feature reuse abilities of convolutional neural network. The proposed method decreases the cost of training and inference processes without using any special hardware-software equipment by just reducing the number of parameters and computational operations while achieving a feasible accuracy. Extensive experimental results demonstrate that the proposed architecture is more efficient than the AlexNet and VGGNet in terms of model size, required parameters, and even accuracy. The proposed model has been evaluated on the ImageNet, MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. It achieves state-of-the-art results on Fashion MNIST dataset and reasonable results on the others. The obtained results show the superiority of the proposed method to efficient models such as the SqueezNet. It is also comparable with state-of-the-art efficient models such as CondenseNet and ShuffleNet. △ Less

Submitted 8 June, 2020; v1 submitted 2 January, 2020; originally announced January 2020.

arXiv:1912.12082 [pdf, other]

Pointwise Attention-Based Atrous Convolutional Neural Networks

Authors: Mobina Mahdavi, Fahimeh Fooladgar, Shohreh Kasaei

Abstract: With the rapid progress of deep convolutional neural networks, in almost all robotic applications, the availability of 3D point clouds improves the accuracy of 3D semantic segmentation methods. Rendering of these irregular, unstructured, and unordered 3D points to 2D images from multiple viewpoints imposes some issues such as loss of information due to 3D to 2D projection, discretizing artifacts,… ▽ More With the rapid progress of deep convolutional neural networks, in almost all robotic applications, the availability of 3D point clouds improves the accuracy of 3D semantic segmentation methods. Rendering of these irregular, unstructured, and unordered 3D points to 2D images from multiple viewpoints imposes some issues such as loss of information due to 3D to 2D projection, discretizing artifacts, and high computational costs. To efficiently deal with a large number of points and incorporate more context of each point, a pointwise attention-based atrous convolutional neural network architecture is proposed. It focuses on salient 3D feature points among all feature maps while considering outstanding contextual information via spatial channel-wise attention modules. The proposed model has been evaluated on the two most important 3D point cloud datasets for the 3D semantic segmentation task. It achieves a reasonable performance compared to state-of-the-art models in terms of accuracy, with a much smaller number of parameters. △ Less

Submitted 27 December, 2019; originally announced December 2019.

Comments: 7 pages, 6 figures. Author one and author two contributed equally

arXiv:1912.11691 [pdf, other]

Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images

Authors: Fahimeh Fooladgar, Shohreh Kasaei

Abstract: The 3D scene understanding is mainly considered as a crucial requirement in computer vision and robotics applications. One of the high-level tasks in 3D scene understanding is semantic segmentation of RGB-Depth images. With the availability of RGB-D cameras, it is desired to improve the accuracy of the scene understanding process by exploiting the depth features along with the appearance features.… ▽ More The 3D scene understanding is mainly considered as a crucial requirement in computer vision and robotics applications. One of the high-level tasks in 3D scene understanding is semantic segmentation of RGB-Depth images. With the availability of RGB-D cameras, it is desired to improve the accuracy of the scene understanding process by exploiting the depth features along with the appearance features. As depth images are independent of illumination, they can improve the quality of semantic labeling alongside RGB images. Consideration of both common and specific features of these two modalities improves the performance of semantic segmentation. One of the main problems in RGB-Depth semantic segmentation is how to fuse or combine these two modalities to achieve more advantages of each modality while being computationally efficient. Recently, the methods that encounter deep convolutional neural networks have reached the state-of-the-art results by early, late, and middle fusion strategies. In this paper, an efficient encoder-decoder model with the attention-based fusion block is proposed to integrate mutual influences between feature maps of these two modalities. This block explicitly extracts the interdependences among concatenated feature maps of these modalities to exploit more powerful feature maps from RGB-Depth images. The extensive experimental results on three main challenging datasets of NYU-V2, SUN RGB-D, and Stanford 2D-3D-Semantic show that the proposed network outperforms the state-of-the-art models with respect to computational cost as well as model size. Experimental results also illustrate the effectiveness of the proposed lightweight attention-based fusion model in terms of accuracy. △ Less

Submitted 25 December, 2019; originally announced December 2019.

arXiv:1912.09539 [pdf, other]

Interactive Open-Ended Learning for 3D Object Recognition

Authors: S. Hamidreza Kasaei

Abstract: The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new env… ▽ More The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new environments. Inspired by this capability, we seek to create a cognitive object perception and perceptual learning architecture that can learn 3D object categories in an open-ended fashion. In this context, ``open-ended'' implies that the set of categories to be learned is not known in advance, and the training instances are extracted from actual experiences of a robot, and thus become gradually available, rather than being available since the beginning of the learning process. In particular, this architecture provides perception capabilities that will allow robots to incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks. This framework integrates detection, tracking, teaching, learning, and recognition of objects. An extensive set of systematic experiments, in multiple experimental settings, was carried out to thoroughly evaluate the described learning approaches. Experimental results show that the proposed system is able to interact with human users, learn new object categories over time, as well as perform complex tasks. The contributions presented in this thesis have been fully implemented and evaluated on different standard object and scene datasets and empirically evaluated on different robotic platforms. △ Less

Submitted 19 December, 2019; originally announced December 2019.

Comments: PhD thesis

arXiv:1912.00535 [pdf, other]

doi 10.1109/TITS.2020.3046478

Deep Learning for Visual Tracking: A Comprehensive Survey

Authors: Seyed Mojtaba Marvasti-Zadeh, Li Cheng, Hossein Ghanei-Yakhdan, Shohreh Kasaei

Abstract: Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years -- predominantl… ▽ More Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years -- predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from nine key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. Finally, by conducting critical analyses of these state-of-the-art trackers quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions. △ Less

Submitted 26 January, 2021; v1 submitted 1 December, 2019; originally announced December 2019.

Comments: Accepted Manuscript in IEEE Transactions on Intelligent Transportation Systems

arXiv:1909.09706 [pdf, other]

Do Compressed Representations Generalize Better?

Authors: Hassan Hafez-Kolahi, Shohreh Kasaei, Mahdiyeh Soleymani-Baghshah

Abstract: One of the most studied problems in machine learning is finding reasonable constraints that guarantee the generalization of a learning algorithm. These constraints are usually expressed as some simplicity assumptions on the target. For instance, in the Vapnik-Chervonenkis (VC) theory the space of possible hypotheses is considered to have a limited VC dimension. In this paper, the constraint on the… ▽ More One of the most studied problems in machine learning is finding reasonable constraints that guarantee the generalization of a learning algorithm. These constraints are usually expressed as some simplicity assumptions on the target. For instance, in the Vapnik-Chervonenkis (VC) theory the space of possible hypotheses is considered to have a limited VC dimension. In this paper, the constraint on the entropy $H(X)$ of the input variable $X$ is studied as a simplicity assumption. It is proven that the sample complexity to achieve an $ε$-$δ$ Probably Approximately Correct (PAC) hypothesis is bounded by $\frac{2^{ \left.6H(X)\middle/ε\right.}+\log{\frac{1}δ}}{ε^2}$ which is sharp up to the $\frac{1}{ε^2}$ factor. Morever, it is shown that if a feature learning process is employed to learn the compressed representation from the dataset, this bound no longer exists. These findings have important implications on the Information Bottleneck (IB) theory which had been utilized to explain the generalization power of Deep Neural Networks (DNNs), but its applicability for this purpose is currently under debate by researchers. In particular, this is a rigorous proof for the previous heuristic that compressed representations are exponentially easier to be learned. However, our analysis pinpoints two factors preventing the IB, in its current form, to be applicable in studying neural networks. Firstly, the exponential dependence of sample complexity on $\frac{1}ε$, which can lead to a dramatic effect on the bounds in practical applications when $ε$ is small. Secondly, our analysis reveals that arguments based on input compression are inherently insufficient to explain generalization of methods like DNNs in which the features are also learned using available data. △ Less

Submitted 2 January, 2020; v1 submitted 20 September, 2019; originally announced September 2019.

arXiv:1907.12924 [pdf, other]

Look Further to Recognize Better: Learning Shared Topics and Category-Specific Dictionaries for Open-Ended 3D Object Recognition

Authors: S. Hamidreza Kasaei

Abstract: Service robots are expected to operate effectively in human-centric environments for long periods of time. In such realistic scenarios, fine-grained object categorization is as important as basic-level object categorization. We tackle this problem by proposing an open-ended object recognition approach which concurrently learns both the object categories and the local features for encoding objects.… ▽ More Service robots are expected to operate effectively in human-centric environments for long periods of time. In such realistic scenarios, fine-grained object categorization is as important as basic-level object categorization. We tackle this problem by proposing an open-ended object recognition approach which concurrently learns both the object categories and the local features for encoding objects. In this work, each object is represented using a set of general latent visual topics and category-specific dictionaries. The general topics encode the common patterns of all categories, while the category-specific dictionary describes the content of each category in details. The proposed approach discovers both sets of general and specific representations in an unsupervised fashion and updates them incrementally using new object views. Experimental results show that our approach yields significant improvements over the previous state-of-the-art approaches concerning scalability and object classification performance. Moreover, our approach demonstrates the capability of learning from very few training examples in a real-world setting. Regarding computation time, the best result was obtained with a Bag-of-Words method followed by a variant of the Latent Dirichlet Allocation approach. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: arXiv admin note: text overlap with arXiv:1902.03057

arXiv:1907.10932 [pdf, other]

Object Perception and Gras** in Open-Ended Domains

Authors: S. Hamidreza Kasaei

Abstract: Nowadays service robots are leaving the structured and completely known environments and entering human-centric settings. For these robots, object perception and gras** are two challenging tasks due to the high demand for accurate and real-time responses. Although many problems have already been understood and solved successfully, many challenges still remain. Open-ended learning is one of these… ▽ More Nowadays service robots are leaving the structured and completely known environments and entering human-centric settings. For these robots, object perception and gras** are two challenging tasks due to the high demand for accurate and real-time responses. Although many problems have already been understood and solved successfully, many challenges still remain. Open-ended learning is one of these challenges waiting for many improvements. Cognitive science revealed that humans learn to recognize object categories and grasp affordances ceaselessly over time. This ability allows adapting to new environments by enhancing their knowledge from the accumulation of experiences and the conceptualization of new object categories. Inspired by this, an autonomous robot must have the ability to process visual information and conduct learning and recognition tasks in an open-ended fashion. In this context, "open-ended" implies that the set of object categories to be learned is not known in advance, and the training instances are extracted from online experiences of a robot, and become gradually available over time, rather than being completely available at the beginning of the learning process. In my research, I mainly focus on interactive open-ended learning approaches to recognize multiple objects and their grasp affordances concurrently. In particular, I try to address the following research questions: (i) What is the importance of open-ended learning for autonomous robots? (ii) How robots could learn incrementally from their own experiences as well as from interaction with humans? (iii) What are the limitations of Deep Learning approaches to be used in an open-ended manner? (iv) How to evaluate open-ended learning approaches and what are the right metrics to do so? △ Less

Submitted 25 July, 2019; originally announced July 2019.

arXiv:1904.03743 [pdf, other]

Information Bottleneck and its Applications in Deep Learning

Authors: Hassan Hafez-Kolahi, Shohreh Kasaei

Abstract: Information Theory (IT) has been used in Machine Learning (ML) from early days of this field. In the last decade, advances in Deep Neural Networks (DNNs) have led to surprising improvements in many applications of ML. The result has been a paradigm shift in the community toward revisiting previous ideas and applications in this new framework. Ideas from IT are no exception. One of the ideas which… ▽ More Information Theory (IT) has been used in Machine Learning (ML) from early days of this field. In the last decade, advances in Deep Neural Networks (DNNs) have led to surprising improvements in many applications of ML. The result has been a paradigm shift in the community toward revisiting previous ideas and applications in this new framework. Ideas from IT are no exception. One of the ideas which is being revisited by many researchers in this new era, is Information Bottleneck (IB); a formulation of information extraction based on IT. The IB is promising in both analyzing and improving DNNs. The goal of this survey is to review the IB concept and demonstrate its applications in deep learning. The information theoretic nature of IB, makes it also a good candidate in showing the more general concept of how IT can be used in ML. Two important concepts are highlighted in this narrative on the subject, i) the concise and universal view that IT provides on seemingly unrelated methods of ML, demonstrated by explaining how IB relates to minimal sufficient statistics, stochastic gradient descent, and variational auto-encoders, and ii) the common technical mistakes and problems caused by applying ideas from IT, which is discussed by a careful study of some recent methods suffering from them. △ Less

Submitted 7 April, 2019; originally announced April 2019.

arXiv:1904.02530 [pdf, other]

Interactive Open-Ended Object, Affordance and Grasp Learning for Robotic Manipulation

Authors: S. Hamidreza Kasaei, Nima Shafii, Luis Seabra Lopes, Ana Maria Tome

Abstract: Service robots are expected to autonomously and efficiently work in human-centric environments. For this type of robots, object perception and manipulation are challenging tasks due to need for accurate and real-time response. This paper presents an interactive open-ended learning approach to recognize multiple objects and their grasp affordances concurrently. This is an important contribution in… ▽ More Service robots are expected to autonomously and efficiently work in human-centric environments. For this type of robots, object perception and manipulation are challenging tasks due to need for accurate and real-time response. This paper presents an interactive open-ended learning approach to recognize multiple objects and their grasp affordances concurrently. This is an important contribution in the field of service robots since no matter how extensive the training data used for batch learning, a robot might always be confronted with an unknown object when operating in human-centric environments. The paper describes the system architecture and the learning and recognition capabilities. Grasp learning associates grasp configurations (i.e., end-effector positions and orientations) to grasp affordance categories. The grasp affordance category and the grasp configuration are taught through verbal and kinesthetic teaching, respectively. A Bayesian approach is adopted for learning and recognition of object categories and an instance-based approach is used for learning and recognition of affordance categories. An extensive set of experiments has been performed to assess the performance of the proposed approach regarding recognition accuracy, scalability and grasp success rate on challenging datasets and real-world scenarios. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Showing 1–50 of 53 results for author: Kasai, S