Search | arXiv e-print repository

CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes

Authors: Danial Qashqai, Emad Mousavian, Shahriar Baradaran Shokouhi, Sattar Mirzakuchaki

Abstract: Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and develo** multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computationa… ▽ More Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and develo** multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at https://github.com/Danial-Qashqai/CSFNet. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2311.06651 [pdf, other]

Traffic Sign Recognition Using Local Vision Transformer

Authors: Ali Farzipour, Omid Nejati Manzari, Shahriar B. Shokouhi

Abstract: Recognition of traffic signs is a crucial aspect of self-driving cars and driver assistance systems, and machine vision tasks such as traffic sign recognition have gained significant attention. CNNs have been frequently used in machine vision, but introducing vision transformers has provided an alternative approach to global feature learning. This paper proposes a new novel model that blends the a… ▽ More Recognition of traffic signs is a crucial aspect of self-driving cars and driver assistance systems, and machine vision tasks such as traffic sign recognition have gained significant attention. CNNs have been frequently used in machine vision, but introducing vision transformers has provided an alternative approach to global feature learning. This paper proposes a new novel model that blends the advantages of both convolutional and transformer-based networks for traffic sign recognition. The proposed model includes convolutional blocks for capturing local correlations and transformer-based blocks for learning global dependencies. Additionally, a locality module is incorporated to enhance local perception. The performance of the suggested model is evaluated on the Persian Traffic Sign Dataset and German Traffic Sign Recognition Benchmark and compared with SOTA convolutional and transformer-based models. The experimental evaluations demonstrate that the hybrid network with the locality module outperforms pure transformer-based models and some of the best convolutional networks in accuracy. Specifically, our proposed final model reached 99.66% accuracy in the German traffic sign recognition benchmark and 99.8% in the Persian traffic sign dataset, higher than the best convolutional models. Moreover, it outperforms existing CNNs and ViTs while maintaining fast inference speed. Consequently, the proposed model proves to be significantly faster and more suitable for real-world applications. △ Less

Submitted 11 November, 2023; originally announced November 2023.

arXiv:2302.09462 [pdf, other]

doi 10.1016/j.compbiomed.2023.106791

MedViT: A Robust Vision Transformer for Generalized Medical Image Classification

Authors: Omid Nejati Manzari, Hamid Ahmadabadi, Hossein Kashiani, Shahriar B. Shokouhi, Ahmad Ayatollahi

Abstract: Convolutional Neural Networks (CNNs) have advanced existing medical systems for automatic disease diagnosis. However, there are still concerns about the reliability of deep medical diagnosis systems against the potential threats of adversarial attacks since inaccurate diagnosis could lead to disastrous consequences in the safety realm. In this study, we propose a highly robust yet efficient CNN-Tr… ▽ More Convolutional Neural Networks (CNNs) have advanced existing medical systems for automatic disease diagnosis. However, there are still concerns about the reliability of deep medical diagnosis systems against the potential threats of adversarial attacks since inaccurate diagnosis could lead to disastrous consequences in the safety realm. In this study, we propose a highly robust yet efficient CNN-Transformer hybrid model which is equipped with the locality of CNNs as well as the global connectivity of vision Transformers. To mitigate the high quadratic complexity of the self-attention mechanism while jointly attending to information in various representation subspaces, we construct our attention mechanism by means of an efficient convolution operation. Moreover, to alleviate the fragility of our Transformer model against adversarial attacks, we attempt to learn smoother decision boundaries. To this end, we augment the shape information of an image in the high-level feature space by permuting the feature mean and variance within mini-batches. With less computational complexity, our proposed hybrid model demonstrates its high robustness and generalization ability compared to the state-of-the-art studies on a large-scale collection of standardized MedMNIST-2D datasets. △ Less

Submitted 18 February, 2023; originally announced February 2023.

Journal ref: Computers in Biology and Medicine 2023

arXiv:2301.11553 [pdf, other]

Robust Transformer with Locality Inductive Bias and Feature Normalization

Authors: Omid Nejati Manzari, Hossein Kashiani, Hojat Asgarian Dehkordi, Shahriar Baradaran Shokouhi

Abstract: Vision transformers have been demonstrated to yield state-of-the-art results on a variety of computer vision tasks using attention-based networks. However, research works in transformers mostly do not investigate robustness/accuracy trade-off, and they still struggle to handle adversarial perturbations. In this paper, we explore the robustness of vision transformers against adversarial perturbatio… ▽ More Vision transformers have been demonstrated to yield state-of-the-art results on a variety of computer vision tasks using attention-based networks. However, research works in transformers mostly do not investigate robustness/accuracy trade-off, and they still struggle to handle adversarial perturbations. In this paper, we explore the robustness of vision transformers against adversarial perturbations and try to enhance their robustness/accuracy trade-off in white box attack settings. To this end, we propose Locality iN Locality (LNL) transformer model. We prove that the locality introduction to LNL contributes to the robustness performance since it aggregates local information such as lines, edges, shapes, and even objects. In addition, to further improve the robustness performance, we encourage LNL to extract training signal from the moments (a.k.a., mean and standard deviation) and the normalized features. We validate the effectiveness and generality of LNL by achieving state-of-the-art results in terms of accuracy and robustness metrics on German Traffic Sign Recognition Benchmark (GTSRB) and Canadian Institute for Advanced Research (CIFAR-10). More specifically, for traffic sign classification, the proposed LNL yields gains of 1.1% and ~35% in terms of clean and robustness accuracy compared to the state-of-the-art studies. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: 9 pages, 3 Figures, 6 Tables

Journal ref: Engineering Science and Technology, an International Journal, 2023

arXiv:2207.06067 [pdf, other]

Pyramid Transformer for Traffic Sign Detection

Authors: Omid Nejati Manzari, Amin Boudesh, Shahriar B. Shokouhi

Abstract: Traffic sign detection is a vital task in the visual system of self-driving cars and the automated driving system. Recently, novel Transformer-based models have achieved encouraging results for various computer vision tasks. We still observed that vanilla ViT could not yield satisfactory results in traffic sign detection because the overall size of the datasets is very small and the class distribu… ▽ More Traffic sign detection is a vital task in the visual system of self-driving cars and the automated driving system. Recently, novel Transformer-based models have achieved encouraging results for various computer vision tasks. We still observed that vanilla ViT could not yield satisfactory results in traffic sign detection because the overall size of the datasets is very small and the class distribution of traffic signs is extremely unbalanced. To overcome this problem, a novel Pyramid Transformer with locality mechanisms is proposed in this paper. Specifically, Pyramid Transformer has several spatial pyramid reduction layers to shrink and embed the input image into tokens with rich multi-scale context by using atrous convolutions. Moreover, it inherits an intrinsic scale invariance inductive bias and is able to learn local feature representation for objects at various scales, thereby enhancing the network robustness against the size discrepancy of traffic signs. The experiments are conducted on the German Traffic Sign Detection Benchmark (GTSDB). The results demonstrate the superiority of the proposed model in the traffic sign detection tasks. More specifically, Pyramid Transformer achieves 77.8% mAP on GTSDB when applied to the Cascade RCNN as the backbone, which surpasses most well-known and widely-used state-of-the-art models. △ Less

Submitted 22 July, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

arXiv:2112.07015 [pdf, other]

Multi-Expert Human Action Recognition with Hierarchical Super-Class Learning

Authors: Hojat Asgarian Dehkordi, Ali Soltani Nezhad, Hossein Kashiani, Shahriar Baradaran Shokouhi, Ahmad Ayatollahi

Abstract: In still image human action recognition, existing studies have mainly leveraged extra bounding box information along with class labels to mitigate the lack of temporal information in still images; however, preparing extra data with manual annotation is time-consuming and also prone to human errors. Moreover, the existing studies have not addressed action recognition with long-tailed distribution.… ▽ More In still image human action recognition, existing studies have mainly leveraged extra bounding box information along with class labels to mitigate the lack of temporal information in still images; however, preparing extra data with manual annotation is time-consuming and also prone to human errors. Moreover, the existing studies have not addressed action recognition with long-tailed distribution. In this paper, we propose a two-phase multi-expert classification method for human action recognition to cope with long-tailed distribution by means of super-class learning and without any extra information. To choose the best configuration for each super-class and characterize inter-class dependency between different action classes, we propose a novel Graph-Based Class Selection (GCS) algorithm. In the proposed approach, a coarse-grained phase selects the most relevant fine-grained experts. Then, the fine-grained experts encode the intricate details within each super-class so that the inter-class variation increases. Extensive experimental evaluations are conducted on various public human action recognition datasets, including Stanford40, Pascal VOC 2012 Action, BU101+, and IHAR datasets. The experimental results demonstrate that the proposed method yields promising improvements. To be more specific, in IHAR, Sanford40, Pascal VOC 2012 Action, and BU101+ benchmarks, the proposed approach outperforms the state-of-the-art studies by 8.92%, 0.41%, 0.66%, and 2.11 % with much less computational cost and without any auxiliary annotation information. Besides, it is proven that in addressing action recognition with long-tailed distribution, the proposed method outperforms its counterparts by a significant margin. △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: 47 pages

arXiv:2102.03932 [pdf]

doi 10.1002/mp.15156

Automatic Breast Lesion Detection in Ultrafast DCE-MRI Using Deep Learning

Authors: Fazael Ayatollahi, Shahriar B. Shokouhi, Ritse M. Mann, Jonas Teuwen

Abstract: Purpose: We propose a deep learning-based computer-aided detection (CADe) method to detect breast lesions in ultrafast DCE-MRI sequences. This method uses both the three-dimensional spatial information and temporal information obtained from the early-phase of the dynamic acquisition. Methods: The proposed CADe method, based on a modified 3D RetinaNet model, operates on ultrafast T1 weighted sequen… ▽ More Purpose: We propose a deep learning-based computer-aided detection (CADe) method to detect breast lesions in ultrafast DCE-MRI sequences. This method uses both the three-dimensional spatial information and temporal information obtained from the early-phase of the dynamic acquisition. Methods: The proposed CADe method, based on a modified 3D RetinaNet model, operates on ultrafast T1 weighted sequences, which are preprocessed for motion compensation, temporal normalization, and are cropped before passing into the model. The model is optimized to enable the detection of relatively small breast lesions in a screening setting, focusing on detection of lesions that are harder to differentiate from confounding structures inside the breast. Results: The method was developed based on a dataset consisting of 489 ultrafast MRI studies obtained from 462 patients containing a total of 572 lesions (365 malignant, 207 benign) and achieved a detection rate, sensitivity, and detection rate of benign lesions of 0.90 (0.876-0.934), 0.95 (0.934-0.980), and 0.81 (0.751-0.871) at 4 false positives per normal breast with 10-fold cross-testing, respectively. Conclusions: The deep learning architecture used for the proposed CADe application can efficiently detect benign and malignant lesions on ultrafast DCE-MRI. Furthermore, utilizing the less visible hard-to detect-lesions in training improves the learning process and, subsequently, detection of malignant breast lesions. △ Less

Submitted 15 August, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

Journal ref: Medical physics vol. 48,10 (2021): 5897-5907

arXiv:2008.09891 [pdf, other]

Online Visual Tracking with One-Shot Context-Aware Domain Adaptation

Authors: Hossein Kashiani, Amir Abbas Hamidi Imani, Shahriar Baradaran Shokouhi, Ahmad Ayatollahi

Abstract: Online learning policy makes visual trackers more robust against different distortions through learning domain-specific cues. However, the trackers adopting this policy fail to fully leverage the discriminative context of the background areas. Moreover, owing to the lack of sufficient data at each time step, the online learning approach can also make the trackers prone to over-fitting to the backg… ▽ More Online learning policy makes visual trackers more robust against different distortions through learning domain-specific cues. However, the trackers adopting this policy fail to fully leverage the discriminative context of the background areas. Moreover, owing to the lack of sufficient data at each time step, the online learning approach can also make the trackers prone to over-fitting to the background regions. In this paper, we propose a domain adaptation approach to strengthen the contributions of the semantic background context. The domain adaptation approach is backboned with only an off-the-shelf deep model. The strength of the proposed approach comes from its discriminative ability to handle severe occlusion and background clutter challenges. We further introduce a cost-sensitive loss alleviating the dominance of non-semantic background candidates over the semantic candidates, thereby dealing with the data imbalance issue. Experimental results demonstrate that our tracker achieves competitive results at real-time speed compared to the state-of-the-art trackers. △ Less

Submitted 17 April, 2021; v1 submitted 22 August, 2020; originally announced August 2020.

Comments: 36 pages, 1 algorithm, 8 figures, 1 table

arXiv:2003.09893 [pdf, other]

doi 10.1109/ICCKE48569.2019.8965014

Ensembles of Deep Neural Networks for Action Recognition in Still Images

Authors: Sina Mohammadi, Sina Ghofrani Majelan, Shahriar B. Shokouhi

Abstract: Despite the fact that notable improvements have been made recently in the field of feature extraction and classification, human action recognition is still challenging, especially in images, in which, unlike videos, there is no motion. Thus, the methods proposed for recognizing human actions in videos cannot be applied to still images. A big challenge in action recognition in still images is the l… ▽ More Despite the fact that notable improvements have been made recently in the field of feature extraction and classification, human action recognition is still challenging, especially in images, in which, unlike videos, there is no motion. Thus, the methods proposed for recognizing human actions in videos cannot be applied to still images. A big challenge in action recognition in still images is the lack of large enough datasets, which is problematic for training deep Convolutional Neural Networks (CNNs) due to the overfitting issue. In this paper, by taking advantage of pre-trained CNNs, we employ the transfer learning technique to tackle the lack of massive labeled action recognition datasets. Furthermore, since the last layer of the CNN has class-specific information, we apply an attention mechanism on the output feature maps of the CNN to extract more discriminative and powerful features for classification of human actions. Moreover, we use eight different pre-trained CNNs in our framework and investigate their performance on Stanford 40 dataset. Finally, we propose using the Ensemble Learning technique to enhance the overall accuracy of action classification by combining the predictions of multiple models. The best setting of our method is able to achieve 93.17$\%$ accuracy on the Stanford 40 dataset. △ Less

Submitted 22 March, 2020; originally announced March 2020.

Comments: 5 pages, 2 figures, 3 tables, Accepted by ICCKE 2019

Journal ref: 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 2019, pp. 315-318

arXiv:1810.00119 [pdf, other]

Visual Object Tracking based on Adaptive Siamese and Motion Estimation Network

Authors: Hossein Kashiani, Shahriar B. Shokouhi

Abstract: Recently, convolutional neural network (CNN) has attracted much attention in different areas of computer vision, due to its powerful abstract feature representation. Visual object tracking is one of the interesting and important areas in computer vision that achieves remarkable improvements in recent years. In this work, we aim to improve both the motion and observation models in visual object tra… ▽ More Recently, convolutional neural network (CNN) has attracted much attention in different areas of computer vision, due to its powerful abstract feature representation. Visual object tracking is one of the interesting and important areas in computer vision that achieves remarkable improvements in recent years. In this work, we aim to improve both the motion and observation models in visual object tracking by leveraging representation power of CNNs. To this end, a motion estimation network (named MEN) is utilized to seek the most likely locations of the target and prepare a further clue in addition to the previous target position. Hence the motion estimation would be enhanced by generating a small number of candidates near two plausible positions. The generated candidates are then fed into a trained Siamese network to detect the most probable candidate. Each candidate is compared to an adaptable buffer, which is updated under a predefined condition. To take into account the target appearance changes, a weighting CNN (called WCNN) adaptively assigns weights to the final similarity scores of the Siamese network using sequence-specific information. Evaluation results on well-known benchmark datasets (OTB100, OTB50 and OTB2013) prove that the proposed tracker outperforms the state-of-the-art competitors. △ Less

Submitted 28 September, 2018; originally announced October 2018.

Comments: 28 pages, 1 algorithm, 7 figures, 2 table, Submitted to Elsevier, Image and Vision Computing

arXiv:1803.06141 [pdf]

doi 10.1109/ICCKE.2017.8167940

Patchwise object tracking via structural local sparse appearance model

Authors: Hossein Kashiyani, Shahriar B. Shokouhi

Abstract: In this paper, we propose a robust visual tracking method which exploits the relationships of targets in adjacent frames using patchwise joint sparse representation. Two sets of overlap** patches with different sizes are extracted from target candidates to construct two dictionaries with consideration of joint sparse representation. By applying this representation into structural sparse appearan… ▽ More In this paper, we propose a robust visual tracking method which exploits the relationships of targets in adjacent frames using patchwise joint sparse representation. Two sets of overlap** patches with different sizes are extracted from target candidates to construct two dictionaries with consideration of joint sparse representation. By applying this representation into structural sparse appearance model, we can take two-fold advantages. First, the correlation of target patches over time is considered. Second, using this local appearance model with different patch sizes takes into account local features of target thoroughly. Furthermore, the position of candidate patches and their occlusion levels are utilized simultaneously to obtain the final likelihood of target candidates. Evaluations on recent challenging benchmark show that our tracking method outperforms the state-of-the-art trackers. △ Less

Submitted 16 March, 2018; originally announced March 2018.

Comments: 6 pages, 3 figures, Accepted by ICCKE 2017

arXiv:1209.1949 [pdf]

Improved Robust DWT-Watermarking in YCbCr Color Space

Authors: Atefeh Elahian, Mehdi Khalili, Shahriar Baradaran Shokouhi

Abstract: Digital watermarking is an effective way to protect copyright. In this paper, a robust watermarking algorithm based on wavelet transformation is proposed which can confirm the copyright without original image. The wavelet transformation technique is effective in image analyzing and processing. Thus the color-image watermark algorithm based on discrete wavelet transformation (DWT) begins to draw an… ▽ More Digital watermarking is an effective way to protect copyright. In this paper, a robust watermarking algorithm based on wavelet transformation is proposed which can confirm the copyright without original image. The wavelet transformation technique is effective in image analyzing and processing. Thus the color-image watermark algorithm based on discrete wavelet transformation (DWT) begins to draw an increasing attention. In the proposed approach, the watermark Encrypt by Arnold transform and the host image is converted into the YCbCr color space. Then its Y channel decomposed into wavelet coefficients and the selected approximation coefficients are quantized and then their least significant bit of the quantized coefficients is replaced by the Encrypted watermark using LSB insertion technique. The experimental results show that watermark embedded by this algorithm is of better robustness and extra imperceptibility and robustness against wavelet compression compared to the traditional embedding methods in RGB color space. △ Less

Submitted 10 September, 2012; originally announced September 2012.

Comments: 5 Pages, 4 Figures, 3 Tables

MSC Class: 68U10; 68U20; 65C20; 94A08; 94A24; 94A60; 11T71; 14G50; 68P25; 81P94 ACM Class: D.4.6; K.6.5; K.4.2

Journal ref: Global journal of Computer Application and Technology (GJCAT), Vol.1, No.3, 2011, Pages 300-304

Showing 1–12 of 12 results for author: Shokouhi, S B