Search | arXiv e-print repository

SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers

Authors: Xiangyong Lu, Masanori Suganuma, Takayuki Okatani

Abstract: Computer vision has become increasingly prevalent in solving real-world problems across diverse domains, including smart agriculture, fishery, and livestock management. These applications may not require processing many image frames per second, leading practitioners to use single board computers (SBCs). Although many lightweight networks have been developed for mobile/edge devices, they primarily… ▽ More Computer vision has become increasingly prevalent in solving real-world problems across diverse domains, including smart agriculture, fishery, and livestock management. These applications may not require processing many image frames per second, leading practitioners to use single board computers (SBCs). Although many lightweight networks have been developed for mobile/edge devices, they primarily target smartphones with more powerful processors and not SBCs with the low-end CPUs. This paper introduces a CNN-ViT hybrid network called SBCFormer, which achieves high accuracy and fast computation on such low-end CPUs. The hardware constraints of these CPUs make the Transformer's attention mechanism preferable to convolution. However, using attention on low-end CPUs presents a challenge: high-resolution internal feature maps demand excessive computational resources, but reducing their resolution results in the loss of local image details. SBCFormer introduces an architectural design to address this issue. As a result, SBCFormer achieves the highest trade-off between accuracy and speed on a Raspberry Pi 4 Model B with an ARM-Cortex A72 CPU. For the first time, it achieves an ImageNet-1K top-1 accuracy of around 80% at a speed of 1.0 frame/sec on the SBC. Code is available at https://github.com/xyongLu/SBCFormer. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 11 pages, 2 figures, WACV2024

Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV2024)

arXiv:2310.04671 [pdf, other]

Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction

Authors: Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani

Abstract: This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from stati… ▽ More This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from static images. The problem needs predicting and reasoning about future events based on uncertain observations, which falls under visual abductive reasoning. To enable research in this understudied area, a new dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is created. The dataset consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. These are annotated by human annotators, who identify risky scenes and provide descriptions of potential accidents that could occur a few seconds later. We present several baseline methods and evaluate their performance on our dataset, identifying remaining issues and discussing future directions. This study contributes to the field by introducing a novel problem formulation and dataset, enabling researchers to explore the potential of multi-modal AI for driving hazard prediction. △ Less

Submitted 1 July, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: Main Paper: 11 pages, Supplementary Materials: 25 pages

Journal ref: IEEE Trans. Intell. Veh. (2024) 1-11

arXiv:2307.03243 [pdf, other]

That's BAD: Blind Anomaly Detection by Implicit Local Feature Clustering

Authors: Jie Zhang, Masanori Suganuma, Takayuki Okatani

Abstract: Recent studies on visual anomaly detection (AD) of industrial objects/textures have achieved quite good performance. They consider an unsupervised setting, specifically the one-class setting, in which we assume the availability of a set of normal (\textit{i.e.}, anomaly-free) images for training. In this paper, we consider a more challenging scenario of unsupervised AD, in which we detect anomalie… ▽ More Recent studies on visual anomaly detection (AD) of industrial objects/textures have achieved quite good performance. They consider an unsupervised setting, specifically the one-class setting, in which we assume the availability of a set of normal (\textit{i.e.}, anomaly-free) images for training. In this paper, we consider a more challenging scenario of unsupervised AD, in which we detect anomalies in a given set of images that might contain both normal and anomalous samples. The setting does not assume the availability of known normal data and thus is completely free from human annotation, which differs from the standard AD considered in recent studies. For clarity, we call the setting blind anomaly detection (BAD). We show that BAD can be converted into a local outlier detection problem and propose a novel method named PatchCluster that can accurately detect image- and pixel-level anomalies. Experimental results show that PatchCluster shows a promising performance without the knowledge of normal data, even comparable to the SOTA methods applied in the one-class setting needing it. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2307.03101 [pdf, other]

Contextual Affinity Distillation for Image Anomaly Detection

Authors: Jie Zhang, Masanori Suganuma, Takayuki Okatani

Abstract: Previous works on unsupervised industrial anomaly detection mainly focus on local structural anomalies such as cracks and color contamination. While achieving significantly high detection performance on this kind of anomaly, they are faced with logical anomalies that violate the long-range dependencies such as a normal object placed in the wrong position. In this paper, based on previous knowledge… ▽ More Previous works on unsupervised industrial anomaly detection mainly focus on local structural anomalies such as cracks and color contamination. While achieving significantly high detection performance on this kind of anomaly, they are faced with logical anomalies that violate the long-range dependencies such as a normal object placed in the wrong position. In this paper, based on previous knowledge distillation works, we propose to use two students (local and global) to better mimic the teacher's behavior. The local student, which is used in previous studies mainly focuses on structural anomaly detection while the global student pays attention to logical anomalies. To further encourage the global student's learning to capture long-range dependencies, we design the global context condensing block (GCCB) and propose a contextual affinity loss for the student training and anomaly scoring. Experimental results show the proposed method doesn't need cumbersome training techniques and achieves a new state-of-the-art performance on the MVTec LOCO AD dataset. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2307.02897 [pdf, other]

RefVSR++: Exploiting Reference Inputs for Reference-based Video Super-resolution

Authors: Han Zou, Masanori Suganuma, Takayuki Okatani

Abstract: Smartphones equipped with a multi-camera system comprising multiple cameras with different field-of-view (FoVs) are becoming more prevalent. These camera configurations are compatible with reference-based SR and video SR, which can be executed simultaneously while recording video on the device. Thus, combining these two SR methods can improve image quality. Recently, Lee et al. have presented such… ▽ More Smartphones equipped with a multi-camera system comprising multiple cameras with different field-of-view (FoVs) are becoming more prevalent. These camera configurations are compatible with reference-based SR and video SR, which can be executed simultaneously while recording video on the device. Thus, combining these two SR methods can improve image quality. Recently, Lee et al. have presented such a method, RefVSR. In this paper, we consider how to optimally utilize the observations obtained, including input low-resolution (LR) video and reference (Ref) video. RefVSR extends conventional video SR quite simply, aggregating the LR and Ref inputs over time in a single bidirectional stream. However, considering the content difference between LR and Ref images due to their FoVs, we can derive the maximum information from the two image sequences by aggregating them independently in the temporal direction. Then, we propose an improved method, RefVSR++, which can aggregate two features in parallel in the temporal direction, one for aggregating the fused LR and Ref inputs and the other for Ref inputs over time. Furthermore, we equip RefVSR++ with enhanced mechanisms to align image features over time, which is the key to the success of video SR. We experimentally show that RefVSR++ outperforms RefVSR by over 1dB in PSNR, achieving the new state-of-the-art. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2307.02875 [pdf, other]

Reference-based Motion Blur Removal: Learning to Utilize Sharpness in the Reference Image

Authors: Han Zou, Masanori Suganuma, Takayuki Okatani

Abstract: Despite the recent advancement in the study of removing motion blur in an image, it is still hard to deal with strong blurs. While there are limits in removing blurs from a single image, it has more potential to use multiple images, e.g., using an additional image as a reference to deblur a blurry image. A typical setting is deburring an image using a nearby sharp image(s) in a video sequence, as… ▽ More Despite the recent advancement in the study of removing motion blur in an image, it is still hard to deal with strong blurs. While there are limits in removing blurs from a single image, it has more potential to use multiple images, e.g., using an additional image as a reference to deblur a blurry image. A typical setting is deburring an image using a nearby sharp image(s) in a video sequence, as in the studies of video deblurring. This paper proposes a better method to use the information present in a reference image. The method does not need a strong assumption on the reference image. We can utilize an alternative shot of the identical scene, just like in video deblurring, or we can even employ a distinct image from another scene. Our method first matches local patches of the target and reference images and then fuses their features to estimate a sharp image. We employ a patch-based feature matching strategy to solve the difficult problem of matching the blurry image with the sharp reference. Our method can be integrated into pre-existing networks designed for single image deblurring. The experimental results show the effectiveness of the proposed method. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2207.09775 [pdf, other]

Rectifying Open-set Object Detection: A Taxonomy, Practical Applications, and Proper Evaluation

Authors: Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani

Abstract: Open-set object detection (OSOD) has recently gained attention. It is to detect unknown objects while correctly detecting known objects. In this paper, we first point out that the recent studies' formalization of OSOD, which generalizes open-set recognition (OSR) and thus considers an unlimited variety of unknown objects, has a fundamental issue. This issue emerges from the difference between imag… ▽ More Open-set object detection (OSOD) has recently gained attention. It is to detect unknown objects while correctly detecting known objects. In this paper, we first point out that the recent studies' formalization of OSOD, which generalizes open-set recognition (OSR) and thus considers an unlimited variety of unknown objects, has a fundamental issue. This issue emerges from the difference between image classification and object detection, making it hard to evaluate OSOD methods' performance properly. We then introduce a novel scenario of OSOD, which considers known and unknown classes within a specified super-class of object classes. This new scenario has practical applications and is free from the above issue, enabling proper evaluation of OSOD performance and probably making the problem more manageable. Finally, we experimentally evaluate existing OSOD methods with the new scenario using multiple datasets, showing that the current state-of-the-art OSOD methods attain limited performance similar to a simple baseline method. The paper also presents a taxonomy of OSOD that clarifies different problem formalizations. We hope our study helps the community reconsider OSOD problems and progress in the right direction. △ Less

Submitted 29 November, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

Comments: 17 pages, 7 figures

arXiv:2207.09666 [pdf, other]

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

Authors: Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani

Abstract: Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first… ▽ More Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first two could be resolved by additionally using grid-based features. However, how to extract and fuse these two types of features is uncharted. This paper proposes a Transformer-only neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster. Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. This innovative design and the integration of the dual visual features bring about significant performance improvement. The experimental results on several image captioning benchmarks show that GRIT outperforms previous methods in inference accuracy and speed. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: Accepted to ECCV 2022; 14 pages with appendix; Code: https://github.com/davidnvq/grit

arXiv:2207.03047 [pdf, other]

Single-image Defocus Deblurring by Integration of Defocus Map Prediction Tracing the Inverse Problem Computation

Authors: Qian Ye, Masanori Suganuma, Takayuki Okatani

Abstract: In this paper, we consider the problem in defocus image deblurring. Previous classical methods follow two-steps approaches, i.e., first defocus map estimation and then the non-blind deblurring. In the era of deep learning, some researchers have tried to address these two problems by CNN. However, the simple concatenation of defocus map, which represents the blur level, leads to suboptimal performa… ▽ More In this paper, we consider the problem in defocus image deblurring. Previous classical methods follow two-steps approaches, i.e., first defocus map estimation and then the non-blind deblurring. In the era of deep learning, some researchers have tried to address these two problems by CNN. However, the simple concatenation of defocus map, which represents the blur level, leads to suboptimal performance. Considering the spatial variant property of the defocus blur and the blur level indicated in the defocus map, we employ the defocus map as conditional guidance to adjust the features from the input blurring images instead of simple concatenation. Then we propose a simple but effective network with spatial modulation based on the defocus map. To achieve this, we design a network consisting of three sub-networks, including the defocus map estimation network, a condition network that encodes the defocus map into condition features, and the defocus deblurring network that performs spatially dynamic modulation based on the condition features. Moreover, the spatially dynamic modulation is based on an affine transform function to adjust the features from the input blurry images. Experimental results show that our method can achieve better quantitative and qualitative evaluation performance than the existing state-of-the-art methods on the commonly used public test datasets. △ Less

Submitted 6 July, 2022; originally announced July 2022.

arXiv:2207.02539 [pdf, other]

Learning Regularized Multi-Scale Feature Flow for High Dynamic Range Imaging

Authors: Qian Ye, Masanori Suganuma, Jun Xiao, Takayuki Okatani

Abstract: Reconstructing ghosting-free high dynamic range (HDR) images of dynamic scenes from a set of multi-exposure images is a challenging task, especially with large object motion and occlusions, leading to visible artifacts using existing methods. To address this problem, we propose a deep network that tries to learn multi-scale feature flow guided by the regularized loss. It first extracts multi-scale… ▽ More Reconstructing ghosting-free high dynamic range (HDR) images of dynamic scenes from a set of multi-exposure images is a challenging task, especially with large object motion and occlusions, leading to visible artifacts using existing methods. To address this problem, we propose a deep network that tries to learn multi-scale feature flow guided by the regularized loss. It first extracts multi-scale features and then aligns features from non-reference images. After alignment, we use residual channel attention blocks to merge the features from different images. Extensive qualitative and quantitative comparisons show that our approach achieves state-of-the-art performance and produces excellent results where color artifacts and geometric distortions are significantly reduced. △ Less

Submitted 6 July, 2022; originally announced July 2022.

arXiv:2207.00067 [pdf, other]

Rethinking Unsupervised Domain Adaptation for Semantic Segmentation

Authors: Zhijie Wang, Masanori Suganuma, Takayuki Okatani

Abstract: Unsupervised domain adaptation (UDA) adapts a model trained on one domain (called source) to a novel domain (called target) using only unlabeled data. Due to its high annotation cost, researchers have developed many UDA methods for semantic segmentation, which assume no labeled sample is available in the target domain. We question the practicality of this assumption for two reasons. First, after t… ▽ More Unsupervised domain adaptation (UDA) adapts a model trained on one domain (called source) to a novel domain (called target) using only unlabeled data. Due to its high annotation cost, researchers have developed many UDA methods for semantic segmentation, which assume no labeled sample is available in the target domain. We question the practicality of this assumption for two reasons. First, after training a model with a UDA method, we must somehow verify the model before deployment. Second, UDA methods have at least a few hyper-parameters that need to be determined. The surest solution to these is to evaluate the model using validation data, i.e., a certain amount of labeled target-domain samples. This question about the basic assumption of UDA leads us to rethink UDA from a data-centric point of view. Specifically, we assume we have access to a minimum level of labeled data. Then, we ask how much is necessary to find good hyper-parameters of existing UDA methods. We then consider what if we use the same data for supervised training of the same model, e.g., finetuning. We conducted experiments to answer these questions with popular scenarios, {GTA5, SYNTHIA}$\rightarrow$Cityscapes. We found that i) choosing good hyper-parameters needs only a few labeled images for some UDA methods whereas a lot more for others; and ii) simple finetuning works surprisingly well; it outperforms many UDA methods if only several dozens of labeled images are available. △ Less

Submitted 22 January, 2024; v1 submitted 30 June, 2022; originally announced July 2022.

Comments: Under review in Pattern Recognition Letters

arXiv:2109.06432 [pdf, other]

Improved Few-shot Segmentation by Redefinition of the Roles of Multi-level CNN Features

Authors: Zhijie Wang, Masanori Suganuma, Takayuki Okatani

Abstract: This study is concerned with few-shot segmentation, i.e., segmenting the region of an unseen object class in a query image, given support image(s) of its instances. The current methods rely on the pretrained CNN features of the support and query images. The key to good performance depends on the proper fusion of their mid-level and high-level features; the former contains shape-oriented informatio… ▽ More This study is concerned with few-shot segmentation, i.e., segmenting the region of an unseen object class in a query image, given support image(s) of its instances. The current methods rely on the pretrained CNN features of the support and query images. The key to good performance depends on the proper fusion of their mid-level and high-level features; the former contains shape-oriented information, while the latter has class-oriented information. Current state-of-the-art methods follow the approach of Tian et al., which gives the mid-level features the primary role and the high-level features the secondary role. In this paper, we reinterpret this widely employed approach by redifining the roles of the multi-level features; we swap the primary and secondary roles. Specifically, we regard that the current methods improve the initial estimate generated from the high-level features using the mid-level features. This reinterpretation suggests a new application of the current methods: to apply the same network multiple times to iteratively update the estimate of the object's region, starting from its initial estimate. Our experiments show that this method is effective and has updated the previous state-of-the-art on COCO-20$^i$ in the 1-shot and 5-shot settings and on PASCAL-5$^i$ in the 1-shot setting. △ Less

Submitted 14 September, 2021; v1 submitted 14 September, 2021; originally announced September 2021.

arXiv:2109.06422 [pdf, other]

Cross-Region Domain Adaptation for Class-level Alignment

Authors: Zhijie Wang, Xing Liu, Masanori Suganuma, Takayuki Okatani

Abstract: Semantic segmentation requires a lot of training data, which necessitates costly annotation. There have been many studies on unsupervised domain adaptation (UDA) from one domain to another, e.g., from computer graphics to real images. However, there is still a gap in accuracy between UDA and supervised training on native domain data. It is arguably attributable to class-level misalignment between… ▽ More Semantic segmentation requires a lot of training data, which necessitates costly annotation. There have been many studies on unsupervised domain adaptation (UDA) from one domain to another, e.g., from computer graphics to real images. However, there is still a gap in accuracy between UDA and supervised training on native domain data. It is arguably attributable to class-level misalignment between the source and target domain data. To cope with this, we propose a method that applies adversarial training to align two feature distributions in the target domain. It uses a self-training framework to split the image into two regions (i.e., trusted and untrusted), which form two distributions to align in the feature space. We term this approach cross-region adaptation (CRA) to distinguish from the previous methods of aligning different domain distributions, which we call cross-domain adaptation (CDA). CRA can be applied after any CDA method. Experimental results show that this always improves the accuracy of the combined CDA method, having updated the state-of-the-art. △ Less

Submitted 6 October, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: Under review in Computer Vision and Image Understanding

arXiv:2109.03585 [pdf, other]

Matching in the Dark: A Dataset for Matching Image Pairs of Low-light Scenes

Authors: Wenzheng Song, Masanori Suganuma, Xing Liu, Noriyuki Shimobayashi, Daisuke Maruta, Takayuki Okatani

Abstract: This paper considers matching images of low-light scenes, aiming to widen the frontier of SfM and visual SLAM applications. Recent image sensors can record the brightness of scenes with more than eight-bit precision, available in their RAW-format image. We are interested in making full use of such high-precision information to match extremely low-light scene images that conventional methods cannot… ▽ More This paper considers matching images of low-light scenes, aiming to widen the frontier of SfM and visual SLAM applications. Recent image sensors can record the brightness of scenes with more than eight-bit precision, available in their RAW-format image. We are interested in making full use of such high-precision information to match extremely low-light scene images that conventional methods cannot handle. For extreme low-light scenes, even if some of their brightness information exists in the RAW format images' low bits, the standard raw image processing on cameras fails to utilize them properly. As was recently shown by Chen et al., CNNs can learn to produce images with a natural appearance from such RAW-format images. To consider if and how well we can utilize such information stored in RAW-format images for image matching, we have created a new dataset named MID (matching in the dark). Using it, we experimentally evaluated combinations of eight image-enhancing methods and eleven image matching methods consisting of classical/neural local descriptors and classical/neural initial point-matching methods. The results show the advantage of using the RAW-format images and the strengths and weaknesses of the above component methods. They also imply there is room for further research. △ Less

Submitted 14 September, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

Comments: 15 pages, 14 figures, ICCV2021

MSC Class: 68T40; 68T07

arXiv:2106.00596 [pdf, other]

Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

Authors: Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani

Abstract: There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin.… ▽ More There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from the input image. Moreover, our method considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction. This contributes to the accurate prediction of actions for navigation. A preliminary version of the method won the ALFRED Challenge 2020. The current version achieves the unseen environment's success rate of 4.45% with a single view, which is further improved to 8.37% with multiple views. △ Less

Submitted 6 June, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

Comments: To appear in IJCAI2021. 8-page main paper and Appendix following. Appendix E for details of entry submission to EAI 2021. Github: https://github.com/davidnvq/lwit-alfred

arXiv:2005.03463 [pdf, other]

How Can CNNs Use Image Position for Segmentation?

Authors: Rito Murase, Masanori Suganuma, Takayuki Okatani

Abstract: Convolution is an equivariant operation, and image position does not affect its result. A recent study shows that the zero-padding employed in convolutional layers of CNNs provides position information to the CNNs. The study further claims that the position information enables accurate inference for several tasks, such as object recognition, segmentation, etc. However, there is a technical issue w… ▽ More Convolution is an equivariant operation, and image position does not affect its result. A recent study shows that the zero-padding employed in convolutional layers of CNNs provides position information to the CNNs. The study further claims that the position information enables accurate inference for several tasks, such as object recognition, segmentation, etc. However, there is a technical issue with the design of the experiments of the study, and thus the correctness of the claim is yet to be verified. Moreover, the absolute image position may not be essential for the segmentation of natural images, in which target objects will appear at any image position. In this study, we investigate how positional information is and can be utilized for segmentation tasks. Toward this end, we consider {\em positional encoding} (PE) that adds channels embedding image position to the input images and compare PE with several padding methods. Considering the above nature of natural images, we choose medical image segmentation tasks, in which the absolute position appears to be relatively important, as the same organs (of different patients) are captured in similar sizes and positions. We draw a mixed conclusion from the experimental results; the positional encoding certainly works in some cases, but the absolute image position may not be so important for segmentation tasks as we think. △ Less

Submitted 7 May, 2020; originally announced May 2020.

Comments: 11 pages

arXiv:1911.11390 [pdf, other]

Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs

Authors: Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani

Abstract: It has been a primary concern in recent studies of vision and language tasks to design an effective attention mechanism dealing with interactions between the two modalities. The Transformer has recently been extended and applied to several bi-modal tasks, yielding promising results. For visual dialog, it becomes necessary to consider interactions between three or more inputs, i.e., an image, a que… ▽ More It has been a primary concern in recent studies of vision and language tasks to design an effective attention mechanism dealing with interactions between the two modalities. The Transformer has recently been extended and applied to several bi-modal tasks, yielding promising results. For visual dialog, it becomes necessary to consider interactions between three or more inputs, i.e., an image, a question, and a dialog history, or even its individual dialog components. In this paper, we present a neural architecture named Light-weight Transformer for Many Inputs (LTMI) that can efficiently deal with all the interactions between multiple such inputs in visual dialog. It has a block structure similar to the Transformer and employs the same design of attention computation, whereas it has only a small number of parameters, yet has sufficient representational power for the purpose. Assuming a standard setting of visual dialog, a layer built upon the proposed attention block has less than one-tenth of parameters as compared with its counterpart, a natural Transformer extension. The experimental results on the VisDial datasets validate the effectiveness of the proposed approach, showing improvements of the best NDCG score on the VisDial v1.0 dataset from 57.59 to 60.92 with a single model, from 64.47 to 66.53 with ensemble models, and even to 74.88 with additional finetuning. Our implementation code is available at https://github.com/davidnvq/visdial. △ Less

Submitted 17 July, 2020; v1 submitted 26 November, 2019; originally announced November 2019.

Comments: Accepted to ECCV 2020, 14 pages. Slight change in title

arXiv:1910.09212 [pdf, other]

Analysis and a Solution of Momentarily Missed Detection for Anchor-based Object Detectors

Authors: Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani

Abstract: The employment of convolutional neural networks has led to significant performance improvement on the task of object detection. However, when applying existing detectors to continuous frames in a video, we often encounter momentary miss-detection of objects, that is, objects are undetected exceptionally at a few frames, although they are correctly detected at all other frames. In this paper, we an… ▽ More The employment of convolutional neural networks has led to significant performance improvement on the task of object detection. However, when applying existing detectors to continuous frames in a video, we often encounter momentary miss-detection of objects, that is, objects are undetected exceptionally at a few frames, although they are correctly detected at all other frames. In this paper, we analyze the mechanism of how such miss-detection occurs. For the most popular class of detectors that are based on anchor boxes, we show the followings: i) besides apparent causes such as motion blur, occlusions, background clutters, etc., the majority of remaining miss-detection can be explained by an improper behavior of the detectors at boundaries of the anchor boxes; and ii) this can be rectified by improving the way of choosing positive samples from candidate anchor boxes when training the detectors. △ Less

Submitted 16 January, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

Comments: Accepted to WACV 2020, 9 pages

arXiv:1907.04508 [pdf, other]

Restoring Images with Unknown Degradation Factors by Recurrent Use of a Multi-branch Network

Authors: Xing Liu, Masanori Suganuma, Xiyang Luo, Takayuki Okatani

Abstract: The employment of convolutional neural networks has achieved unprecedented performance in the task of image restoration for a variety of degradation factors. However, high-performance networks have been specifically designed for a single degradation factor. In this paper, we tackle a harder problem, restoring a clean image from its degraded version with an unknown degradation factor, subject to th… ▽ More The employment of convolutional neural networks has achieved unprecedented performance in the task of image restoration for a variety of degradation factors. However, high-performance networks have been specifically designed for a single degradation factor. In this paper, we tackle a harder problem, restoring a clean image from its degraded version with an unknown degradation factor, subject to the condition that it is one of the known factors. Toward this end, we design a network having multiple pairs of input and output branches and use it in a recurrent fashion such that a different branch pair is used at each of the recurrent paths. We reinforce the shared part of the network with improved components so that it can handle different degradation factors. We also propose a two-step training method for the network, which consists of multi-task learning and finetuning. The experimental results show that the proposed network yields at least comparable or sometimes even better performance on four degradation factors as compared with the best dedicated network for each of the four. We also test it on a further harder task where the input image contains multiple degradation factors that are mixed with unknown mixture ratios, showing that it achieves better performance than the previous state-of-the-art method designed for the task. △ Less

Submitted 21 January, 2020; v1 submitted 10 July, 2019; originally announced July 2019.

arXiv:1905.10628 [pdf, other]

Hyperparameter-Free Out-of-Distribution Detection Using Softmax of Scaled Cosine Similarity

Authors: Engkarat Techapanurak, Masanori Suganuma, Takayuki Okatani

Abstract: The ability to detect out-of-distribution (OOD) samples is vital to secure the reliability of deep neural networks in real-world applications. Considering the nature of OOD samples, detection methods should not have hyperparameters that need to be tuned depending on incoming OOD samples. However, most of the recently proposed methods do not meet this requirement, leading to compromised performance… ▽ More The ability to detect out-of-distribution (OOD) samples is vital to secure the reliability of deep neural networks in real-world applications. Considering the nature of OOD samples, detection methods should not have hyperparameters that need to be tuned depending on incoming OOD samples. However, most of the recently proposed methods do not meet this requirement, leading to compromised performance in real-world applications. In this paper, we propose a simple, hyperparameter-free method based on softmax of scaled cosine similarity. It resembles the approach employed by modern metric learning methods, but it differs in details; the differences are essential to achieve high detection performance. We show through experiments that our method outperforms the existing methods on the evaluation test recently proposed by Shafaei et al., which takes the above issue of hyperparameter dependency into account. We also show that it achieves at least comparable performance to other methods on the conventional test, where their hyperparameters are chosen using explicit OOD samples. Furthermore, it is computationally more efficient than most of the previous methods, since it needs only a single forward pass. △ Less

Submitted 25 November, 2019; v1 submitted 25 May, 2019; originally announced May 2019.

Comments: Extend the supplementary material

arXiv:1903.08817 [pdf, other]

Dual Residual Networks Leveraging the Potential of Paired Operations for Image Restoration

Authors: Xing Liu, Masanori Suganuma, Zhun Sun, Takayuki Okatani

Abstract: In this paper, we study design of deep neural networks for tasks of image restoration. We propose a novel style of residual connections dubbed "dual residual connection", which exploits the potential of paired operations, e.g., up- and down-sampling or convolution with large- and small-size kernels. We design a modular block implementing this connection style; it is equipped with two containers to… ▽ More In this paper, we study design of deep neural networks for tasks of image restoration. We propose a novel style of residual connections dubbed "dual residual connection", which exploits the potential of paired operations, e.g., up- and down-sampling or convolution with large- and small-size kernels. We design a modular block implementing this connection style; it is equipped with two containers to which arbitrary paired operations are inserted. Adopting the "unraveled" view of the residual networks proposed by Veit et al., we point out that a stack of the proposed modular blocks allows the first operation in a block interact with the second operation in any subsequent blocks. Specifying the two operations in each of the stacked blocks, we build a complete network for each individual task of image restoration. We experimentally evaluate the proposed approach on five image restoration tasks using nine datasets. The results show that the proposed networks with properly chosen paired operations outperform previous methods on almost all of the tasks and datasets. △ Less

Submitted 7 April, 2019; v1 submitted 20 March, 2019; originally announced March 2019.

Comments: i) Accepted to CVPR 2019 ii) Code, trained models and additional results for visual comparison will be provided at https://github.com/liu-vis/DualResidualNetworks

arXiv:1812.00733 [pdf, other]

Attention-based Adaptive Selection of Operations for Image Restoration in the Presence of Unknown Combined Distortions

Authors: Masanori Suganuma, Xing Liu, Takayuki Okatani

Abstract: Many studies have been conducted so far on image restoration, the problem of restoring a clean image from its distorted version. There are many different types of distortion which affect image quality. Previous studies have focused on single types of distortion, proposing methods for removing them. However, image quality degrades due to multiple factors in the real world. Thus, depending on applic… ▽ More Many studies have been conducted so far on image restoration, the problem of restoring a clean image from its distorted version. There are many different types of distortion which affect image quality. Previous studies have focused on single types of distortion, proposing methods for removing them. However, image quality degrades due to multiple factors in the real world. Thus, depending on applications, e.g., vision for autonomous cars or surveillance cameras, we need to be able to deal with multiple combined distortions with unknown mixture ratios. For this purpose, we propose a simple yet effective layer architecture of neural networks. It performs multiple operations in parallel, which are weighted by an attention mechanism to enable selection of proper operations depending on the input. The layer can be stacked to form a deep network, which is differentiable and thus can be trained in an end-to-end fashion by gradient descent. The experimental results show that the proposed method works better than previous methods by a good margin on tasks of restoring images with multiple combined distortions. △ Less

Submitted 7 April, 2019; v1 submitted 3 December, 2018; originally announced December 2018.

Comments: CVPR 2019

arXiv:1803.00370 [pdf, other]

Exploiting the Potential of Standard Convolutional Autoencoders for Image Restoration by Evolutionary Search

Authors: Masanori Suganuma, Mete Ozay, Takayuki Okatani

Abstract: Researchers have applied deep neural networks to image restoration tasks, in which they proposed various network architectures, loss functions, and training methods. In particular, adversarial training, which is employed in recent studies, seems to be a key ingredient to success. In this paper, we show that simple convolutional autoencoders (CAEs) built upon only standard network components, i.e.,… ▽ More Researchers have applied deep neural networks to image restoration tasks, in which they proposed various network architectures, loss functions, and training methods. In particular, adversarial training, which is employed in recent studies, seems to be a key ingredient to success. In this paper, we show that simple convolutional autoencoders (CAEs) built upon only standard network components, i.e., convolutional layers and skip connections, can outperform the state-of-the-art methods which employ adversarial training and sophisticated loss functions. The secret is to employ an evolutionary algorithm to automatically search for good architectures. Training optimized CAEs by minimizing the $\ell_2$ loss between reconstructed images and their ground truths using the ADAM optimizer is all we need. Our experimental results show that this approach achieves 27.8 dB peak signal to noise ratio (PSNR) on the CelebA dataset and 40.4 dB on the SVHN dataset, compared to 22.8 dB and 33.0 dB provided by the former state-of-the-art methods, respectively. △ Less

Submitted 1 March, 2018; originally announced March 2018.

Comments: Our code is available at https://github.com/sg-nm/Evolutionary-Autoencoders

arXiv:1704.00764 [pdf, other]

A Genetic Programming Approach to Designing Convolutional Neural Network Architectures

Authors: Masanori Suganuma, Shinichi Shirakawa, Tomoharu Nagao

Abstract: The convolutional neural network (CNN), which is one of the deep learning models, has seen much success in a variety of computer vision tasks. However, designing CNN architectures still requires expert knowledge and a lot of trial and error. In this paper, we attempt to automatically construct CNN architectures for an image classification task based on Cartesian genetic programming (CGP). In our m… ▽ More The convolutional neural network (CNN), which is one of the deep learning models, has seen much success in a variety of computer vision tasks. However, designing CNN architectures still requires expert knowledge and a lot of trial and error. In this paper, we attempt to automatically construct CNN architectures for an image classification task based on Cartesian genetic programming (CGP). In our method, we adopt highly functional modules, such as convolutional blocks and tensor concatenation, as the node functions in CGP. The CNN structure and connectivity represented by the CGP encoding method are optimized to maximize the validation accuracy. To evaluate the proposed method, we constructed a CNN architecture for the image classification task with the CIFAR-10 dataset. The experimental result shows that the proposed method can be used to automatically find the competitive CNN architecture compared with state-of-the-art models. △ Less

Submitted 11 August, 2017; v1 submitted 3 April, 2017; originally announced April 2017.

Comments: This is the revised version of the GECCO 2017 paper. The code of our method is available at https://github.com/sg-nm/cgp-cnn

Showing 1–24 of 24 results for author: Suganuma, M