Search | arXiv e-print repository

doi 10.23919/MVA57639.2023.10215935

MVA2023 Small Object Detection Challenge for Spotting Birds: Dataset, Methods, and Results

Authors: Yuki Kondo, Norimichi Ukita, Takayuki Yamaguchi, Hao-Yu Hou, Mu-Yi Shen, Chia-Chi Hsu, En-Ming Huang, Yu-Chen Huang, Yu-Cheng Xia, Chien-Yao Wang, Chun-Yi Lee, Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, Ichiro Ide, Yosuke Shinya, Xinyao Liu, Guang Liang, Syusuke Yasui

Abstract: Small Object Detection (SOD) is an important machine vision topic because (i) a variety of real-world applications require object detection for distant objects and (ii) SOD is a challenging task due to the noisy, blurred, and less-informative image appearances of small objects. This paper proposes a new SOD dataset consisting of 39,070 images including 137,121 bird instances, which is called the S… ▽ More Small Object Detection (SOD) is an important machine vision topic because (i) a variety of real-world applications require object detection for distant objects and (ii) SOD is a challenging task due to the noisy, blurred, and less-informative image appearances of small objects. This paper proposes a new SOD dataset consisting of 39,070 images including 137,121 bird instances, which is called the Small Object Detection for Spotting Birds (SOD4SB) dataset. The detail of the challenge with the SOD4SB dataset is introduced in this paper. In total, 223 participants joined this challenge. This paper briefly introduces the award-winning methods. The dataset, the baseline code, and the website for evaluation on the public testset are publicly available. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: This paper is included in the proceedings of the 18th International Conference on Machine Vision Applications (MVA2023). It will be officially published at a later date. Project page : https://www.mva-org.jp/mva2023/challenge

Journal ref: 2023 18th International Conference on Machine Vision and Applications (MVA)

arXiv:2303.08380 [pdf, other]

Pre-instruction for Pedestrians Interacting Autonomous Vehicles with an eHMI: Effects on Their Psychology and Walking Behavior

Authors: Hailong Liu, Takatsugu Hirayama

Abstract: eHMIs refers to a novel and explicit communication method for pedestrian-AV negotiation in interactions, such as in encounter scenarios. However, pedestrians with limited experience in negotiating with AVs could lack a comprehensive and correct understanding of the information on driving intentions' meaning as conveyed by AVs through eHMI, particularly in the current contexts where AV and eHMI are… ▽ More eHMIs refers to a novel and explicit communication method for pedestrian-AV negotiation in interactions, such as in encounter scenarios. However, pedestrians with limited experience in negotiating with AVs could lack a comprehensive and correct understanding of the information on driving intentions' meaning as conveyed by AVs through eHMI, particularly in the current contexts where AV and eHMI are not yet mainstream. Consequently, pedestrians who misunderstand the driving intention of the AVs during the encounter may feel threatened and perform unpredictable behaviors. To solve this issue, this study proposes using the pre-instruction on the rationale of eHMI to help pedestrians correctly understand driving intentions and predict AV behavior. Consequently, this can improve their subjective feelings (ie. sense of danger, trust in AV, and sense of relief) and decision-making. In addition, this study suggests that the eHMI could better guide pedestrian behavior through the pre-instruction. The results of interaction experiments in the road crossing scene show that participants found it more difficult to recognize the situation when they encountered an AV without eHMI than when they encountered a manual driving vehicle (MV); in addition, participants' subjective feelings and hesitations while decision-making worsened significantly. After the pre-instruction, the participants could understand the driving intention of an AV with eHMI and predict driving behavior more easily. Furthermore, the participants' subjective feelings and hesitation to make decisions improved, reaching the same criteria used for MV. Moreover, this study found that the information guidance of using eHMI influenced the participants' walking speed, resulting in a small variation over the time horizon via multiple trials when they fully understood the principle of eHMI through the pre-instruction. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2303.03144 [pdf, other]

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Authors: Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Keisuke Doman, Yasutomo Kawanishi, Ichiro Ide

Abstract: Recently, large-scale Vision and Language (V\&L) pretraining has become the standard backbone of many multimedia systems. While it has shown remarkable performance even in unseen situations, it often performs in ways not intuitive to humans. Particularly, they usually do not consider the pronunciation of the input, which humans would utilize to understand language, especially when it comes to unkn… ▽ More Recently, large-scale Vision and Language (V\&L) pretraining has become the standard backbone of many multimedia systems. While it has shown remarkable performance even in unseen situations, it often performs in ways not intuitive to humans. Particularly, they usually do not consider the pronunciation of the input, which humans would utilize to understand language, especially when it comes to unknown words. Thus, this paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP), one of the V\&L pretrained models, to make it consider the pronunciation similarity among its pronunciation inputs. To achieve this, we first propose a phoneme embedding that utilizes the phoneme relationships provided by the International Phonetic Alphabet (IPA) chart as a phonetic prior. Next, by distilling the frozen CLIP text encoder, we train a pronunciation encoder employing the IPA-based embedding. The proposed model named IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text). Quantitative evaluation reveals that the phoneme distribution on the embedding space represents phonetic relationships more accurately when using the proposed phoneme embedding. Furthermore, in some multimodal retrieval tasks, we confirm that the proposed pronunciation encoder enhances the performance of the text encoder and that the pronunciation encoder handles nonsense words in a more phonetic manner than the text encoder. Finally, qualitative evaluation verifies the correlation between the pronunciation encoder and human perception regarding pronunciation similarity. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: 11 pages, 8 figures, 5 Tables

arXiv:2105.03891 [pdf, other]

Interaction Detection Between Vehicles and Vulnerable Road Users: A Deep Generative Approach with Attention

Authors: Hao Cheng, Li Feng, Hailong Liu, Takatsugu Hirayama, Hiroshi Murase, Monika Sester

Abstract: Intersections where vehicles are permitted to turn and interact with vulnerable road users (VRUs) like pedestrians and cyclists are among some of the most challenging locations for automated and accurate recognition of road users' behavior. In this paper, we propose a deep conditional generative model for interaction detection at such locations. It aims to automatically analyze massive video data… ▽ More Intersections where vehicles are permitted to turn and interact with vulnerable road users (VRUs) like pedestrians and cyclists are among some of the most challenging locations for automated and accurate recognition of road users' behavior. In this paper, we propose a deep conditional generative model for interaction detection at such locations. It aims to automatically analyze massive video data about the continuity of road users' behavior. This task is essential for many intelligent transportation systems such as traffic safety control and self-driving cars that depend on the understanding of road users' locomotion. A Conditional Variational Auto-Encoder based model with Gaussian latent variables is trained to encode road users' behavior and perform probabilistic and diverse predictions of interactions. The model takes as input the information of road users' type, position and motion automatically extracted by a deep learning object detector and optical flow from videos, and generates frame-wise probabilities that represent the dynamics of interactions between a turning vehicle and any VRUs involved. The model's efficacy was validated by testing on real--world datasets acquired from two different intersections. It achieved an F1-score above 0.96 at a right--turn intersection in Germany and 0.89 at a left--turn intersection in Japan, both with very busy traffic flows. △ Less

Submitted 9 May, 2021; originally announced May 2021.

arXiv:2102.07958 [pdf, other]

doi 10.1109/IV48863.2021.9575246

Importance of Instruction for Pedestrian-Automated Driving Vehicle Interaction with an External Human Machine Interface: Effects on Pedestrians' Situation Awareness, Trust, Perceived Risks and Decision Making

Authors: Hailong Liu, Takatsugu Hirayama, Masaya Watanabe

Abstract: Compared to a manual driving vehicle (MV), an automated driving vehicle lacks a way to communicate with the pedestrian through the driver when it interacts with the pedestrian because the driver usually does not participate in driving tasks. Thus, an external human machine interface (eHMI) can be viewed as a novel explicit communication method for providing driving intentions of an automated drivi… ▽ More Compared to a manual driving vehicle (MV), an automated driving vehicle lacks a way to communicate with the pedestrian through the driver when it interacts with the pedestrian because the driver usually does not participate in driving tasks. Thus, an external human machine interface (eHMI) can be viewed as a novel explicit communication method for providing driving intentions of an automated driving vehicle (AV) to pedestrians when they need to negotiate in an interaction, e.g., an encountering scene. However, the eHMI may not guarantee that the pedestrians will fully recognize the intention of the AV. In this paper, we propose that the instruction of the eHMI's rationale can help pedestrians correctly understand the driving intentions and predict the behavior of the AV, and thus their subjective feelings (i.e., dangerous feeling, trust in the AV, and feeling of relief) and decision-making are also improved. The results of an interaction experiment in a road-crossing scene indicate that the participants were more difficult to be aware of the situation when they encountered an AV w/o eHMI compared to when they encountered an MV; further, the participants' subjective feelings and hesitation in decision-making also deteriorated significantly. When the eHMI was used in the AV, the situational awareness, subjective feelings and decision-making of the participants regarding the AV w/ eHMI were improved. After the instruction, it was easier for the participants to understand the driving intention and predict driving behavior of the AV w/ eHMI. Further, the subjective feelings and the hesitation related to decision-making were improved and reached the same standards as that for the MV. △ Less

Submitted 30 May, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 5 figures, Accepted by IEEE IV2021

arXiv:2003.00689 [pdf, other]

doi 10.1109/ITSC45102.2020.9294696

What Timing for an Automated Vehicle to Make Pedestrians Understand Its Driving Intentions for Improving Their Perception of Safety?

Authors: Hailong Liu, Takatsugu Hirayama, Luis Yoichi Morales, Hiroshi Murase

Abstract: Although automated driving systems have been used frequently, they are still unpopular in society. To increase the popularity of automated vehicles (AVs), assisting pedestrians to accurately understand the driving intentions and improving their perception of safety when interacting with AVs are considered effective. Therefore, the AV should send information about its driving intention to pedestria… ▽ More Although automated driving systems have been used frequently, they are still unpopular in society. To increase the popularity of automated vehicles (AVs), assisting pedestrians to accurately understand the driving intentions and improving their perception of safety when interacting with AVs are considered effective. Therefore, the AV should send information about its driving intention to pedestrians when they interact with each other. However, the following questions should be answered regarding how the AV sends the information to them: 1) What timing for an AV to make pedestrians understand its driving intentions after being noticed by them? 2) What timing for an AV to make pedestrians feel safe after being noticed by them? Thirteen participants were invited to interact with a manually driven vehicle and an AV in an experiment. The participants' gaze information and a subjective evaluation of their understanding of the driving intention as well as their perception of safety were collected. By analyzing the participants' gaze duration on the vehicle with their subjective evaluations, we found that the AV should enable the pedestrian to accurately understand its driving intention within 0.5~6.5 [s] and make the pedestrian feel safe within 0.5~8.0 [s] while the pedestrian is gazing at it. △ Less

Submitted 12 June, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: Accepted by IEEE ITSC 2020, 7 pages, 9 figures, 1 table

arXiv:2001.01340 [pdf, other]

doi 10.1080/10447318.2022.2073006

What Is the Gaze Behavior of Pedestrians in Interactions with an Automated Vehicle When They Do Not Understand Its Intentions?

Authors: Hailong Liu, Takatsugu Hirayama, Luis Yoichi Morales, Hiroshi Murase

Abstract: Interactions between pedestrians and automated vehicles (AVs) will increase significantly with the popularity of AV. However, pedestrians often have not enough trust on the AVs , particularly when they are confused about an AV's intention in a interaction. This study seeks to evaluate if pedestrians clearly understand the driving intentions of AVs in interactions and presents experimental research… ▽ More Interactions between pedestrians and automated vehicles (AVs) will increase significantly with the popularity of AV. However, pedestrians often have not enough trust on the AVs , particularly when they are confused about an AV's intention in a interaction. This study seeks to evaluate if pedestrians clearly understand the driving intentions of AVs in interactions and presents experimental research on the relationship between gaze behaviors of pedestrians and their understanding of the intentions of the AV. The hypothesis investigated in this study was that the less the pedestrian understands the driving intentions of the AV, the longer the duration of their gazing behavior will be. A pedestrian--vehicle interaction experiment was designed to verify the proposed hypothesis. A robotic wheelchair was used as the manual driving vehicle (MV) and AV for interacting with pedestrians while pedestrians' gaze data and their subjective evaluation of the driving intentions were recorded. The experimental results supported our hypothesis as there was a negative correlation between the pedestrians' gaze duration on the AV and their understanding of the driving intentions of the AV. Moreover, the gaze duration of most of the pedestrians on the MV was shorter than that on an AV. Therefore, we conclude with two recommendations to designers of external human-machine interfaces (eHMI): (1) when a pedestrian is engaged in an interaction with an AV, the driving intentions of the AV should be provided; (2) if the pedestrian still gazes at the AV after the AV displays its driving intentions, the AV should provide clearer information about its driving intentions. △ Less

Submitted 12 May, 2020; v1 submitted 5 January, 2020; originally announced January 2020.

Comments: 10 pages, 10 figures

arXiv:1905.05601 [pdf, other]

doi 10.1016/j.ifacol.2019.12.073

Saliency difference based objective evaluation method for a superimposed screen of the HUD with various background

Authors: Hailong Liu, Toshihiro Hiraoka, Takatsugu Hirayama, Dongmin Kim

Abstract: The head-up display (HUD) is an emerging device which can project information on a transparent screen. The HUD has been used in airplanes and vehicles, and it is usually placed in front of the operator's view. In the case of the vehicle, the driver can see not only various information on the HUD but also the backgrounds (driving environment) through the HUD. However, the projected information on t… ▽ More The head-up display (HUD) is an emerging device which can project information on a transparent screen. The HUD has been used in airplanes and vehicles, and it is usually placed in front of the operator's view. In the case of the vehicle, the driver can see not only various information on the HUD but also the backgrounds (driving environment) through the HUD. However, the projected information on the HUD may interfere with the colors in the background because the HUD is transparent. For example, a red message on the HUD will be less noticeable when there is an overlap between it and the red brake light from the front vehicle. As the first step to solve this issue, how to evaluate the mutual interference between the information on the HUD and backgrounds is important. Therefore, this paper proposes a method to evaluate the mutual interference based on saliency. It can be evaluated by comparing the HUD part cut from a saliency map of a measured image with the HUD image. △ Less

Submitted 13 May, 2019; originally announced May 2019.

Comments: 10 pages, 5 fighres, 1 table, accepted by IFAC-HMS 2019

Showing 1–8 of 8 results for author: Hirayama, T