Search | arXiv e-print repository

arXiv:2407.00985 [pdf, other]

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Authors: Takayuki Nishimura, Katsuyuki Kuyo, Motonari Kambara, Komei Sugiura

Abstract: We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same pol… ▽ More We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted for presentation at IROS2024

arXiv:2404.03161 [pdf, other]

BioVL-QR: Egocentric Biochemical Video-and-Language Dataset Using Micro QR Codes

Authors: Taichi Nishimura, Koki Yamamoto, Yuto Haneji, Keiya Kajimura, Chihiro Nishiwaki, Eriko Daikoku, Natsuko Okuda, Fumihito Ono, Hirotaka Kameko, Shinsuke Mori

Abstract: This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore… ▽ More This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore, previous studies assume that objects are manually annotated and given for downstream tasks, but this is costly and time-consuming. To address this issue, this study focuses on Micro QR Codes to detect objects automatically. From our preliminary study, we found that detecting objects only using Micro QR Codes is still difficult because the researchers manipulate objects, causing blur and occlusion frequently. To address this, we also propose a novel object labeling method by combining a Micro QR Code detector and an off-the-shelf hand object detector. As one of the applications of our dataset, we conduct the task of generating protocols from experiment videos and find that our approach can generate accurate protocols. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 6 pages

arXiv:2404.02523 [pdf, other]

Text-driven Affordance Learning from Egocentric Vision

Authors: Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori

Abstract: Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both… ▽ More Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both hand-object and tool-object interactions. We introduce text-driven affordance learning, aiming to learn contact points and manipulation trajectories from an egocentric view following textual instruction. In our task, contact points are represented as heatmaps, and the manipulation trajectory as sequences of coordinates that incorporate both linear and rotational movements for various manipulations. However, when we gather data for this task, manual annotations of these diverse interactions are costly. To this end, we propose a pseudo dataset creation pipeline and build a large pseudo-training dataset: TextAFF80K, consisting of over 80K instances of the contact points, trajectories, images, and text tuples. We extend existing referring expression comprehension models for our task, and experimental results show that our approach robustly handles multiple affordances, serving as a new standard for affordance learning in real-world scenarios. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2403.16483 [pdf, other]

Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Authors: Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai, Taichi Nishimura, Shinsuke Mori

Abstract: Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expr… ▽ More Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: LREC-COLING 2024

arXiv:2403.16320 [pdf]

doi 10.1109/LRA.2024.3376953

Single-Motor Robotic Gripper with Multi-Surface Fingers for Variable Gras** Configurations

Authors: Toshihiro Nishimura, Yosuke Suzuki, Tokuo Tsuj, Tetsuyou Watanabe

Abstract: This study proposes a novel robotic gripper with variable gras** configurations for gras** various objects. The fingers of the developed gripper incorporate multiple different surfaces. The gripper possesses the function of altering the finger surfaces facing a target object by rotating the fingers in its longitudinal direction. In the proposed design equipped with two fingers, the two fingers… ▽ More This study proposes a novel robotic gripper with variable gras** configurations for gras** various objects. The fingers of the developed gripper incorporate multiple different surfaces. The gripper possesses the function of altering the finger surfaces facing a target object by rotating the fingers in its longitudinal direction. In the proposed design equipped with two fingers, the two fingers incorporate three and four surfaces, respectively, resulting in the nine available gras** configurations by the combination of these finger surfaces. The developed gripper is equipped with the functions of opening/closing its fingers for gras** and rotating its fingers to alter the gras** configuration -all achieved with a single motor. To enable the two motions using a single motor, this study introduces a self-motion switching mechanism utilizing magnets. This mechanism automatically transitions between gripper motions based on the direction of the motor rotation when the gripper is fully opened. In this state, rotating the motor towards closing initiates the finger closing action, while further opening the fingers from the fully opened state activates the finger rotation. This letter presents the gripper design, the mechanics of the self-motion switching mechanism, the control method, and the gras** configuration selection strategy. The performance of the gripper is experimentally demonstrated. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Journal ref: in IEEE Robotics and Automation Letters, vol. 9, no. 5, pp. 4114-4121, May 2024

arXiv:2401.09774 [pdf, other]

On the Audio Hallucinations in Large Audio-Video Language Models

Authors: Taichi Nishimura, Shota Nakada, Masayoshi Kondo

Abstract: Large audio-video language models can generate descriptions for both video and audio. However, they sometimes ignore audio content, producing audio descriptions solely reliant on visual information. This paper refers to this as audio hallucinations and analyzes them in large audio-video language models. We gather 1,000 sentences by inquiring about audio information and annotate them whether they c… ▽ More Large audio-video language models can generate descriptions for both video and audio. However, they sometimes ignore audio content, producing audio descriptions solely reliant on visual information. This paper refers to this as audio hallucinations and analyzes them in large audio-video language models. We gather 1,000 sentences by inquiring about audio information and annotate them whether they contain hallucinations. If a sentence is hallucinated, we also categorize the type of hallucination. The results reveal that 332 sentences are hallucinated with distinct trends observed in nouns and verbs for each hallucination type. Based on this, we tackle a task of audio hallucination classification using pre-trained audio-text models in the zero-shot and fine-tuning settings. Our experimental results reveal that the zero-shot models achieve higher performance (52.2% in F1) than the random (40.3%) and the fine-tuning models achieve 87.9%, outperforming the zero-shot models. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: 6 pages

arXiv:2312.00414 [pdf, other]

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

Authors: Taichi Nishimura, Shota Nakada, Masayoshi Kondo

Abstract: In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigat… ▽ More In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs. △ Less

Submitted 11 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: 24 pages

arXiv:2311.16444 [pdf, other]

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Authors: Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato

Abstract: We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome… ▽ More We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language. △ Less

Submitted 29 November, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.06855 [pdf, other]

DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training

Authors: Kanta Kaneda, Ryosuke Korekata, Yuiga Wada, Shunya Nagashima, Motonari Kambara, Yui Iioka, Haruka Matsuo, Yuto Imai, Takayuki Nishimura, Komei Sugiura

Abstract: This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. To address this task, we propose DialMAT. DialMAT introduces Moment-based Adversarial Training, which incorporates adversarial perturbations into the latent space of language, image, and action. Additionally, it introduces a crossmodal… ▽ More This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. To address this task, we propose DialMAT. DialMAT introduces Moment-based Adversarial Training, which incorporates adversarial perturbations into the latent space of language, image, and action. Additionally, it introduces a crossmodal parallel feature extraction mechanism that applies foundation models to both language and image. We evaluated our model using a dataset constructed from the DialFRED dataset and demonstrated superior performance compared to the baseline method in terms of success rate and path weighted success rate. The model secured the top position in the DialFRED Challenge, which took place at the CVPR 2023 Embodied AI workshop. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: Accepted for presentation at Fourth Annual Embodied AI Workshop at CVPR

arXiv:2310.17197 [pdf]

doi 10.1109/TMECH.2023.3294491

Lightweight High-Speed and High-Force Gripper for Assembly

Authors: Toshihiro Nishimura, Takeshi Takaki, Yosuke Suzuki, Tokuo Tsuji, Tetsuyou Watanabe

Abstract: This paper presents a novel industrial robotic gripper with a high gras** speed (maximum: 1396 mm/s), high tip force (maximum: 80 N) for gras**, large motion range, and lightweight design (0.3 kg). To realize these features, the high-speed section of the quick-return mechanism and load-sensitive continuously variable transmission mechanism are installed in the gripper. The gripper is also equi… ▽ More This paper presents a novel industrial robotic gripper with a high gras** speed (maximum: 1396 mm/s), high tip force (maximum: 80 N) for gras**, large motion range, and lightweight design (0.3 kg). To realize these features, the high-speed section of the quick-return mechanism and load-sensitive continuously variable transmission mechanism are installed in the gripper. The gripper is also equipped with a self-centering function. The high gras** speed and self-centering function improve the cycle time in robotic operations. In addition, the high tip force is advantageous for stably gras** and assembling heavy objects. Moreover, the design of the gripper reduce the gripper's proportion of the manipulator's payload, thus increasing the weight of the object that can be grasped. The gripper performance was validated through kinematic and static analyses as well as experimental evaluations. This paper also presents the analysis of the self-centering function of the developed gripper. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.17192 [pdf]

doi 10.1109/LRA.2023.3315559

Single-Motor Robotic Gripper With Three Functional Modes for Gras** in Confined Spaces

Authors: Toshihiro Nishimura, Tetsuyou Watanabe

Abstract: This study proposes a novel robotic gripper driven by a single motor. The main task is to pick up objects in confined spaces. For this purpose, the developed gripper has three operating modes: gras**, finger-bending, and pull-in modes. Using these three modes, the developed gripper can rotate and translate a grasped object, i.e., can perform in-hand manipulation. This in-hand manipulation is eff… ▽ More This study proposes a novel robotic gripper driven by a single motor. The main task is to pick up objects in confined spaces. For this purpose, the developed gripper has three operating modes: gras**, finger-bending, and pull-in modes. Using these three modes, the developed gripper can rotate and translate a grasped object, i.e., can perform in-hand manipulation. This in-hand manipulation is effective for gras** in extremely confined spaces, such as the inside of a box in a shelf, to avoid interference between the grasped object and obstacles. To achieve the three modes using a single motor, the developed gripper is equipped with two novel self-motion switching mechanisms. These mechanisms switch their motions automatically when the motion being generated is prevented. An analysis of the mechanism and control methodology used to achieve the desired behavior are presented. Furthermore, the validity of the analysis and methodology are experimentally demonstrated. The gripper performance is also evaluated through the gras** tests. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2306.13894 [pdf, other]

OUXT Polaris: Autonomous Navigation System for the 2022 Maritime RobotX Challenge

Authors: Kenta Okamoto, Akihisa Nagata, Kyoma Arai, Yusei Nagao, Tatsuki Nishimura, Kento Hirogaki, Shunya Tanaka, Masato Kobayashi, Tatsuya Sanada, Masaya Kataoka

Abstract: OUXT-Polaris has been develo** an autonomous navigation system by participating in the Maritime RobotX Challenge 2014, 2016, and 2018. In this paper, we describe the improvement of the previous vessel system. We also indicate the advantage of the improved design. Moreover, we describe the develo** method under Covid-19 using simulation / miniture-size hardware and the feature components for th… ▽ More OUXT-Polaris has been develo** an autonomous navigation system by participating in the Maritime RobotX Challenge 2014, 2016, and 2018. In this paper, we describe the improvement of the previous vessel system. We also indicate the advantage of the improved design. Moreover, we describe the develo** method under Covid-19 using simulation / miniture-size hardware and the feature components for the next RobotX Challenge. △ Less

Submitted 24 June, 2023; originally announced June 2023.

Comments: Technical Design Paper of 2022 Maritime RobotX Challenge

arXiv:2303.04929 [pdf]

doi 10.1109/LRA.2023.3254465

Flexible and slim device switching air blowing and suction by a single airflow control

Authors: Seita Nojiri, Toshihiro Nishimura, Kenjiro Tadakuma, Tetsuyou Watanabe

Abstract: This study proposes a soft robotic device with a slim and flexible body that switches between air blowing and suction with a single airflow control. Suction is achieved by jet flow entraining surrounding air, and blowing is achieved by blocking and reversing jet flow. The thin and flexible flap gate enables the switching. Air flow is blocked while the gate is closed and passes through while the ga… ▽ More This study proposes a soft robotic device with a slim and flexible body that switches between air blowing and suction with a single airflow control. Suction is achieved by jet flow entraining surrounding air, and blowing is achieved by blocking and reversing jet flow. The thin and flexible flap gate enables the switching. Air flow is blocked while the gate is closed and passes through while the gate is open. The opening and closing of the flap gate are controlled by the expansion of the inflatable chambers installed near the gate. The extent of expansion is determined by the upstream static pressure. Therefore, the gate can be controlled by the input airflow rate. The dimensions of the flap gate are introduced as a design parameter, and we show that the parameter contributes to the blowing and suction capacities. We also experimentally demonstrate that the proposed device is available for a variable friction system and an end effector for picking up a thin object covered with dust. △ Less

Submitted 8 March, 2023; originally announced March 2023.

arXiv:2303.04927 [pdf]

doi 10.1109/LRA.2023.3254461

High-payload and self-adaptive robotic hand with 1-degree-of-freedom translation/rotation switching mechanism

Authors: Toshihiro Nishimura, Tsubasa Muryoe, Tetsuyou Watanabe

Abstract: This study proposes a novel robotic hand that can achieve self-adaptive gras** and a large payload (over 20 kg) with a single actuator. Accordingly, two novel mechanisms, an actuation system with self-motion switching and a self-adaptive finger with a self-locking mechanism, are installed in a 1-degree-of-freedom robotic hand. The actuation system switches the output motion from translational to… ▽ More This study proposes a novel robotic hand that can achieve self-adaptive gras** and a large payload (over 20 kg) with a single actuator. Accordingly, two novel mechanisms, an actuation system with self-motion switching and a self-adaptive finger with a self-locking mechanism, are installed in a 1-degree-of-freedom robotic hand. The actuation system switches the output motion from translational to rotational according to the applied external load. The finger is bent by inserting a flexible shaft inside it. Its bending posture can conform to the shape of the object owing to the flexible shaft, and the posture is fixed by a self-locking mechanism, which can be released by the rotational motion of the actuation system. This study presents a mechanical analysis of these mechanisms to achieve the desired behavior. The analysis was validated experimentally, and a robotic hand with these mechanisms were evaluated using gras** tests. △ Less

Submitted 8 March, 2023; originally announced March 2023.

arXiv:2301.03196 [pdf, other]

doi 10.1109/GLOBECOM54140.2023.10437162

Near-optimal stochastic MIMO signal detection with a mixture of t-distribution prior

Authors: Junichiro Hagiwara, Kazushi Matsumura, Hiroki Asumi, Yukiko Kasuga, Toshihiko Nishimura, Takanori Sato, Yasutaka Ogawa, Takeo Ohgane

Abstract: Multiple-input multiple-output (MIMO) systems will play a crucial role in future wireless communication, but improving their signal detection performance to increase transmission efficiency remains a challenge. To address this issue, we propose extending the discrete signal detection problem in MIMO systems to a continuous one and applying the Hamiltonian Monte Carlo method, an efficient Markov ch… ▽ More Multiple-input multiple-output (MIMO) systems will play a crucial role in future wireless communication, but improving their signal detection performance to increase transmission efficiency remains a challenge. To address this issue, we propose extending the discrete signal detection problem in MIMO systems to a continuous one and applying the Hamiltonian Monte Carlo method, an efficient Markov chain Monte Carlo algorithm. In our previous studies, we have used a mixture of normal distributions for the prior distribution. In this study, we propose using a mixture of t-distributions, which further improves detection performance. Based on our theoretical analysis and computer simulations, the proposed method can achieve near-optimal signal detection with polynomial computational complexity. This high-performance and practical MIMO signal detection could contribute to the development of the 6th-generation mobile network. △ Less

Submitted 7 March, 2024; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: Published in the 2023 IEEE Global Communications Conference (GLOBECOM)

arXiv:2211.05303 [pdf]

doi 10.1109/LRA.2022.3187823

1-degree-of-freedom Robotic Gripper With Infinite Self-Twist Function

Authors: Toshihiro Nishimura, Yosuke Suzuki, Tokuo Tsuji, Tetsuyou Watanabe

Abstract: This study proposed a novel robotic gripper that can achieve gras** and infinite wrist twisting motions using a single actuator. The gripper is equipped with a differential gear mechanism that allows switching between the gras** and twisting motions according to the magnitude of the tip force applied to the finger. The gras** motion is activated when the tip force is below a set value, and t… ▽ More This study proposed a novel robotic gripper that can achieve gras** and infinite wrist twisting motions using a single actuator. The gripper is equipped with a differential gear mechanism that allows switching between the gras** and twisting motions according to the magnitude of the tip force applied to the finger. The gras** motion is activated when the tip force is below a set value, and the wrist twisting motion is activated when the tip force exceeds this value. "Twist gras**," a special gras** mode that allows the wrap** of a flexible thin object around the fingers of the gripper, can be achieved by the twisting motion. Twist gras** is effective for handling objects with flexible thin parts, such as laminated packaging pouches, that are difficult to grasp using conventional antipodal gras**. In this study, the gripper design is presented, and twist gras** is analyzed. The gripper performance is experimentally validated. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2211.05257 [pdf]

doi 10.1109/LRA.2022.3192653

Single-Fingered Reconfigurable Robotic Gripper With a Folding Mechanism for Narrow Working Spaces

Authors: Toshihiro Nishimura, Tsubasa Muryoe, Yoshitatsu Asama, Hiroki Ikeuchi, Ryo Toshima, Tetsuyou Watanabe

Abstract: This letter proposes a novel single-fingered reconfigurable robotic gripper for gras** objects in narrow working spaces. The finger of the developed gripper realizes two configurations, namely, the insertion and gras** modes, using only a single motor. In the insertion mode, the finger assumes a thin shape such that it can insert its tip into a narrow space. The gras** mode of the finger is… ▽ More This letter proposes a novel single-fingered reconfigurable robotic gripper for gras** objects in narrow working spaces. The finger of the developed gripper realizes two configurations, namely, the insertion and gras** modes, using only a single motor. In the insertion mode, the finger assumes a thin shape such that it can insert its tip into a narrow space. The gras** mode of the finger is activated through a folding mechanism. Mode switching can be achieved in two ways: switching the mode actively by a motor, or combining passive rotation of the fingertip through contact with the support surface and active motorized construction of the claw. The latter approach is effective when it is unclear how much finger insertion is required for a specific task. The structure provides a simple control scheme. The performance of the proposed robotic gripper design and control methodology was experimentally evaluated. The minimum width of the insertion space required to grasp an object is 4 mm (1 mm, when using a strategy). △ Less

Submitted 21 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: This study was presented at IROS 2022

Journal ref: IEEE Robotics and Automation Letters, Vol.7, No.4 (2022) 10192-10199

arXiv:2209.10134 [pdf, other]

Recipe Generation from Unsegmented Cooking Videos

Authors: Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Shinsuke Mori

Abstract: This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is c… ▽ More This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this, we propose a transformer-based multimodal recurrent approach of training an event selector and sentence generator for selecting oracle events from the DVC's events and generating sentences for them. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model outputs the appropriate number of events in the correct order. △ Less

Submitted 18 February, 2024; v1 submitted 21 September, 2022; originally announced September 2022.

Comments: Accepted at ACM TOMM; ACM Transactions on Multimedia Computing, Communications, and Applications

arXiv:2209.05840 [pdf, other]

Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows

Authors: Keisuke Shirai, Atsushi Hashimoto, Taichi Nishimura, Hirotaka Kameko, Shuhei Kurita, Yoshitaka Ushiku, Shinsuke Mori

Abstract: We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-mo… ▽ More We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: COLING 2022

arXiv:2203.14470 [pdf]

doi 10.1109/LRA.2022.3157964

Soft robotic hand with finger-bending/friction-reduction switching mechanism through 1-degree-of-freedom flow control

Authors: Toshihiro Nishimura, Kensuke Shimizu, Seita Nojiri, Kenjiro Tadakuma, Yosuke Suzuki, Tokuo Tsuji, Tetsuyou Watanabe

Abstract: This paper proposes a novel pneumatic soft robotic hand that incorporates a mechanism that can switch the airflow path using a single airflow control. The developed hand can control the finger motion and operate the surface friction variable mechanism. In the friction variable mechanism, a lubricant is injected onto the high-friction finger surface to reduce surface friction. To inject the lubrica… ▽ More This paper proposes a novel pneumatic soft robotic hand that incorporates a mechanism that can switch the airflow path using a single airflow control. The developed hand can control the finger motion and operate the surface friction variable mechanism. In the friction variable mechanism, a lubricant is injected onto the high-friction finger surface to reduce surface friction. To inject the lubrication using a positive-pressure airflow, the Venturi effect is applied. The design and evaluation of the airflow-path switching and friction variable mechanisms are described. Moreover, the entire design of a soft robotic hand equipped with these mechanisms is presented. The performance was validated through gras**, placing, and manipulation tests. △ Less

Submitted 27 March, 2022; originally announced March 2022.

Journal ref: IEEE Robotics and Automation Letters (2022)(Early Access)

arXiv:1909.11274 [pdf, other]

Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

Authors: Taiji Suzuki, Hiroshi Abe, Tomoaki Nishimura

Abstract: One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compre… ▽ More One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compression based bound is one of the promising approaches. However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network. In this paper, we give a unified frame-work that can convert compression based bounds to those for non-compressed original networks. The bound gives even better rate than the one for the compressed network by improving the bias term. By establishing the unified frame-work, we can obtain a data dependent generalization error bound which gives a tighter evaluation than the data independent ones. △ Less

Submitted 21 June, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

Comments: published in ICLR2020

arXiv:1808.08558 [pdf, other]

Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error

Authors: Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, Tomoaki Nishimura

Abstract: Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective comp… ▽ More Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method. △ Less

Submitted 13 July, 2020; v1 submitted 26 August, 2018; originally announced August 2018.

Comments: 17 pages, 4 figures. Accepted in IJCAI-PRICAI 2020. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 2839--2846

Showing 1–22 of 22 results for author: Nishimura, T