-
Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models
Authors:
Takayuki Nishimura,
Katsuyuki Kuyo,
Motonari Kambara,
Komei Sugiura
Abstract:
We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same pol…
▽ More
We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
BioVL-QR: Egocentric Biochemical Video-and-Language Dataset Using Micro QR Codes
Authors:
Taichi Nishimura,
Koki Yamamoto,
Yuto Haneji,
Keiya Kajimura,
Chihiro Nishiwaki,
Eriko Daikoku,
Natsuko Okuda,
Fumihito Ono,
Hirotaka Kameko,
Shinsuke Mori
Abstract:
This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore…
▽ More
This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore, previous studies assume that objects are manually annotated and given for downstream tasks, but this is costly and time-consuming. To address this issue, this study focuses on Micro QR Codes to detect objects automatically. From our preliminary study, we found that detecting objects only using Micro QR Codes is still difficult because the researchers manipulate objects, causing blur and occlusion frequently. To address this, we also propose a novel object labeling method by combining a Micro QR Code detector and an off-the-shelf hand object detector. As one of the applications of our dataset, we conduct the task of generating protocols from experiment videos and find that our approach can generate accurate protocols.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Text-driven Affordance Learning from Egocentric Vision
Authors:
Tomoya Yoshida,
Shuhei Kurita,
Taichi Nishimura,
Shinsuke Mori
Abstract:
Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both…
▽ More
Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both hand-object and tool-object interactions. We introduce text-driven affordance learning, aiming to learn contact points and manipulation trajectories from an egocentric view following textual instruction. In our task, contact points are represented as heatmaps, and the manipulation trajectory as sequences of coordinates that incorporate both linear and rotational movements for various manipulations. However, when we gather data for this task, manual annotations of these diverse interactions are costly. To this end, we propose a pseudo dataset creation pipeline and build a large pseudo-training dataset: TextAFF80K, consisting of over 80K instances of the contact points, trajectories, images, and text tuples. We extend existing referring expression comprehension models for our task, and experimental results show that our approach robustly handles multiple affordances, serving as a new standard for affordance learning in real-world scenarios.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Authors:
Keyaki Ohno,
Hirotaka Kameko,
Keisuke Shirai,
Taichi Nishimura,
Shinsuke Mori
Abstract:
Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expr…
▽ More
Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Single-Motor Robotic Gripper with Multi-Surface Fingers for Variable Gras** Configurations
Authors:
Toshihiro Nishimura,
Yosuke Suzuki,
Tokuo Tsuj,
Tetsuyou Watanabe
Abstract:
This study proposes a novel robotic gripper with variable gras** configurations for gras** various objects. The fingers of the developed gripper incorporate multiple different surfaces. The gripper possesses the function of altering the finger surfaces facing a target object by rotating the fingers in its longitudinal direction. In the proposed design equipped with two fingers, the two fingers…
▽ More
This study proposes a novel robotic gripper with variable gras** configurations for gras** various objects. The fingers of the developed gripper incorporate multiple different surfaces. The gripper possesses the function of altering the finger surfaces facing a target object by rotating the fingers in its longitudinal direction. In the proposed design equipped with two fingers, the two fingers incorporate three and four surfaces, respectively, resulting in the nine available gras** configurations by the combination of these finger surfaces. The developed gripper is equipped with the functions of opening/closing its fingers for gras** and rotating its fingers to alter the gras** configuration -all achieved with a single motor. To enable the two motions using a single motor, this study introduces a self-motion switching mechanism utilizing magnets. This mechanism automatically transitions between gripper motions based on the direction of the motor rotation when the gripper is fully opened. In this state, rotating the motor towards closing initiates the finger closing action, while further opening the fingers from the fully opened state activates the finger rotation. This letter presents the gripper design, the mechanics of the self-motion switching mechanism, the control method, and the gras** configuration selection strategy. The performance of the gripper is experimentally demonstrated.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
On the Audio Hallucinations in Large Audio-Video Language Models
Authors:
Taichi Nishimura,
Shota Nakada,
Masayoshi Kondo
Abstract:
Large audio-video language models can generate descriptions for both video and audio. However, they sometimes ignore audio content, producing audio descriptions solely reliant on visual information. This paper refers to this as audio hallucinations and analyzes them in large audio-video language models. We gather 1,000 sentences by inquiring about audio information and annotate them whether they c…
▽ More
Large audio-video language models can generate descriptions for both video and audio. However, they sometimes ignore audio content, producing audio descriptions solely reliant on visual information. This paper refers to this as audio hallucinations and analyzes them in large audio-video language models. We gather 1,000 sentences by inquiring about audio information and annotate them whether they contain hallucinations. If a sentence is hallucinated, we also categorize the type of hallucination. The results reveal that 332 sentences are hallucinated with distinct trends observed in nouns and verbs for each hallucination type. Based on this, we tackle a task of audio hallucination classification using pre-trained audio-text models in the zero-shot and fine-tuning settings. Our experimental results reveal that the zero-shot models achieve higher performance (52.2% in F1) than the random (40.3%) and the fine-tuning models achieve 87.9%, outperforming the zero-shot models.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval
Authors:
Taichi Nishimura,
Shota Nakada,
Masayoshi Kondo
Abstract:
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigat…
▽ More
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.
△ Less
Submitted 11 March, 2024; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
Authors:
Takehiko Ohkawa,
Takuma Yagi,
Taichi Nishimura,
Ryosuke Furuta,
Atsushi Hashimoto,
Yoshitaka Ushiku,
Yoichi Sato
Abstract:
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome…
▽ More
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.
△ Less
Submitted 29 November, 2023; v1 submitted 27 November, 2023;
originally announced November 2023.
-
DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training
Authors:
Kanta Kaneda,
Ryosuke Korekata,
Yuiga Wada,
Shunya Nagashima,
Motonari Kambara,
Yui Iioka,
Haruka Matsuo,
Yuto Imai,
Takayuki Nishimura,
Komei Sugiura
Abstract:
This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. To address this task, we propose DialMAT. DialMAT introduces Moment-based Adversarial Training, which incorporates adversarial perturbations into the latent space of language, image, and action. Additionally, it introduces a crossmodal…
▽ More
This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. To address this task, we propose DialMAT. DialMAT introduces Moment-based Adversarial Training, which incorporates adversarial perturbations into the latent space of language, image, and action. Additionally, it introduces a crossmodal parallel feature extraction mechanism that applies foundation models to both language and image. We evaluated our model using a dataset constructed from the DialFRED dataset and demonstrated superior performance compared to the baseline method in terms of success rate and path weighted success rate. The model secured the top position in the DialFRED Challenge, which took place at the CVPR 2023 Embodied AI workshop.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Lightweight High-Speed and High-Force Gripper for Assembly
Authors:
Toshihiro Nishimura,
Takeshi Takaki,
Yosuke Suzuki,
Tokuo Tsuji,
Tetsuyou Watanabe
Abstract:
This paper presents a novel industrial robotic gripper with a high gras** speed (maximum: 1396 mm/s), high tip force (maximum: 80 N) for gras**, large motion range, and lightweight design (0.3 kg). To realize these features, the high-speed section of the quick-return mechanism and load-sensitive continuously variable transmission mechanism are installed in the gripper. The gripper is also equi…
▽ More
This paper presents a novel industrial robotic gripper with a high gras** speed (maximum: 1396 mm/s), high tip force (maximum: 80 N) for gras**, large motion range, and lightweight design (0.3 kg). To realize these features, the high-speed section of the quick-return mechanism and load-sensitive continuously variable transmission mechanism are installed in the gripper. The gripper is also equipped with a self-centering function. The high gras** speed and self-centering function improve the cycle time in robotic operations. In addition, the high tip force is advantageous for stably gras** and assembling heavy objects. Moreover, the design of the gripper reduce the gripper's proportion of the manipulator's payload, thus increasing the weight of the object that can be grasped. The gripper performance was validated through kinematic and static analyses as well as experimental evaluations. This paper also presents the analysis of the self-centering function of the developed gripper.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Single-Motor Robotic Gripper With Three Functional Modes for Gras** in Confined Spaces
Authors:
Toshihiro Nishimura,
Tetsuyou Watanabe
Abstract:
This study proposes a novel robotic gripper driven by a single motor. The main task is to pick up objects in confined spaces. For this purpose, the developed gripper has three operating modes: gras**, finger-bending, and pull-in modes. Using these three modes, the developed gripper can rotate and translate a grasped object, i.e., can perform in-hand manipulation. This in-hand manipulation is eff…
▽ More
This study proposes a novel robotic gripper driven by a single motor. The main task is to pick up objects in confined spaces. For this purpose, the developed gripper has three operating modes: gras**, finger-bending, and pull-in modes. Using these three modes, the developed gripper can rotate and translate a grasped object, i.e., can perform in-hand manipulation. This in-hand manipulation is effective for gras** in extremely confined spaces, such as the inside of a box in a shelf, to avoid interference between the grasped object and obstacles. To achieve the three modes using a single motor, the developed gripper is equipped with two novel self-motion switching mechanisms. These mechanisms switch their motions automatically when the motion being generated is prevented. An analysis of the mechanism and control methodology used to achieve the desired behavior are presented. Furthermore, the validity of the analysis and methodology are experimentally demonstrated. The gripper performance is also evaluated through the gras** tests.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
OUXT Polaris: Autonomous Navigation System for the 2022 Maritime RobotX Challenge
Authors:
Kenta Okamoto,
Akihisa Nagata,
Kyoma Arai,
Yusei Nagao,
Tatsuki Nishimura,
Kento Hirogaki,
Shunya Tanaka,
Masato Kobayashi,
Tatsuya Sanada,
Masaya Kataoka
Abstract:
OUXT-Polaris has been develo** an autonomous navigation system by participating in the Maritime RobotX Challenge 2014, 2016, and 2018. In this paper, we describe the improvement of the previous vessel system. We also indicate the advantage of the improved design. Moreover, we describe the develo** method under Covid-19 using simulation / miniture-size hardware and the feature components for th…
▽ More
OUXT-Polaris has been develo** an autonomous navigation system by participating in the Maritime RobotX Challenge 2014, 2016, and 2018. In this paper, we describe the improvement of the previous vessel system. We also indicate the advantage of the improved design. Moreover, we describe the develo** method under Covid-19 using simulation / miniture-size hardware and the feature components for the next RobotX Challenge.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
Flexible and slim device switching air blowing and suction by a single airflow control
Authors:
Seita Nojiri,
Toshihiro Nishimura,
Kenjiro Tadakuma,
Tetsuyou Watanabe
Abstract:
This study proposes a soft robotic device with a slim and flexible body that switches between air blowing and suction with a single airflow control. Suction is achieved by jet flow entraining surrounding air, and blowing is achieved by blocking and reversing jet flow. The thin and flexible flap gate enables the switching. Air flow is blocked while the gate is closed and passes through while the ga…
▽ More
This study proposes a soft robotic device with a slim and flexible body that switches between air blowing and suction with a single airflow control. Suction is achieved by jet flow entraining surrounding air, and blowing is achieved by blocking and reversing jet flow. The thin and flexible flap gate enables the switching. Air flow is blocked while the gate is closed and passes through while the gate is open. The opening and closing of the flap gate are controlled by the expansion of the inflatable chambers installed near the gate. The extent of expansion is determined by the upstream static pressure. Therefore, the gate can be controlled by the input airflow rate. The dimensions of the flap gate are introduced as a design parameter, and we show that the parameter contributes to the blowing and suction capacities. We also experimentally demonstrate that the proposed device is available for a variable friction system and an end effector for picking up a thin object covered with dust.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
High-payload and self-adaptive robotic hand with 1-degree-of-freedom translation/rotation switching mechanism
Authors:
Toshihiro Nishimura,
Tsubasa Muryoe,
Tetsuyou Watanabe
Abstract:
This study proposes a novel robotic hand that can achieve self-adaptive gras** and a large payload (over 20 kg) with a single actuator. Accordingly, two novel mechanisms, an actuation system with self-motion switching and a self-adaptive finger with a self-locking mechanism, are installed in a 1-degree-of-freedom robotic hand. The actuation system switches the output motion from translational to…
▽ More
This study proposes a novel robotic hand that can achieve self-adaptive gras** and a large payload (over 20 kg) with a single actuator. Accordingly, two novel mechanisms, an actuation system with self-motion switching and a self-adaptive finger with a self-locking mechanism, are installed in a 1-degree-of-freedom robotic hand. The actuation system switches the output motion from translational to rotational according to the applied external load. The finger is bent by inserting a flexible shaft inside it. Its bending posture can conform to the shape of the object owing to the flexible shaft, and the posture is fixed by a self-locking mechanism, which can be released by the rotational motion of the actuation system. This study presents a mechanical analysis of these mechanisms to achieve the desired behavior. The analysis was validated experimentally, and a robotic hand with these mechanisms were evaluated using gras** tests.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Near-optimal stochastic MIMO signal detection with a mixture of t-distribution prior
Authors:
Junichiro Hagiwara,
Kazushi Matsumura,
Hiroki Asumi,
Yukiko Kasuga,
Toshihiko Nishimura,
Takanori Sato,
Yasutaka Ogawa,
Takeo Ohgane
Abstract:
Multiple-input multiple-output (MIMO) systems will play a crucial role in future wireless communication, but improving their signal detection performance to increase transmission efficiency remains a challenge. To address this issue, we propose extending the discrete signal detection problem in MIMO systems to a continuous one and applying the Hamiltonian Monte Carlo method, an efficient Markov ch…
▽ More
Multiple-input multiple-output (MIMO) systems will play a crucial role in future wireless communication, but improving their signal detection performance to increase transmission efficiency remains a challenge. To address this issue, we propose extending the discrete signal detection problem in MIMO systems to a continuous one and applying the Hamiltonian Monte Carlo method, an efficient Markov chain Monte Carlo algorithm. In our previous studies, we have used a mixture of normal distributions for the prior distribution. In this study, we propose using a mixture of t-distributions, which further improves detection performance. Based on our theoretical analysis and computer simulations, the proposed method can achieve near-optimal signal detection with polynomial computational complexity. This high-performance and practical MIMO signal detection could contribute to the development of the 6th-generation mobile network.
△ Less
Submitted 7 March, 2024; v1 submitted 9 January, 2023;
originally announced January 2023.
-
1-degree-of-freedom Robotic Gripper With Infinite Self-Twist Function
Authors:
Toshihiro Nishimura,
Yosuke Suzuki,
Tokuo Tsuji,
Tetsuyou Watanabe
Abstract:
This study proposed a novel robotic gripper that can achieve gras** and infinite wrist twisting motions using a single actuator. The gripper is equipped with a differential gear mechanism that allows switching between the gras** and twisting motions according to the magnitude of the tip force applied to the finger. The gras** motion is activated when the tip force is below a set value, and t…
▽ More
This study proposed a novel robotic gripper that can achieve gras** and infinite wrist twisting motions using a single actuator. The gripper is equipped with a differential gear mechanism that allows switching between the gras** and twisting motions according to the magnitude of the tip force applied to the finger. The gras** motion is activated when the tip force is below a set value, and the wrist twisting motion is activated when the tip force exceeds this value. "Twist gras**," a special gras** mode that allows the wrap** of a flexible thin object around the fingers of the gripper, can be achieved by the twisting motion. Twist gras** is effective for handling objects with flexible thin parts, such as laminated packaging pouches, that are difficult to grasp using conventional antipodal gras**. In this study, the gripper design is presented, and twist gras** is analyzed. The gripper performance is experimentally validated.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Single-Fingered Reconfigurable Robotic Gripper With a Folding Mechanism for Narrow Working Spaces
Authors:
Toshihiro Nishimura,
Tsubasa Muryoe,
Yoshitatsu Asama,
Hiroki Ikeuchi,
Ryo Toshima,
Tetsuyou Watanabe
Abstract:
This letter proposes a novel single-fingered reconfigurable robotic gripper for gras** objects in narrow working spaces. The finger of the developed gripper realizes two configurations, namely, the insertion and gras** modes, using only a single motor. In the insertion mode, the finger assumes a thin shape such that it can insert its tip into a narrow space. The gras** mode of the finger is…
▽ More
This letter proposes a novel single-fingered reconfigurable robotic gripper for gras** objects in narrow working spaces. The finger of the developed gripper realizes two configurations, namely, the insertion and gras** modes, using only a single motor. In the insertion mode, the finger assumes a thin shape such that it can insert its tip into a narrow space. The gras** mode of the finger is activated through a folding mechanism. Mode switching can be achieved in two ways: switching the mode actively by a motor, or combining passive rotation of the fingertip through contact with the support surface and active motorized construction of the claw. The latter approach is effective when it is unclear how much finger insertion is required for a specific task. The structure provides a simple control scheme. The performance of the proposed robotic gripper design and control methodology was experimentally evaluated. The minimum width of the insertion space required to grasp an object is 4 mm (1 mm, when using a strategy).
△ Less
Submitted 21 November, 2022; v1 submitted 9 November, 2022;
originally announced November 2022.
-
Recipe Generation from Unsegmented Cooking Videos
Authors:
Taichi Nishimura,
Atsushi Hashimoto,
Yoshitaka Ushiku,
Hirotaka Kameko,
Shinsuke Mori
Abstract:
This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is c…
▽ More
This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this, we propose a transformer-based multimodal recurrent approach of training an event selector and sentence generator for selecting oracle events from the DVC's events and generating sentences for them. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model outputs the appropriate number of events in the correct order.
△ Less
Submitted 18 February, 2024; v1 submitted 21 September, 2022;
originally announced September 2022.
-
Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows
Authors:
Keisuke Shirai,
Atsushi Hashimoto,
Taichi Nishimura,
Hirotaka Kameko,
Shuhei Kurita,
Yoshitaka Ushiku,
Shinsuke Mori
Abstract:
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-mo…
▽ More
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Soft robotic hand with finger-bending/friction-reduction switching mechanism through 1-degree-of-freedom flow control
Authors:
Toshihiro Nishimura,
Kensuke Shimizu,
Seita Nojiri,
Kenjiro Tadakuma,
Yosuke Suzuki,
Tokuo Tsuji,
Tetsuyou Watanabe
Abstract:
This paper proposes a novel pneumatic soft robotic hand that incorporates a mechanism that can switch the airflow path using a single airflow control. The developed hand can control the finger motion and operate the surface friction variable mechanism. In the friction variable mechanism, a lubricant is injected onto the high-friction finger surface to reduce surface friction. To inject the lubrica…
▽ More
This paper proposes a novel pneumatic soft robotic hand that incorporates a mechanism that can switch the airflow path using a single airflow control. The developed hand can control the finger motion and operate the surface friction variable mechanism. In the friction variable mechanism, a lubricant is injected onto the high-friction finger surface to reduce surface friction. To inject the lubrication using a positive-pressure airflow, the Venturi effect is applied. The design and evaluation of the airflow-path switching and friction variable mechanisms are described. Moreover, the entire design of a soft robotic hand equipped with these mechanisms is presented. The performance was validated through gras**, placing, and manipulation tests.
△ Less
Submitted 27 March, 2022;
originally announced March 2022.
-
Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network
Authors:
Taiji Suzuki,
Hiroshi Abe,
Tomoaki Nishimura
Abstract:
One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compre…
▽ More
One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compression based bound is one of the promising approaches. However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network. In this paper, we give a unified frame-work that can convert compression based bounds to those for non-compressed original networks. The bound gives even better rate than the one for the compressed network by improving the bias term. By establishing the unified frame-work, we can obtain a data dependent generalization error bound which gives a tighter evaluation than the data independent ones.
△ Less
Submitted 21 June, 2020; v1 submitted 24 September, 2019;
originally announced September 2019.
-
Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error
Authors:
Taiji Suzuki,
Hiroshi Abe,
Tomoya Murata,
Shingo Horiuchi,
Kotaro Ito,
Tokuma Wachi,
So Hirai,
Masatoshi Yukishima,
Tomoaki Nishimura
Abstract:
Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective comp…
▽ More
Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method.
△ Less
Submitted 13 July, 2020; v1 submitted 26 August, 2018;
originally announced August 2018.