Search | arXiv e-print repository

doi 10.1109/ICRA40945.2020.9197188

Stable Tool-Use with Flexible Musculoskeletal Hands by Learning the Predictive Model of Sensor State Transition

Authors: Kento Kawaharazuka, Kei Tsuzuki, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: The flexible under-actuated musculoskeletal hand is superior in its adaptability and impact resistance. On the other hand, since the relationship between sensors and actuators cannot be uniquely determined, almost all its controls are based on feedforward controls. When gras** and using a tool, the contact state of the hand gradually changes due to the inertia of the tool or impact of action, an… ▽ More The flexible under-actuated musculoskeletal hand is superior in its adaptability and impact resistance. On the other hand, since the relationship between sensors and actuators cannot be uniquely determined, almost all its controls are based on feedforward controls. When gras** and using a tool, the contact state of the hand gradually changes due to the inertia of the tool or impact of action, and the initial contact state is hardly kept. In this study, we propose a system that trains the predictive network of sensor state transition using the actual robot sensor information, and keeps the initial contact state by a feedback control using the network. We conduct experiments of hammer hitting, vacuuming, and brooming, and verify the effectiveness of this study. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted at ICRA2020

arXiv:2406.17134 [pdf, other]

doi 10.1109/LRA.2020.2972841

Musculoskeletal AutoEncoder: A Unified Online Acquisition Method of Intersensory Networks for State Estimation, Control, and Simulation of Musculoskeletal Humanoids

Authors: Kento Kawaharazuka, Kei Tsuzuki, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: While the musculoskeletal humanoid has various biomimetic benefits, the modeling of its complex structure is difficult, and many learning-based systems have been developed so far. There are various methods, such as control methods using acquired relationships between joints and muscles represented by a data table or neural network, and state estimation methods using Extended Kalman Filter or table… ▽ More While the musculoskeletal humanoid has various biomimetic benefits, the modeling of its complex structure is difficult, and many learning-based systems have been developed so far. There are various methods, such as control methods using acquired relationships between joints and muscles represented by a data table or neural network, and state estimation methods using Extended Kalman Filter or table search. In this study, we construct a Musculoskeletal AutoEncoder representing the relationship among joint angles, muscle tensions, and muscle lengths, and propose a unified method of state estimation, control, and simulation of musculoskeletal humanoids using it. By updating the Musculoskeletal AutoEncoder online using the actual robot sensor information, we can continuously conduct more accurate state estimation, control, and simulation than before the online learning. We conducted several experiments using the musculoskeletal humanoid Musashi, and verified the effectiveness of this study. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted at IEEE Robotics and Automation Letters

arXiv:2406.12658 [pdf, other]

Federated Learning with a Single Shared Image

Authors: Sunny Soni, Aaqib Saeed, Yuki M. Asano

Abstract: Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the transfer of knowledge gained from each client model with the server. One popular method, FedDF, uses distillation to tackle this task with the use of a common, shared dataset on which pre… ▽ More Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the transfer of knowledge gained from each client model with the server. One popular method, FedDF, uses distillation to tackle this task with the use of a common, shared dataset on which predictions are exchanged. However, in many contexts such a dataset might be difficult to acquire due to privacy and the clients might not allow for storage of a large shared dataset. To this end, in this paper, we introduce a new method that improves this knowledge distillation method to only rely on a single shared image between clients and server. In particular, we propose a novel adaptive dataset pruning algorithm that selects the most informative crops generated from only a single image. With this, we show that federated learning with distillation under a limited shared dataset budget works better by using a single image compared to multiple individual ones. Finally, we extend our approach to allow for training heterogeneous client architectures by incorporating a non-uniform distillation schedule and client-model mirroring on the server side. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 8 Pages, 3 Figures, Appendix 4 Pages, CVPRW 2024

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7782-7790

arXiv:2406.05573 [pdf, other]

doi 10.1109/MRA.2020.2987805

Toward Autonomous Driving by Musculoskeletal Humanoids: A Study of Developed Hardware and Learning-Based Software

Authors: Kento Kawaharazuka, Kei Tsuzuki, Yuya Koga, Yusuke Omura, Tasuku Makabe, Koki Shinjo, Moritaka Onitsuka, Yuya Nagamatsu, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: This paper summarizes an autonomous driving project by musculoskeletal humanoids. The musculoskeletal humanoid, which mimics the human body in detail, has redundant sensors and a flexible body structure. These characteristics are suitable for motions with complex environmental contact, and the robot is expected to sit down on the car seat, step on the acceleration and brake pedals, and operate the… ▽ More This paper summarizes an autonomous driving project by musculoskeletal humanoids. The musculoskeletal humanoid, which mimics the human body in detail, has redundant sensors and a flexible body structure. These characteristics are suitable for motions with complex environmental contact, and the robot is expected to sit down on the car seat, step on the acceleration and brake pedals, and operate the steering wheel by both arms. We reconsider the developed hardware and software of the musculoskeletal humanoid Musashi in the context of autonomous driving. The respective components of autonomous driving are conducted using the benefits of the hardware and software. Finally, Musashi succeeded in the pedal and steering wheel operations with recognition. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: Accepted at IEEE Robotics and Automation Magazine

arXiv:2405.17423 [pdf, other]

Privacy-Aware Visual Language Models

Authors: Laurens Samson, Nimrod Barazani, Sennay Ghebreab, Yuki M. Asano

Abstract: This paper aims to advance our understanding of how Visual Language Models (VLMs) handle privacy-sensitive information, a crucial concern as these technologies become integral to everyday life. To this end, we introduce a new benchmark PrivBench, which contains images from 8 sensitive categories such as passports, or fingerprints. We evaluate 10 state-of-the-art VLMs on this benchmark and observe… ▽ More This paper aims to advance our understanding of how Visual Language Models (VLMs) handle privacy-sensitive information, a crucial concern as these technologies become integral to everyday life. To this end, we introduce a new benchmark PrivBench, which contains images from 8 sensitive categories such as passports, or fingerprints. We evaluate 10 state-of-the-art VLMs on this benchmark and observe a generally limited understanding of privacy, highlighting a significant area for model improvement. Based on this we introduce PrivTune, a new instruction-tuning dataset aimed at equip** VLMs with knowledge about visual privacy. By tuning two pretrained VLMs, TinyLLaVa and MiniGPT-v2, on this small dataset, we achieve strong gains in their ability to recognize sensitive content, outperforming even GPT4-V. At the same time, we show that privacy-tuning only minimally affects the VLMs performance on standard benchmarks such as VQA. Overall, this paper lays out a crucial challenge for making VLMs effective in handling real-world data safely and provides a simple recipe that takes the first step towards building privacy-aware VLMs. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: preprint

arXiv:2405.14862 [pdf, other]

Bitune: Bidirectional Instruction-Tuning

Authors: Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

Abstract: We introduce Bitune, a method that improves instruction-tuning of pretrained decoder-only large language models, leading to consistent gains on downstream tasks. Bitune applies both causal and bidirectional attention to the prompt, to obtain a better representation of the query or instruction. We realize this by introducing two sets of parameters, for which we apply parameter-efficient finetuning… ▽ More We introduce Bitune, a method that improves instruction-tuning of pretrained decoder-only large language models, leading to consistent gains on downstream tasks. Bitune applies both causal and bidirectional attention to the prompt, to obtain a better representation of the query or instruction. We realize this by introducing two sets of parameters, for which we apply parameter-efficient finetuning techniques. These causal and bidirectional features are then combined into a weighted average with trainable coefficients, which is subsequently used to generate new tokens. We demonstrate significant improvements in zero-shot performance on commonsense reasoning, arithmetic, and language understanding tasks, while extensive ablation studies validate the role of each component and demonstrate the method's agnosticism to different PEFT techniques. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.11092 [pdf, other]

What metrics of participation balance predict outcomes of collaborative learning with a robot?

Authors: Yuya Asano, Diane Litman, Quentin King-Shepard, Tristan Maidment, Tyree Langley, Teresa Davison, Timothy Nokes-Malach, Adriana Kovashka, Erin Walker

Abstract: One of the keys to the success of collaborative learning is balanced participation by all learners, but this does not always happen naturally. Pedagogical robots have the potential to facilitate balance. However, it remains unclear what participation balance robots should aim at; various metrics have been proposed, but it is still an open question whether we should balance human participation in h… ▽ More One of the keys to the success of collaborative learning is balanced participation by all learners, but this does not always happen naturally. Pedagogical robots have the potential to facilitate balance. However, it remains unclear what participation balance robots should aim at; various metrics have been proposed, but it is still an open question whether we should balance human participation in human-human interactions (HHI) or human-robot interactions (HRI) and whether we should consider robots' participation in collaborative learning involving multiple humans and a robot. This paper examines collaborative learning between a pair of students and a teachable robot that acts as a peer tutee to answer the aforementioned question. Through an exploratory study, we hypothesize which balance metrics in the literature and which portions of dialogues (including vs. excluding robots' participation and human participation in HHI vs. HRI) will better predict learning as a group. We test the hypotheses with another study and replicate them with automatically obtained units of participation to simulate the information available to robots when they adaptively fix imbalances in real-time. Finally, we discuss recommendations on which metrics learning science researchers should choose when trying to understand how to facilitate collaboration. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: To appear in Seventeenth International Conference on Educational Data Mining (EDM 2024)

arXiv:2404.17202 [pdf, other]

Self-supervised visual learning in the low-data regime: a comparative evaluation

Authors: Sotirios Konstantakos, Despina Ioanna Chalkiadaki, Ioannis Mademlis, Yuki M. Asano, Efstratios Gavves, Georgios Th. Papadopoulos

Abstract: Self-Supervised Learning (SSL) is a valuable and robust training methodology for contemporary Deep Neural Networks (DNNs), enabling unsupervised pretraining on a `pretext task' that does not require ground-truth labels/annotation. This allows efficient representation learning from massive amounts of unlabeled training data, which in turn leads to increased accuracy in a `downstream task' by exploi… ▽ More Self-Supervised Learning (SSL) is a valuable and robust training methodology for contemporary Deep Neural Networks (DNNs), enabling unsupervised pretraining on a `pretext task' that does not require ground-truth labels/annotation. This allows efficient representation learning from massive amounts of unlabeled training data, which in turn leads to increased accuracy in a `downstream task' by exploiting supervised transfer learning. Despite the relatively straightforward conceptualization and applicability of SSL, it is not always feasible to collect and/or to utilize very large pretraining datasets, especially when it comes to real-world application settings. In particular, in cases of specialized and domain-specific application scenarios, it may not be achievable or practical to assemble a relevant image pretraining dataset in the order of millions of instances or it could be computationally infeasible to pretrain at this scale. This motivates an investigation on the effectiveness of common SSL pretext tasks, when the pretraining dataset is of relatively limited/constrained size. In this context, this work introduces a taxonomy of modern visual SSL methods, accompanied by detailed explanations and insights regarding the main categories of approaches, and, subsequently, conducts a thorough comparative experimental evaluation in the low-data regime, targeting to identify: a) what is learnt via low-data SSL pretraining, and b) how do different SSL categories behave in such training scenarios. Interestingly, for domain-specific downstream tasks, in-domain low-data SSL pretraining outperforms the common approach of large-scale pretraining on general datasets. Grounded on the obtained results, valuable insights are highlighted regarding the performance of each category of SSL methods, which in turn suggest straightforward future research directions in the field. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.14100 [pdf, other]

doi 10.1109/HUMANOIDS.2018.8625002

A Method of Joint Angle Estimation Using Only Relative Changes in Muscle Lengths for Tendon-driven Humanoids with Complex Musculoskeletal Structures

Authors: Kento Kawaharazuka, Shogo Makino, Masaya Kawamura, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: Tendon-driven musculoskeletal humanoids typically have complex structures similar to those of human beings, such as ball joints and the scapula, in which encoders cannot be installed. Therefore, joint angles cannot be directly obtained and need to be estimated using the changes in muscle lengths. In previous studies, methods using table-search and extended kalman filter have been developed. These… ▽ More Tendon-driven musculoskeletal humanoids typically have complex structures similar to those of human beings, such as ball joints and the scapula, in which encoders cannot be installed. Therefore, joint angles cannot be directly obtained and need to be estimated using the changes in muscle lengths. In previous studies, methods using table-search and extended kalman filter have been developed. These methods express the joint-muscle map**, which is the nonlinear relationship between joint angles and muscle lengths, by using a data table, polynomials, or a neural network. However, due to computational complexity, these methods cannot consider the effects of polyarticular muscles. In this study, considering the limitation of the computational cost, we reduce unnecessary degrees of freedom, divide joints and muscles into several groups, and formulate a joint angle estimation method that takes into account polyarticular muscles. Also, we extend the estimation method to propose a joint angle estimation method using only the relative changes in muscle lengths. By this extension, which does not use absolute muscle lengths, we do not need to execute a difficult calibration of muscle lengths for tendon-driven musculoskeletal humanoids. Finally, we conduct experiments in simulation and actual environments, and verify the effectiveness of this study. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: Accepted at Humanoids2018

arXiv:2404.14080 [pdf, other]

doi 10.1109/HUMANOIDS.2018.8624923

TWIMP: Two-Wheel Inverted Musculoskeletal Pendulum as a Learning Control Platform in the Real World with Environmental Physical Contact

Authors: Kento Kawaharazuka, Tasuku Makabe, Shogo Makino, Kei Tsuzuki, Yuya Nagamatsu, Yuki Asano, Takuma Shirai, Fumihito Sugai, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: By the recent spread of machine learning in the robotics field, a humanoid that can act, perceive, and learn in the real world through contact with the environment needs to be developed. In this study, as one of the choices, we propose a novel humanoid TWIMP, which combines a human mimetic musculoskeletal upper limb with a two-wheel inverted pendulum. By combining the benefit of a musculoskeletal… ▽ More By the recent spread of machine learning in the robotics field, a humanoid that can act, perceive, and learn in the real world through contact with the environment needs to be developed. In this study, as one of the choices, we propose a novel humanoid TWIMP, which combines a human mimetic musculoskeletal upper limb with a two-wheel inverted pendulum. By combining the benefit of a musculoskeletal humanoid, which can achieve soft contact with the external environment, and the benefit of a two-wheel inverted pendulum with a small footprint and high mobility, we can easily investigate learning control systems in environments with contact and sudden impact. We reveal our whole concept and system details of TWIMP, and execute several preliminary experiments to show its potential ability. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: Accepted at Humanoids2018

arXiv:2404.14045 [pdf, ps, other]

Defining the type IIB matrix model without breaking Lorentz symmetry

Authors: Yuhma Asano, Jun Nishimura, Worapat Piensuk, Naoyuki Yamamori

Abstract: The type IIB matrix model is a promising nonperturbative formulation of superstring theory, which may elucidate the emergence of (3+1)-dimensional space-time. However, the partition function is divergent due to the Lorentz symmetry, which is represented by a noncompact group. This divergence has been regularized conventionally by introducing some infrared cutoff, which breaks the Lorentz symmetry.… ▽ More The type IIB matrix model is a promising nonperturbative formulation of superstring theory, which may elucidate the emergence of (3+1)-dimensional space-time. However, the partition function is divergent due to the Lorentz symmetry, which is represented by a noncompact group. This divergence has been regularized conventionally by introducing some infrared cutoff, which breaks the Lorentz symmetry. Here we point out that Lorentz invariant observables become classical as one removes the infrared cutoff and that this "classicalization" is actually an artifact of the Lorentz symmetry breaking cutoff. In order to overcome this problem, we propose a natural way to "gauge-fix" the Lorentz symmetry in a fully nonperturbative manner. This also enables us to perform numerical simulations in such a way that the time-evolution can be extracted directly from the matrix configurations. △ Less

Submitted 10 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 4 pages, no figure, (v2) minor changes with improved presentation

Report number: UTHEP-787, KEK-TH-2617

arXiv:2404.13381 [pdf, other]

DNA: Differentially private Neural Augmentation for contact tracing

Authors: Rob Romijnders, Christos Louizos, Yuki M. Asano, Max Welling

Abstract: The COVID19 pandemic had enormous economic and societal consequences. Contact tracing is an effective way to reduce infection rates by detecting potential virus carriers early. However, this was not generally adopted in the recent pandemic, and privacy concerns are cited as the most important reason. We substantially improve the privacy guarantees of the current state of the art in decentralized c… ▽ More The COVID19 pandemic had enormous economic and societal consequences. Contact tracing is an effective way to reduce infection rates by detecting potential virus carriers early. However, this was not generally adopted in the recent pandemic, and privacy concerns are cited as the most important reason. We substantially improve the privacy guarantees of the current state of the art in decentralized contact tracing. Whereas previous work was based on statistical inference only, we augment the inference with a learned neural network and ensure that this neural augmentation satisfies differential privacy. In a simulator for COVID19, even at epsilon=1 per message, this can significantly improve the detection of potentially infected individuals and, as a result of targeted testing, reduce infection rates. This work marks an important first step in integrating deep learning into contact tracing while maintaining essential privacy guarantees. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: Privacy Regulation and Protection in Machine Learning Workshop at ICLR 2024

arXiv:2404.05295 [pdf, other]

doi 10.1109/LRA.2018.2789849

Online Learning of Joint-Muscle Map** Using Vision in Tendon-driven Musculoskeletal Humanoids

Authors: Kento Kawaharazuka, Shogo Makino, Masaya Kawamura, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: The body structures of tendon-driven musculoskeletal humanoids are complex, and accurate modeling is difficult, because they are made by imitating the body structures of human beings. For this reason, we have not been able to move them accurately like ordinary humanoids driven by actuators in each axis, and large internal muscle tension and slack of tendon wires have emerged by the model error bet… ▽ More The body structures of tendon-driven musculoskeletal humanoids are complex, and accurate modeling is difficult, because they are made by imitating the body structures of human beings. For this reason, we have not been able to move them accurately like ordinary humanoids driven by actuators in each axis, and large internal muscle tension and slack of tendon wires have emerged by the model error between its geometric model and the actual robot. Therefore, we construct a joint-muscle map** (JMM) using a neural network (NN), which expresses a nonlinear relationship between joint angles and muscle lengths, and aim to move tendon-driven musculoskeletal humanoids accurately by updating the JMM online from data of the actual robot. In this study, the JMM is updated online by using the vision of the robot so that it moves to the correct position (Vision Updater). Also, we execute another update to modify muscle antagonisms correctly (Antagonism Updater). By using these two updaters, the error between the target and actual joint angles decrease to about 40% in 5 minutes, and we show through a manipulation experiment that the tendon-driven musculoskeletal humanoid Kengoro becomes able to move as intended. This novel system can adapt to the state change and growth of robots, because it updates the JMM online successively. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at IEEE Robotics and Automation Letters, 2018

arXiv:2404.05293 [pdf, other]

doi 10.1109/LRA.2019.2923968

Long-time Self-body Image Acquisition and its Application to the Control of Musculoskeletal Structures

Authors: Kento Kawaharazuka, Kei Tsuzuki, Shogo Makino, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: The tendon-driven musculoskeletal humanoid has many benefits that human beings have, but the modeling of its complex muscle and bone structures is difficult and conventional model-based controls cannot realize intended movements. Therefore, a learning control mechanism that acquires nonlinear relationships between joint angles, muscle tensions, and muscle lengths from the actual robot is necessary… ▽ More The tendon-driven musculoskeletal humanoid has many benefits that human beings have, but the modeling of its complex muscle and bone structures is difficult and conventional model-based controls cannot realize intended movements. Therefore, a learning control mechanism that acquires nonlinear relationships between joint angles, muscle tensions, and muscle lengths from the actual robot is necessary. In this study, we propose a system which runs the learning control mechanism for a long time to keep the self-body image of the musculoskeletal humanoid correct at all times. Also, we show that the musculoskeletal humanoid can conduct position control, torque control, and variable stiffness control using this self-body image. We conduct a long-time self-body image acquisition experiment lasting 3 hours, evaluate variable stiffness control using the self-body image, etc., and discuss the superiority and practicality of the self-body image acquisition of musculoskeletal structures, comprehensively. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at IEEE Robotics and Automation Letters, 2019

arXiv:2404.05286 [pdf, other]

doi 10.1109/IROS.2018.8593428

Online Self-body Image Acquisition Considering Changes in Muscle Routes Caused by Softness of Body Tissue for Tendon-driven Musculoskeletal Humanoids

Authors: Kento Kawaharazuka, Shogo Makino, Masaya Kawamura, Ayaka Fujii, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: Tendon-driven musculoskeletal humanoids have many benefits in terms of the flexible spine, multiple degrees of freedom, and variable stiffness. At the same time, because of its body complexity, there are problems in controllability. First, due to the large difference between the actual robot and its geometric model, it cannot move as intended and large internal muscle tension may emerge. Second, m… ▽ More Tendon-driven musculoskeletal humanoids have many benefits in terms of the flexible spine, multiple degrees of freedom, and variable stiffness. At the same time, because of its body complexity, there are problems in controllability. First, due to the large difference between the actual robot and its geometric model, it cannot move as intended and large internal muscle tension may emerge. Second, movements which do not appear as changes in muscle lengths may emerge, because of the muscle route changes caused by softness of body tissue. To solve these problems, we construct two models: ideal joint-muscle model and muscle-route change model, using a neural network. We initialize these models by a man-made geometric model and update them online using the sensor information of the actual robot. We validate that the tendon-driven musculoskeletal humanoid Kengoro is able to obtain a correct self-body image through several experiments. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at IROS2018

arXiv:2404.00890 [pdf, other]

doi 10.1109/HUMANOIDS47582.2021.9555807

Development of Musculoskeletal Legs with Planar Interskeletal Structures to Realize Human Comparable Moving Function

Authors: Moritaka Onitsuka, Manabu Nishiura, Kento Kawaharazuka, Kei Tsuzuki, Yasunori Toshimitsu, Yusuke Omura, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: Musculoskeletal humanoids have been developed by imitating humans and expected to perform natural and dynamic motions as well as humans. To achieve desired motions stably in current musculoskeletal humanoids is not easy because they cannot maintain the sufficient moment arm of muscles in various postures. In this research, we discuss planar structures that spread across joint structures such as li… ▽ More Musculoskeletal humanoids have been developed by imitating humans and expected to perform natural and dynamic motions as well as humans. To achieve desired motions stably in current musculoskeletal humanoids is not easy because they cannot maintain the sufficient moment arm of muscles in various postures. In this research, we discuss planar structures that spread across joint structures such as ligament and planar muscles and the application of planar interskeletal structures to humanoid robots. Next, we develop MusashiOLegs, a musculoskeletal legs which has planar interskeletal structures and conducts several experiments to verify the importance of planar interskeletal structures. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: accepted at Humanoids2020

arXiv:2403.17459 [pdf, other]

doi 10.1109/IROS.2017.8202291

High-Power, Flexible, Robust Hand: Development of Musculoskeletal Hand Using Machined Springs and Realization of Self-Weight Supporting Motion with Humanoid

Authors: Shogo Makino, Kento Kawaharazuka, Masaya Kawamura, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: Human can not only support their body during standing or walking, but also support them by hand, so that they can dangle a bar and others. But most humanoid robots support their body only in the foot and they use their hand just to manipulate objects because their hands are too weak to support their body. Strong hands are supposed to enable humanoid robots to act in much broader scene. Therefore,… ▽ More Human can not only support their body during standing or walking, but also support them by hand, so that they can dangle a bar and others. But most humanoid robots support their body only in the foot and they use their hand just to manipulate objects because their hands are too weak to support their body. Strong hands are supposed to enable humanoid robots to act in much broader scene. Therefore, we developed new life-size five-fingered hand that can support the body of life-size humanoid robot. It is tendon-driven and underactuated hand and actuators in forearms produce large grip** force. This hand has flexible joints using machined springs, which can be designed integrally with the attachment. Thus, it has both structural strength and impact resistance in spite of small size. As other characteristics, this hand has force sensors to measure external force and the fingers can be flexed along objects though the number of actuators to flex fingers is less than that of fingers. We installed the developed hand on musculoskeletal humanoid "Kengoro" and achieved two self-weight supporting motions: push-up motion and dangling motion. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: accepted at IROS2017

arXiv:2403.17452 [pdf, other]

doi 10.1109/IROS.2018.8594316

Five-fingered Hand with Wide Range of Thumb Using Combination of Machined Springs and Variable Stiffness Joints

Authors: Shogo Makino, Kento Kawaharazuka, Ayaka Fujii, Masaya Kawamura, Tasuku Makabe, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: Human hands can not only grasp objects of various shape and size and manipulate them in hands but also exert such a large grip** force that they can support the body in the situations such as dangling a bar and climbing a ladder. On the other hand, it is difficult for most robot hands to manage both. Therefore in this paper we developed the hand which can grasp various objects and exert large gr… ▽ More Human hands can not only grasp objects of various shape and size and manipulate them in hands but also exert such a large grip** force that they can support the body in the situations such as dangling a bar and climbing a ladder. On the other hand, it is difficult for most robot hands to manage both. Therefore in this paper we developed the hand which can grasp various objects and exert large grip** force. To develop such hand, we focused on the thumb CM joint with wide range of motion and the MP joints of four fingers with the DOF of abduction and adduction. Based on the hand with large grip** force and flexibility using machined spring, we applied above mentioned joint mechanism to the hand. The thumb CM joint has wide range of motion because of the combination of three machined springs and MP joints of four fingers have variable rigidity mechanism instead of driving each joint independently in order to move joint in limited space and by limited actuators. Using the developed hand, we achieved the gras** of various objects, supporting a large load and several motions with an arm. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: accepted at IROS2018

arXiv:2403.01502 [pdf, ps, other]

Oscillating-charged Andreev Bound States and Their Appearance in UTe$_2$

Authors: Satoshi Ando, Shingo Kobayashi, Andreas P. Schnyder, Yasuhiro Asano, Satoshi Ikegaya

Abstract: In a superconductor with a sublattice degree of freedom, we find unconventional Andreev bound states whose charge density oscillates in sign between the two sublattices. The appearance of these oscillating-charged Andreev bound states is characterized by a Zak phase, rather than a conventional topological invariant. In contrast to conventional Andreev bound states, for oscillating-charged Andreev… ▽ More In a superconductor with a sublattice degree of freedom, we find unconventional Andreev bound states whose charge density oscillates in sign between the two sublattices. The appearance of these oscillating-charged Andreev bound states is characterized by a Zak phase, rather than a conventional topological invariant. In contrast to conventional Andreev bound states, for oscillating-charged Andreev bound states the proportionality between the electron-like spectral function, the local density of states and the tunneling conductance is broken. We examine the possible appearance of these novel Andreev bound states in UTe$_2$ and locally noncentrosymmetric superconductors. △ Less

Submitted 3 March, 2024; originally announced March 2024.

arXiv:2402.16844 [pdf, other]

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Authors: Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi

Abstract: Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of dif… ▽ More Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.14957 [pdf, other]

The Common Stability Mechanism behind most Self-Supervised Learning Approaches

Authors: Abhishek Jha, Matthew B. Blaschko, Yuki M. Asano, Tinne Tuytelaars

Abstract: Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniqu… ▽ More Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniques, e.g. by utilizing negative examples in a contrastive formulation, or exponential moving average and predictor in BYOL and SimSiam. In this paper, we provide a framework to explain the stability mechanism of these different SSL techniques: i) we discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO; ii) we provide an argument that despite different formulations these methods implicitly optimize a similar objective function, i.e. minimizing the magnitude of the expected representation over all data samples, or the mean of the data distribution, while maximizing the magnitude of the expected representation of individual samples over different data augmentations; iii) we provide mathematical and empirical evidence to support our framework. We formulate different hypotheses and test them using the Imagenet100 dataset. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: Additional visualizations (.gif): https://github.com/abskjha/CenterVectorSSL

arXiv:2402.08657 [pdf, other]

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Authors: Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano

Abstract: Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom,… ▽ More Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) kee** the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2401.11485 [pdf, other]

doi 10.1145/3658144

ColorVideoVDP: A visual difference predictor for image, video and display distortions

Authors: Rafal K. Mantiuk, Param Hanji, Maliha Ashraf, Yuta Asano, Alexandre Chapiro

Abstract: ColorVideoVDP is a video and image quality metric that models spatial and temporal aspects of vision, for both luminance and color. The metric is built on novel psychophysical models of chromatic spatiotemporal contrast sensitivity and cross-channel contrast masking. It accounts for the viewing conditions, geometric, and photometric characteristics of the display. It was trained to predict common… ▽ More ColorVideoVDP is a video and image quality metric that models spatial and temporal aspects of vision, for both luminance and color. The metric is built on novel psychophysical models of chromatic spatiotemporal contrast sensitivity and cross-channel contrast masking. It accounts for the viewing conditions, geometric, and photometric characteristics of the display. It was trained to predict common video streaming distortions (e.g. video compression, rescaling, and transmission errors), and also 8 new distortion types related to AR/VR displays (e.g. light source and waveguide non-uniformities). To address the latter application, we collected our novel XR-Display-Artifact-Video quality dataset (XR-DAVID), comprised of 336 distorted videos. Extensive testing on XR-DAVID, as well as several datasets from the literature, indicate a significant gain in prediction performance compared to existing metrics. ColorVideoVDP opens the doors to many novel applications which require the joint automated spatiotemporal assessment of luminance and color distortions, including video streaming, display specification and design, visual comparison of results, and perceptually-guided quality optimization. △ Less

Submitted 2 July, 2024; v1 submitted 21 January, 2024; originally announced January 2024.

Comments: 28 pages

Journal ref: SIGGRAPH 2024 Technical Papers, Article 129

arXiv:2401.05735 [pdf, other]

Object-Centric Diffusion for Efficient Video Editing

Authors: Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

Abstract: Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we c… ▽ More Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2312.17244 [pdf, other]

The LLM Surgeon

Authors: Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort

Abstract: State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative… ▽ More State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models. △ Less

Submitted 20 March, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.11581 [pdf, other]

Protect Your Score: Contact Tracing With Differential Privacy Guarantees

Authors: Rob Romijnders, Christos Louizos, Yuki M. Asano, Max Welling

Abstract: The pandemic in 2020 and 2021 had enormous economic and societal consequences, and studies show that contact tracing algorithms can be key in the early containment of the virus. While large strides have been made towards more effective contact tracing algorithms, we argue that privacy concerns currently hold deployment back. The essence of a contact tracing algorithm constitutes the communication… ▽ More The pandemic in 2020 and 2021 had enormous economic and societal consequences, and studies show that contact tracing algorithms can be key in the early containment of the virus. While large strides have been made towards more effective contact tracing algorithms, we argue that privacy concerns currently hold deployment back. The essence of a contact tracing algorithm constitutes the communication of a risk score. Yet, it is precisely the communication and release of this score to a user that an adversary can leverage to gauge the private health status of an individual. We pinpoint a realistic attack scenario and propose a contact tracing algorithm with differential privacy guarantees against this attack. The algorithm is tested on the two most widely used agent-based COVID19 simulators and demonstrates superior performance in a wide range of settings. Especially for realistic test scenarios and while releasing each risk score with epsilon=1 differential privacy, we achieve a two to ten-fold reduction in the infection rate of the virus. To the best of our knowledge, this presents the first contact tracing algorithm with differential privacy guarantees when revealing risk scores for COVID19. △ Less

Submitted 15 February, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Accepted to The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2312.08895 [pdf, other]

Motion Flow Matching for Human Motion Synthesis and Editing

Authors: Vincent Tao Hu, Wenzhe Yin, **chuan Ma, Yunlu Chen, Basura Fernando, Yuki M Asano, Efstratios Gavves, Pascal Mettes, Bjorn Ommer, Cees G. M. Snoek

Abstract: Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose \emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effective… ▽ More Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose \emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named \emph{sampling trajectory rewriting} leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. Our code will be released. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: WIP

arXiv:2312.08892 [pdf, other]

VaLID: Variable-Length Input Diffusion for Novel View Synthesis

Authors: Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, Amirhossein Habibian

Abstract: Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision. As this task is heavily under-constrained, some recent work, like Zero123, tries to solve this problem with generative modeling, specifically using pre-trained diffusion models. Although this strategy generalizes well… ▽ More Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision. As this task is heavily under-constrained, some recent work, like Zero123, tries to solve this problem with generative modeling, specifically using pre-trained diffusion models. Although this strategy generalizes well to new scenes, compared to neural radiance field-based methods, it offers low levels of flexibility. For example, it can only accept a single-view image as input, despite realistic applications often offering multiple input images. This is because the source-view images and corresponding poses are processed separately and injected into the model at different stages. Thus it is not trivial to generalize the model into multi-view source images, once they are available. To solve this issue, we try to process each pose image pair separately and then fuse them as a unified visual representation which will be injected into the model to guide image synthesis at the target-views. However, inconsistency and computation costs increase as the number of input source-view images increases. To solve these issues, the Multi-view Cross Former module is proposed which maps variable-length input data to fix-size output data. A two-stage training strategy is introduced to further improve the efficiency during training time. Qualitative and quantitative evaluation over multiple datasets demonstrates the effectiveness of the proposed method against previous approaches. The code will be released according to the acceptance. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: paper and supplementary material

arXiv:2312.08825 [pdf, other]

Guided Diffusion from Self-Supervised Diffusion Features

Authors: Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M. Asano, Cees G. M. Snoek, Bjorn Ommer

Abstract: Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks a… ▽ More Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: Work In Progress

arXiv:2312.04539 [pdf, other]

Auto-Vocabulary Semantic Segmentation

Authors: Osman Ülger, Maksymilian Kulicki, Yuki Asano, Martin R. Oswald

Abstract: Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-Language Models. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require… ▽ More Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-Language Models. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, \ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names. △ Less

Submitted 20 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

arXiv:2311.17299 [pdf, other]

Federated Fine-Tuning of Foundation Models via Probabilistic Masking

Authors: Vasileios Tsouvalas, Yuki Asano, Aaqib Saeed

Abstract: Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around $1$ bit-per-parameter (bpp).… ▽ More Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around $1$ bit-per-parameter (bpp). However, these approaches fail to harness the characteristics of FMs, with their large number of parameters still posing a challenge to communication efficiency, even at these bitrate regimes. In this work, we present DeltaMask, a novel method that efficiently fine-tunes FMs in FL at an ultra-low bitrate, well below 1 bpp. DeltaMask employs stochastic masking to detect highly effective subnetworks within FMs and leverage stochasticity and sparsity in client masks to compress updates into a compact grayscale image using probabilistic filters, deviating from traditional weight training approaches. Our comprehensive evaluations across various datasets and architectures demonstrate DeltaMask efficiently achieves bitrates as low as 0.09 bpp, enhancing communication efficiency while maintaining FMs performance, as measured on 8 datasets and 5 pre-trained models of various network architectures. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: 19 pages, 9 figures

arXiv:2310.11454 [pdf, other]

VeRA: Vector-based Random Matrix Adaptation

Authors: Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

Abstract: Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameter… ▽ More Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models. △ Less

Submitted 16 January, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: Accepted at ICLR 2024, website: https://dkopi.github.io/vera

arXiv:2310.08584 [pdf, other]

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

Authors: Shashanka Venkataramanan, Mamshad Nayeem Rizve, João Carreira, Yuki M. Asano, Yannis Avrithis

Abstract: Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution,… ▽ More Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DoRA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks. △ Less

Submitted 23 May, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: Accepted to ICLR 2024 (Best paper honorable mention). Project Page: https://shashankvkt.github.io/dora

arXiv:2310.00500 [pdf, other]

Self-Supervised Open-Ended Classification with Small Visual Language Models

Authors: Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek, Marcel Worring, Yuki M. Asano

Abstract: We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. Our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct a training signal consisting of inter… ▽ More We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. Our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct a training signal consisting of interleaved sequences of image and pseudocaption pairs and a query image, which we denote as the 'self-context' sequence. Based on this signal the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research and applications in open-ended few-shot learning that otherwise requires access to large or proprietary models. △ Less

Submitted 6 December, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

arXiv:2308.11796 [pdf, other]

Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

Authors: Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

Abstract: Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consisten… ▽ More Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. The implementation can be found here : https://github.com/SMSD75/Timetuning △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2308.07350 [pdf, other]

Efficient Neural PDE-Solvers using Quantization Aware Training

Authors: Winfried van den Dool, Tijmen Blankevoort, Max Welling, Yuki M. Asano

Abstract: In the past years, the application of neural networks as an alternative to classical numerical methods to solve Partial Differential Equations has emerged as a potential paradigm shift in this century-old mathematical field. However, in terms of practical applicability, computational cost remains a substantial bottleneck. Classical approaches try to mitigate this challenge by limiting the spatial… ▽ More In the past years, the application of neural networks as an alternative to classical numerical methods to solve Partial Differential Equations has emerged as a potential paradigm shift in this century-old mathematical field. However, in terms of practical applicability, computational cost remains a substantial bottleneck. Classical approaches try to mitigate this challenge by limiting the spatial resolution on which the PDEs are defined. For neural PDE solvers, we can do better: Here, we investigate the potential of state-of-the-art quantization methods on reducing computational costs. We show that quantizing the network weights and activations can successfully lower the computational cost of inference while maintaining performance. Our results on four standard PDE datasets and three network architectures show that quantization-aware training works across settings and three orders of FLOPs magnitudes. Finally, we empirically demonstrate that Pareto-optimality of computational cost vs performance is almost always achieved only by incorporating quantization. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: Accepted at the ICCV 2023 Workshop on Resource Efficient Deep Learning for Computer Vision

arXiv:2308.02211 [pdf, ps, other]

Discontinuous Transition to Superconducting Phase

Authors: Takumi Sato, Shingo Kobayashi, Yasuhiro Asano

Abstract: We discuss the instability of uniform superconducting states that contain the pairing correlations belonging to the odd-frequency symmetry class. The instability originates from the paramagnetic response of odd-frequency Cooper pairs and is considerable at finite temperatures. As a result, the pair potential varies discontinuously at the transition temperature when the amplitude of the odd-frequen… ▽ More We discuss the instability of uniform superconducting states that contain the pairing correlations belonging to the odd-frequency symmetry class. The instability originates from the paramagnetic response of odd-frequency Cooper pairs and is considerable at finite temperatures. As a result, the pair potential varies discontinuously at the transition temperature when the amplitude of the odd-frequency pairing correlation functions is sufficiently large. The discontinuous transition to the superconducting phase is a general feature of superconductors that include odd-frequency Cooper pairs. △ Less

Submitted 1 July, 2024; v1 submitted 4 August, 2023; originally announced August 2023.

Comments: 16 pages, 2 figure

arXiv:2307.08727 [pdf, other]

Learning to Count without Annotations

Authors: Lukas Knobel, Tengda Han, Yuki M. Asano

Abstract: While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose UnCounTR, a model that can learn this task without requiring any manual annotations. To this end, we construct "Self-Collages", images with vario… ▽ More While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose UnCounTR, a model that can learn this task without requiring any manual annotations. To this end, we construct "Self-Collages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate for the first time the ability of reference-based counting without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN and DETR, but also matches the performance of supervised counting models in some domains. △ Less

Submitted 29 March, 2024; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Accepted at CVPR'24. Code available at https://github.com/lukasknobel/SelfCollages

arXiv:2306.13291 [pdf, ps, other]

Multi-locational Majorana Zero Modes

Authors: Yutaro Nagae, Andreas P. Schnyder, Yukio Tanaka, Yasuhiro Asano, Satoshi Ikegaya

Abstract: We show the appearance of an unconventional Majorana zero mode whose wave function splits into multiple parts located at different ends of different topological superconductors, hereinafter referred to as a multi-locational Majorana zero mode. Specifically, we discuss the multi-locational Majorana zero modes in a three-terminal Josephson junction consisting of topological superconductors, which fo… ▽ More We show the appearance of an unconventional Majorana zero mode whose wave function splits into multiple parts located at different ends of different topological superconductors, hereinafter referred to as a multi-locational Majorana zero mode. Specifically, we discuss the multi-locational Majorana zero modes in a three-terminal Josephson junction consisting of topological superconductors, which forms an elemental qubit of fault-tolerant topological quantum computers. We also demonstrate anomalously long-ranged nonlocal resonant transport phenomena caused by the multi-locational Majorana zero mode. △ Less

Submitted 11 March, 2024; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: 6+6 pages, 2+2 figures

arXiv:2306.09643 [pdf, other]

BISCUIT: Causal Representation Learning from Binary Interactions

Authors: Phillip Lippe, Sara Magliacane, Sindy Löwe, Yuki M. Asano, Taco Cohen, Efstratios Gavves

Abstract: Identifying the causal variables of an environment and how to intervene on them is of core value in applications such as robotics and embodied AI. While an agent can commonly interact with the environment and may implicitly perturb the behavior of some of these causal variables, often the targets it affects remain unknown. In this paper, we show that causal variables can still be identified for ma… ▽ More Identifying the causal variables of an environment and how to intervene on them is of core value in applications such as robotics and embodied AI. While an agent can commonly interact with the environment and may implicitly perturb the behavior of some of these causal variables, often the targets it affects remain unknown. In this paper, we show that causal variables can still be identified for many common setups, e.g., additive Gaussian noise models, if the agent's interactions with a causal variable can be described by an unknown binary variable. This happens when each causal variable has two different mechanisms, e.g., an observational and an interventional one. Using this identifiability result, we propose BISCUIT, a method for simultaneously learning causal variables and their corresponding binary interaction variables. On three robotic-inspired datasets, BISCUIT accurately identifies causal variables and can even be scaled to complex, realistic environments for embodied AI. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: Published in: Uncertainty in Artificial Intelligence (UAI 2023). Project page: https://phlippe.github.io/BISCUIT/

arXiv:2306.07302 [pdf, other]

Impact of Experiencing Misrecognition by Teachable Agents on Learning and Rapport

Authors: Yuya Asano, Diane Litman, Mingzhi Yu, Nikki Lobczowski, Timothy Nokes-Malach, Adriana Kovashka, Erin Walker

Abstract: While speech-enabled teachable agents have some advantages over ty**-based ones, they are vulnerable to errors stemming from misrecognition by automatic speech recognition (ASR). These errors may propagate, resulting in unexpected changes in the flow of conversation. We analyzed how such changes are linked with learning gains and learners' rapport with the agents. Our results show they are not r… ▽ More While speech-enabled teachable agents have some advantages over ty**-based ones, they are vulnerable to errors stemming from misrecognition by automatic speech recognition (ASR). These errors may propagate, resulting in unexpected changes in the flow of conversation. We analyzed how such changes are linked with learning gains and learners' rapport with the agents. Our results show they are not related to learning gains or rapport, regardless of the types of responses the agents should have returned given the correct input from learners without ASR errors. We also discuss the implications for optimal error-recovery policies for teachable agents that can be drawn from these findings. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: Accepted to AIED 2023

arXiv:2305.06015 [pdf, other]

doi 10.1103/PhysRevB.108.064509

Fulde-Ferrell-Larkin-Ovchinnikov state in a superconducting thin film attached to a ferromagnetic cluster

Authors: Shu-Ichiro Suzuki, Takumi Sato, Alexander A. Golubov, Yasuhiro Asano

Abstract: We study theoretically the Fulde-Ferrell-Larkin-Ovchinnikov (FFLO) states appearing locally in a superconducting thin film with a small circular magnetic cluster. The pair potential, the pairing correlations, the free-energy density, and the quasiparticle density of states are calculated for several cluster sizes and the exchange potentials by solving the Eilenberger equation in two dimensions. Th… ▽ More We study theoretically the Fulde-Ferrell-Larkin-Ovchinnikov (FFLO) states appearing locally in a superconducting thin film with a small circular magnetic cluster. The pair potential, the pairing correlations, the free-energy density, and the quasiparticle density of states are calculated for several cluster sizes and the exchange potentials by solving the Eilenberger equation in two dimensions. The number of nodes in the pair potential increases with increasing the exchange potential and cluster size. The local FFLO states are stabilized by the superconducting condensate away from the magnetic cluster even though the free-energy density beneath the ferromagnet exceeds locally the normal-state value. The analysis of the pairing-correlation functions shows that the spatial variation of the spin-singlet $s$-wave pair potential generates $p$-wave Cooper pairs, and that odd-frequency Cooper pairs govern the inhomogeneous subgap spectra in the local density of states. We also discuss a way of detecting the local FFLO states based on the calculated quasiparticle density of states. △ Less

Submitted 27 July, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: 9 pages, 6 figures

Journal ref: Phys. Rev. B 108, 064509 (2023)

arXiv:2304.08102 [pdf, other]

doi 10.1103/PhysRevB.108.144505

Supercurrent reversal in Zeeman-split Josephson junctions

Authors: Shu-Ichiro Suzuki, Yasuhiro Asano, Alexander A. Golubov

Abstract: We study theoretically the shape of the current-phase relation in a Josephson junction comprising the Zeeman-split superconductors (ZSs) and a normal metal (N). We show that at low temperatures the Josephson current in the ZS/N/ZS junctions exhibits an additional reversal in direction at a certain phase difference $\varphi_c \in (0, π)$. Calculating the spectral Josephson current, the band-splitti… ▽ More We study theoretically the shape of the current-phase relation in a Josephson junction comprising the Zeeman-split superconductors (ZSs) and a normal metal (N). We show that at low temperatures the Josephson current in the ZS/N/ZS junctions exhibits an additional reversal in direction at a certain phase difference $\varphi_c \in (0, π)$. Calculating the spectral Josephson current, the band-splitting due to the Zeeman interaction is shown to cause the level crossing in the spectra of the Andreev bound states and the sign reversal in the Josephson current. Additionally, we propose an alternative method to electrically control the critical phase difference $\varphi_c$ by tuning the Rashba spin-orbit coupling, eliminating the need for manipulating magnetizations. △ Less

Submitted 16 June, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: 8 pages, 10 figures

Journal ref: Phys. Rev. B 108, 144505 (2023)

arXiv:2304.00961 [pdf, other]

Self-Ordering Point Clouds

Authors: Pengwan Yang, Cees G. M. Snoek, Yuki M. Asano

Abstract: In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call sel… ▽ More In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its scalability and superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories. △ Less

Submitted 10 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.01008 [pdf, other]

The dynamics of zero modes in lattice gauge theory---difference between SU(2) and SU(3) in 4D

Authors: Yuhma Asano, Jun Nishimura

Abstract: The dynamics of zero modes in gauge theory is highly nontrivial due to its nonperturbative nature even in the case where the other modes can be treated perturbatively. One of the related issues concerns the possible instability of the trivial vacuum $A_μ(x)=0$ due to the existence of nontrivial degenerate vacua known as "torons". Here we investigate this issue for the 4D SU(2) and SU(3) pure Yang-… ▽ More The dynamics of zero modes in gauge theory is highly nontrivial due to its nonperturbative nature even in the case where the other modes can be treated perturbatively. One of the related issues concerns the possible instability of the trivial vacuum $A_μ(x)=0$ due to the existence of nontrivial degenerate vacua known as "torons". Here we investigate this issue for the 4D SU(2) and SU(3) pure Yang-Mills theories on the lattice by explicit Monte Carlo calculation of the Wilson loops and the Polyakov line at large $β$. While we confirm the leading $1/β$ predictions obtained around the trivial vacuum in both SU(2) and SU(3) cases, we find that the subleading term vanishes only logarithmically in the SU(2) case unlike the power-law decay in the SU(3) case. In fact, the 4D SU(2) case is marginal according to the criterion by Coste et al. Here we show that the trivial vacuum dominates in this case due to large fluctuations of the zero modes around it, thereby providing a clear understanding of the observed behaviors. △ Less

Submitted 7 March, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: 17 pages, 30 figures; v2: references added and typos corrected

Report number: UTHEP-778, KEK-TH-2500

arXiv:2302.00353 [pdf, other]

Towards Label-Efficient Incremental Learning: A Survey

Authors: Mert Kilickaya, Joost van de Weijer, Yuki M. Asano

Abstract: The current dominant paradigm when building a machine learning model is to iterate over a dataset over and over until convergence. Such an approach is non-incremental, as it assumes access to all images of all categories at once. However, for many applications, non-incremental learning is unrealistic. To that end, researchers study incremental learning, where a learner is required to adapt to an i… ▽ More The current dominant paradigm when building a machine learning model is to iterate over a dataset over and over until convergence. Such an approach is non-incremental, as it assumes access to all images of all categories at once. However, for many applications, non-incremental learning is unrealistic. To that end, researchers study incremental learning, where a learner is required to adapt to an incoming stream of data with a varying distribution while preventing forgetting of past knowledge. Significant progress has been made, however, the vast majority of works focus on the fully supervised setting, making these algorithms label-hungry thus limiting their real-life deployment. To that end, in this paper, we make the first attempt to survey recently growing interest in label-efficient incremental learning. We identify three subdivisions, namely semi-, few-shot- and self-supervised learning to reduce labeling efforts. Finally, we identify novel directions that can further enhance label-efficiency and improve incremental learning scalability. Project website: https://github.com/kilickaya/label-efficient-il. △ Less

Submitted 11 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

arXiv:2301.02240 [pdf, other]

Skip-Attention: Improving Vision Transformers by Paying Less Attention

Authors: Shashanka Venkataramanan, Amir Ghodrati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian

Abstract: This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to ap… ▽ More This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks. △ Less

Submitted 17 January, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

arXiv:2211.13716 [pdf, other]

doi 10.1093/ptep/ptad042

On the existence of the NS5-brane limit of the plane wave matrix model

Authors: Yuhma Asano, Goro Ishiki, Takaki Matsumoto, Shinji Shimasaki, Hiromasa Watanabe

Abstract: We consider a double scaling limit of the plane wave matrix model (PWMM), in which the gravity dual geometry of PWMM reduces to a class of spherical NS5-brane solutions. We identify the form of the scaling limit for the dual geometry of PWMM around a general vacuum and then translate the limit into the field theoretic language. We also show that the limit indeed exists at least in a certain planar… ▽ More We consider a double scaling limit of the plane wave matrix model (PWMM), in which the gravity dual geometry of PWMM reduces to a class of spherical NS5-brane solutions. We identify the form of the scaling limit for the dual geometry of PWMM around a general vacuum and then translate the limit into the field theoretic language. We also show that the limit indeed exists at least in a certain planar 1/4-BPS sector of PWMM by using the localization computation analytically. In addition, we employ the hybrid Monte Carlo method to compute the matrix integral obtained by the localization method, near the parameter region where the supergravity approximation is valid. Our numerical results, which are considered to be the first computation of quantum loop correction to the Lin-Maldacena geometry, suggest that the double scaling limit exists beyond the planar sector. △ Less

Submitted 7 May, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: 35 pages, 7 figures; v2: a reference added; v3: some explanations and expressions elaborated

Report number: DIAS-STP-22-17, UTHEP-776, YITP-22-132

Journal ref: PTEP 2023 (2023) 4, 043B01

arXiv:2211.03085 [pdf, other]

doi 10.1103/PhysRevB.107.064511

Nuclear spin relaxation rate of nonunitary Dirac and Weyl superconductors

Authors: Koki Maeno, Yuki Kawaguchi, Yasuhiro Asano, Shingo Kobayashi

Abstract: Nonunitary superconductivity has attracted renewed interest as a novel gapless phase of matter. In this study, we investigate the superconducting gap structure of nonunitary odd-parity chiral pairing states in a superconductor involving strong spin-orbit interactions. By applying a group theoretical classification of chiral states in terms of discrete rotation symmetry, we categorized all possible… ▽ More Nonunitary superconductivity has attracted renewed interest as a novel gapless phase of matter. In this study, we investigate the superconducting gap structure of nonunitary odd-parity chiral pairing states in a superconductor involving strong spin-orbit interactions. By applying a group theoretical classification of chiral states in terms of discrete rotation symmetry, we categorized all possible point-nodal gap structures in nonunitary chiral states into four types in terms of the topological number of nodes and node positions relative to the rotation axis. In addition to conventional Dirac and Weyl point nodes, we identify a novel type of Dirac point node unique to nonunitary chiral superconducting states. The node type can be identified experimentally based on the temperature dependence of the nuclear magnetic resonance longitudinal relaxation rate. The implication of our results for a nonunitary odd-parity superconductor in UTe$_2$ is also discussed. △ Less

Submitted 6 November, 2022; originally announced November 2022.

Comments: 18 pages, 4 figures

arXiv:2210.10820 [pdf, other]

VTC: Improving Video-Text Retrieval with User Comments

Authors: Laura Hanu, James Thewlis, Yuki M. Asano, Christian Rupprecht

Abstract: Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, sin… ▽ More Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations. Project page: https://unitaryai.github.io/vtc-paper. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted paper at the European Conference on Computer Vision (ECCV) 2022

Showing 1–50 of 228 results for author: Asano, Y