-
Modeling the Real World with High-Density Visual Particle Dynamics
Authors:
William F. Whitney,
Jacob Varley,
Deepali Jain,
Krzysztof Choromanski,
Sumeet Singh,
Vikas Sindhwani
Abstract:
We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neig…
▽ More
We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can gras** tasks. See videos and particle dynamics rendered by HD-VPD at https://sites.google.com/view/hd-vpd.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning
Authors:
Arijit Sehanobish,
Avinava Dubey,
Krzysztof Choromanski,
Somnath Basu Roy Chowdhury,
Deepali Jain,
Vikas Sindhwani,
Snigdha Chaturvedi
Abstract:
Recent efforts to scale Transformer models have demonstrated rapid progress across a wide range of tasks (Wei et al., 2022). However, fine-tuning these models for downstream tasks is expensive due to their large parameter counts. Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable alternative by allowing us to fine-tune models by updating only a small number of parameters. I…
▽ More
Recent efforts to scale Transformer models have demonstrated rapid progress across a wide range of tasks (Wei et al., 2022). However, fine-tuning these models for downstream tasks is expensive due to their large parameter counts. Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable alternative by allowing us to fine-tune models by updating only a small number of parameters. In this work, we propose a general framework for parameter efficient fine-tuning (PEFT), based on structured unrestricted-rank matrices (SURM) which can serve as a drop-in replacement for popular approaches such as Adapters and LoRA. Unlike other methods like LoRA, SURMs provides more flexibility in finding the right balance between compactness and expressiveness. This is achieved by using low displacement rank matrices (LDRMs), which hasn't been used in this context before. SURMs remain competitive with baselines, often providing significant quality improvements while using a smaller parameter budget. SURMs achieve 5-7% accuracy gains on various image classification tasks while replacing low-rank matrices in LoRA. It also results in up to 12x reduction of the number of parameters in adapters (with virtually no loss in quality) on the GLUE benchmark.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Improving and Evaluating Machine Learning Methods for Forensic Shoeprint Matching
Authors:
Divij Jain,
Saatvik Kher,
Lena Liang,
Yufeng Wu,
Ashley Zheng,
Xizhen Cai,
Anna Plantinga,
Elizabeth Upton
Abstract:
We propose a machine learning pipeline for forensic shoeprint pattern matching that improves on the accuracy and generalisability of existing methods. We extract 2D coordinates from shoeprint scans using edge detection and align the two shoeprints with iterative closest point (ICP). We then extract similarity metrics to quantify how well the two prints match and use these metrics to train a random…
▽ More
We propose a machine learning pipeline for forensic shoeprint pattern matching that improves on the accuracy and generalisability of existing methods. We extract 2D coordinates from shoeprint scans using edge detection and align the two shoeprints with iterative closest point (ICP). We then extract similarity metrics to quantify how well the two prints match and use these metrics to train a random forest that generates a probabilistic measurement of how likely two prints are to have originated from the same outsole. We assess the generalisability of machine learning methods trained on lab shoeprint scans to more realistic crime scene shoeprint data by evaluating the accuracy of our methods on several shoeprint scenarios: partial prints, prints with varying levels of blurriness, prints with different amounts of wear, and prints from different shoe models. We find that models trained on one type of shoeprint yield extremely high levels of accuracy when tested on shoeprint pairs of the same scenario but fail to generalise to other scenarios. We also discover that models trained on a variety of scenarios predict almost as accurately as models trained on specific scenarios.
△ Less
Submitted 2 April, 2024;
originally announced May 2024.
-
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models
Authors:
Devansh Jain,
Priyanshu Kumar,
Samuel Gehman,
Xuhui Zhou,
Thomas Hartvigsen,
Maarten Sap
Abstract:
Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-sca…
▽ More
Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scra** over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
△ Less
Submitted 20 May, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity
Authors:
Jake Varley,
Sumeet Singh,
Deepali Jain,
Krzysztof Choromanski,
Andy Zeng,
Somnath Basu Roy Chowdhury,
Avinava Dubey,
Vikas Sindhwani
Abstract:
We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for gras**. With sem…
▽ More
We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for gras**. With semantic and physical safety in mind, these modules are interfaced with a real-time trajectory optimizer and a compliant tracking controller to enable human-robot proximity. We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace.Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities. One may also in-place swap modules to improve the robustness of the overall platform, for instance with imitation-learned policies.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Studying Differential Mental Health Expressions in India
Authors:
Khushi Shelat,
Sunny Rai,
Devansh R Jain,
Kishen Sivabalan,
Young Min Cho,
Maitreyi Redkar,
Samindara Sawant,
Sharath Chandra Guntuku
Abstract:
Psychosocial stressors and the symptomatology of mental disorders vary across cultures. However, current understandings of mental health expressions on social media are predominantly derived from studies in WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. In this paper, we analyze mental health posts on Reddit made by individuals in India, to identify variations in online…
▽ More
Psychosocial stressors and the symptomatology of mental disorders vary across cultures. However, current understandings of mental health expressions on social media are predominantly derived from studies in WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. In this paper, we analyze mental health posts on Reddit made by individuals in India, to identify variations in online depression language specific to the Indian context compared to users from the Rest of the World (ROW). Unlike in Western samples, we observe that mental health discussions in India additionally express sadness, use negation, are present-focused, and are related to work and achievement. Illness is uniquely correlated to India, indicating the association between depression and physical health in Indian patients. Two clinical psychologists validated the findings from social media posts and found 95% of the top 20 topics associated with mental health discussions as prevalent in Indians. Significant linguistic variations in online mental health-related language in India compared to ROW, emphasize the importance of develo** precision-targeted interventions that are culturally appropriate.
△ Less
Submitted 16 June, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
SwissNYF: Tool Grounded LLM Agents for Black Box Setting
Authors:
Somnath Sendhil Kumar,
Dhruv Jain,
Eshaan Agarwal,
Raunak Pandey
Abstract:
While Large Language Models (LLMs) have demonstrated enhanced capabilities in function-calling, these advancements primarily rely on accessing the functions' responses. This methodology is practical for simpler APIs but faces scalability issues with irreversible APIs that significantly impact the system, such as a database deletion API. Similarly, processes requiring extensive time for each API ca…
▽ More
While Large Language Models (LLMs) have demonstrated enhanced capabilities in function-calling, these advancements primarily rely on accessing the functions' responses. This methodology is practical for simpler APIs but faces scalability issues with irreversible APIs that significantly impact the system, such as a database deletion API. Similarly, processes requiring extensive time for each API call and those necessitating forward planning, like automated action pipelines, present complex challenges. Furthermore, scenarios often arise where a generalized approach is needed because algorithms lack direct access to the specific implementations of these functions or secrets to use them. Traditional tool planning methods are inadequate in these cases, compelling the need to operate within black-box environments. Unlike their performance in tool manipulation, LLMs excel in black-box tasks, such as program synthesis. Therefore, we harness the program synthesis capabilities of LLMs to strategize tool usage in black-box settings, ensuring solutions are verified prior to implementation. We introduce TOPGUN, an ingeniously crafted approach leveraging program synthesis for black box tool planning. Accompanied by SwissNYF, a comprehensive suite that integrates black-box algorithms for planning and verification tasks, addressing the aforementioned challenges and enhancing the versatility and effectiveness of LLMs in complex API interactions. The public code for SwissNYF is available at https://github.com/iclr-dummy-user/SwissNYF.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality Awareness
Authors:
Ruei-Che Chang,
Chia-Sheng Hung,
Bing-Yu Chen,
Dhruv Jain,
Anhong Guo
Abstract:
Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum…
▽ More
Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum posts within the blind community, identifying prevailing challenges, needs, and desired solutions. We synthesized the results and propose SoundShift for increasing MR sound awareness, which includes six sound manipulations: Transparency Shift, Envelope Shift, Position Shift, Style Shift, Time Shift, and Sound Append. To evaluate the effectiveness of SoundShift, we conducted a user study with 18 blind participants across three simulated MR scenarios, where participants identified specific sounds within intricate soundscapes. We found that SoundShift increased MR sound awareness and minimized cognitive load. Finally, we developed three real-world example applications to demonstrate the practicality of SoundShift.
△ Less
Submitted 26 May, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the AIIB23 challenge
Authors:
Yang Nan,
Xiaodan Xing,
Shiyi Wang,
Zeyu Tang,
Federico N Felder,
Sheng Zhang,
Roberta Eufrasia Ledda,
Xiaoliu Ding,
Ruiqi Yu,
Wei** Liu,
Feng Shi,
Tianyang Sun,
Zehong Cao,
Minghui Zhang,
Yun Gu,
Hanxiao Zhang,
Jian Gao,
**yu Wang,
Wen Tang,
Pengxin Yu,
Han Kang,
Junqiang Chen,
Xing Lu,
Boyu Zhang,
Michail Mamalakis
, et al. (16 additional authors not shown)
Abstract:
Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intric…
▽ More
Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intricate honeycombing patterns present in the lung tissues of fibrotic lung disease patients exacerbate the challenges, often leading to various prediction errors. To address this issue, the 'Airway-Informed Quantitative CT Imaging Biomarker for Fibrotic Lung Disease 2023' (AIIB23) competition was organized in conjunction with the official 2023 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). The airway structures were meticulously annotated by three experienced radiologists. Competitors were encouraged to develop automatic airway segmentation models with high robustness and generalization abilities, followed by exploring the most correlated QIB of mortality prediction. A training set of 120 high-resolution computerised tomography (HRCT) scans were publicly released with expert annotations and mortality status. The online validation set incorporated 52 HRCT scans from patients with fibrotic lung disease and the offline test set included 140 cases from fibrosis and COVID-19 patients. The results have shown that the capacity of extracting airway trees from patients with fibrotic lung disease could be enhanced by introducing voxel-wise weighted general union loss and continuity loss. In addition to the competitive image biomarkers for prognosis, a strong airway-derived biomarker (Hazard ratio>1.5, p<0.0001) was revealed for survival prognostication compared with existing clinical measurements, clinician assessment and AI-based biomarkers.
△ Less
Submitted 16 April, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention
Authors:
Isabel Leal,
Krzysztof Choromanski,
Deepali Jain,
Avinava Dubey,
Jake Varley,
Michael Ryoo,
Yao Lu,
Frederick Liu,
Vikas Sindhwani,
Quan Vuong,
Tamas Sarlos,
Ken Oslund,
Karol Hausman,
Kanishka Rao
Abstract:
We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (includi…
▽ More
We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (including massive billion-parameter vision-language-action models or VLAs), into their efficient linear-attention counterparts maintaining high quality. We demonstrate the effectiveness of SARA-RT by speeding up: (a) the class of recently introduced RT-2 models, the first VLA robotic policies pre-trained on internet-scale data, as well as (b) Point Cloud Transformer (PCT) robotic policies operating on large point clouds. We complement our results with the rigorous mathematical analysis providing deeper insight into the phenomenon of SARA.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Can Language Model Moderators Improve the Health of Online Discourse?
Authors:
Hyundong Cho,
Shuai Liu,
Taiwei Shi,
Darpan Jain,
Basem Rizk,
Yuyang Huang,
Zixun Lu,
Nuan Wen,
Jonathan Gratch,
Emilio Ferrara,
Jonathan May
Abstract:
Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establis…
▽ More
Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness grounded on moderation literature and establish design criteria for conducting realistic yet safe evaluation. We then propose a comprehensive evaluation framework to assess models' moderation capabilities independently of human intervention. With our framework, we conduct the first known study of language models as conversational moderators, finding that appropriately prompted models that incorporate insights from social science can provide specific and fair feedback on toxic behavior but struggle to influence users to increase their levels of respect and cooperation.
△ Less
Submitted 6 May, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Robotic Table Tennis: A Case Study into a High Speed Learning System
Authors:
David B. D'Ambrosio,
Jonathan Abelian,
Saminda Abeyruwan,
Michael Ahn,
Alex Bewley,
Justin Boyd,
Krzysztof Choromanski,
Omar Cortes,
Erwin Coumans,
Tianli Ding,
Wenbo Gao,
Laura Graesser,
Atil Iscen,
Navdeep Jaitly,
Deepali Jain,
Juhana Kangaspunta,
Satoshi Kataoka,
Gus Kouretas,
Yuheng Kuang,
Nevena Lazic,
Corey Lynch,
Reza Mahjourian,
Sherry Q. Moore,
Thinh Nguyen,
Ken Oslund
, et al. (10 additional authors not shown)
Abstract:
We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real w…
▽ More
We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description, including numerous design decisions that are typically not widely disseminated, with a collection of studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, sensitivity to policy hyper-parameters, and choice of action space. A video demonstrating the components of the system and details of experimental results can be found at https://youtu.be/uFcnWjB42I0.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Agile Catching with Whole-Body MPC and Blackbox Policy Learning
Authors:
Saminda Abeyruwan,
Alex Bewley,
Nicholas M. Boffi,
Krzysztof Choromanski,
David D'Ambrosio,
Deepali Jain,
Pannag Sanketi,
Anish Shankar,
Vikas Sindhwani,
Sumeet Singh,
Jean-Jacques Slotine,
Stephen Tu
Abstract:
We address a benchmark task in agile robotics: catching objects thrown at high-speed. This is a challenging task that involves tracking, intercepting, and cradling a thrown object with access only to visual observations of the object and the proprioceptive state of the robot, all within a fraction of a second. We present the relative merits of two fundamentally different solution strategies: (i) M…
▽ More
We address a benchmark task in agile robotics: catching objects thrown at high-speed. This is a challenging task that involves tracking, intercepting, and cradling a thrown object with access only to visual observations of the object and the proprioceptive state of the robot, all within a fraction of a second. We present the relative merits of two fundamentally different solution strategies: (i) Model Predictive Control using accelerated constrained trajectory optimization, and (ii) Reinforcement Learning using zeroth-order optimization. We provide insights into various performance trade-offs including sample efficiency, sim-to-real transfer, robustness to distribution shifts, and whole-body multimodality via extensive on-hardware experiments. We conclude with proposals on fusing "classical" and "learning-based" techniques for agile robot control. Videos of our experiments may be found at https://sites.google.com/view/agile-catching
△ Less
Submitted 19 October, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Barkour: Benchmarking Animal-level Agility with Quadruped Robots
Authors:
Ken Caluwaerts,
Atil Iscen,
J. Chase Kew,
Wenhao Yu,
Tingnan Zhang,
Daniel Freeman,
Kuang-Huei Lee,
Lisa Lee,
Stefano Saliceti,
Vincent Zhuang,
Nathan Batchelor,
Steven Bohez,
Federico Casarini,
Jose Enrique Chen,
Omar Cortes,
Erwin Coumans,
Adil Dostmohamed,
Gabriel Dulac-Arnold,
Alejandro Escontrela,
Erik Frey,
Roland Hafner,
Deepali Jain,
Bauyrjan Jyenis,
Yuheng Kuang,
Edward Lee
, et al. (19 additional authors not shown)
Abstract:
Animals have evolved various agile locomotion strategies, such as sprinting, lea**, and jum**. There is a growing interest in develo** legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agili…
▽ More
Animals have evolved various agile locomotion strategies, such as sprinting, lea**, and jum**. There is a growing interest in develo** legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agility. We introduce the Barkour benchmark, an obstacle course to quantify agility for legged robots. Inspired by dog agility competitions, it consists of diverse obstacles and a time based scoring mechanism. This encourages researchers to develop controllers that not only move fast, but do so in a controllable and versatile way. To set strong baselines, we present two methods for tackling the benchmark. In the first approach, we train specialist locomotion skills using on-policy reinforcement learning methods and combine them with a high-level navigation controller. In the second approach, we distill the specialist skills into a Transformer-based generalist locomotion policy, named Locomotion-Transformer, that can handle various terrains and adjust the robot's gait based on the perceived environment and robot states. Using a custom-built quadruped robot, we demonstrate that our method can complete the course at half the speed of a dog. We hope that our work represents a step towards creating controllers that enable robots to reach animal-level agility.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Mnemosyne: Learning to Train Transformers with Transformers
Authors:
Deepali Jain,
Krzysztof Marcin Choromanski,
Avinava Dubey,
Sumeet Singh,
Vikas Sindhwani,
Tingnan Zhang,
Jie Tan
Abstract:
In this work, we propose a new class of learnable optimizers, called \textit{Mnemosyne}. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature enginee…
▽ More
In this work, we propose a new class of learnable optimizers, called \textit{Mnemosyne}. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We conduct an extensive empirical evaluation of Mnemosyne on: (a) fine-tuning a wide range of Vision Transformers (ViTs) from medium-size architectures to massive ViT-Hs (36 layers, 16 heads), (b) pre-training BERT models and (c) soft prompt-tuning large 11B+ T5XXL models. We complement our results with a comprehensive theoretical analysis of the compact associative memory used by Mnemosyne which we believe was never done before.
△ Less
Submitted 16 June, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Efficient Graph Field Integrators Meet Point Clouds
Authors:
Krzysztof Choromanski,
Arijit Sehanobish,
Han Lin,
Yunfan Zhao,
Eli Berger,
Tetiana Parshakova,
Alvin Pan,
David Watkins,
Tianyi Zhang,
Valerii Likhosherstov,
Somnath Basu Roy Chowdhury,
Avinava Dubey,
Deepali Jain,
Tamas Sarlos,
Snigdha Chaturvedi,
Adrian Weller
Abstract:
We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Metho…
▽ More
We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Methods (FMMs), which have had a tremendous impact on efficient integration, but for non-Euclidean spaces. We focus on geometries induced by distributions of walk lengths between points (e.g., shortest-path distance). We provide an extensive theoretical analysis of our algorithms, obtaining new results in structural graph theory as a byproduct. We also perform exhaustive empirical evaluation, including on-surface interpolation for rigid and deformable objects (particularly for mesh-dynamics modeling), Wasserstein distance computations for point clouds, and the Gromov-Wasserstein variant.
△ Less
Submitted 4 October, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
CRC-RL: A Novel Visual Feature Representation Architecture for Unsupervised Reinforcement Learning
Authors:
Darshita Jain,
Anima Majumder,
Samrat Dutta,
Swagat Kumar
Abstract:
This paper addresses the problem of visual feature representation learning with an aim to improve the performance of end-to-end reinforcement learning (RL) models. Specifically, a novel architecture is proposed that uses a heterogeneous loss function, called CRC loss, to learn improved visual features which can then be used for policy learning in RL. The CRC-loss function is a combination of three…
▽ More
This paper addresses the problem of visual feature representation learning with an aim to improve the performance of end-to-end reinforcement learning (RL) models. Specifically, a novel architecture is proposed that uses a heterogeneous loss function, called CRC loss, to learn improved visual features which can then be used for policy learning in RL. The CRC-loss function is a combination of three individual loss functions, namely, contrastive, reconstruction and consistency loss. The feature representation is learned in parallel to the policy learning while sharing the weight updates through a Siamese Twin encoder model. This encoder model is augmented with a decoder network and a feature projection network to facilitate computation of the above loss components. Through empirical analysis involving latent feature visualization, an attempt is made to provide an insight into the role played by this loss function in learning new action-dependent features and how they are linked to the complexity of the problems being solved. The proposed architecture, called CRC-RL, is shown to outperform the existing state-of-the-art methods on the challenging Deep mind control suite environments by a significant margin thereby creating a new benchmark in this field.
△ Less
Submitted 28 February, 2023; v1 submitted 31 January, 2023;
originally announced January 2023.
-
GlobalFlowNet: Video Stabilization using Deep Distilled Global Motion Estimates
Authors:
Jerin Geo James,
Devansh Jain,
Ajit Rajwade
Abstract:
Videos shot by laymen using hand-held cameras contain undesirable shaky motion. Estimating the global motion between successive frames, in a manner not influenced by moving objects, is central to many video stabilization techniques, but poses significant challenges. A large body of work uses 2D affine transformations or homography for the global motion. However, in this work, we introduce a more g…
▽ More
Videos shot by laymen using hand-held cameras contain undesirable shaky motion. Estimating the global motion between successive frames, in a manner not influenced by moving objects, is central to many video stabilization techniques, but poses significant challenges. A large body of work uses 2D affine transformations or homography for the global motion. However, in this work, we introduce a more general representation scheme, which adapts any existing optical flow network to ignore the moving objects and obtain a spatially smooth approximation of the global motion between video frames. We achieve this by a knowledge distillation approach, where we first introduce a low pass filter module into the optical flow network to constrain the predicted optical flow to be spatially smooth. This becomes our student network, named as \textsc{GlobalFlowNet}. Then, using the original optical flow network as the teacher network, we train the student network using a robust loss function. Given a trained \textsc{GlobalFlowNet}, we stabilize videos using a two stage process. In the first stage, we correct the instability in affine parameters using a quadratic programming approach constrained by a user-specified crop** limit to control loss of field of view. In the second stage, we stabilize the video further by smoothing global motion parameters, expressed using a small number of discrete cosine transform coefficients. In extensive experiments on a variety of different videos, our technique outperforms state of the art techniques in terms of subjective quality and different quantitative measures of video stability. The source code is publicly available at \href{https://github.com/GlobalFlowNet/GlobalFlowNet}{https://github.com/GlobalFlowNet/GlobalFlowNet}
△ Less
Submitted 4 November, 2022; v1 submitted 25 October, 2022;
originally announced October 2022.
-
Implicit Two-Tower Policies
Authors:
Yunfan Zhao,
Qingkai Pan,
Krzysztof Choromanski,
Deepali Jain,
Vikas Sindhwani
Abstract:
We present a new class of structured reinforcement learning policy-architectures, Implicit Two-Tower (ITT) policies, where the actions are chosen based on the attention scores of their learnable latent representations with those of the input states. By explicitly disentangling action from state processing in the policy stack, we achieve two main goals: substantial computational gains and better pe…
▽ More
We present a new class of structured reinforcement learning policy-architectures, Implicit Two-Tower (ITT) policies, where the actions are chosen based on the attention scores of their learnable latent representations with those of the input states. By explicitly disentangling action from state processing in the policy stack, we achieve two main goals: substantial computational gains and better performance. Our architectures are compatible with both: discrete and continuous action spaces. By conducting tests on 15 environments from OpenAI Gym and DeepMind Control Suite, we show that ITT-architectures are particularly suited for blackbox/evolutionary optimization and the corresponding policy training algorithms outperform their vanilla unstructured implicit counterparts as well as commonly used explicit policies. We complement our analysis by showing how techniques such as hashing and lazy tower updates, critically relying on the two-tower structure of ITTs, can be applied to obtain additional computational improvements.
△ Less
Submitted 25 October, 2023; v1 submitted 1 August, 2022;
originally announced August 2022.
-
i-Sim2Real: Reinforcement Learning of Robotic Policies in Tight Human-Robot Interaction Loops
Authors:
Saminda Abeyruwan,
Laura Graesser,
David B. D'Ambrosio,
Avi Singh,
Anish Shankar,
Alex Bewley,
Deepali Jain,
Krzysztof Choromanski,
Pannag R. Sanketi
Abstract:
Sim-to-real transfer is a powerful paradigm for robotic reinforcement learning. The ability to train policies in simulation enables safe exploration and large-scale data collection quickly at low cost. However, prior works in sim-to-real transfer of robotic policies typically do not involve any human-robot interaction because accurately simulating human behavior is an open problem. In this work, o…
▽ More
Sim-to-real transfer is a powerful paradigm for robotic reinforcement learning. The ability to train policies in simulation enables safe exploration and large-scale data collection quickly at low cost. However, prior works in sim-to-real transfer of robotic policies typically do not involve any human-robot interaction because accurately simulating human behavior is an open problem. In this work, our goal is to leverage the power of simulation to train robotic policies that are proficient at interacting with humans upon deployment. But there is a chicken and egg problem -- how to gather examples of a human interacting with a physical robot so as to model human behavior in simulation without already having a robot that is able to interact with a human? Our proposed method, Iterative-Sim-to-Real (i-S2R), attempts to address this. i-S2R bootstraps from a simple model of human behavior and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined. For all training we apply a new evolutionary search algorithm called Blackbox Gradient Sensing (BGS). We evaluate our method on a real world robotic table tennis setting, where the objective for the robot is to play cooperatively with a human player for as long as possible. Table tennis is a high-speed, dynamic task that requires the two players to react quickly to each other's moves, making for a challenging test bed for research on human-robot interaction. We present results on an industrial robotic arm that is able to cooperatively play table tennis with human players, achieving rallies of 22 successive hits on average and 150 at best. Further, for 80% of players, rally lengths are 70% to 175% longer compared to the sim-to-real plus fine-tuning (S2R+FT) baseline. For videos of our system in action, please see https://sites.google.com/view/is2r.
△ Less
Submitted 21 November, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
Self-Labeling Refinement for Robust Representation Learning with Bootstrap Your Own Latent
Authors:
Siddhant Garg,
Dhruval Jain
Abstract:
In this work, we have worked towards two major goals. Firstly, we have investigated the importance of Batch Normalisation (BN) layers in a non-contrastive representation learning framework called Bootstrap Your Own Latent (BYOL). We conducted several experiments to conclude that BN layers are not necessary for representation learning in BYOL. Moreover, BYOL only learns from the positive pairs of i…
▽ More
In this work, we have worked towards two major goals. Firstly, we have investigated the importance of Batch Normalisation (BN) layers in a non-contrastive representation learning framework called Bootstrap Your Own Latent (BYOL). We conducted several experiments to conclude that BN layers are not necessary for representation learning in BYOL. Moreover, BYOL only learns from the positive pairs of images but ignores other semantically similar images in the same input batch. For the second goal, we have introduced two new loss functions to determine the semantically similar pairs in the same input batch of images and reduce the distance between their representations. These loss functions are Cross-Cosine Similarity Loss (CCSL) and Cross-Sigmoid Similarity Loss (CSSL). Using the proposed loss functions, we are able to surpass the performance of Vanilla BYOL (71.04%) by training the BYOL framework using CCSL loss (76.87%) on the STL10 dataset. BYOL trained using CSSL loss performs comparably with Vanilla BYOL.
△ Less
Submitted 9 April, 2022;
originally announced April 2022.
-
ProtoSound: A Personalized and Scalable Sound Recognition System for Deaf and Hard-of-Hearing Users
Authors:
Dhruv Jain,
Khoa Huynh Anh Nguyen,
Steven Goodman,
Rachel Grossman-Kahn,
Hung Ngo,
Aditya Kusupati,
Ruofei Du,
Alex Olwal,
Leah Findlater,
Jon E. Froehlich
Abstract:
Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fi…
▽ More
Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fine-grained categories. ProtoSound is motivated by prior work examining sound awareness needs of DHH people and by a survey we conducted with 472 DHH participants. To evaluate ProtoSound, we characterized performance on two real-world sound datasets, showing significant improvement over state-of-the-art (e.g., +9.7% accuracy on the first dataset). We then deployed ProtoSound's end-user training and real-time recognition through a mobile application and recruited 19 hearing participants who listened to the real-world sounds and rated the accuracy across 56 locations (e.g., homes, restaurants, parks). Results show that ProtoSound personalized the model on-device in real-time and accurately learned sounds across diverse acoustic contexts. We close by discussing open challenges in personalizable sound recognition, including the need for better recording interfaces and algorithmic improvements.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Nonverbal Sound Detection for Disordered Speech
Authors:
Colin Lea,
Zifang Huang,
Dhruv Jain,
Lauren Tooley,
Zeinab Liaghat,
Shrinath Thelapurath,
Leah Findlater,
Jeffrey P. Bigham
Abstract:
Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or…
▽ More
Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or hard of hearing, have a severe stutter, or are minimally verbal. We introduce an alternative voice-based input system which relies on sound event detection using fifteen nonverbal mouth sounds like "pop," "click," or "eh." This system was designed to work regardless of ones' speech abilities and allows full access to existing technology. In this paper, we describe the design of a dataset, model considerations for real-world deployment, and efforts towards model personalization. Our fully-supervised model achieves segment-level precision and recall of 88.6% and 88.4% on an internal dataset of 710 adults, while achieving 0.31 false positives per hour on aggressors such as speech. Five-shot personalization enables satisfactory performance in 84.5% of cases where the generic model fails.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Distilling Hypernymy Relations from Language Models: On the Effectiveness of Zero-Shot Taxonomy Induction
Authors:
Devansh Jain,
Luis Espinosa Anke
Abstract:
In this paper, we analyze zero-shot taxonomy learning methods which are based on distilling knowledge from language models via prompting and sentence scoring. We show that, despite their simplicity, these methods outperform some supervised strategies and are competitive with the current state-of-the-art under adequate conditions. We also show that statistical and linguistic properties of prompts d…
▽ More
In this paper, we analyze zero-shot taxonomy learning methods which are based on distilling knowledge from language models via prompting and sentence scoring. We show that, despite their simplicity, these methods outperform some supervised strategies and are competitive with the current state-of-the-art under adequate conditions. We also show that statistical and linguistic properties of prompts dictate downstream performance.
△ Less
Submitted 10 February, 2022;
originally announced February 2022.
-
Hybrid Random Features
Authors:
Krzysztof Choromanski,
Haoxian Chen,
Han Lin,
Yuanzhe Ma,
Arijit Sehanobish,
Deepali Jain,
Michael S Ryoo,
Jake Varley,
Andy Zeng,
Valerii Likhosherstov,
Dmitry Kalashnikov,
Vikas Sindhwani,
Adrian Weller
Abstract:
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the…
▽ More
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.
△ Less
Submitted 30 January, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
The CORSMAL benchmark for the prediction of the properties of containers
Authors:
Alessio Xompero,
Santiago Donaher,
Vladimir Iashin,
Francesca Palermo,
Gökhan Solak,
Claudio Coppola,
Reina Ishikawa,
Yuichi Nagao,
Ryo Hachiuma,
Qi Liu,
Fan Feng,
Chuanlin Lan,
Rosa H. M. Chan,
Guilherme Christmann,
Jyun-Ting Song,
Gonuguntla Neeharika,
Chinnakotla Krishna Teja Reddy,
Dinesh Jain,
Bakhtawar Ur Rehman,
Andrea Cavallaro
Abstract:
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmar…
▽ More
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the filling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework.
△ Less
Submitted 21 April, 2022; v1 submitted 27 July, 2021;
originally announced July 2021.
-
The State of Infodemic on Twitter
Authors:
Drishti Jain,
Tavpritesh Sethi
Abstract:
Following the wave of misinterpreted, manipulated and malicious information growing on the Internet, the misinformation surrounding COVID-19 has become a paramount issue. In the context of the current COVID-19 pandemic, social media posts and platforms are at risk of rumors and misinformation in the face of the serious uncertainty surrounding the virus itself. At the same time, the uncertainty and…
▽ More
Following the wave of misinterpreted, manipulated and malicious information growing on the Internet, the misinformation surrounding COVID-19 has become a paramount issue. In the context of the current COVID-19 pandemic, social media posts and platforms are at risk of rumors and misinformation in the face of the serious uncertainty surrounding the virus itself. At the same time, the uncertainty and new nature of COVID-19 means that other unconfirmed information that may appear "rumored" may be an important indicator of the behavior and impact of this new virus. Twitter, in particular, has taken a center stage in this storm where Covid-19 has been a much talked about subject. We have presented an exploratory analysis of the tweets and the users who are involved in spreading misinformation and then delved into machine learning models and natural language processing techniques to identify if a tweet contains misinformation.
△ Less
Submitted 17 May, 2021;
originally announced May 2021.
-
A bibliometric analysis of citation diversity in accessibility and HCI research
Authors:
Lucy Lu Wang,
Kelly Mack,
Emma McDonnell,
Dhruv Jain,
Leah Findlater,
Jon E. Froehlich
Abstract:
Accessibility research sits at the junction of several disciplines, drawing influence from HCI, disability studies, psychology, education, and more. To characterize the influences and extensions of accessibility research, we undertake a study of citation trends for accessibility and related HCI communities. We assess the diversity of venues and fields of study represented among the referenced and…
▽ More
Accessibility research sits at the junction of several disciplines, drawing influence from HCI, disability studies, psychology, education, and more. To characterize the influences and extensions of accessibility research, we undertake a study of citation trends for accessibility and related HCI communities. We assess the diversity of venues and fields of study represented among the referenced and citing papers of 836 accessibility research papers from ASSETS and CHI, finding that though publications in computer science dominate these citation relationships, the relative proportion of citations from papers on psychology and medicine has grown over time. Though ASSETS is a more niche venue than CHI in terms of citational diversity, both conferences display standard levels of diversity among their incoming and outgoing citations when analyzed in the context of 53K papers from 13 accessibility and HCI conference venues.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Unlocking Pixels for Reinforcement Learning via Implicit Attention
Authors:
Krzysztof Marcin Choromanski,
Deepali Jain,
Wenhao Yu,
Xingyou Song,
Jack Parker-Holder,
Tingnan Zhang,
Valerii Likhosherstov,
Aldo Pacchiano,
Anirban Santara,
Yunhao Tang,
Jie Tan,
Adrian Weller
Abstract:
There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and the potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is an attention bottleneck, which provides a simple and effective framework for learning h…
▽ More
There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and the potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is an attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods cannot be applied beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these techniques can be successfully adopted for the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. We show this on a range of tasks from the Distracting Control Suite to vision-based quadruped robots locomotion. We provide rigorous theoretical analysis of the proposed algorithm.
△ Less
Submitted 1 October, 2021; v1 submitted 8 February, 2021;
originally announced February 2021.
-
ES-ENAS: Efficient Evolutionary Optimization for Large Hybrid Search Spaces
Authors:
Xingyou Song,
Krzysztof Choromanski,
Jack Parker-Holder,
Yunhao Tang,
Qiuyi Zhang,
Daiyi Peng,
Deepali Jain,
Wenbo Gao,
Aldo Pacchiano,
Tamas Sarlos,
Yuxiang Yang
Abstract:
In this paper, we approach the problem of optimizing blackbox functions over large hybrid search spaces consisting of both combinatorial and continuous parameters. We demonstrate that previous evolutionary algorithms which rely on mutation-based approaches, while flexible over combinatorial spaces, suffer from a curse of dimensionality in high dimensional continuous spaces both theoretically and e…
▽ More
In this paper, we approach the problem of optimizing blackbox functions over large hybrid search spaces consisting of both combinatorial and continuous parameters. We demonstrate that previous evolutionary algorithms which rely on mutation-based approaches, while flexible over combinatorial spaces, suffer from a curse of dimensionality in high dimensional continuous spaces both theoretically and empirically, which thus limits their scope over hybrid search spaces as well. In order to combat this curse, we propose ES-ENAS, a simple and modular joint optimization procedure combining the class of sample-efficient smoothed gradient techniques, commonly known as Evolutionary Strategies (ES), with combinatorial optimizers in a highly scalable and intuitive way, inspired by the one-shot or supernet paradigm introduced in Efficient Neural Architecture Search (ENAS). By doing so, we achieve significantly more sample efficiency, which we empirically demonstrate over synthetic benchmarks, and are further able to apply ES-ENAS for architecture search over popular RL benchmarks.
△ Less
Submitted 15 March, 2023; v1 submitted 18 January, 2021;
originally announced January 2021.
-
What Do We Mean by "Accessibility Research"? A Literature Survey of Accessibility Papers in CHI and ASSETS from 1994 to 2019
Authors:
Kelly Mack,
Emma McDonnell,
Dhruv Jain,
Lucy Lu Wang,
Jon E. Froehlich,
Leah Findlater
Abstract:
Accessibility research has grown substantially in the past few decades, yet there has been no literature review of the field. To understand current and historical trends, we created and analyzed a dataset of accessibility papers appearing at CHI and ASSETS since ASSETS' founding in 1994. We qualitatively coded areas of focus and methodological decisions for the past 10 years (2010-2019, N=506 pape…
▽ More
Accessibility research has grown substantially in the past few decades, yet there has been no literature review of the field. To understand current and historical trends, we created and analyzed a dataset of accessibility papers appearing at CHI and ASSETS since ASSETS' founding in 1994. We qualitatively coded areas of focus and methodological decisions for the past 10 years (2010-2019, N=506 papers), and analyzed paper counts and keywords over the full 26 years (N=836 papers). Our findings highlight areas that have received disproportionate attention and those that are underserved--for example, over 43% of papers in the past 10 years are on accessibility for blind and low vision people. We also capture common study characteristics, such as the roles of disabled and nondisabled participants as well as sample sizes (e.g., a median of 13 for participant groups with disabilities and older adults). We close by critically reflecting on gaps in the literature and offering guidance for future work in the field.
△ Less
Submitted 3 February, 2021; v1 submitted 11 January, 2021;
originally announced January 2021.
-
Disentangled Planning and Control in Vision Based Robotics via Reward Machines
Authors:
Alberto Camacho,
Jacob Varley,
Deepali Jain,
Atil Iscen,
Dmitry Kalashnikov
Abstract:
In this work we augment a Deep Q-Learning agent with a Reward Machine (DQRM) to increase speed of learning vision-based policies for robot tasks, and overcome some of the limitations of DQN that prevent it from converging to good-quality policies. A reward machine (RM) is a finite state machine that decomposes a task into a discrete planning graph and equips the agent with a reward function to gui…
▽ More
In this work we augment a Deep Q-Learning agent with a Reward Machine (DQRM) to increase speed of learning vision-based policies for robot tasks, and overcome some of the limitations of DQN that prevent it from converging to good-quality policies. A reward machine (RM) is a finite state machine that decomposes a task into a discrete planning graph and equips the agent with a reward function to guide it toward task completion. The reward machine can be used for both reward sha**, and informing the policy what abstract state it is currently at. An abstract state is a high level simplification of the current state, defined in terms of task relevant features. These two supervisory signals of reward sha** and knowledge of current abstract state coming from the reward machine complement each other and can both be used to improve policy performance as demonstrated on several vision based robotic pick and place tasks. Particularly for vision based robotics applications, it is often easier to build a reward machine than to try and get a policy to learn the task without this structure.
△ Less
Submitted 28 December, 2020;
originally announced December 2020.
-
Codeswitched Sentence Creation using Dependency Parsing
Authors:
Dhruval Jain,
Arun D Prabhu,
Shubham Vatsal,
Gopi Ramena,
Naresh Purre
Abstract:
Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) task…
▽ More
Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of generated Codeswitched data using some qualitative metrics along with providing baseline results for couple of NLP tasks.
△ Less
Submitted 5 December, 2020;
originally announced December 2020.
-
From Pixels to Legs: Hierarchical Learning of Quadruped Locomotion
Authors:
Deepali Jain,
Atil Iscen,
Ken Caluwaerts
Abstract:
Legged robots navigating crowded scenes and complex terrains in the real world are required to execute dynamic leg movements while processing visual input for obstacle avoidance and path planning. We show that a quadruped robot can acquire both of these skills by means of hierarchical reinforcement learning (HRL). By virtue of their hierarchical structure, our policies learn to implicitly break do…
▽ More
Legged robots navigating crowded scenes and complex terrains in the real world are required to execute dynamic leg movements while processing visual input for obstacle avoidance and path planning. We show that a quadruped robot can acquire both of these skills by means of hierarchical reinforcement learning (HRL). By virtue of their hierarchical structure, our policies learn to implicitly break down this joint problem by concurrently learning High Level (HL) and Low Level (LL) neural network policies. These two levels are connected by a low dimensional hidden layer, which we call latent command. HL receives a first-person camera view, whereas LL receives the latent command from HL and the robot's on-board sensors to control its actuators. We train policies to walk in two different environments: a curved cliff and a maze. We show that hierarchical policies can concurrently learn to locomote and navigate in these environments, and show they are more efficient than non-hierarchical neural network policies. This architecture also allows for knowledge reuse across tasks. LL networks trained on one task can be transferred to a new task in a new environment. Finally HL, which processes camera images, can be evaluated at much lower and varying frequencies compared to LL, thus reducing computation times and bandwidth requirements.
△ Less
Submitted 23 November, 2020;
originally announced November 2020.
-
On-Device Text Image Super Resolution
Authors:
Dhruval Jain,
Arun D Prabhu,
Gopi Ramena,
Manoj Goyal,
Debi Prasanna Mohanty,
Sukumar Moharana,
Naresh Purre
Abstract:
Recent research on super-resolution (SR) has witnessed major developments with the advancements of deep convolutional neural networks. There is a need for information extraction from scenic text images or even document images on device, most of which are low-resolution (LR) images. Therefore, SR becomes an essential pre-processing step as Bicubic Upsampling, which is conventionally present in smar…
▽ More
Recent research on super-resolution (SR) has witnessed major developments with the advancements of deep convolutional neural networks. There is a need for information extraction from scenic text images or even document images on device, most of which are low-resolution (LR) images. Therefore, SR becomes an essential pre-processing step as Bicubic Upsampling, which is conventionally present in smartphones, performs poorly on LR images. To give the user more control over his privacy, and to reduce the carbon footprint by reducing the overhead of cloud computing and hours of GPU usage, executing SR models on the edge is a necessity in the recent times. There are various challenges in running and optimizing a model on resource-constrained platforms like smartphones. In this paper, we present a novel deep neural network that reconstructs sharper character edges and thus boosts OCR confidence. The proposed architecture not only achieves significant improvement in PSNR over bicubic upsampling on various benchmark datasets but also runs with an average inference time of 11.7 ms per image. We have outperformed state-of-the-art on the Text330 dataset. We also achieve an OCR accuracy of 75.89% on the ICDAR 2015 TextSR dataset, where ground truth has an accuracy of 78.10%.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
Learning Agile Locomotion Skills with a Mentor
Authors:
Atil Iscen,
George Yu,
Alejandro Escontrela,
Deepali Jain,
Jie Tan,
Ken Caluwaerts
Abstract:
Develo** agile behaviors for legged robots remains a challenging problem. While deep reinforcement learning is a promising approach, learning truly agile behaviors typically requires tedious reward sha** and careful curriculum design. We formulate agile locomotion as a multi-stage learning problem in which a mentor guides the agent throughout the training. The mentor is optimized to place a ch…
▽ More
Develo** agile behaviors for legged robots remains a challenging problem. While deep reinforcement learning is a promising approach, learning truly agile behaviors typically requires tedious reward sha** and careful curriculum design. We formulate agile locomotion as a multi-stage learning problem in which a mentor guides the agent throughout the training. The mentor is optimized to place a checkpoint to guide the movement of the robot's center of mass while the student (i.e. the robot) learns to reach these checkpoints. Once the student can solve the task, we teach the student to perform the task without the mentor. We evaluate our proposed learning system with a simulated quadruped robot on a course consisting of randomly generated gaps and hurdles. Our method significantly outperforms a single-stage RL baseline without a mentor, and the quadruped robot can agilely run and jump across gaps and obstacles. Finally, we present a detailed analysis of the learned behaviors' feasibility and efficiency.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
On-Device Language Identification of Text in Images using Diacritic Characters
Authors:
Shubham Vatsal,
Nikhil Arora,
Gopi Ramena,
Sukumar Moharana,
Dhruval Jain,
Naresh Purre,
Rachit S Munjal
Abstract:
Diacritic characters can be considered as a unique set of characters providing us with adequate and significant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images usin…
▽ More
Diacritic characters can be considered as a unique set of characters providing us with adequate and significant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images using the presence of diacritic characters in order to improve Optical Character Recognition (OCR) performance in any given automated environment. We showcase our work across 13 Latin languages encompassing 85 diacritic characters. We use an architecture similar to Squeezedet for object detection of diacritic characters followed by a shallow network to finally identify the language. OCR systems when accompanied with identified language parameter tends to produce better results than sole deployment of OCR systems. The discussed work apart from guaranteeing an improvement in OCR results also takes on-device (mobile phone) constraints into consideration in terms of model size and inference time.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
Surveys without Questions: A Reinforcement Learning Approach
Authors:
Atanu R Sinha,
Deepali Jain,
Nikhil Sheoran,
Sopan Khosla,
Reshmi Sasidharan
Abstract:
The 'old world' instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers' online interactio…
▽ More
The 'old world' instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers' online interactions; failing to capture a customer's interactions over time since the rating is a one-time snapshot; and inability to tie back customers' ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data, typically collected for every customer's online interactions, by develo** an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL, as proxy ratings. Our approach does not need any survey data for training. Yet, on validation against actual survey data, proxy ratings yield reasonable performance results. Additionally, we offer a new way to draw insights from values of the value function, which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one, customer-level and the other, aggregate-level for click actions across customers. Both are defined around proportion of all pairwise, successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum, proxy ratings computed unobtrusively from clickstream, for every action, for each customer, and for every session can offer interpretable and more insightful alternative to surveys.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
On-device Filtering of Social Media Images for Efficient Storage
Authors:
Dhruval Jain,
DP Mohanty,
Sanjeev Roy,
Naresh Purre,
Sukumar Moharana
Abstract:
Artificially crafted images such as memes, seasonal greetings, etc are flooding the social media platforms today. These eventually start occupying a lot of internal memory of smartphones and it gets cumbersome for the user to go through hundreds of images and delete these synthetic images. To address this, we propose a novel method based on Convolutional Neural Networks (CNNs) for the on-device fi…
▽ More
Artificially crafted images such as memes, seasonal greetings, etc are flooding the social media platforms today. These eventually start occupying a lot of internal memory of smartphones and it gets cumbersome for the user to go through hundreds of images and delete these synthetic images. To address this, we propose a novel method based on Convolutional Neural Networks (CNNs) for the on-device filtering of social media images by classifying these synthetic images and allowing the user to delete them in one go. The custom model uses depthwise separable convolution layers to achieve low inference time on smartphones. We have done an extensive evaluation of our model on various camera image datasets to cover most aspects of images captured by a camera. Various sorts of synthetic social media images have also been tested. The proposed solution achieves an accuracy of 98.25% on the Places-365 dataset and 95.81% on the Synthetic image dataset that we have prepared containing 30K instances.
△ Less
Submitted 14 May, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Reinforcement Learning with Chromatic Networks for Compact Architecture Search
Authors:
Xingyou Song,
Krzysztof Choromanski,
Jack Parker-Holder,
Yunhao Tang,
Wenbo Gao,
Aldo Pacchiano,
Tamas Sarlos,
Deepali Jain,
Yuxiang Yang
Abstract:
We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL…
▽ More
We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as $17$ weight parameters, providing >90% compression over vanilla policies and 6x compression over state-of-the-art compact policies based on Toeplitz matrices, while still maintaining good reward. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.
△ Less
Submitted 6 April, 2021; v1 submitted 10 July, 2019;
originally announced July 2019.
-
Hierarchical Reinforcement Learning for Quadruped Locomotion
Authors:
Deepali Jain,
Atil Iscen,
Ken Caluwaerts
Abstract:
Legged locomotion is a challenging task for learning algorithms, especially when the task requires a diverse set of primitive behaviors. To solve these problems, we introduce a hierarchical framework to automatically decompose complex locomotion tasks. A high-level policy issues commands in a latent space and also selects for how long the low-level policy will execute the latent command. Concurren…
▽ More
Legged locomotion is a challenging task for learning algorithms, especially when the task requires a diverse set of primitive behaviors. To solve these problems, we introduce a hierarchical framework to automatically decompose complex locomotion tasks. A high-level policy issues commands in a latent space and also selects for how long the low-level policy will execute the latent command. Concurrently, the low-level policy uses the latent command and only the robot's on-board sensors to control the robot's actuators. Our approach allows the high-level policy to run at a lower frequency than the low-level one. We test our framework on a path-following task for a dynamic quadruped robot and we show that steering behaviors automatically emerge in the latent command space as low-level skills are needed for this task. We then show efficient adaptation of the trained policy to a different task by transfer of the trained low-level policy. Finally, we validate the policies on a real quadruped robot. To the best of our knowledge, this is the first application of end-to-end hierarchical learning to a real robotic locomotion task.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch
Authors:
Danna Gurari,
Yinan Zhao,
Suyog Dutt Jain,
Margrit Betke,
Kristen Grauman
Abstract:
Foreground object segmentation is a critical step for many image analysis tasks. While automated methods can produce high-quality results, their failures disappoint users in need of practical solutions. We propose a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images…
▽ More
Foreground object segmentation is a critical step for many image analysis tasks. While automated methods can produce high-quality results, their failures disappoint users in need of practical solutions. We propose a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and automated methods. The framework is based on a prediction module that estimates the quality of given algorithm-drawn segmentations. We demonstrate the value of the framework for two novel tasks related to predicting how to distribute annotation efforts between algorithms and humans. Specifically, we develop two systems that automatically decide, for a batch of images, when to recruit humans versus computers to create 1) coarse segmentations required to initialize segmentation tools and 2) final, fine-grained segmentations. Experiments demonstrate the advantage of relying on a mix of human and computer efforts over relying on either resource alone for segmenting objects in images coming from three diverse modalities (visible, phase contrast microscopy, and fluorescence microscopy).
△ Less
Submitted 30 April, 2019;
originally announced May 2019.
-
Provably Robust Blackbox Optimization for Reinforcement Learning
Authors:
Krzysztof Choromanski,
Aldo Pacchiano,
Jack Parker-Holder,
Yunhao Tang,
Deepali Jain,
Yuxiang Yang,
Atil Iscen,
Jasmine Hsu,
Vikas Sindhwani
Abstract:
Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they can match state of the art methods for policy optimization problems in Robotics. However, it is well known that DFO methods suffer from prohibitively high sampling complexity. They can also be very sensitive to noisy rew…
▽ More
Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they can match state of the art methods for policy optimization problems in Robotics. However, it is well known that DFO methods suffer from prohibitively high sampling complexity. They can also be very sensitive to noisy rewards and stochastic dynamics. In this paper, we propose a new class of algorithms, called Robust Blackbox Optimization (RBO). Remarkably, even if up to $23\%$ of all the measurements are arbitrarily corrupted, RBO can provably recover gradients to high accuracy. RBO relies on learning gradient flows using robust regression methods to enable off-policy updates. On several MuJoCo robot control tasks, when all other RL approaches collapse in the presence of adversarial noise, RBO is able to train policies effectively. We also show that RBO can be applied to legged locomotion tasks including path tracking for quadruped robots.
△ Less
Submitted 8 July, 2019; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Pixel Objectness: Learning to Segment Generic Objects Automatically in Images and Videos
Authors:
Bo Xiong,
Suyog Dutt Jain,
Kristen Grauman
Abstract:
We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all "object-like" regions---even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning an object/background label to each pixel, implemented using a deep ful…
▽ More
We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all "object-like" regions---even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning an object/background label to each pixel, implemented using a deep fully convolutional network. When applied to a video, our model further incorporates a motion stream, and the network learns to combine both appearance and motion and attempts to extract all prominent objects whether they are moving or not. Beyond the core model, a second contribution of our approach is how it leverages varying strengths of training annotations. Pixel-level annotations are quite difficult to obtain, yet crucial for training a deep network approach for segmentation. Thus we propose ways to exploit weakly labeled data for learning dense foreground segmentation. For images, we show the value in mixing object category examples with image-level labels together with relatively few images with boundary-level annotations. For video, we show how to bootstrap weakly annotated videos together with the network trained for image segmentation. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art for fully automatic segmentation of generic (unseen) objects. In addition, we demonstrate how our approach benefits image retrieval and image retargeting, both of which flourish when given our high-quality foreground maps. Code, models, and videos are at:http://vision.cs.utexas.edu/projects/pixelobjectness/
△ Less
Submitted 17 December, 2018; v1 submitted 11 August, 2018;
originally announced August 2018.
-
An Interval Type-2 Fuzzy Approach to Automatic PDF Generation for Histogram Specification
Authors:
Vishal Agarwal,
Diwanshu Jain,
A. Vamshi Krishna Reddy,
Frank Chung-Hoon Rhee
Abstract:
Image enhancement plays an important role in several application in the field of computer vision and image processing. Histogram specification (HS) is one of the most widely used techniques for contrast enhancement of an image, which requires an appropriate probability density function for the transformation. In this paper, we propose a fuzzy method to find a suitable PDF automatically for histogr…
▽ More
Image enhancement plays an important role in several application in the field of computer vision and image processing. Histogram specification (HS) is one of the most widely used techniques for contrast enhancement of an image, which requires an appropriate probability density function for the transformation. In this paper, we propose a fuzzy method to find a suitable PDF automatically for histogram specification using interval type - 2 (IT2) fuzzy approach, based on the fuzzy membership values obtained from the histogram of input image. The proposed algorithm works in 5 stages which includes - symmetric Gaussian fitting on the histogram, extraction of IT2 fuzzy membership functions (MFs) and therefore, footprint of uncertainty (FOU), obtaining membership value (MV), generating PDF and application of HS. We have proposed 4 different methods to find membership values - point-wise method, center of weight method, area method, and karnik-mendel (KM) method. The framework is sensitive to local variations in the histogram and chooses the best PDF so as to improve contrast enhancement. Experimental validity of the methods used is illustrated by qualitative and quantitative analysis on several images using the image quality index - Average Information Content (AIC) or Entropy, and by comparison with the commonly used algorithms such as Histogram Equalization (HE), Recursive Mean-Separate Histogram Equalization (RMSHE) and Brightness Preserving Fuzzy Histogram Equalization (BPFHE). It has been found out that on an average, our algorithm improves the AIC index by 11.5% as compared to the index obtained by histogram equalisation.
△ Less
Submitted 6 May, 2018;
originally announced May 2018.
-
A Multi-Modal Approach to Infer Image Affect
Authors:
Ashok Sundaresan,
Sugumar Murugesan,
Sean Davis,
Karthik Kappaganthu,
ZhongYi **,
Divya Jain,
Anurag Maunder
Abstract:
The group affect or emotion in an image of people can be inferred by extracting features about both the people in the picture and the overall makeup of the scene. The state-of-the-art on this problem investigates a combination of facial features, scene extraction and even audio tonality. This paper combines three additional modalities, namely, human pose, text-based tagging and CNN extracted featu…
▽ More
The group affect or emotion in an image of people can be inferred by extracting features about both the people in the picture and the overall makeup of the scene. The state-of-the-art on this problem investigates a combination of facial features, scene extraction and even audio tonality. This paper combines three additional modalities, namely, human pose, text-based tagging and CNN extracted features / predictions. To the best of our knowledge, this is the first time all of the modalities were extracted using deep neural networks. We evaluate the performance of our approach against baselines and identify insights throughout this paper.
△ Less
Submitted 13 March, 2018;
originally announced March 2018.
-
Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s)
Authors:
Danna Gurari,
Kun He,
Bo Xiong,
Jianming Zhang,
Mehrnoosh Sameki,
Suyog Dutt Jain,
Stan Sclaroff,
Margrit Betke,
Kristen Grauman
Abstract:
We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight wid…
▽ More
We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight widely used datasets, we crowdsource labeling the images as "ambiguous" or "not ambiguous" to segment in order to construct a new dataset we call STATIC. Using STATIC, we develop a system that automatically predicts which images are ambiguous. Experiments demonstrate the advantage of our prediction system over existing saliency-based methods on images from vision benchmarks and images taken by blind people who are trying to recognize objects in their environment. Finally, we introduce a crowdsourcing system to achieve cost savings for collecting the diversity of all valid "ground truth" foreground object segmentations by collecting extra segmentations only when ambiguity is expected. Experiments show our system eliminates up to 47% of human effort compared to existing crowdsourcing methods with no loss in capturing the diversity of ground truths.
△ Less
Submitted 30 April, 2017;
originally announced May 2017.
-
FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos
Authors:
Suyog Dutt Jain,
Bo Xiong,
Kristen Grauman
Abstract:
We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos. We formulate this task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified…
▽ More
We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos. We formulate this task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, we show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. Through experiments on three challenging video segmentation benchmarks, our method substantially improves the state-of-the-art for segmenting generic (unseen) objects. Code and pre-trained models are available on the project website.
△ Less
Submitted 12 April, 2017; v1 submitted 19 January, 2017;
originally announced January 2017.
-
Pixel Objectness
Authors:
Suyog Dutt Jain,
Bo Xiong,
Kristen Grauman
Abstract:
We propose an end-to-end learning framework for generating foreground object segmentations. Given a single novel image, our approach produces pixel-level masks for all "object-like" regions---even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning foreground/background labels to all pixels, implemented using a deep fully convolut…
▽ More
We propose an end-to-end learning framework for generating foreground object segmentations. Given a single novel image, our approach produces pixel-level masks for all "object-like" regions---even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning foreground/background labels to all pixels, implemented using a deep fully convolutional network. Key to our idea is training with a mix of image-level object category examples together with relatively few images with boundary-level annotations. Our method substantially improves the state-of-the-art on foreground segmentation for ImageNet and MIT Object Discovery datasets. Furthermore, on over 1 million images, we show that it generalizes well to segment object categories unseen in the foreground maps used for training. Finally, we demonstrate how our approach benefits image retrieval and image retargeting, both of which flourish when given our high-quality foreground maps.
△ Less
Submitted 12 April, 2017; v1 submitted 19 January, 2017;
originally announced January 2017.
-
Click Carving: Segmenting Objects in Video with Point Clicks
Authors:
Suyog Dutt Jain,
Kristen Grauman
Abstract:
We present a novel form of interactive video object segmentation where a few clicks by the user helps the system produce a full spatio-temporal segmentation of the object of interest. Whereas conventional interactive pipelines take the user's initialization as a starting point, we show the value in the system taking the lead even in initialization. In particular, for a given video frame, the syste…
▽ More
We present a novel form of interactive video object segmentation where a few clicks by the user helps the system produce a full spatio-temporal segmentation of the object of interest. Whereas conventional interactive pipelines take the user's initialization as a starting point, we show the value in the system taking the lead even in initialization. In particular, for a given video frame, the system precomputes a ranked list of thousands of possible segmentation hypotheses (also referred to as object region proposals) using image and motion cues. Then, the user looks at the top ranked proposals, and clicks on the object boundary to carve away erroneous ones. This process iterates (typically 2-3 times), and each time the system revises the top ranked proposal set, until the user is satisfied with a resulting segmentation mask. Finally, the mask is propagated across the video to produce a spatio-temporal object tube. On three challenging datasets, we provide extensive comparisons with both existing work and simpler alternative methods. In all, the proposed Click Carving approach strikes an excellent balance of accuracy and human effort. It outperforms all similarly fast methods, and is competitive or better than those requiring 2 to 12 times the effort.
△ Less
Submitted 5 July, 2016;
originally announced July 2016.