Search | arXiv e-print repository

Real-World Robot Learning with Masked Visual Pre-training

Authors: Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell

Abstract: In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic ta… ▽ More In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: CoRL 2022; Project page: https://tetexiao.com/projects/real-mvp

arXiv:2209.12892 [pdf, other]

Learning to Learn with Generative Models of Neural Network Checkpoints

Authors: William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A. Efros, Jitendra Malik

Abstract: We explore a data-driven approach for learning to optimize neural networks. We construct a dataset of neural network checkpoints and train a generative model on the parameters. In particular, our model is a conditional diffusion transformer that, given an initial input parameter vector and a prompted loss, error, or return, predicts the distribution over parameter updates that achieve the desired… ▽ More We explore a data-driven approach for learning to optimize neural networks. We construct a dataset of neural network checkpoints and train a generative model on the parameters. In particular, our model is a conditional diffusion transformer that, given an initial input parameter vector and a prompted loss, error, or return, predicts the distribution over parameter updates that achieve the desired metric. At test time, it can optimize neural networks with unseen parameters for downstream tasks in just one update. We find that our approach successfully generates parameters for a wide range of loss prompts. Moreover, it can sample multimodal parameter solutions and has favorable scaling properties. We apply our method to different neural network architectures and tasks in supervised and reinforcement learning. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: Code available at https://www.github.com/wpeebles/G.pt . Project page and videos available at https://www.wpeebles.com/Gpt

arXiv:2209.09232 [pdf, other]

Learning a Single Near-hover Position Controller for Vastly Different Quadcopters

Authors: Dingqi Zhang, Antonio Loquercio, Xiangyu Wu, Ashish Kumar, Jitendra Malik, Mark W. Mueller

Abstract: This paper proposes an adaptive near-hover position controller for quadcopters, which can be deployed to quadcopters of very different mass, size and motor constants, and also shows rapid adaptation to unknown disturbances during runtime. The core algorithmic idea is to learn a single policy that can adapt online at test time not only to the disturbances applied to the drone, but also to the robot… ▽ More This paper proposes an adaptive near-hover position controller for quadcopters, which can be deployed to quadcopters of very different mass, size and motor constants, and also shows rapid adaptation to unknown disturbances during runtime. The core algorithmic idea is to learn a single policy that can adapt online at test time not only to the disturbances applied to the drone, but also to the robot dynamics and hardware in the same framework. We achieve this by training a neural network to estimate a latent representation of the robot and environment parameters, which is used to condition the behaviour of the controller, also represented as a neural network. We train both networks exclusively in simulation with the goal of flying the quadcopters to goal positions and avoiding crashes to the ground. We directly deploy the same controller trained in the simulation without any modifications on two quadcopters in the real world with differences in mass, size, motors, and propellers with mass differing by 4.5 times. In addition, we show rapid adaptation to sudden and large disturbances up to one-third of the mass of the quadcopters. We perform an extensive evaluation in both simulation and the physical world, where we outperform a state-of-the-art learning-based adaptive controller and a traditional PID controller specifically tuned to each platform individually. Video results can be found at https://youtu.be/U-c-LbTfvoA. △ Less

Submitted 2 May, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: Comprehensive video results can be found on the project webpage https://dz298.github.io/universal-drone-controller/

arXiv:2209.02778 [pdf, other]

Multi-skill Mobile Manipulation for Object Rearrangement

Authors: Jiayuan Gu, Devendra Singh Chaplot, Hao Su, Jitendra Malik

Abstract: We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this frame… ▽ More We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: Project website: https://sites.google.com/view/hab-m3

arXiv:2207.08139 [pdf]

2D Self-Organized ONN Model For Handwritten Text Recognition

Authors: Hanadi Hassen Mohammed, Junaid Malik, Somaya Al-Madeed, Serkan Kiranyaz

Abstract: Deep Convolutional Neural Networks (CNNs) have recently reached state-of-the-art Handwritten Text Recognition (HTR) performance. However, recent research has shown that typical CNNs' learning performance is limited since they are homogeneous networks with a simple (linear) neuron model. With their heterogeneous network structure incorporating non-linear neurons, Operational Neural Networks (ONNs)… ▽ More Deep Convolutional Neural Networks (CNNs) have recently reached state-of-the-art Handwritten Text Recognition (HTR) performance. However, recent research has shown that typical CNNs' learning performance is limited since they are homogeneous networks with a simple (linear) neuron model. With their heterogeneous network structure incorporating non-linear neurons, Operational Neural Networks (ONNs) have recently been proposed to address this drawback. Self-ONNs are self-organized variations of ONNs with the generative neuron model that can generate any non-linear function using the Taylor approximation. In this study, in order to improve the state-of-the-art performance level in HTR, the 2D Self-organized ONNs (Self-ONNs) in the core of a novel network model are proposed. Moreover, deformable convolutions, which have recently been demonstrated to tackle variations in the writing styles better, are utilized in this study. The results over the IAM English dataset and HADARA80P Arabic dataset show that the proposed model with the operational layers of Self-ONNs significantly improves Character Error Rate (CER) and Word Error Rate (WER). Compared with its counterpart CNNs, Self-ONNs reduce CER and WER by 1.2% and 3.4 % in the HADARA80P and 0.199% and 1.244% in the IAM dataset. The results over the benchmark IAM demonstrate that the proposed model with the operational layers of Self-ONNs outperforms recent deep CNN models by a significant margin while the use of Self-ONNs with deformable convolutions demonstrates exceptional results. △ Less

Submitted 17 July, 2022; originally announced July 2022.

Comments: To appear in in Applied Soft Computing Journal (Elsevier)

arXiv:2206.00888 [pdf, other]

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Authors: Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

Abstract: The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and mi… ▽ More The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online. △ Less

Submitted 15 October, 2022; v1 submitted 2 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022

arXiv:2205.15299 [pdf, other]

Adapting Rapid Motor Adaptation for Bipedal Robots

Authors: Ashish Kumar, Zhongyu Li, Jun Zeng, Deepak Pathak, Koushil Sreenath, Jitendra Malik

Abstract: Recent advances in legged locomotion have enabled quadrupeds to walk on challenging terrains. However, bipedal robots are inherently more unstable and hence it's harder to design walking controllers for them. In this work, we leverage recent advances in rapid adaptation for locomotion control, and extend them to work on bipedal robots. Similar to existing works, we start with a base policy which p… ▽ More Recent advances in legged locomotion have enabled quadrupeds to walk on challenging terrains. However, bipedal robots are inherently more unstable and hence it's harder to design walking controllers for them. In this work, we leverage recent advances in rapid adaptation for locomotion control, and extend them to work on bipedal robots. Similar to existing works, we start with a base policy which produces actions while taking as input an estimated extrinsics vector from an adaptation module. This extrinsics vector contains information about the environment and enables the walking controller to rapidly adapt online. However, the extrinsics estimator could be imperfect, which might lead to poor performance of the base policy which expects a perfect estimator. In this paper, we propose A-RMA (Adapting RMA), which additionally adapts the base policy for the imperfect extrinsics estimator by finetuning it using model-free RL. We demonstrate that A-RMA outperforms a number of RL-based baseline controllers and model-based controllers in simulation, and show zero-shot deployment of a single A-RMA policy to enable a bipedal robot, Cassie, to walk in a variety of different scenarios in the real world beyond what it has seen during training. Videos and results at https://ashish-kmr.github.io/a-rma/ △ Less

Submitted 6 September, 2022; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: First two authors contributed equally. Website at https://ashish-kmr.github.io/a-rma/

arXiv:2204.06107 [pdf, other]

Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

Authors: Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, Du Tran

Abstract: Open-world instance segmentation is the task of grou** pixels into object instances without any pre-determined taxonomy. This is challenging, as state-of-the-art methods rely on explicit class semantics obtained from large labeled datasets, and out-of-domain evaluation performance drops significantly. Here we propose a novel approach for mask proposals, Generic Grou** Networks (GGNs), construc… ▽ More Open-world instance segmentation is the task of grou** pixels into object instances without any pre-determined taxonomy. This is challenging, as state-of-the-art methods rely on explicit class semantics obtained from large labeled datasets, and out-of-domain evaluation performance drops significantly. Here we propose a novel approach for mask proposals, Generic Grou** Networks (GGNs), constructed without semantic supervision. Our approach combines a local measure of pixel affinity with instance-level mask supervision, producing a training regimen designed to make the model as generic as the data diversity allows. We introduce a method for predicting Pairwise Affinities (PA), a learned local relationship between pairs of pixels. PA generalizes very well to unseen categories. From PA we construct a large set of pseudo-ground-truth instance masks; combined with human-annotated instance masks we train GGNs and significantly outperform the SOTA on open-world instance segmentation on various benchmarks including COCO, LVIS, ADE20K, and UVO. Code is available on project website: https://sites.google.com/view/generic-grou**/. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: CVPR 2022

arXiv:2204.02722 [pdf, other]

doi 10.1007/s12043-023-02521-4

Symmetric Bidirectional Quantum Teleportation using a Six-Qubit Cluster State as a Quantum Channel

Authors: Javid A Malik, Muzaffar Qadir Lone, Rayees A Malla

Abstract: Bidirectional quantum teleportation is a fundamental protocol for exchanging quantum information between two quantum nodes. All bidirectional quantum teleportation protocols till now have achieved a maximum efficiency of $40\%$. Here, we propose a new scheme for symmetric bidirectional quantum teleportation using a six-qubit cluster state as the quantum channel, for symmetric ($3\leftrightarrow3$)… ▽ More Bidirectional quantum teleportation is a fundamental protocol for exchanging quantum information between two quantum nodes. All bidirectional quantum teleportation protocols till now have achieved a maximum efficiency of $40\%$. Here, we propose a new scheme for symmetric bidirectional quantum teleportation using a six-qubit cluster state as the quantum channel, for symmetric ($3\leftrightarrow3$) qubit bidirectional quantum teleportation of a special three-qubit entangled state. The novelty of our scheme lies in its generalization for ($N\leftrightarrow N$) qubit bidirectional quantum teleportation employing a $2N$-qubit cluster state. The efficiency of the proposed protocol is remarkably increased to $50\%$ which is highest till now. Interestingly, only GHZ-state measurements and four Toffoli-gate operations are necessary which is independent of the number $(N)$ of qubits to be teleported. △ Less

Submitted 6 April, 2022; originally announced April 2022.

arXiv:2203.06173 [pdf, other]

Masked Visual Pre-training for Motor Control

Authors: Tete Xiao, Ilija Radosavovic, Trevor Darrell, Jitendra Malik

Abstract: This paper shows that self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We first train the visual representations by masked modeling of natural images. We then freeze the visual encoder and train neural network controllers on top with reinforcement learning. We do not perform any task-specific fine-tuning of the encoder; the same… ▽ More This paper shows that self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We first train the visual representations by masked modeling of natural images. We then freeze the visual encoder and train neural network controllers on top with reinforcement learning. We do not perform any task-specific fine-tuning of the encoder; the same visual representations are used for all motor control tasks. To the best of our knowledge, this is the first self-supervised model to exploit real-world images at scale for motor control. To accelerate progress in learning from pixels, we contribute a benchmark suite of hand-designed tasks varying in movements, scenes, and robots. Without relying on labels, state-estimation, or expert demonstrations, we consistently outperform supervised encoders by up to 80% absolute success rate, sometimes even matching the oracle state performance. We also find that in-the-wild images, e.g., from YouTube or Egocentric videos, lead to better visual representations for various manipulation tasks than ImageNet images. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: Code and videos at: https://tetexiao.com/projects/mvp

arXiv:2202.09517 [pdf, other]

Deep Learning for Hate Speech Detection: A Comparative Study

Authors: Jitendra Singh Malik, Hezhe Qiao, Guansong Pang, Anton van den Hengel

Abstract: Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empiri… ▽ More Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods, mediated through the three most commonly used datasets. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. We particularly focus our analysis on measures of practical performance, including detection accuracy, computational efficiency, capability in using pre-trained models, and domain generalization. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions. Code and dataset are available at https://github.com/jmjmalik22/Hate-Speech-Detection. △ Less

Submitted 6 December, 2023; v1 submitted 18 February, 2022; originally announced February 2022.

Comments: 18 pages, 4 figures, and 6 tables

arXiv:2202.05265 [pdf, other]

Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

Authors: Anastasios N Angelopoulos, Amit P Kohli, Stephen Bates, Michael I Jordan, Jitendra Malik, Thayer Alshaabi, Srigokul Upadhyayula, Yaniv Romano

Abstract: Image-to-image regression is an important learning task, used frequently in biological imaging. Current algorithms, however, do not generally offer statistical guarantees that protect against a model's mistakes and hallucinations. To address this, we develop uncertainty quantification techniques with rigorous statistical guarantees for image-to-image regression problems. In particular, we show how… ▽ More Image-to-image regression is an important learning task, used frequently in biological imaging. Current algorithms, however, do not generally offer statistical guarantees that protect against a model's mistakes and hallucinations. To address this, we develop uncertainty quantification techniques with rigorous statistical guarantees for image-to-image regression problems. In particular, we show how to derive uncertainty intervals around each pixel that are guaranteed to contain the true value with a user-specified confidence probability. Our methods work in conjunction with any base machine learning model, such as a neural network, and endow it with formal mathematical guarantees -- regardless of the true unknown data distribution or choice of model. Furthermore, they are simple to implement and computationally inexpensive. We evaluate our procedure on three image-to-image regression tasks: quantitative phase microscopy, accelerated magnetic resonance imaging, and super-resolution transmission electron microscopy of a Drosophila melanogaster brain. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: Code available at https://github.com/aangelopoulos/im2im-uq

arXiv:2202.00589 [pdf]

Blind ECG Restoration by Operational Cycle-GANs

Authors: Serkan Kiranyaz, Ozer Can Devecioglu, Turker Ince, Junaid Malik, Muhammad Chowdhury, Tahir Hamid, Rashid Mazhar, Amith Khandakar, Anas Tahir, Tawsifur Rahman, Moncef Gabbouj

Abstract: Continuous long-term monitoring of electrocardiography (ECG) signals is crucial for the early detection of cardiac abnormalities such as arrhythmia. Non-clinical ECG recordings acquired by Holter and wearable ECG sensors often suffer from severe artifacts such as baseline wander, signal cuts, motion artifacts, variations on QRS amplitude, noise, and other interferences. Usually, a set of such arti… ▽ More Continuous long-term monitoring of electrocardiography (ECG) signals is crucial for the early detection of cardiac abnormalities such as arrhythmia. Non-clinical ECG recordings acquired by Holter and wearable ECG sensors often suffer from severe artifacts such as baseline wander, signal cuts, motion artifacts, variations on QRS amplitude, noise, and other interferences. Usually, a set of such artifacts occur on the same ECG signal with varying severity and duration, and this makes an accurate diagnosis by machines or medical doctors extremely difficult. Despite numerous studies that have attempted ECG denoising, they naturally fail to restore the actual ECG signal corrupted with such artifacts due to their simple and naive noise model. In this study, we propose a novel approach for blind ECG restoration using cycle-consistent generative adversarial networks (Cycle-GANs) where the quality of the signal can be improved to a clinical level ECG regardless of the type and severity of the artifacts corrupting the signal. To further boost the restoration performance, we propose 1D operational Cycle-GANs with the generative neuron model. The proposed approach has been evaluated extensively using one of the largest benchmark ECG datasets from the China Physiological Signal Challenge (CPSC-2020) with more than one million beats. Besides the quantitative and qualitative evaluations, a group of cardiologists performed medical evaluations to validate the quality and usability of the restored ECG, especially for an accurate arrhythmia diagnosis. △ Less

Submitted 29 January, 2022; originally announced February 2022.

Comments: 16 pages, 10 figures, journal article submission

arXiv:2201.10029 [pdf, other]

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Authors: Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman

Abstract: State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that… ▽ More State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that `where to look?' can be treated purely as a perception problem, and learned without environment interactions. To address this, we propose a network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. We train the potential function network using supervised learning on a passive dataset of top-down semantic maps, and integrate it into a modular framework to perform ObjectGoal navigation. Experiments on Gibson and Matterport3D demonstrate that our method achieves the state-of-the-art for ObjectGoal navigation while incurring up to 1,600x less computational cost for training. Code and pre-trained models are available: https://vision.cs.utexas.edu/projects/poni/ △ Less

Submitted 17 June, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: 8 pages + supplementary. Accepted in CVPR 2022

arXiv:2201.08383 [pdf, other]

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Authors: Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

Abstract: While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at… ▽ More While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit. △ Less

Submitted 30 November, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: Technical report. arXiv v2: add link to code

arXiv:2201.06091 [pdf, other]

Parallel transmit PUlse design for Saturation Homogeneity (PUSH) for Magnetization Transfer imaging at 7T

Authors: David Leitão, Raphael Tomi-Tricot, Pip Bridgen, Tom Wilkinson, Patrick Liebig, Rene Gumbrecht, Dieter Ritter, Sharon L. Giles, Ana Baburamani, Jan Sedlacik, Joseph V. Hajnal, Shaihan J. Malik

Abstract: Purpose: This work proposes a novel RF pulse design for parallel transmit (pTx) systems to obtain uniform saturation of semisolid magnetization for Magnetization Transfer (MT) contrast in the presence of transmit field ($B_1^+$) inhomogeneities. The semisolid magnetization is usually modeled as being purely longitudinal, with the applied $B_1^+$ field saturating but not rotating its magnetization,… ▽ More Purpose: This work proposes a novel RF pulse design for parallel transmit (pTx) systems to obtain uniform saturation of semisolid magnetization for Magnetization Transfer (MT) contrast in the presence of transmit field ($B_1^+$) inhomogeneities. The semisolid magnetization is usually modeled as being purely longitudinal, with the applied $B_1^+$ field saturating but not rotating its magnetization, thus standard pTx pulse design methods do not apply. Theory and Methods: Pulse design for Saturation Homogeneity (PUSH) optimizes pTx RF pulses by considering uniformity of root-mean squared $B_1^+$, $B_1^{rms}$, which relates to the rate of semisolid saturation. Here we considered designs consisting of a small number of spatially non-selective sub-pulses optimized over either a single 2D plane or 3D. Simulations and in vivo experiments on a 7T Terra system with an 8-TX Nova head coil in 5 subjects were carried out to study the homogenization of $B_1^{rms}$ and of the MT contrast by acquiring MT ratio maps. Results: Simulations and in vivo experiments showed up to 6 and 2 times more uniform $B_1^{rms}$ compared to circular polarized (CP) mode for 2D and 3D optimizations, respectively. This translated into 4 and 1.25 times more uniform MT contrast, consistently for all subjects, where 2 sub-pulses were enough for the implementation and coil used. Conclusion: The proposed PUSH method obtains more uniform and higher MT contrast than CP mode within the same SAR budget. △ Less

Submitted 16 January, 2022; originally announced January 2022.

Comments: 18 pages, 9 figures. Code available at: https://github.com/mriphysics/PUSH

arXiv:2201.02076 [pdf]

Universal pulses for homogeneous excitation using single channel coils

Authors: Ronald Mooiweer, Ian A. Clark, Eleanor A. Maguire, Martina F. Callaghan, Jospeh V. Hajnal, Shaihan J. Malik

Abstract: Purpose: Universal Pulses (UPs) are excitation pulses that reduce the flip angle inhomogeneity in high field MRI systems without subject-specific optimization, originally developed for parallel transmit (PTX) systems at 7T. We investigated the potential benefits of UPs for single channel (SC) transmit systems at 3T, which are widely used for clinical and research imaging, and for which flip angle… ▽ More Purpose: Universal Pulses (UPs) are excitation pulses that reduce the flip angle inhomogeneity in high field MRI systems without subject-specific optimization, originally developed for parallel transmit (PTX) systems at 7T. We investigated the potential benefits of UPs for single channel (SC) transmit systems at 3T, which are widely used for clinical and research imaging, and for which flip angle inhomogeneity can still be problematic. Methods: SC-UPs were designed using a spiral nonselective k-space trajectory for brain imaging at 3T using transmit field maps (B1+) and off-resonance maps (B0) acquired on two different scanner types: a 'standard' single channel transmit system and a system with a PTX body coil. The effect of training group size was investigated using data (200 subjects) from the standard system. The PTX system was used to compare SC-UPs to PTX-UPs (15 subjects). In two additional subjects, prospective imaging using SC-UP was studied. Results: Average flip angle error fell from 9.5+/-0.5% for 'default' excitation to 3.0+/-0.6% using SC-UPs trained over 50 subjects. Performance of the UPs was found to steadily improve as training group size increased, but stabilized after ~15 subjects. On the PTX-enabled system, SC-UPs again outperformed default excitation in simulations (4.8+/-0.6% error versus 10.6+/-0.8% respectively) though greater homogenization could be achieved with PTX-UPs (3.9+/-0.6%) and personalized pulses (SC-PP 3.6+/-1.0%, PTX-PP 2.9+/-0.6%). MP-RAGE imaging using SC-UP resulted in greater separation between grey and white matter signal intensities than default excitation. Conclusions: SC-UPs can improve excitation homogeneity in standard 3T systems without further calibration and could be used instead of a default excitation pulse for nonselective neuroimaging at 3T. △ Less

Submitted 6 January, 2022; originally announced January 2022.

Comments: Submitted to Magnetic Resonance Imaging

arXiv:2112.04477 [pdf, other]

Tracking People by Predicting 3D Appearance, Location & Pose

Authors: Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Jitendra Malik

Abstract: In this paper, we present an approach for tracking people in monocular videos, by predicting their future 3D representations. To achieve this, we first lift people to 3D from a single frame in a robust way. This lifting includes information about the 3D pose of the person, his or her location in the 3D space, and the 3D appearance. As we track a person, we collect 3D observations over time in a tr… ▽ More In this paper, we present an approach for tracking people in monocular videos, by predicting their future 3D representations. To achieve this, we first lift people to 3D from a single frame in a robust way. This lifting includes information about the 3D pose of the person, his or her location in the 3D space, and the 3D appearance. As we track a person, we collect 3D observations over time in a tracklet representation. Given the 3D nature of our observations, we build temporal models for each one of the previous attributes. We use these models to predict the future state of the tracklet, including 3D location, 3D appearance, and 3D pose. For a future frame, we compute the similarity between the predicted state of a tracklet and the single frame observations in a probabilistic manner. Association is solved with simple Hungarian matching, and the matches are used to update the respective tracklets. We evaluate our approach on various benchmarks and report state-of-the-art results. △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: Project Page : https://brjathu.github.io/PHALP/

arXiv:2112.02094 [pdf, other]

Coupling Vision and Proprioception for Navigation of Legged Robots

Authors: Zipeng Fu, Ashish Kumar, Ananye Agarwal, Haozhi Qi, Jitendra Malik, Deepak Pathak

Abstract: We exploit the complementary strengths of vision and proprioception to develop a point-goal navigation system for legged robots, called VP-Nav. Legged systems are capable of traversing more complex terrain than wheeled robots, but to fully utilize this capability, we need a high-level path planner in the navigation system to be aware of the walking capabilities of the low-level locomotion policy i… ▽ More We exploit the complementary strengths of vision and proprioception to develop a point-goal navigation system for legged robots, called VP-Nav. Legged systems are capable of traversing more complex terrain than wheeled robots, but to fully utilize this capability, we need a high-level path planner in the navigation system to be aware of the walking capabilities of the low-level locomotion policy in varying environments. We achieve this by using proprioceptive feedback to ensure the safety of the planned path by sensing unexpected obstacles like glass walls, terrain properties like slipperiness or softness of the ground and robot properties like extra payload that are likely missed by vision. The navigation system uses onboard cameras to generate an occupancy map and a corresponding cost map to reach the goal. A fast marching planner then generates a target path. A velocity command generator takes this as input to generate the desired velocity for the walking policy. A safety advisor module adds sensed unexpected obstacles to the occupancy map and environment-determined speed limits to the velocity command generator. We show superior performance compared to wheeled robot baselines, and ablation studies which have disjoint high-level planning and low-level control. We also show the real-world deployment of VP-Nav on a quadruped robot with onboard sensors and computation. Videos at https://navigation-locomotion.github.io △ Less

Submitted 24 July, 2022; v1 submitted 3 December, 2021; originally announced December 2021.

Comments: CVPR 2022 final version. Website at https://navigation-locomotion.github.io

arXiv:2112.01526 [pdf, other]

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Authors: Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

Abstract: In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and K… ▽ More In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit. △ Less

Submitted 30 March, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: CVPR 2022 Camera Ready

arXiv:2112.01010 [pdf, other]

Differentiable Spatial Planning using Transformers

Authors: Devendra Singh Chaplot, Deepak Pathak, Jitendra Malik

Abstract: We consider the problem of spatial path planning. In contrast to the classical solutions which optimize a new plan from scratch and assume access to the full map with ground truth obstacle locations, we learn a planner from the data in a differentiable manner that allows us to leverage statistical regularities from past data. We propose Spatial Planning Transformers (SPT), which given an obstacle… ▽ More We consider the problem of spatial path planning. In contrast to the classical solutions which optimize a new plan from scratch and assume access to the full map with ground truth obstacle locations, we learn a planner from the data in a differentiable manner that allows us to leverage statistical regularities from past data. We propose Spatial Planning Transformers (SPT), which given an obstacle map learns to generate actions by planning over long-range spatial dependencies, unlike prior data-driven planners that propagate information locally via convolutional structure in an iterative manner. In the setting where the ground truth map is not known to the agent, we leverage pre-trained SPTs in an end-to-end framework that has the structure of mapper and planner built into it which allows seamless generalization to out-of-distribution maps and goals. SPTs outperform prior state-of-the-art differentiable planners across all the setups for both manipulation and navigation tasks, leading to an absolute improvement of 7-19%. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Comments: Published at ICML 2021. See project webpage at https://devendrachaplot.github.io/projects/spatial-planning-transformers

arXiv:2112.01001 [pdf, other]

SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency

Authors: Devendra Singh Chaplot, Murtaza Dalal, Saurabh Gupta, Jitendra Malik, Ruslan Salakhutdinov

Abstract: In this paper, we explore how we can build upon the data and models of Internet images and use them to adapt to robot vision without requiring any extra labels. We present a framework called Self-supervised Embodied Active Learning (SEAL). It utilizes perception models trained on internet images to learn an active exploration policy. The observations gathered by this exploration policy are labelle… ▽ More In this paper, we explore how we can build upon the data and models of Internet images and use them to adapt to robot vision without requiring any extra labels. We present a framework called Self-supervised Embodied Active Learning (SEAL). It utilizes perception models trained on internet images to learn an active exploration policy. The observations gathered by this exploration policy are labelled using 3D consistency and used to improve the perception model. We build and utilize 3D semantic maps to learn both action and perception in a completely self-supervised manner. The semantic map is used to compute an intrinsic motivation reward for training the exploration policy and for labelling the agent observations using spatio-temporal 3D consistency and label propagation. We demonstrate that the SEAL framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by just moving around in training environments and the improved perception model can be used to improve Object Goal Navigation. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Comments: Published at NeurIPS 2021. See project webpage at https://devendrachaplot.github.io/projects/seal

arXiv:2111.14948 [pdf]

Image denoising by Super Neurons: Why go deep?

Authors: Junaid Malik, Serkan Kiranyaz, Moncef Gabbouj

Abstract: Classical image denoising methods utilize the non-local self-similarity principle to effectively recover image content from noisy images. Current state-of-the-art methods use deep convolutional neural networks (CNNs) to effectively learn the map** from noisy to clean images. Deep denoising CNNs manifest a high learning capacity and integrate non-local information owing to the large receptive fie… ▽ More Classical image denoising methods utilize the non-local self-similarity principle to effectively recover image content from noisy images. Current state-of-the-art methods use deep convolutional neural networks (CNNs) to effectively learn the map** from noisy to clean images. Deep denoising CNNs manifest a high learning capacity and integrate non-local information owing to the large receptive field yielded by numerous cascade of hidden layers. However, deep networks are also computationally complex and require large data for training. To address these issues, this study draws the focus on the Self-organized Operational Neural Networks (Self-ONNs) empowered by a novel neuron model that can achieve a similar or better denoising performance with a compact and shallow model. Recently, the concept of super-neurons has been introduced which augment the non-linear transformations of generative neurons by utilizing non-localized kernel locations for an enhanced receptive field size. This is the key accomplishment which renders the need for a deep network configuration. As the integration of non-local information is known to benefit denoising, in this work we investigate the use of super neurons for both synthetic and real-world image denoising. We also discuss the practical issues in implementing the super neuron model on GPUs and propose a trade-off between the heterogeneity of non-localized operations and computational complexity. Our results demonstrate that with the same width and depth, Self-ONNs with super neurons provide a significant boost of denoising performance over the networks with generative and convolutional neurons for both denoising tasks. Moreover, results demonstrate that Self-ONNs with super neurons can achieve a competitive and superior synthetic denoising performances than well-known deep CNN denoisers for synthetic and real-world denoising, respectively. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2111.09887 [pdf, other]

PyTorchVideo: A Deep Learning Library for Video Understanding

Authors: Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

Abstract: We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models tha… ▽ More We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/ △ Less

Submitted 18 November, 2021; originally announced November 2021.

Comments: Technical report

arXiv:2111.07868 [pdf, other]

Tracking People with 3D Representations

Authors: Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Jitendra Malik

Abstract: We present a novel approach for tracking multiple people in video. Unlike past approaches which employ 2D representations, we focus on using 3D representations of people, located in three-dimensional space. To this end, we develop a method, Human Mesh and Appearance Recovery (HMAR) which in addition to extracting the 3D geometry of the person as a SMPL mesh, also extracts appearance as a texture m… ▽ More We present a novel approach for tracking multiple people in video. Unlike past approaches which employ 2D representations, we focus on using 3D representations of people, located in three-dimensional space. To this end, we develop a method, Human Mesh and Appearance Recovery (HMAR) which in addition to extracting the 3D geometry of the person as a SMPL mesh, also extracts appearance as a texture map on the triangles of the mesh. This serves as a 3D representation for appearance that is robust to viewpoint and pose changes. Given a video clip, we first detect bounding boxes corresponding to people, and for each one, we extract 3D appearance, pose, and location information using HMAR. These embedding vectors are then sent to a transformer, which performs spatio-temporal aggregation of the representations over the duration of the sequence. The similarity of the resulting representations is used to solve for associations that assigns each person to a tracklet. We evaluate our approach on the Posetrack, MuPoTs and AVA datasets. We find that 3D representations are more effective than 2D representations for tracking in these settings, and we obtain state-of-the-art performance. Code and results are available at: https://brjathu.github.io/T3DP. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2111.01674 [pdf, other]

Minimizing Energy Consumption Leads to the Emergence of Gaits in Legged Robots

Authors: Zipeng Fu, Ashish Kumar, Jitendra Malik, Deepak Pathak

Abstract: Legged locomotion is commonly studied and expressed as a discrete set of gait patterns, like walk, trot, gallop, which are usually treated as given and pre-programmed in legged robots for efficient locomotion at different speeds. However, fixing a set of pre-programmed gaits limits the generality of locomotion. Recent animal motor studies show that these conventional gaits are only prevalent in id… ▽ More Legged locomotion is commonly studied and expressed as a discrete set of gait patterns, like walk, trot, gallop, which are usually treated as given and pre-programmed in legged robots for efficient locomotion at different speeds. However, fixing a set of pre-programmed gaits limits the generality of locomotion. Recent animal motor studies show that these conventional gaits are only prevalent in ideal flat terrain conditions while real-world locomotion is unstructured and more like bouts of intermittent steps. What principles could lead to both structured and unstructured patterns across mammals and how to synthesize them in robots? In this work, we take an analysis-by-synthesis approach and learn to move by minimizing mechanical energy. We demonstrate that learning to minimize energy consumption plays a key role in the emergence of natural locomotion gaits at different speeds in real quadruped robots. The emergent gaits are structured in ideal terrains and look similar to that of horses and sheep. The same approach leads to unstructured gaits in rough terrains which is consistent with the findings in animal motor control. We validate our hypothesis in both simulation and real hardware across natural terrains. Videos at https://energy-locomotion.github.io △ Less

Submitted 25 October, 2021; originally announced November 2021.

Comments: CoRL 2021. Website at https://energy-locomotion.github.io

arXiv:2110.07058 [pdf, other]

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/ △ Less

Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

arXiv:2110.06199 [pdf, other]

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Authors: Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, Jitendra Malik

Abstract: We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. ABO contains product catalog images, metadata, and artist-created 3D models with complex geometries and physically-based materials that correspond to real, household objects. We derive challenging benchmarks that exploit the unique properties of ABO and measure… ▽ More We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. ABO contains product catalog images, metadata, and artist-created 3D models with complex geometries and physically-based materials that correspond to real, household objects. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval. △ Less

Submitted 24 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

arXiv:2110.05472 [pdf, other]

Differentiable Stereopsis: Meshes from multiple views using differentiable rendering

Authors: Shubham Goel, Georgia Gkioxari, Jitendra Malik

Abstract: We propose Differentiable Stereopsis, a multi-view stereo approach that reconstructs shape and texture from few input views and noisy cameras. We pair traditional stereopsis and modern differentiable rendering to build an end-to-end model which predicts textured 3D meshes of objects with varying topologies and shape. We frame stereopsis as an optimization problem and simultaneously update shape an… ▽ More We propose Differentiable Stereopsis, a multi-view stereo approach that reconstructs shape and texture from few input views and noisy cameras. We pair traditional stereopsis and modern differentiable rendering to build an end-to-end model which predicts textured 3D meshes of objects with varying topologies and shape. We frame stereopsis as an optimization problem and simultaneously update shape and cameras via simple gradient descent. We run an extensive quantitative analysis and compare to traditional multi-view stereo techniques and state-of-the-art learning based methods. We show compelling reconstructions on challenging real-world scenes and for an abundance of object types with complex shape, topology and texture. Project webpage: https://shubham-goel.github.io/ds/ △ Less

Submitted 23 September, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: In CVPR2022. Project webpage: https://shubham-goel.github.io/ds/

Journal ref: In CVPR 2022 (pp. 8635-8644)

arXiv:2110.04994 [pdf, other]

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

Authors: Ainaz Eftekhar, Alexander Sax, Roman Bachmann, Jitendra Malik, Amir Zamir

Abstract: This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models. Common… ▽ More This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models. Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks, despite having seen no benchmark or non-pipeline data. The depth estimation network outperforms MiDaS and the surface normal estimation network is the first to achieve human-level performance for in-the-wild surface normal estimation -- at least according to one metric on the OASIS benchmark. The Dockerized pipeline with CLI, the (mostly python) code, PyTorch dataloaders for the generated data, the generated starter dataset, download scripts and other utilities are available through our project website, https://omnidata.vision. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: ICCV 2021: See project website https://omnidata.vision

arXiv:2110.02381 [pdf]

doi 10.1109/TNNLS.2022.3158867

Robust Peak Detection for Holter ECGs by Self-Organized Operational Neural Networks

Authors: Moncef Gabbouj, Serkan Kiranyaz, Junaid Malik, Muhammad Uzair Zahid, Turker Ince, Muhammad Chowdhury, Amith Khandakar, Anas Tahir

Abstract: Although numerous R-peak detectors have been proposed in the literature, their robustness and performance levels may significantly deteriorate in low-quality and noisy signals acquired from mobile electrocardiogram (ECG) sensors, such as Holter monitors. Recently, this issue has been addressed by deep 1-D convolutional neural networks (CNNs) that have achieved state-of-the-art performance levels i… ▽ More Although numerous R-peak detectors have been proposed in the literature, their robustness and performance levels may significantly deteriorate in low-quality and noisy signals acquired from mobile electrocardiogram (ECG) sensors, such as Holter monitors. Recently, this issue has been addressed by deep 1-D convolutional neural networks (CNNs) that have achieved state-of-the-art performance levels in Holter monitors; however, they pose a high complexity level that requires special parallelized hardware setup for real-time processing. On the other hand, their performance deteriorates when a compact network configuration is used instead. This is an expected outcome as recent studies have demonstrated that the learning performance of CNNs is limited due to their strictly homogenous configuration with the sole linear neuron model. In this study, to further boost the peak detection performance along with an elegant computational efficiency, we propose 1-D Self-Organized ONNs (Self-ONNs) with generative neurons. The most crucial advantage of 1-D Self-ONNs over the ONNs is their self-organization capability that voids the need to search for the best operator set per neuron since each generative neuron has the ability to create the optimal operator during training. The experimental results over the China Physiological Signal Challenge-2020 (CPSC) dataset with more than one million ECG beats show that the proposed 1-D Self-ONNs can significantly surpass the state-of-the-art deep CNN with less computational complexity. Results demonstrate that the proposed solution achieves a 99.10% F1-score, 99.79% sensitivity, and 98.42% positive predictivity in the CPSC dataset, which is the best R-peak detection performance ever achieved. △ Less

Submitted 12 January, 2024; v1 submitted 30 September, 2021; originally announced October 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2110.02215

Journal ref: in IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 11, pp. 9363-9374, Nov. 2023

arXiv:2110.02215 [pdf]

Real-Time Patient-Specific ECG Classification by 1D Self-Operational Neural Networks

Authors: Junaid Malik, Ozer Can Devecioglu, Serkan Kiranyaz, Turker Ince, Moncef Gabbouj

Abstract: Despite the proliferation of numerous deep learning methods proposed for generic ECG classification and arrhythmia detection, compact systems with the real-time ability and high accuracy for classifying patient-specific ECG are still few. Particularly, the scarcity of patient-specific data poses an ultimate challenge to any classifier. Recently, compact 1D Convolutional Neural Networks (CNNs) have… ▽ More Despite the proliferation of numerous deep learning methods proposed for generic ECG classification and arrhythmia detection, compact systems with the real-time ability and high accuracy for classifying patient-specific ECG are still few. Particularly, the scarcity of patient-specific data poses an ultimate challenge to any classifier. Recently, compact 1D Convolutional Neural Networks (CNNs) have achieved the state-of-the-art performance level for the accurate classification of ventricular and supraventricular ectopic beats. However, several studies have demonstrated the fact that the learning performance of the conventional CNNs is limited because they are homogenous networks with a basic (linear) neuron model. In order to address this deficiency and further boost the patient-specific ECG classification performance, in this study, we propose 1D Self-organized Operational Neural Networks (1D Self-ONNs). Due to its self-organization capability, Self-ONNs have the utmost advantage and superiority over conventional ONNs where the prior operator search within the operator set library to find the best possible set of operators is entirely avoided. As the first study where 1D Self-ONNs are ever proposed for a classification task, our results over the MIT-BIH arrhythmia benchmark database demonstrate that 1D Self-ONNs can surpass 1D CNNs with a significant margin while having a similar computational complexity. Under AAMI recommendations and with minimal common training data used, over the entire MIT-BIH dataset 1D Self-ONNs have achieved 98% and 99.04% average accuracies, 76.6% and 93.7% average F1 scores on supra-ventricular and ventricular ectopic beat (VEB) classifications, respectively, which is the highest performance level ever reported. △ Less

Submitted 30 September, 2021; originally announced October 2021.

arXiv:2109.14873 [pdf]

Early Bearing Fault Diagnosis of Rotating Machinery by 1D Self-Organized Operational Neural Networks

Authors: Turker Ince, Junaid Malik, Ozer Can Devecioglu, Serkan Kiranyaz, Onur Avci, Levent Eren, Moncef Gabbouj

Abstract: Preventive maintenance of modern electric rotating machinery (RM) is critical for ensuring reliable operation, preventing unpredicted breakdowns and avoiding costly repairs. Recently many studies investigated machine learning monitoring methods especially based on Deep Learning networks focusing mostly on detecting bearing faults; however, none of them addressed bearing fault severity classificati… ▽ More Preventive maintenance of modern electric rotating machinery (RM) is critical for ensuring reliable operation, preventing unpredicted breakdowns and avoiding costly repairs. Recently many studies investigated machine learning monitoring methods especially based on Deep Learning networks focusing mostly on detecting bearing faults; however, none of them addressed bearing fault severity classification for early fault diagnosis with high enough accuracy. 1D Convolutional Neural Networks (CNNs) have indeed achieved good performance for detecting RM bearing faults from raw vibration and current signals but did not classify fault severity. Furthermore, recent studies have demonstrated the limitation in terms of learning capability of conventional CNNs attributed to the basic underlying linear neuron model. Recently, Operational Neural Networks (ONNs) were proposed to enhance the learning capability of CNN by introducing non-linear neuron models and further heterogeneity in the network configuration. In this study, we propose 1D Self-organized ONNs (Self-ONNs) with generative neurons for bearing fault severity classification and providing continuous condition monitoring. Experimental results over the benchmark NSF/IMS bearing vibration dataset using both x- and y-axis vibration signals for inner race and rolling element faults demonstrate that the proposed 1D Self-ONNs achieve significant performance gap against the state-of-the-art (1D CNNs) with similar computational complexity. △ Less

Submitted 30 September, 2021; originally announced September 2021.

arXiv:2109.13604 [pdf]

Real-Time Glaucoma Detection from Digital Fundus Images using Self-ONNs

Authors: Ozer Can Devecioglu, Junaid Malik, Turker Ince, Serkan Kiranyaz, Eray Atalay, Moncef Gabbouj

Abstract: Glaucoma leads to permanent vision disability by damaging the optical nerve that transmits visual images to the brain. The fact that glaucoma does not show any symptoms as it progresses and cannot be stopped at the later stages, makes it critical to be diagnosed in its early stages. Although various deep learning models have been applied for detecting glaucoma from digital fundus images, due to th… ▽ More Glaucoma leads to permanent vision disability by damaging the optical nerve that transmits visual images to the brain. The fact that glaucoma does not show any symptoms as it progresses and cannot be stopped at the later stages, makes it critical to be diagnosed in its early stages. Although various deep learning models have been applied for detecting glaucoma from digital fundus images, due to the scarcity of labeled data, their generalization performance was limited along with high computational complexity and special hardware requirements. In this study, compact Self-Organized Operational Neural Networks (Self- ONNs) are proposed for early detection of glaucoma in fundus images and their performance is compared against the conventional (deep) Convolutional Neural Networks (CNNs) over three benchmark datasets: ACRIMA, RIM-ONE, and ESOGU. The experimental results demonstrate that Self-ONNs not only achieve superior detection performance but can also significantly reduce the computational complexity making it a potentially suitable network model for biomedical datasets especially when the data is scarce. △ Less

Submitted 28 September, 2021; originally announced September 2021.

arXiv:2109.01594 [pdf]

Super Neurons

Authors: Serkan Kiranyaz, Junaid Malik, Mehmet Yamac, Mert Duman, Ilke Adalioglu, Esin Guldogan, Turker Ince, Moncef Gabbouj

Abstract: Self-Organized Operational Neural Networks (Self-ONNs) have recently been proposed as new-generation neural network models with nonlinear learning units, i.e., the generative neurons that yield an elegant level of diversity; however, like its predecessor, conventional Convolutional Neural Networks (CNNs), they still have a common drawback: localized (fixed) kernel operations. This severely limits… ▽ More Self-Organized Operational Neural Networks (Self-ONNs) have recently been proposed as new-generation neural network models with nonlinear learning units, i.e., the generative neurons that yield an elegant level of diversity; however, like its predecessor, conventional Convolutional Neural Networks (CNNs), they still have a common drawback: localized (fixed) kernel operations. This severely limits the receptive field and information flow between layers and thus brings the necessity for deep and complex models. It is highly desired to improve the receptive field size without increasing the kernel dimensions. This requires a significant upgrade over the generative neurons to achieve the non-localized kernel operations for each connection between consecutive layers. In this article, we present superior (generative) neuron models (or super neurons in short) that allow random or learnable kernel shifts and thus can increase the receptive field size of each connection. The kernel localization process varies among the two super-neuron models. The first model assumes randomly localized kernels within a range and the second one learns (optimizes) the kernel locations during training. An extensive set of comparative evaluations against conventional and deformable convolutional, along with the generative neurons demonstrates that super neurons can empower Self-ONNs to achieve a superior learning and generalization capability with a minimal computational complexity burden. △ Less

Submitted 15 April, 2023; v1 submitted 3 August, 2021; originally announced September 2021.

Comments: 25 pages, 13 figures

arXiv:2107.09584 [pdf, other]

Active 3D Shape Reconstruction from Vision and Touch

Authors: Edward J. Smith, David Meger, Luis Pineda, Roberto Calandra, Jitendra Malik, Adriana Romero, Michal Drozdzal

Abstract: Humans build 3D understandings of the world through active object exploration, using jointly their senses of vision and touch. However, in 3D shape reconstruction, most recent progress has relied on static datasets of limited sensory data such as RGB images, depth maps or haptic readings, leaving the active exploration of the shape largely unexplored. Inactive touch sensing for 3D reconstruction,… ▽ More Humans build 3D understandings of the world through active object exploration, using jointly their senses of vision and touch. However, in 3D shape reconstruction, most recent progress has relied on static datasets of limited sensory data such as RGB images, depth maps or haptic readings, leaving the active exploration of the shape largely unexplored. Inactive touch sensing for 3D reconstruction, the goal is to actively select the tactile readings that maximize the improvement in shape reconstruction accuracy. However, the development of deep learning-based active touch models is largely limited by the lack of frameworks for shape exploration. In this paper, we focus on this problem and introduce a system composed of: 1) a haptic simulator leveraging high spatial resolution vision-based tactile sensors for active touching of 3D objects; 2)a mesh-based 3D shape reconstruction model that relies on tactile or visuotactile signals; and 3) a set of data-driven solutions with either tactile or visuotactile priors to guide the shape exploration. Our framework enables the development of the first fully data-driven solutions to active touch on top of learned models for object understanding. Our experiments show the benefits of such solutions in the task of 3D shape understanding where our models consistently outperform natural baselines. We provide our framework as a tool to foster future research in this direction. △ Less

Submitted 26 October, 2021; v1 submitted 20 July, 2021; originally announced July 2021.

Journal ref: Published at Neurips 2021

arXiv:2107.04034 [pdf, other]

RMA: Rapid Motor Adaptation for Legged Robots

Authors: Ashish Kumar, Zipeng Fu, Deepak Pathak, Jitendra Malik

Abstract: Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to solve this problem of real-time online adaptation in quadruped robots. RMA consists of two components: a base policy and an adaptation module. The combination of these c… ▽ More Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to solve this problem of real-time online adaptation in quadruped robots. RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second. RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning. We train RMA on a varied terrain generator using bioenergetics-inspired rewards and deploy it on a variety of difficult terrains including rocky, slippery, deformable surfaces in environments with grass, long vegetation, concrete, pebbles, stairs, sand, etc. RMA shows state-of-the-art performance across diverse real-world as well as simulation experiments. Video results at https://ashish-kmr.github.io/rma-legged-robots/ △ Less

Submitted 8 July, 2021; originally announced July 2021.

Comments: RSS 2021. Webpage at https://ashish-kmr.github.io/rma-legged-robots/

arXiv:2107.01205 [pdf, other]

HandVoxNet++: 3D Hand Shape and Pose Estimation using Voxel-Based Neural Networks

Authors: Jameel Malik, Soshi Shimada, Ahmed Elhayek, Sk Aziz Ali, Christian Theobalt, Vladislav Golyanik, Didier Stricker

Abstract: 3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. Existing methods addressing it directly regress hand meshes via 2D convolutional neural networks, which leads to artefacts due to perspective distortions in the images. To address the limitations of the existing methods, we develop HandVoxNet++, i.e., a voxel-based dee… ▽ More 3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. Existing methods addressing it directly regress hand meshes via 2D convolutional neural networks, which leads to artefacts due to perspective distortions in the images. To address the limitations of the existing methods, we develop HandVoxNet++, i.e., a voxel-based deep network with 3D and graph convolutions trained in a fully supervised manner. The input to our network is a 3D voxelized-depth-map-based on the truncated signed distance function (TSDF). HandVoxNet++ relies on two hand shape representations. The first one is the 3D voxelized grid of hand shape, which does not preserve the mesh topology and which is the most accurate representation. The second representation is the hand surface that preserves the mesh topology. We combine the advantages of both representations by aligning the hand surface to the voxelized hand shape either with a new neural Graph-Convolutions-based Mesh Registration (GCN-MeshReg) or classical segment-wise Non-Rigid Gravitational Approach (NRGA++) which does not rely on training data. In extensive evaluations on three public benchmarks, i.e., SynHand5M, depth-based HANDS19 challenge and HO-3D, the proposed HandVoxNet++ achieves state-of-the-art performance. In this journal extension of our previous approach presented at CVPR 2020, we gain 41.09% and 13.7% higher shape alignment accuracy on SynHand5M and HANDS19 datasets, respectively. Our method is ranked first on the HANDS19 challenge dataset (Task 1: Depth-Based 3D Hand Pose Estimation) at the moment of the submission of our results to the portal in August 2020. △ Less

Submitted 5 December, 2021; v1 submitted 2 July, 2021; originally announced July 2021.

Comments: 13 pages, 6 tables, 7 figures; project webpage: http://4dqv.mpi-inf.mpg.de/HandVoxNet++/. arXiv admin note: text overlap with arXiv:2004.01588

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021

arXiv:2106.14405 [pdf, other]

Habitat 2.0: Training Home Assistants to Rearrange their Habitat

Authors: Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, Dhruv Batra

Abstract: We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spa… ▽ More We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from 'hand-off problems', and (3) SPA pipelines are more brittle than RL policies. △ Less

Submitted 1 July, 2022; v1 submitted 28 June, 2021; originally announced June 2021.

arXiv:2105.14926 [pdf, other]

Self-Organized Residual Blocks for Image Super-Resolution

Authors: Onur Keleş, A. Murat Tekalp, Junaid Malik, Serkan Kıranyaz

Abstract: It has become a standard practice to use the convolutional networks (ConvNet) with RELU non-linearity in image restoration and super-resolution (SR). Although the universal approximation theorem states that a multi-layer neural network can approximate any non-linear function with the desired precision, it does not reveal the best network architecture to do so. Recently, operational neural networks… ▽ More It has become a standard practice to use the convolutional networks (ConvNet) with RELU non-linearity in image restoration and super-resolution (SR). Although the universal approximation theorem states that a multi-layer neural network can approximate any non-linear function with the desired precision, it does not reveal the best network architecture to do so. Recently, operational neural networks (ONNs) that choose the best non-linearity from a set of alternatives, and their "self-organized" variants (Self-ONN) that approximate any non-linearity via Taylor series have been proposed to address the well-known limitations and drawbacks of conventional ConvNets such as network homogeneity using only the McCulloch-Pitts neuron model. In this paper, we propose the concept of self-organized operational residual (SOR) blocks, and present hybrid network architectures combining regular residual and SOR blocks to strike a balance between the benefits of stronger non-linearity and the overall number of parameters. The experimental results demonstrate that the~proposed architectures yield performance improvements in both PSNR and perceptual metrics. △ Less

Submitted 31 May, 2021; originally announced May 2021.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2021

arXiv:2105.12651 [pdf, other]

The Usability and Trustworthiness of Medical Eye Images

Authors: Daniel Diethei, Ashley Colley, Lisa Dannenberg, Muhammad Fawad Jawaid Malik, Johannes Schöning

Abstract: The majority of blindness is preventable, and is located in develo** countries. While mHealth applications for retinal imaging in combination with affordable smartphone lens adaptors are a step towards better eye care access, the expert knowledge and additional hardware needed are often unavailable in develo** countries. Eye screening apps without lens adaptors exist, but we do not know much a… ▽ More The majority of blindness is preventable, and is located in develo** countries. While mHealth applications for retinal imaging in combination with affordable smartphone lens adaptors are a step towards better eye care access, the expert knowledge and additional hardware needed are often unavailable in develo** countries. Eye screening apps without lens adaptors exist, but we do not know much about the experience of guiding users to take medical eye images. Additionally, when an AI based diagnosis is provided, trust plays an important role in ensuring in the adoption. This work addresses factors that impact the usability and trustworthiness dimensions of mHealth applications. We present the design, development and evaluation of EyeGuide, a mobile app that assists users in taking medical eye images using only their smartphone camera. In a study (n=28) we observed that users of an interactive tutorial captured images faster compared to audible tone based guidance. In a second study (n=40) we found out that providing disease-specific background information was the most effective factor to increase trustworthiness in the AI based diagnosis. Application areas of EyeGuide are AI based disease detection and telemedicine examinations. △ Less

Submitted 26 May, 2021; originally announced May 2021.

Journal ref: 2021 IEEE International Conference on Healthcare Informatics (ICHI)

arXiv:2105.12107 [pdf, other]

Self-Organized Variational Autoencoders (Self-VAE) for Learned Image Compression

Authors: M. Akın Yılmaz, Onur Keleş, Hilal Güven, A. Murat Tekalp, Junaid Malik, Serkan Kıranyaz

Abstract: In end-to-end optimized learned image compression, it is standard practice to use a convolutional variational autoencoder with generalized divisive normalization (GDN) to transform images into a latent space. Recently, Operational Neural Networks (ONNs) that learn the best non-linearity from a set of alternatives, and their self-organized variants, Self-ONNs, that approximate any non-linearity via… ▽ More In end-to-end optimized learned image compression, it is standard practice to use a convolutional variational autoencoder with generalized divisive normalization (GDN) to transform images into a latent space. Recently, Operational Neural Networks (ONNs) that learn the best non-linearity from a set of alternatives, and their self-organized variants, Self-ONNs, that approximate any non-linearity via Taylor series have been proposed to address the limitations of convolutional layers and a fixed nonlinear activation. In this paper, we propose to replace the convolutional and GDN layers in the variational autoencoder with self-organized operational layers, and propose a novel self-organized variational autoencoder (Self-VAE) architecture that benefits from stronger non-linearity. The experimental results demonstrate that the proposed Self-VAE yields improvements in both rate-distortion performance and perceptual image quality. △ Less

Submitted 28 May, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2021

arXiv:2104.11227 [pdf, other]

Multiscale Vision Transformers

Authors: Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

Abstract: We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution.… ▽ More We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast △ Less

Submitted 22 April, 2021; originally announced April 2021.

Comments: Technical report

arXiv:2104.02406 [pdf]

doi 10.1002/mrm.28984

Magnetization Transfer-Mediated MR Fingerprinting

Authors: Daniel J. West, Gastao Cruz, Rui P. A. G. Teixeira, Torben Schneider, Jacques-Donald Tournier, Joseph V. Hajnal, Claudia Prieto, Shaihan J. Malik

Abstract: Purpose: Magnetization transfer (MT) and inhomogeneous MT (ihMT) contrasts are used in MRI to provide information about macromolecular tissue content. In particular, MT is sensitive to macromolecules and ihMT appears to be specific to myelinated tissue. This study proposes a technique to characterize MT and ihMT properties from a single acquisition, producing both semiquantitative contrast ratios,… ▽ More Purpose: Magnetization transfer (MT) and inhomogeneous MT (ihMT) contrasts are used in MRI to provide information about macromolecular tissue content. In particular, MT is sensitive to macromolecules and ihMT appears to be specific to myelinated tissue. This study proposes a technique to characterize MT and ihMT properties from a single acquisition, producing both semiquantitative contrast ratios, and quantitative parameter maps. Theory and Methods: Building upon previous work that uses multiband radiofrequency (RF) pulses to efficiently generate ihMT contrast, we propose a cyclic-steady-state approach that cycles between multiband and single-band pulses to boost the achieved contrast. Resultant time-variable signals are reminiscent of a magnetic resonance fingerprinting (MRF) acquisition, except that the signal fluctuations are entirely mediated by magnetization transfer effects. A dictionary-based low-rank inversion method is used to reconstruct the resulting images and to produce both semiquantitative MT ratio (MTR) and ihMT ratio (ihMTR) maps, as well as quantitative parameter estimates corresponding to an ihMT tissue model. Results: Phantom and in vivo brain data acquired at 1.5T demonstrate the expected contrast trends, with ihMTR maps showing contrast more specific to white matter (WM), as has been reported by others. Quantitative estimation of semisolid fraction and dipolar T1 was also possible and yielded measurements consistent with literature values in the brain. Conclusions: By cycling between multiband and single-band pulses, an entirely magnetization transfer mediated 'fingerprinting' method was demonstrated. This proof-of-concept approach can be used to generate semiquantitative maps and quantitatively estimate some macromolecular specific tissue parameters. △ Less

Submitted 23 August, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

Comments: 34 Pages and 15 Figures (Including Supporting Information), Submitted to Magnetic Resonance in Medicine (MRM). Updated to include link to final published article

arXiv:2103.14580 [pdf, other]

Correcting Automated and Manual Speech Transcription Errors using Warped Language Models

Authors: Mahdi Namazifar, John Malik, Li Erran Li, Gokhan Tur, Dilek Hakkani Tür

Abstract: Masked language models have revolutionized natural language processing systems in the past few years. A recently introduced generalization of masked language models called warped language models are trained to be more robust to the types of errors that appear in automatic or manual transcriptions of spoken language by exposing the language model to the same types of errors during training. In this… ▽ More Masked language models have revolutionized natural language processing systems in the past few years. A recently introduced generalization of masked language models called warped language models are trained to be more robust to the types of errors that appear in automatic or manual transcriptions of spoken language by exposing the language model to the same types of errors during training. In this work we propose a novel approach that takes advantage of the robustness of warped language models to transcription noise for correcting transcriptions of spoken language. We show that our proposed approach is able to achieve up to 10% reduction in word error rates of both automatic and manual transcriptions of spoken language. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: Submitted to INTERSPEECH

arXiv:2103.03070 [pdf]

Convolutional versus Self-Organized Operational Neural Networks for Real-World Blind Image Denoising

Authors: Junaid Malik, Serkan Kiranyaz, Mehmet Yamac, Esin Guldogan, Moncef Gabbouj

Abstract: Real-world blind denoising poses a unique image restoration challenge due to the non-deterministic nature of the underlying noise distribution. Prevalent discriminative networks trained on synthetic noise models have been shown to generalize poorly to real-world noisy images. While curating real-world noisy images and improving ground truth estimation procedures remain key points of interest, a po… ▽ More Real-world blind denoising poses a unique image restoration challenge due to the non-deterministic nature of the underlying noise distribution. Prevalent discriminative networks trained on synthetic noise models have been shown to generalize poorly to real-world noisy images. While curating real-world noisy images and improving ground truth estimation procedures remain key points of interest, a potential research direction is to explore extensions to the widely used convolutional neuron model to enable better generalization with fewer data and lower network complexity, as opposed to simply using deeper Convolutional Neural Networks (CNNs). Operational Neural Networks (ONNs) and their recent variant, Self-organized ONNs (Self-ONNs), propose to embed enhanced non-linearity into the neuron model and have been shown to outperform CNNs across a variety of regression tasks. However, all such comparisons have been made for compact networks and the efficacy of deploying operational layers as a drop-in replacement for convolutional layers in contemporary deep architectures remains to be seen. In this work, we tackle the real-world blind image denoising problem by employing, for the first time, a deep Self-ONN. Extensive quantitative and qualitative evaluations spanning multiple metrics and four high-resolution real-world noisy image datasets against the state-of-the-art deep CNN network, DnCNN, reveal that deep Self-ONNs consistently achieve superior results with performance gains of up to 1.76dB in PSNR. Furthermore, Self-ONNs with half and even quarter the number of layers that require only a fraction of computational resources as that of DnCNN can still achieve similar or better results compared to the state-of-the-art. △ Less

Submitted 5 May, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

Comments: Submitted for review in IEEE TIP

arXiv:2103.03060 [pdf]

BM3D vs 2-Layer ONN

Authors: Junaid Malik, Serkan Kiranyaz, Mehmet Yamac, Moncef Gabbouj

Abstract: Despite their recent success on image denoising, the need for deep and complex architectures still hinders the practical usage of CNNs. Older but computationally more efficient methods such as BM3D remain a popular choice, especially in resource-constrained scenarios. In this study, we aim to find out whether compact neural networks can learn to produce competitive results as compared to BM3D for… ▽ More Despite their recent success on image denoising, the need for deep and complex architectures still hinders the practical usage of CNNs. Older but computationally more efficient methods such as BM3D remain a popular choice, especially in resource-constrained scenarios. In this study, we aim to find out whether compact neural networks can learn to produce competitive results as compared to BM3D for AWGN image denoising. To this end, we configure networks with only two hidden layers and employ different neuron models and layer widths for comparing the performance with BM3D across different AWGN noise levels. Our results conclusively show that the recently proposed self-organized variant of operational neural networks based on a generative neuron model (Self-ONNs) is not only a better choice as compared to CNNs, but also provide competitive results as compared to BM3D and even significantly surpass it for high noise levels. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: Submitted for review in ICIP 2021

arXiv:2101.02703 [pdf, other]

Distribution-Free, Risk-Controlling Prediction Sets

Authors: Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, Michael I. Jordan

Abstract: While improving prediction accuracy has been the focus of machine learning in recent years, this alone does not suffice for reliable decision-making. Deploying learning systems in consequential settings also requires calibrating and communicating the uncertainty of predictions. To convey instance-wise uncertainty for prediction tasks, we show how to generate set-valued predictions from a black-box… ▽ More While improving prediction accuracy has been the focus of machine learning in recent years, this alone does not suffice for reliable decision-making. Deploying learning systems in consequential settings also requires calibrating and communicating the uncertainty of predictions. To convey instance-wise uncertainty for prediction tasks, we show how to generate set-valued predictions from a black-box predictor that control the expected loss on future test points at a user-specified level. Our approach provides explicit finite-sample guarantees for any dataset by using a holdout set to calibrate the size of the prediction sets. This framework enables simple, distribution-free, rigorous error control for many tasks, and we demonstrate it in five large-scale machine learning problems: (1) classification problems where some mistakes are more costly than others; (2) multi-label classification, where each observation has multiple associated labels; (3) classification problems where the labels have a hierarchical structure; (4) image segmentation, where we wish to predict a set of pixels containing an object of interest; and (5) protein structure prediction. Lastly, we discuss extensions to uncertainty quantification for ranking, metric learning and distributionally robust learning. △ Less

Submitted 4 August, 2021; v1 submitted 7 January, 2021; originally announced January 2021.

Comments: Project website available at http://www.angelopoulos.ai/blog/posts/rcps/ and codebase available at https://github.com/aangelopoulos/rcps

arXiv:2012.09856 [pdf, other]

Reconstructing Hand-Object Interactions in the Wild

Authors: Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, Jitendra Malik

Abstract: In this work we explore reconstructing hand-object interactions in the wild. The core challenge of this problem is the lack of appropriate 3D labeled data. To overcome this issue, we propose an optimization-based procedure which does not require direct 3D supervision. The general strategy we adopt is to exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D… ▽ More In this work we explore reconstructing hand-object interactions in the wild. The core challenge of this problem is the lack of appropriate 3D labeled data. To overcome this issue, we propose an optimization-based procedure which does not require direct 3D supervision. The general strategy we adopt is to exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) to provide constraints for the 3D reconstruction. Rather than optimizing the hand and object individually, we optimize them jointly which allows us to impose additional constraints based on hand-object contact, collision, and occlusion. Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens and the 100 Days of Hands datasets, across a range of object categories. Quantitatively, we demonstrate that our approach compares favorably to existing approaches in the lab settings where ground truth 3D annotations are available. △ Less

Submitted 30 December, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: Project page: https://people.eecs.berkeley.edu/~zhecao/rhoi/

arXiv:2012.09843 [pdf, other]

Human Mesh Recovery from Multiple Shots

Authors: Georgios Pavlakos, Jitendra Malik, Angjoo Kanazawa

Abstract: Videos from edited media like movies are a useful, yet under-explored source of information. The rich variety of appearance and interactions between humans depicted over a large temporal context in these films could be a valuable source of data. However, the richness of data comes at the expense of fundamental challenges such as abrupt shot changes and close up shots of actors with heavy truncatio… ▽ More Videos from edited media like movies are a useful, yet under-explored source of information. The rich variety of appearance and interactions between humans depicted over a large temporal context in these films could be a valuable source of data. However, the richness of data comes at the expense of fundamental challenges such as abrupt shot changes and close up shots of actors with heavy truncation, which limits the applicability of existing human 3D understanding methods. In this paper, we address these limitations with an insight that while shot changes of the same scene incur a discontinuity between frames, the 3D structure of the scene still changes smoothly. This allows us to handle frames before and after the shot change as multi-view signal that provide strong cues to recover the 3D state of the actors. We propose a multi-shot optimization framework, which leads to improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh. We show that the resulting data is beneficial in the training of various human mesh recovery models: for single image, we achieve improved robustness; for video we propose a pure transformer-based temporal encoder, which can naturally handle missing observations due to shot changes in the input frames. We demonstrate the importance of the insight and proposed models through extensive experiments. The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media, which could be helpful for many downstream applications. Project page: https://geopavlakos.github.io/multishot △ Less

Submitted 17 December, 2020; originally announced December 2020.

Showing 51–100 of 233 results for author: Malik, J