Skip to main content

Showing 1–41 of 41 results for author: Russell, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.03190  [pdf, other

    cs.CV

    Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

    Authors: Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan Russell, Fabian Caba Heilbron

    Abstract: In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased te… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  2. arXiv:2404.04346  [pdf, other

    cs.CV

    Koala: Key frame-conditioned long video-LLM

    Authors: Reuben Tan, Ximeng Sun, ** Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko

    Abstract: Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to unde… ▽ More

    Submitted 3 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024 as a poster highlight

  3. arXiv:2312.04966  [pdf, other

    cs.CV

    Customizing Motion in Text-to-Video Diffusion Models

    Authors: Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell

    Abstract: We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First,… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project page: this website https://joaanna.github.io/customizing_motion/

  4. arXiv:2312.02985  [pdf, other

    cs.CV

    FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

    Authors: Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic

    Abstract: We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Se… ▽ More

    Submitted 15 November, 2023; originally announced December 2023.

    Comments: 21 pages, 18 figures. arXiv admin note: substantial text overlap with arXiv:2204.05145

  5. arXiv:2306.10169  [pdf, other

    cs.CV cs.CL cs.LG

    Meta-Personalizing Vision-Language Models to Find Named Instances in Video

    Authors: Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, Simon Jenni

    Abstract: Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a metho… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted to CVPR 2023. Project webpage: https://danielchyeh.github.io/metaper/

  6. arXiv:2306.09327  [pdf, other

    cs.CV

    Language-Guided Music Recommendation for Video via Prompt Analogies

    Authors: Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell

    Abstract: We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 (Highlight paper). Project page: https://www.danielbmckee.com/language-guided-music-for-video

  7. arXiv:2304.08490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Conditional Generation of Audio from Video via Foley Analogies

    Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

    Abstract: The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributi… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  8. arXiv:2303.16342  [pdf, other

    cs.CV cs.AI cs.CL

    Language-Guided Audio-Visual Source Separation via Trimodal Consistency

    Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

    Abstract: We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to… ▽ More

    Submitted 23 September, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023

  9. arXiv:2212.14693  [pdf, other

    cs.CY cs.AI cs.LG

    A Learned Simulation Environment to Model Student Engagement and Retention in Automated Online Courses

    Authors: N. Imstepf, S. Senn, A. Fortin, B. Russell, C. Horn

    Abstract: We developed a simulator to quantify the effect of exercise ordering on both student engagement and retention. Our approach combines the construction of neural network representations for users and exercises using a dynamic matrix factorization method. We further created a machine learning models of success and dropout prediction. As a result, our system is able to predict student engagement and r… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

    Comments: 6 pages, 3 figures, 1 table

    ACM Class: J.4

  10. arXiv:2210.13445  [pdf, other

    cs.CV

    Monocular Dynamic View Synthesis: A Reality Check

    Authors: Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, Angjoo Kanazawa

    Abstract: We study the recent progress on dynamic view synthesis (DVS) from monocular video. Though existing approaches have demonstrated impressive results, we show a discrepancy between the practical capture process and the existing experimental protocols, which effectively leaks in multi-view signals during training. We define effective multi-view factors (EMFs) to quantify the amount of multi-view signa… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022. Project page: https://hangg7.com/dycheck. Code: https://github.com/KAIR-BAIR/dycheck

  11. arXiv:2206.07148  [pdf, other

    cs.MM cs.CV

    It's Time for Artistic Correspondence in Music and Video

    Authors: Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon

    Abstract: We present an approach for recommending a music track for a given video, and vice versa, based on both their temporal alignment and their correspondence at an artistic level. We propose a self-supervised approach that learns this correspondence directly from data, without any need of human annotations. In order to capture the high-level concepts that are required to solve the task, we propose mode… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: CVPR 2022

  12. arXiv:2205.14929  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    Neural Volumetric Object Selection

    Authors: Zhongzheng Ren, Aseem Agarwala, Bryan Russell, Alexander G. Schwing, Oliver Wang

    Abstract: We introduce an approach for selecting objects in neural volumetric 3D representations, such as multi-plane images (MPI) and neural radiance fields (NeRF). Our approach takes a set of foreground and background 2D user scribbles in one view and automatically estimates a 3D segmentation of the desired object, which can be rendered into novel views. To achieve this result, we propose a novel voxel fe… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

    Comments: CVPR 2022 camera ready

  13. arXiv:2204.05145  [pdf, other

    cs.CV

    Focal Length and Object Pose Estimation via Render and Compare

    Authors: Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic

    Abstract: We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are twofold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Second… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to CVPR2022. Code available at http://github.com/ponimatkin/focalpose

  14. arXiv:2111.06934  [pdf, other

    cs.CV cs.LG

    Contrastive Feature Loss for Image Prediction

    Authors: Alex Andonian, Taesung Park, Bryan Russell, Phillip Isola, Jun-Yan Zhu, Richard Zhang

    Abstract: Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result. Yet, this basic functionality remains an open problem. A popular line of approaches uses the L1 (mean absolute error) loss, either in the pixel or the feature space of pretrained deep networks. However, we observe that these losses tend to produce overly blurry and grey images, and o… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

    Comments: Appeared in Advances in Image Manipulation Workshop at ICCV 2021. GitHub: https://github.com/alexandonian/contrastive-feature-loss

  15. arXiv:2110.10596  [pdf, other

    cs.CV cs.LG

    Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin **, Bryan Russell

    Abstract: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.… ▽ More

    Submitted 2 December, 2021; v1 submitted 20 October, 2021; originally announced October 2021.

    Comments: Accepted at NeurIPS 2021 (Spotlight)

  16. arXiv:2110.03562  [pdf, other

    cs.CV

    Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

    Authors: Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell

    Abstract: We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to joi… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  17. arXiv:2105.06466  [pdf, other

    cs.CV cs.GR cs.LG

    Editing Conditional Radiance Fields

    Authors: Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, Bryan Russell

    Abstract: A neural radiance field (NeRF) is a scene model supporting high-quality view synthesis, optimized per scene. In this paper, we explore enabling user editing of a category-level NeRF - also known as a conditional radiance field - trained on a shape category. Specifically, we introduce a method for propagating coarse 2D user scribbles to the 3D space, to modify the color or shape of a local region.… ▽ More

    Submitted 4 June, 2021; v1 submitted 13 May, 2021; originally announced May 2021.

    Comments: Code: https://github.com/stevliu/editnerf Website: http://editnerf.csail.mit.edu/, v2 updated figure 8 and included additional details

  18. arXiv:2007.11678  [pdf, other

    cs.CV

    Contact and Human Dynamics from Monocular Video

    Authors: Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, Jimei Yang

    Abstract: Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. We first es… ▽ More

    Submitted 24 July, 2020; v1 submitted 22 July, 2020; originally announced July 2020.

    Comments: ECCV 2020

  19. arXiv:2006.06175  [pdf, other

    cs.CV cs.SD eess.AS

    Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

    Authors: Karren Yang, Bryan Russell, Justin Salamon

    Abstract: Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of… ▽ More

    Submitted 11 June, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: CVPR 2020

    MSC Class: 68T45 ACM Class: I.4.0

  20. arXiv:1909.04349  [pdf, other

    cs.CV cs.LG cs.RO

    FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images

    Authors: Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, Thomas Brox

    Abstract: Estimating 3D hand pose from single RGB images is a highly ambiguous problem that relies on an unbiased training dataset. In this paper, we analyze cross-dataset generalization when training on existing datasets. We find that approaches perform well on the datasets they are trained on, but do not generalize to other datasets or in-the-wild scenarios. As a consequence, we introduce the first large-… ▽ More

    Submitted 13 September, 2019; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: Accepted to ICCV 2019, Project page: https://lmb.informatik.uni-freiburg.de/projects/freihand/

  21. arXiv:1908.06217  [pdf, other

    cs.CV

    Neural Re-Simulation for Generating Bounces in Single Images

    Authors: Carlo Innamorati, Bryan Russell, Danny M. Kaufman, and Niloy J. Mitra

    Abstract: We introduce a method to generate videos of dynamic virtual objects plausibly interacting via collisions with a still image's environment. Given a starting trajectory, physically simulated with the estimated geometry of a single, static input image, we learn to 'correct' this trajectory to a visually plausible one via a neural network. The neural network can then be seen as learning to 'correct' t… ▽ More

    Submitted 24 August, 2019; v1 submitted 16 August, 2019; originally announced August 2019.

    Comments: Accepted to ICCV 2019

    MSC Class: 68T45 ACM Class: I.4.9

  22. arXiv:1908.04725  [pdf, other

    cs.CV cs.AI

    Learning elementary structures for 3D shape generation and matching

    Authors: Theo Deprelle, Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, Mathieu Aubry

    Abstract: We propose to represent shapes as the deformation and combination of learnable elementary 3D structures, which are primitives resulting from training over a collection of shape. We demonstrate that the learned elementary 3D structures lead to clear improvements in 3D shape generation and matching. More precisely, we present two complementary approaches for learning elementary structures: (i) patch… ▽ More

    Submitted 14 August, 2019; v1 submitted 13 August, 2019; originally announced August 2019.

  23. arXiv:1907.12763  [pdf, other

    cs.CV cs.CL

    Finding Moments in Video Collections Using Natural Language

    Authors: Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

    Abstract: We introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), a model that represents… ▽ More

    Submitted 23 February, 2022; v1 submitted 30 July, 2019; originally announced July 2019.

  24. arXiv:1907.03165  [pdf, other

    cs.CV

    Unsupervised cycle-consistent deformation for shape matching

    Authors: Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, Mathieu Aubry

    Abstract: We propose a self-supervised approach to deep surface deformation. Given a pair of shapes, our algorithm directly predicts a parametric transformation from one shape to the other respecting correspondences. Our insight is to use cycle-consistency to define a notion of good correspondences in groups of objects and use it as a supervisory signal to train our network. Our method does not rely on a te… ▽ More

    Submitted 6 July, 2019; originally announced July 2019.

  25. arXiv:1904.06827  [pdf, other

    cs.CV

    Bounce and Learn: Modeling Scene Dynamics with Real-World Bounces

    Authors: Senthil Purushwalkam, Abhinav Gupta, Danny M. Kaufman, Bryan Russell

    Abstract: We introduce an approach to model surface properties governing bounces in everyday scenes. Our model learns end-to-end, starting from sensor inputs, to predict post-bounce trajectories and infer two underlying physical properties that govern bouncing - restitution and effective collision normals. Our model, Bounce and Learn, comprises two modules -- a Physics Inference Module (PIM) and a Visual In… ▽ More

    Submitted 14 April, 2019; originally announced April 2019.

    Comments: Accepted for publication at the International Conference on Learning Representations (ICLR) 2019

  26. arXiv:1903.08642  [pdf, other

    cs.CV cs.GR

    Photometric Mesh Optimization for Video-Aligned 3D Object Reconstruction

    Authors: Chen-Hsuan Lin, Oliver Wang, Bryan C. Russell, Eli Shechtman, Vladimir G. Kim, Matthew Fisher, Simon Lucey

    Abstract: In this paper, we address the problem of 3D object mesh reconstruction from RGB videos. Our approach combines the best of multi-view geometric and data-driven methods for 3D reconstruction by optimizing object meshes for multi-view photometric consistency while constraining mesh deformations with a shape prior. We pose this as a piecewise image alignment problem for each mesh face projection. Our… ▽ More

    Submitted 20 March, 2019; originally announced March 2019.

    Comments: Accepted to CVPR 2019 (project page & code: https://chenhsuanlin.bitbucket.io/photometric-mesh-optim/)

  27. arXiv:1902.11216  [pdf, other

    cs.HC cs.GR cs.LG

    B-Script: Transcript-based B-roll Video Editing with Recommendations

    Authors: Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, Gautham J. Mysore

    Abstract: In video production, inserting B-roll is a widely used technique to enrich the story and make a video more engaging. However, determining the right content and positions of B-roll and actually inserting it within the main footage can be challenging, and novice producers often struggle to get both timing and content right. We present B-Script, a system that supports B-roll video editing via interac… ▽ More

    Submitted 28 February, 2019; originally announced February 2019.

    Comments: 11 pages, 10 figures, CHI 2019

    ACM Class: H.5.2

  28. arXiv:1809.01337  [pdf, other

    cs.CV cs.CL

    Localizing Moments in Video with Temporal Language

    Authors: Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

    Abstract: Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in tex… ▽ More

    Submitted 5 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  29. arXiv:1808.07269  [pdf, other

    hep-ex cs.CV physics.data-an physics.ins-det

    A Deep Neural Network for Pixel-Level Electromagnetic Particle Identification in the MicroBooNE Liquid Argon Time Projection Chamber

    Authors: MicroBooNE collaboration, C. Adams, M. Alrashed, R. An, J. Anthony, J. Asaadi, A. Ashkenazi, M. Auger, S. Balasubramanian, B. Baller, C. Barnes, G. Barr, M. Bass, F. Bay, A. Bhat, K. Bhattacharya, M. Bishai, A. Blake, T. Bolton, L. Camilleri, D. Caratelli, I. Caro Terrazas, R. Carr, R. Castillo Fernandez, F. Cavanna , et al. (148 additional authors not shown)

    Abstract: We have developed a convolutional neural network (CNN) that can make a pixel-level prediction of objects in image data recorded by a liquid argon time projection chamber (LArTPC) for the first time. We describe the network design, training techniques, and software tools developed to train this network. The goal of this work is to develop a complete deep neural network based data reconstruction cha… ▽ More

    Submitted 22 August, 2018; originally announced August 2018.

    Journal ref: Phys. Rev. D 99, 092001 (2019)

  30. arXiv:1806.05228  [pdf, other

    cs.CV

    3D-CODED : 3D Correspondences by Deep Deformation

    Authors: Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, Mathieu Aubry

    Abstract: We present a new deep learning approach for matching deformable shapes by introducing {\it Shape Deformation Networks} which jointly encode 3D shapes and correspondences. This is achieved by factoring the surface representation into (i) a template, that parameterizes the surface, and (ii) a learnt global feature vector that parameterizes the transformation of the template into the input surface. B… ▽ More

    Submitted 27 July, 2018; v1 submitted 13 June, 2018; originally announced June 2018.

  31. arXiv:1804.04875  [pdf, other

    cs.CV

    BodyNet: Volumetric Inference of 3D Human Body Shapes

    Authors: Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid

    Abstract: Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue f… ▽ More

    Submitted 18 August, 2018; v1 submitted 13 April, 2018; originally announced April 2018.

    Comments: Appears in: European Conference on Computer Vision 2018 (ECCV 2018). 27 pages

  32. arXiv:1802.05384  [pdf, other

    cs.CV

    AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation

    Authors: Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, Mathieu Aubry

    Abstract: We introduce a method for learning to generate the surface of 3D shapes. Our approach represents a 3D shape as a collection of parametric surface elements and, in contrast to methods generating voxel grids or point clouds, naturally infers a surface representation of the shape. Beyond its novelty, our new shape generation framework, AtlasNet, comes with significant advantages, such as improved pre… ▽ More

    Submitted 20 July, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

  33. Learning Visual Importance for Graphic Designs and Data Visualizations

    Authors: Zoya Bylinskii, Nam Wook Kim, Peter O'Donovan, Sami Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand, Bryan Russell, Aaron Hertzmann

    Abstract: Knowing where people look and click on visual designs can provide clues about how the designs are perceived, and where the most important or relevant content lies. The most important content of a visual design can be used for effective summarization or to facilitate retrieval from a database. We present automated models that predict the relative importance of different elements in data visualizati… ▽ More

    Submitted 8 August, 2017; originally announced August 2017.

    ACM Class: H.5.1

    Journal ref: UIST 2017

  34. arXiv:1708.01641  [pdf, other

    cs.CV

    Localizing Moments in Video with Natural Language

    Authors: Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

    Abstract: We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global vid… ▽ More

    Submitted 4 August, 2017; originally announced August 2017.

    Comments: ICCV 2017

  35. arXiv:1704.02895  [pdf, other

    cs.CV

    ActionVLAD: Learning spatio-temporal aggregation for action classification

    Authors: Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

    Abstract: In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

    Comments: Accepted to CVPR 2017. Project page: https://rohitgirdhar.github.io/ActionVLAD/

  36. arXiv:1702.06506  [pdf, other

    cs.CV cs.LG cs.RO

    PixelNet: Representation of the pixels, by the pixels, and for the pixels

    Authors: Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, Deva Ramanan

    Abstract: We explore design principles for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation. Convolutional predictors, such as the fully-convolutional network (FCN), have achieved remarkable success by exploiting the spatial redundancy of neighboring pixels through convolutional processing. Though computationall… ▽ More

    Submitted 21 February, 2017; originally announced February 2017.

    Comments: Project Page: http://www.cs.cmu.edu/~aayushb/pixelNet/. arXiv admin note: substantial text overlap with arXiv:1609.06694

  37. arXiv:1609.06694  [pdf, other

    cs.CV cs.LG

    PixelNet: Towards a General Pixel-level Architecture

    Authors: Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, Deva Ramanan

    Abstract: We explore architectures for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation. Convolutional predictors, such as the fully-convolutional network (FCN), have achieved remarkable success by exploiting the spatial redundancy of neighboring pixels through convolutional processing. Though computationally ef… ▽ More

    Submitted 21 September, 2016; originally announced September 2016.

  38. arXiv:1604.01347  [pdf, other

    cs.CV

    Marr Revisited: 2D-3D Alignment via Surface Normal Prediction

    Authors: Aayush Bansal, Bryan Russell, Abhinav Gupta

    Abstract: We introduce an approach that leverages surface normal predictions, along with appearance cues, to retrieve 3D models for objects depicted in 2D still images from a large CAD object library. Critical to the success of our approach is the ability to recover accurate surface normals for objects in the depicted scene. We introduce a skip-network model built on the pre-trained Oxford VGG convolutional… ▽ More

    Submitted 5 April, 2016; originally announced April 2016.

  39. arXiv:1512.02497  [pdf, other

    cs.CV cs.LG cs.NE

    Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views

    Authors: Francisco Massa, Bryan Russell, Mathieu Aubry

    Abstract: This paper presents an end-to-end convolutional neural network (CNN) for 2D-3D exemplar detection. We demonstrate that the ability to adapt the features of natural images to better align with those of CAD rendered views is critical to the success of our technique. We show that the adaptation can be learned by compositing rendered views of textured object models on natural images. Our approach can… ▽ More

    Submitted 18 April, 2016; v1 submitted 8 December, 2015; originally announced December 2015.

    Comments: To appear in CVPR 2016

  40. arXiv:1506.01151  [pdf, other

    cs.CV

    Understanding deep features with computer-generated imagery

    Authors: Mathieu Aubry, Bryan Russell

    Abstract: We introduce an approach for analyzing the variation of features generated by convolutional neural networks (CNNs) with respect to scene factors that occur in natural images. Such factors may include object style, 3D viewpoint, color, and scene lighting configuration. Our approach analyzes CNN feature responses corresponding to different scene factors by controlling for them via rendering using a… ▽ More

    Submitted 3 June, 2015; originally announced June 2015.

  41. Zermelo Navigation and a Speed Limit to Quantum Information Processing

    Authors: Benjamin Russell, Susan Stepney

    Abstract: We use a specific geometric method to determine speed limits to the implementation of quantum gates in controlled quantum systems that have a specific class of constrained control functions. We achieve this by applying a recent theorem of Shen, which provides a connection between time optimal navigation on Riemannian manifolds and the geodesics of a certain Finsler metric of Randers type. We use t… ▽ More

    Submitted 13 June, 2014; v1 submitted 24 October, 2013; originally announced October 2013.

    Comments: 7 pages, accepted by Phys.Rev.A

    Journal ref: Phys. Rev. A 90, 012303 (2014)