Search | arXiv e-print repository

Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Authors: Jian Ren, Menglei Chai, Oliver J. Woodford, Kyle Olszewski, Sergey Tulyakov

Abstract: Human motion retargeting aims to transfer the motion of one person in a "driving" video or set of images to another person. Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model. However, the scalability of such methods is limited, as each model can only generate videos for the given target subject, and such training videos are la… ▽ More Human motion retargeting aims to transfer the motion of one person in a "driving" video or set of images to another person. Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model. However, the scalability of such methods is limited, as each model can only generate videos for the given target subject, and such training videos are labor-intensive to acquire and process. Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention. Methods addressing this task generally use either 2D or explicit 3D representations to transfer motion, and in doing so, sacrifice either accurate geometric modeling or the flexibility of an end-to-end learned representation. Inspired by the Transformable Bottleneck Network, which renders novel views and manipulations of rigid objects, we propose an approach based on an implicit volumetric representation of the image content, which can then be spatially manipulated using volumetric flow fields. We address the challenging question of how to aggregate information across different body poses, learning flow fields that allow for combining content from the appropriate regions of input images of highly non-rigid human subjects performing complex motions into a single implicit volumetric representation. This allows us to learn our 3D representation solely from videos of moving people. Armed with both 3D object understanding and end-to-end learned rendering, this categorically novel representation delivers state-of-the-art image generation quality, as shown by our quantitative and qualitative evaluations. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: CVPR 2021

arXiv:2104.11280 [pdf, other]

Motion Representations for Articulated Animation

Authors: Aliaksandr Siarohin, Oliver J. Woodford, Jian Ren, Menglei Chai, Sergey Tulyakov

Abstract: We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. In contrast to the previous keypoint-based works, our method extracts meaningful and consistent regions, describing locations, sh… ▽ More We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. In contrast to the previous keypoint-based works, our method extracts meaningful and consistent regions, describing locations, shape, and pose. The regions correspond to semantically relevant and distinct object parts, that are more easily detected in frames of the driving video. To force decoupling of foreground from background, we model non-object related global motion with an additional affine transformation. To facilitate animation and prevent the leakage of the shape of the driving object, we disentangle shape and pose of objects in the region space. Our model can animate a variety of objects, surpassing previous methods by a large margin on existing benchmarks. We present a challenging new benchmark with high-resolution videos and show that the improvement is particularly pronounced when articulated objects are considered, reaching 96.6% user preference vs. the state of the art. △ Less

Submitted 22 April, 2021; originally announced April 2021.

Journal ref: CVPR 2021

arXiv:2103.03467 [pdf, other]

Teachers Do More Than Teach: Compressing Image-to-Image Models

Authors: Qing **, Jian Ren, Oliver J. Woodford, Jiazhuo Wang, Geng Yuan, Yanzhi Wang, Sergey Tulyakov

Abstract: Generative Adversarial Networks (GANs) have achieved huge success in generating high-fidelity images, however, they suffer from low efficiency due to tremendous computational cost and bulky memory usage. Recent efforts on compression GANs show noticeable progress in obtaining smaller generators by sacrificing image quality or involving a time-consuming searching process. In this work, we aim to ad… ▽ More Generative Adversarial Networks (GANs) have achieved huge success in generating high-fidelity images, however, they suffer from low efficiency due to tremendous computational cost and bulky memory usage. Recent efforts on compression GANs show noticeable progress in obtaining smaller generators by sacrificing image quality or involving a time-consuming searching process. In this work, we aim to address these issues by introducing a teacher network that provides a search space in which efficient network architectures can be found, in addition to performing knowledge distillation. First, we revisit the search space of generative models, introducing an inception-based residual block into generators. Second, to achieve target computation cost, we propose a one-step pruning algorithm that searches a student architecture from the teacher model and substantially reduces searching cost. It requires no l1 sparsity regularization and its associated hyper-parameters, simplifying the training procedure. Finally, we propose to distill knowledge through maximizing feature similarity between teacher and student via an index named Global Kernel Alignment (GKA). Our compressed networks achieve similar or even better image fidelity (FID, mIoU) than the original models with much-reduced computational cost, e.g., MACs. Code will be released at https://github.com/snap-research/CAT. △ Less

Submitted 18 August, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

Comments: 18 pages, 10 figures, accepted by CVPR 2021

arXiv:2010.10968 [pdf, other]

Progressive Batching for Efficient Non-linear Least Squares

Authors: Huu Le, Christopher Zach, Edward Rosten, Oliver J. Woodford

Abstract: Non-linear least squares solvers are used across a broad range of offline and real-time model fitting problems. Most improvements of the basic Gauss-Newton algorithm tackle convergence guarantees or leverage the sparsity of the underlying problem structure for computational speedup. With the success of deep learning methods leveraging large datasets, stochastic optimization methods received recent… ▽ More Non-linear least squares solvers are used across a broad range of offline and real-time model fitting problems. Most improvements of the basic Gauss-Newton algorithm tackle convergence guarantees or leverage the sparsity of the underlying problem structure for computational speedup. With the success of deep learning methods leveraging large datasets, stochastic optimization methods received recently a lot of attention. Our work borrows ideas from both stochastic machine learning and statistics, and we present an approach for non-linear least-squares that guarantees convergence while at the same time significantly reduces the required amount of computation. Empirical results show that our proposed method achieves competitive convergence rates compared to traditional second-order approaches on common computer vision problems, such as image alignment and essential matrix estimation, with very large numbers of residuals. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: Accepted to ACCV 2020

arXiv:2008.11762 [pdf, other]

Large Scale Photometric Bundle Adjustment

Authors: Oliver J. Woodford, Edward Rosten

Abstract: Direct methods have shown promise on visual odometry and SLAM, leading to greater accuracy and robustness over feature-based methods. However, offline 3-d reconstruction from internet images has not yet benefited from a joint, photometric optimization over dense geometry and camera parameters. Issues such as the lack of brightness constancy, and the sheer volume of data, make this a more challengi… ▽ More Direct methods have shown promise on visual odometry and SLAM, leading to greater accuracy and robustness over feature-based methods. However, offline 3-d reconstruction from internet images has not yet benefited from a joint, photometric optimization over dense geometry and camera parameters. Issues such as the lack of brightness constancy, and the sheer volume of data, make this a more challenging task. This work presents a framework for jointly optimizing millions of scene points and hundreds of camera poses and intrinsics, using a photometric cost that is invariant to local lighting changes. The improvement in metric reconstruction accuracy that it confers over feature-based bundle adjustment is demonstrated on the large-scale Tanks & Temples benchmark. We further demonstrate qualitative reconstruction improvements on an internet photo collection, with challenging diversity in lighting and camera intrinsics. △ Less

Submitted 10 September, 2020; v1 submitted 26 August, 2020; originally announced August 2020.

Comments: Presented at BMVC 2020. Fixed errors: intrinsic regularization corrected, and added to the cost

arXiv:1810.04320 [pdf, other]

Least Squares Normalized Cross Correlation

Authors: Oliver J. Woodford

Abstract: Direct methods are widely used for alignment of models to images, due to their accuracy, since they minimize errors in the domain of measurement noise. They have leveraged least squares minimizations, for simple, efficient, variational optimization, since the seminal 1981 work of Lucas & Kanade, and normalized cross correlation (NCC), for robustness to intensity variations, since at least 1972. De… ▽ More Direct methods are widely used for alignment of models to images, due to their accuracy, since they minimize errors in the domain of measurement noise. They have leveraged least squares minimizations, for simple, efficient, variational optimization, since the seminal 1981 work of Lucas & Kanade, and normalized cross correlation (NCC), for robustness to intensity variations, since at least 1972. Despite the complementary benefits of these two well known methods, they have not been effectively combined to address local variations in intensity. Many ad-hoc NCC frameworks, sub-optimal least squares methods and image transformation approaches have thus been proposed instead, each with their own limitations. This work shows that a least squares optimization of NCC without approximation is not only possible, but straightforward and efficient. A robust, locally normalized formulation is introduced to mitigate local intensity variations and partial occlusions. Finally, sparse features with oriented patches are proposed for further efficiency. The resulting framework is simple to implement, computationally efficient and robust to local intensity variations. It is evaluated on the image alignment problem, showing improvements in both convergence rate and computation time over existing lighting invariant methods. △ Less

Submitted 29 March, 2022; v1 submitted 9 October, 2018; originally announced October 2018.

Comments: Final version. Rejected from TPAMI twice

arXiv:1607.00273 [pdf, other]

Noise Models in Feature-based Stereo Visual Odometry

Authors: Pablo F. Alcantarilla, Oliver J. Woodford

Abstract: Feature-based visual structure and motion reconstruction pipelines, common in visual odometry and large-scale reconstruction from photos, use the location of corresponding features in different images to determine the 3D structure of the scene, as well as the camera parameters associated with each image. The noise model, which defines the likelihood of the location of each feature in each image, i… ▽ More Feature-based visual structure and motion reconstruction pipelines, common in visual odometry and large-scale reconstruction from photos, use the location of corresponding features in different images to determine the 3D structure of the scene, as well as the camera parameters associated with each image. The noise model, which defines the likelihood of the location of each feature in each image, is a key factor in the accuracy of such pipelines, alongside optimization strategy. Many different noise models have been proposed in the literature; in this paper we investigate the performance of several. We evaluate these models specifically w.r.t. stereo visual odometry, as this task is both simple (camera intrinsics are constant and known; geometry can be initialized reliably) and has datasets with ground truth readily available (KITTI Odometry and New Tsukuba Stereo Dataset). Our evaluation shows that noise models which are more adaptable to the varying nature of noise generally perform better. △ Less

Submitted 1 July, 2016; originally announced July 2016.

Showing 1–7 of 7 results for author: Woodford, O J