Skip to main content

Showing 1–50 of 55 results for author: Nevatia, R

.
  1. arXiv:2406.11309  [pdf, other

    cs.CV

    BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

    Authors: Xuefeng Hu, Ke Zhang, Min Sun, Albert Chen, Cheng-Hao Kuo, Ram Nevatia

    Abstract: Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Preprint updated from our earlier manuscript submitted to ICLR 2024 (https://openreview.net/forum?id=KNtcoAM5Gy)

  2. arXiv:2404.02345  [pdf, other

    cs.CV

    GaitSTR: Gait Recognition with Sequential Two-stream Refinement

    Authors: Wanrong Zheng, Haidong Zhu, Zhaoheng Zheng, Ram Nevatia

    Abstract: Gait recognition aims to identify a person based on their walking sequences, serving as a useful biometric modality as it can be observed from long distances without requiring cooperation from the subject. In representing a person's walking sequence, silhouettes and skeletons are the two primary modalities used. Silhouette sequences lack detailed part information when overlap** occurs between di… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  3. arXiv:2312.04076  [pdf, other

    cs.CV

    Large Language Models are Good Prompt Learners for Low-Shot Image Classification

    Authors: Zhaoheng Zheng, **gmin Wei, Xuefeng Hu, Haidong Zhu, Ram Nevatia

    Abstract: Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic… ▽ More

    Submitted 2 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  4. arXiv:2311.15510  [pdf, other

    cs.CV

    CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering

    Authors: Haidong Zhu, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Ram Nevatia, Luming Liang

    Abstract: Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic unders… ▽ More

    Submitted 9 July, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

    Comments: Accepted to ECCV 2024. Project available at https://haidongz-usc.github.io/project/caesarnerf

  5. arXiv:2310.15946  [pdf, other

    cs.CV

    ShARc: Shape and Appearance Recognition for Person Identification In-the-wild

    Authors: Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, Ram Nevatia

    Abstract: Identifying individuals in unconstrained video settings is a valuable yet challenging task in biometric analysis due to variations in appearances, environments, degradations, and occlusions. In this paper, we present ShARc, a multimodal approach for video-based person identification in uncontrolled environments that emphasizes 3-D body shape, pose, and appearance. We introduce two encoders: a Pose… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: WACV 2024

  6. arXiv:2308.03793  [pdf, other

    cs.CV cs.LG

    ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

    Authors: Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao, Xiao Zeng, Min Sun, Cheng-Hao Kuo, Ram Nevatia

    Abstract: Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modal… ▽ More

    Submitted 13 December, 2023; v1 submitted 4 August, 2023; originally announced August 2023.

    Comments: Accepted as Oral Paper by 2024 IEEE CVF Winter Conference on Applications of Computer Vision (WACV)

  7. arXiv:2305.16681  [pdf, other

    cs.CV

    CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning

    Authors: Zhaoheng Zheng, Haidong Zhu, Ram Nevatia

    Abstract: In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP… ▽ More

    Submitted 7 November, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: WACV 2024 Camera Ready

  8. arXiv:2304.07916  [pdf, other

    cs.CV

    GaitRef: Gait Recognition with Refined Sequential Skeletons

    Authors: Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, Ram Nevatia

    Abstract: Identifying humans with their walking sequences, known as gait recognition, is a useful biometric understanding task as it can be observed from a long distance and does not require cooperation from the subject. Two common modalities used for representing the walking sequence of a person are silhouettes and joint skeletons. Silhouette sequences, which record the boundary of the walking person in ea… ▽ More

    Submitted 8 August, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

    Comments: IJCB 2023 oral. Code is available at https://github.com/haidongz-usc/GaitRef

  9. arXiv:2304.07915  [pdf, other

    cs.CV

    CAT-NeRF: Constancy-Aware Tx$^2$Former for Dynamic Body Modeling

    Authors: Haidong Zhu, Zhaoheng Zheng, Wanrong Zheng, Ram Nevatia

    Abstract: This paper addresses the problem of human rendering in the video with temporal appearance constancy. Reconstructing dynamic body shapes with volumetric neural rendering methods, such as NeRF, requires finding the correspondence of the points in the canonical and observation space, which demands understanding human body shape and motion. Some methods use rigid transformation, such as SE(3), which c… ▽ More

    Submitted 16 April, 2023; originally announced April 2023.

  10. arXiv:2303.12145  [pdf, other

    cs.CV

    Efficient Feature Distillation for Zero-shot Annotation Object Detection

    Authors: Zhuoming Liu, Xuefeng Hu, Ram Nevatia

    Abstract: We propose a new setting for detecting unseen objects called Zero-shot Annotation object Detection (ZAD). It expands the zero-shot object detection setting by allowing the novel objects to exist in the training images and restricts the additional information the detector uses to novel category names. Recently, to detect unseen objects, large-scale vision-language models (e.g., CLIP) are leveraged… ▽ More

    Submitted 1 November, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: WACV2024 accepted paper

  11. arXiv:2212.09042  [pdf, other

    cs.CV

    Gait Recognition Using 3-D Human Body Shape Inference

    Authors: Haidong Zhu, Zhaoheng Zheng, Ram Nevatia

    Abstract: Gait recognition, which identifies individuals based on their walking patterns, is an important biometric technique since it can be observed from a distance and does not require the subject's cooperation. Recognizing a person's gait is difficult because of the appearance variants in human silhouette sequences produced by varying viewing angles, carrying objects, and clothing. Recent research has p… ▽ More

    Submitted 18 December, 2022; originally announced December 2022.

    Comments: Accepted to WACV 2023

  12. arXiv:2207.01795  [pdf, other

    cs.CV cs.CR cs.LG

    PatchZero: Defending against Adversarial Patch Attacks by Detecting and Zeroing the Patch

    Authors: Ke Xu, Yao Xiao, Zhaoheng Zheng, Kaijie Cai, Ram Nevatia

    Abstract: Adversarial patch attacks mislead neural networks by injecting adversarial pixels within a local region. Patch attacks can be highly effective in a variety of tasks and physically realizable via attachment (e.g. a sticker) to the real-world objects. Despite the diversity in attack patterns, adversarial patches tend to be highly textured and different in appearance from natural images. We exploit t… ▽ More

    Submitted 5 September, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Accepted to WACV 2023

  13. arXiv:2110.11478  [pdf, other

    cs.CV

    MixNorm: Test-Time Adaptation Through Online Normalization Estimation

    Authors: Xuefeng Hu, Gokhan Uzunbas, Sirius Chen, Rui Wang, Ashish Shah, Ram Nevatia, Ser-Nam Lim

    Abstract: We present a simple and effective way to estimate the batch-norm statistics during test time, to fast adapt a source model to target test samples. Known as Test-Time Adaptation, most prior works studying this task follow two assumptions in their evaluation where (1) test samples come together as a large batch, and (2) all from a single test distribution. However, in practice, these two assumptions… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

  14. arXiv:2108.11501  [pdf, other

    cs.CV

    Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction

    Authors: Zhaoheng Zheng, Arka Sadhu, Ram Nevatia

    Abstract: We explore object detection with two attributes: color and material. The task aims to simultaneously detect objects and infer their color and material. A straight-forward approach is to add attribute heads at the very end of a usual object detection pipeline. However, we observe that the two goals are in conflict: Object detection should be attribute-independent and attributes be largely object-in… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: Camera-ready for ICIP 2021

  15. arXiv:2104.03762  [pdf, other

    cs.CV cs.CL

    Video Question Answering with Phrases via Semantic Roles

    Authors: Arka Sadhu, Kan Chen, Ram Nevatia

    Abstract: Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models' application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of a… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: NAACL21 Camera Ready including appendix

  16. arXiv:2104.00990  [pdf, other

    cs.CV cs.CL

    Visual Semantic Role Labeling for Video Understanding

    Authors: Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, Aniruddha Kembhavi

    Abstract: We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchm… ▽ More

    Submitted 2 April, 2021; originally announced April 2021.

    Comments: CVPR21 camera-ready including appendix. Project Page at https://vidsitu.org/

  17. SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

    Authors: Zijian Hu, Zhengyu Yang, Xuefeng Hu, Ram Nevatia

    Abstract: A common classification task situation is where one has a large amount of data available for training, but only a small portion is annotated with class labels. The goal of semi-supervised training, in this context, is to improve classification accuracy by leverage information not only from labeled data but also from a large amount of unlabeled data. Recent works have developed significant improvem… ▽ More

    Submitted 29 June, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: Accepted to CVPR 2021. First two authors contributed equally

  18. arXiv:2011.02655  [pdf, other

    cs.CV

    Utilizing Every Image Object for Semi-supervised Phrase Grounding

    Authors: Haidong Zhu, Arka Sadhu, Zhaoheng Zheng, Ram Nevatia

    Abstract: Phrase grounding models localize an object in the image given a referring expression. The annotated language queries available during training are limited, which also limits the variations of language combinations that a model can see during training. In this paper, we study the case applying objects without labeled queries for training the semi-supervised phrase grounding. We propose to use learn… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

  19. SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

    Authors: Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri, Zhenheng Yang, Ram Nevatia

    Abstract: We present a novel framework, Spatial Pyramid Attention Network (SPAN) for detection and localization of multiple types of image manipulations. The proposed architecture efficiently and effectively models the relationship between image patches at multiple scales by constructing a pyramid of local self-attention blocks. The design includes a novel position projection to encode the spatial positions… ▽ More

    Submitted 13 January, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

    Comments: Accepted at ECCV 2020 (https://link.springer.com/chapter/10.1007%2F978-3-030-58589-1_19) Code Available at https://github.com/ZhiHanZ/IRIS0-SPAN/

    ACM Class: I.4.9

  20. arXiv:2005.00785  [pdf, other

    cs.CL

    Visually Grounded Continual Learning of Compositional Phrases

    Authors: Xisen **, Junyi Du, Arka Sadhu, Ram Nevatia, Xiang Ren

    Abstract: Humans acquire language continually with much more limited access to data samples at a time, as compared to contemporary NLP systems. To study this human-like language acquisition ability, we present VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes. In the task, models are trained on a paired image-… ▽ More

    Submitted 16 November, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: EMNLP 2020; Fixed typos

  21. arXiv:2004.08028  [pdf, other

    cs.CV

    CPARR: Category-based Proposal Analysis for Referring Relationships

    Authors: Chuanzi He, Haidong Zhu, Jiyang Gao, Kan Chen, Ram Nevatia

    Abstract: The task of referring relationships is to localize subject and object entities in an image satisfying a relationship query, which is given in the form of \texttt{<subject, predicate, object>}. This requires simultaneous localization of the subject and object entities in a specified relationship. We introduce a simple yet effective proposal-based method for referring relationships. Different from t… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

    Comments: CVPR 2020 Workshop on Multimodal Learning

  22. arXiv:2003.10606  [pdf, other

    cs.CV cs.CL

    Video Object Grounding using Semantic Roles in Language Description

    Authors: Arka Sadhu, Kan Chen, Ram Nevatia

    Abstract: We explore the task of Video Object Grounding (VOG), which grounds objects in videos referred to in natural language descriptions. Previous methods apply image grounding based algorithms to address VOG, fail to explore the object relation information and suffer from limited generalization. Here, we investigate the role of object relations in VOG and propose a novel framework VOGNet to encode multi… ▽ More

    Submitted 23 March, 2020; originally announced March 2020.

    Comments: CVPR20 camera-ready including appendix

  23. arXiv:2003.08593  [pdf, other

    cs.CV

    Curriculum DeepSDF

    Authors: Yueqi Duan, Haidong Zhu, He Wang, Li Yi, Ram Nevatia, Leonidas J. Guibas

    Abstract: When learning to sketch, beginners start with simple and flexible shapes, and then gradually strive for more complex and accurate ones in the subsequent training sessions. In this paper, we design a "shape curriculum" for learning continuous Signed Distance Function (SDF) on shapes, namely Curriculum DeepSDF. Inspired by how humans learn, Curriculum DeepSDF organizes the learning task in ascending… ▽ More

    Submitted 16 July, 2020; v1 submitted 19 March, 2020; originally announced March 2020.

    Comments: ECCV 2020

  24. arXiv:1908.07129  [pdf, other

    cs.CV cs.CL

    Zero-Shot Grounding of Objects from Natural Language Queries

    Authors: Arka Sadhu, Kan Chen, Ram Nevatia

    Abstract: A phrase grounding system localizes a particular object in an image referred to by a natural language query. In previous work, the phrases were restricted to have nouns that were encountered in training, we extend the task to Zero-Shot Grounding(ZSG) which can include novel, "unseen" nouns. Current phrase grounding systems use an explicit object detection network in a 2-stage framework where one s… ▽ More

    Submitted 19 August, 2019; originally announced August 2019.

    Comments: ICCV19 oral, camera-ready version

  25. arXiv:1907.10202  [pdf, other

    cs.CV

    Pose-variant 3D Facial Attribute Generation

    Authors: Feng-Ju Chang, Xiang Yu, Ram Nevatia, Manmohan Chandraker

    Abstract: We address the challenging problem of generating facial attributes using a single image in an unconstrained pose. In contrast to prior works that largely consider generation on 2D near-frontal images, we propose a GAN-based framework to generate attributes directly on a dense 3D representation given by UV texture and position maps, resulting in photorealistic, geometrically-consistent and identity… ▽ More

    Submitted 23 July, 2019; originally announced July 2019.

  26. arXiv:1904.01665  [pdf, other

    cs.CV

    Activity Driven Weakly Supervised Object Detection

    Authors: Zhenheng Yang, Dhruv Mahajan, Deepti Ghadiyaram, Ram Nevatia, Vignesh Ramanathan

    Abstract: Weakly supervised object detection aims at reducing the amount of supervision required to train detection models. Such models are traditionally learned from images/videos labelled only with the object class and not the object bounding box. In our work, we try to leverage not only the object class labels but also the action labels associated with the data. We show that the action depicted in the im… ▽ More

    Submitted 2 April, 2019; originally announced April 2019.

    Comments: CVPR'19 camera ready

  27. arXiv:1812.03213  [pdf, other

    cs.CV

    PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding

    Authors: Rama Kovvuri, Ram Nevatia

    Abstract: Phrase Grounding aims to detect and localize objects in images that are referred to and are queried by natural language phrases. Phrase grounding finds applications in tasks such as Visual Dialog, Visual Search and Image-text co-reference resolution. In this paper, we present a framework that leverages information such as phrase category, relationships among neighboring phrases in a sentence and c… ▽ More

    Submitted 7 December, 2018; originally announced December 2018.

    Comments: Accepted in ACCV 2018

  28. arXiv:1812.00124  [pdf, other

    cs.CV

    NOTE-RCNN: NOise Tolerant Ensemble RCNN for Semi-Supervised Object Detection

    Authors: JIyang Gao, Jiang Wang, Shengyang Dai, Li-Jia Li, Ram Nevatia

    Abstract: The labeling cost of large number of bounding boxes is one of the main challenges for training modern object detectors. To reduce the dependence on expensive bounding box annotations, we propose a new semi-supervised object detection formulation, in which a few seed box level annotations and a large scale of image level annotations are used to train the detector. We adopt a training-mining framewo… ▽ More

    Submitted 30 November, 2018; originally announced December 2018.

    Comments: 8 pages

  29. arXiv:1811.08925  [pdf, other

    cs.CV

    MAC: Mining Activity Concepts for Language-based Temporal Localization

    Authors: Runzhou Ge, Jiyang Gao, Kan Chen, Ram Nevatia

    Abstract: We address the problem of language-based temporal localization in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries not only have no pre-defined activity list but also may contain complex descriptions. Previous methods address the problem by considering features from video sliding windows and language queries a… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: WACV 2019

  30. arXiv:1810.06125  [pdf, other

    cs.CV

    Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding

    Authors: Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, Alan Yuille

    Abstract: Learning to estimate 3D geometry in a single frame and optical flow from consecutive frames by watching unlabeled videos via deep convolutional network has made significant progress recently. Current state-of-the-art (SoTA) methods treat the two tasks independently. One typical assumption of the existing depth estimation methods is that the scenes contain no independent moving objects. while objec… ▽ More

    Submitted 10 July, 2019; v1 submitted 14 October, 2018; originally announced October 2018.

    Comments: Chenxu Luo, Zhenheng Yang, and Peng Wang contributed equally, TPAMI submission

  31. arXiv:1807.04821  [pdf, other

    cs.CV

    CTAP: Complementary Temporal Action Proposal Generation

    Authors: Jiyang Gao, Kan Chen, Ram Nevatia

    Abstract: Temporal action proposal generation is an important task, akin to object proposals, temporal action proposals are intended to capture "clips" or temporal intervals in videos that are likely to contain an action. Previous methods can be divided to two groups: sliding window ranking and actionness score grou**. Sliding windows uniformly cover all segments in videos, but the temporal boundaries are… ▽ More

    Submitted 18 July, 2018; v1 submitted 12 July, 2018; originally announced July 2018.

    Comments: ECCV 2018 main conference paper (camera ready version). Code is available in http://www.github.com/jiyanggao/CTAP

  32. arXiv:1806.10556  [pdf, other

    cs.CV

    Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding

    Authors: Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia

    Abstract: Learning to estimate 3D geometry in a single image by watching unlabeled videos via deep convolutional network has made significant process recently. Current state-of-the-art (SOTA) methods, are based on the learning framework of rigid structure-from-motion, where only 3D camera ego motion is modeled for geometry estimation.However, moving objects also exist in many videos, e.g. moving cars in a s… ▽ More

    Submitted 15 August, 2018; v1 submitted 27 June, 2018; originally announced June 2018.

    Comments: ECCV18' submission

  33. arXiv:1805.02104  [pdf, other

    cs.CV

    Revisiting Temporal Modeling for Video-based Person ReID

    Authors: Jiyang Gao, Ram Nevatia

    Abstract: Video-based person reID is an important task, which has received much attention in recent years due to the increasing demand in surveillance and camera networks. A typical video-based person reID system consists of three parts: an image-level feature extractor (e.g. CNN), a temporal modeling method to aggregate temporal features and a loss function. Although many methods on temporal modeling have… ▽ More

    Submitted 7 May, 2018; v1 submitted 5 May, 2018; originally announced May 2018.

    Comments: codes available at https://github.com/jiyanggao/Video-Person-ReID

  34. arXiv:1803.10906  [pdf, other

    cs.CV

    Motion-Appearance Co-Memory Networks for Video Question Answering

    Authors: Jiyang Gao, Runzhou Ge, Kan Chen, Ram Nevatia

    Abstract: Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful a… ▽ More

    Submitted 28 March, 2018; originally announced March 2018.

    Comments: CVPR 2018

  35. arXiv:1803.05648  [pdf, other

    cs.CV

    LEGO: Learning Edge with Geometry all at Once by Watching Videos

    Authors: Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia

    Abstract: Learning to estimate 3D geometry in a single image by watching unlabeled videos via deep convolutional network is attracting significant attention. In this paper, we introduce a "3D as-smooth-as-possible (3D-ASAP)" prior inside the pipeline, which enables joint estimation of edges and 3D scene, yielding results with significant improvement in accuracy for fine detailed structures. Specifically, we… ▽ More

    Submitted 23 March, 2018; v1 submitted 15 March, 2018; originally announced March 2018.

    Comments: Accepted to CVPR 2018 as spotlight; Camera ready plus supplementary material. Code will come

  36. arXiv:1803.03879  [pdf, other

    cs.CV

    Knowledge Aided Consistency for Weakly Supervised Phrase Grounding

    Authors: Kan Chen, Jiyang Gao, Ram Nevatia

    Abstract: Given a natural language query, a phrase grounding system aims to localize mentioned objects in an image. In weakly supervised scenario, map** between image regions (i.e., proposals) and language is not available in the training set. Previous methods address this deficiency by training a grounding system via learning to reconstruct language information contained in input queries from predicted p… ▽ More

    Submitted 10 March, 2018; originally announced March 2018.

    Comments: CVPR 2018 conference paper

  37. arXiv:1802.00542  [pdf, other

    cs.CV

    ExpNet: Landmark-Free, Deep, 3D Facial Expressions

    Authors: Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, Gerard Medioni

    Abstract: We describe a deep learning based method for estimating 3D facial expression coefficients. Unlike previous work, our process does not relay on facial landmark detection methods as a proxy step. Recent methods have shown that a CNN can be trained to regress accurate and discriminative 3D morphable model (3DMM) representations, directly from image intensities. By foregoing facial landmark detection,… ▽ More

    Submitted 1 February, 2018; originally announced February 2018.

    Comments: Accepted to the IEEE International Conference on Automatic Face and Gesture Recognition, 2018

  38. arXiv:1711.07607  [pdf, other

    cs.CV

    Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN

    Authors: Jiyang Gao, Zijian, Guo, Zhen Li, Ram Nevatia

    Abstract: Fine-grained image labels are desirable for many computer vision applications, such as visual search or mobile AI assistant. These applications rely on image classification models that can produce hundreds of thousands (e.g. 100K) of diversified fine-grained image labels on input images. However, training a network at this vocabulary scale is challenging, and suffers from intolerable large model s… ▽ More

    Submitted 23 November, 2017; v1 submitted 20 November, 2017; originally announced November 2017.

  39. arXiv:1711.03665  [pdf, other

    cs.CV

    Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency

    Authors: Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, Ramakant Nevatia

    Abstract: Learning to reconstruct depths in a single image by watching unlabeled videos via deep convolutional network (DCN) is attracting significant attention in recent years. In this paper, we introduce a surface normal representation for unsupervised depth estimation framework. Our estimated depths are constrained to be compatible with predicted normals, yielding more robust geometry results. Specifical… ▽ More

    Submitted 9 November, 2017; originally announced November 2017.

    Comments: Accepted at AAAI 2018

  40. arXiv:1708.07517  [pdf, other

    cs.CV

    FacePoseNet: Making a Case for Landmark-Free Face Alignment

    Authors: Fengju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, Gerard Medioni

    Abstract: We show how a simple convolutional neural network (CNN) can be trained to accurately and robustly regress 6 degrees of freedom (6DoF) 3D head pose, directly from image intensities. We further explain how this FacePoseNet (FPN) can be used to align faces in 2D and 3D as an alternative to explicit facial landmark detection for these tasks. We claim that in many cases the standard means of measuring… ▽ More

    Submitted 31 August, 2017; v1 submitted 24 August, 2017; originally announced August 2017.

  41. arXiv:1708.01676  [pdf, other

    cs.CV

    Query-guided Regression Network with Context Policy for Phrase Grounding

    Authors: Kan Chen, Rama Kovvuri, Ram Nevatia

    Abstract: Given a textual description of an image, phrase grounding localizes objects in the image referred by query phrases in the description. State-of-the-art methods address the problem by ranking a set of proposals based on the relevance to each query, which are limited by the performance of independent proposal generation systems and ignore useful cues from context in the description. In this paper, w… ▽ More

    Submitted 4 August, 2017; originally announced August 2017.

    Comments: Spotlight in ICCV 2017

  42. arXiv:1708.00042  [pdf, other

    cs.CV

    Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation

    Authors: Zhenheng Yang, Jiyang Gao, Ram Nevatia

    Abstract: In this work, we address the problem of spatio-temporal action detection in temporally untrimmed videos. It is an important and challenging task as finding accurate human actions in both temporal and spatial space is important for analyzing large-scale video data. To tackle this problem, we propose a cascade proposal and location anticipation (CPLA) model for frame-level action detection. There ar… ▽ More

    Submitted 31 July, 2017; originally announced August 2017.

    Comments: Accepted at BMVC 2017 (oral)

  43. arXiv:1707.04818  [pdf, other

    cs.CV

    RED: Reinforced Encoder-Decoder Networks for Action Anticipation

    Authors: Jiyang Gao, Zhenheng Yang, Ram Nevatia

    Abstract: Action anticipation aims to detect an action before it happens. Many real world applications in robotics and surveillance are related to this predictive capability. Current methods address this problem by first anticipating visual representations of future frames and then categorizing the anticipated representations to actions. However, anticipation is based on a single past frame's representation… ▽ More

    Submitted 16 July, 2017; originally announced July 2017.

  44. arXiv:1705.02101  [pdf, other

    cs.CV

    TALL: Temporal Activity Localization via Language Query

    Authors: Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia

    Abstract: This paper focuses on temporal localization of actions in untrimmed videos. Existing methods typically train classifiers for a pre-defined list of actions and apply them in a sliding window fashion. However, activities in the wild consist of a wide combination of actors, actions and objects; it is difficult to design a proper activity list that meets users' needs. We propose to localize activities… ▽ More

    Submitted 3 August, 2017; v1 submitted 5 May, 2017; originally announced May 2017.

    Comments: ICCV 2017 camera ready (with supplemental material)

  45. arXiv:1705.01180  [pdf, other

    cs.CV

    Cascaded Boundary Regression for Temporal Action Detection

    Authors: Jiyang Gao, Zhenheng Yang, Ram Nevatia

    Abstract: Temporal action detection in long videos is an important problem. State-of-the-art methods address this problem by applying action classifiers on sliding windows. Although sliding windows may contain an identifiable portion of the actions, they may not necessarily cover the entire action instance, which would lead to inferior performance. We adapt a two-stage temporal action detection pipeline wit… ▽ More

    Submitted 2 May, 2017; originally announced May 2017.

  46. arXiv:1704.00763  [pdf, other

    cs.CV

    AMC: Attention guided Multi-modal Correlation Learning for Image Search

    Authors: Kan Chen, Trung Bui, Fang Chen, Zhaowen Wang, Ram Nevatia

    Abstract: Given a user's query, traditional image search systems rank images according to its relevance to a single modality (e.g., image content or surrounding text). Nowadays, an increasing number of images on the Internet are available with associated meta data in rich modalities (e.g., titles, keywords, tags, etc.), which can be exploited for better similarity measure with queries. In this paper, we lev… ▽ More

    Submitted 3 April, 2017; originally announced April 2017.

    Comments: CVPR 2017

  47. arXiv:1703.06189  [pdf, other

    cs.CV

    TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

    Authors: Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, Ram Nevatia

    Abstract: Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e.g. human actions) segments from untrimmed videos is an important step for large-scale video analysis. We propose a novel Temporal Unit Regression Network (TURN) model. There are two salient aspects of TURN: (1) TURN jointly predicts action proposals and refines the tempor… ▽ More

    Submitted 4 August, 2017; v1 submitted 17 March, 2017; originally announced March 2017.

    Comments: ICCV 2017 camera ready

  48. arXiv:1609.03536  [pdf, other

    cs.CV

    A Multi-Scale Cascade Fully Convolutional Network Face Detector

    Authors: Zhenheng Yang, Ram Nevatia

    Abstract: Face detection is challenging as faces in images could be present at arbitrary locations and in different scales. We propose a three-stage cascade structure based on fully convolutional neural networks (FCNs). It first proposes the approximate locations where the faces may be, then aims to find the accurate location by zooming on to the faces. Each level of the FCN cascade is a multi-scale fully-c… ▽ More

    Submitted 12 September, 2016; originally announced September 2016.

    Comments: Accepted to ICPR 16'

  49. arXiv:1609.02284  [pdf, other

    cs.CV

    Learning Action Concept Trees and Semantic Alignment Networks from Image-Description Data

    Authors: Jiyang Gao, Ram Nevatia

    Abstract: Action classification in still images has been a popular research topic in computer vision. Labelling large scale datasets for action classification requires tremendous manual work, which is hard to scale up. Besides, the action categories in such datasets are pre-defined and vocabularies are fixed. However humans may describe the same action with different phrases, which leads to the difficulty o… ▽ More

    Submitted 8 September, 2016; originally announced September 2016.

    Comments: 16 pages, 5 figures

  50. arXiv:1604.04784  [pdf, other

    cs.CV

    ACD: Action Concept Discovery from Image-Sentence Corpora

    Authors: Jiyang Gao, Chen Sun, Ram Nevatia

    Abstract: Action classification in still images is an important task in computer vision. It is challenging as the appearances of ac- tions may vary depending on their context (e.g. associated objects). Manually labeling of context information would be time consuming and difficult to scale up. To address this challenge, we propose a method to automatically discover and cluster action concepts, and learn thei… ▽ More

    Submitted 16 April, 2016; originally announced April 2016.

    Comments: 8 pages, accepted by ICMR 2016