Skip to main content

Showing 1–39 of 39 results for author: Hauptmann, A G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19859  [pdf, other

    cs.AI cs.HC cs.MM

    MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

    Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, **gdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, **-Peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G. Hauptmann

    Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt

  2. arXiv:2406.19236  [pdf, other

    cs.AI cs.CV cs.RO

    Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

    Authors: Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann

    Abstract: Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activitie… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 30 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/

  3. arXiv:2404.18398  [pdf, other

    cs.CL cs.MM

    MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

    Authors: Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann

    Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  4. arXiv:2310.05737  [pdf, other

    cs.CV cs.AI cs.MM

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

    Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer… ▽ More

    Submitted 29 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  5. arXiv:2306.17842  [pdf, other

    cs.CV cs.CL cs.MM

    SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

    Authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yan** Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

    Abstract: In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details n… ▽ More

    Submitted 28 October, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 spotlight

  6. arXiv:2306.08937  [pdf, other

    cs.CL cs.IR

    DocumentNet: Bridging the Data Gap in Document Pre-Training

    Authors: Lijun Yu, ** Miao, Xiaoyu Sun, Jiayi Chen, Alexander G. Hauptmann, Hanjun Dai, Wei Wei

    Abstract: Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlap** entity spaces from different datase… ▽ More

    Submitted 26 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: EMNLP 2023

  7. arXiv:2304.02173  [pdf, other

    cs.CV cs.AI cs.MM

    ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules

    Authors: Zhi-Qi Cheng, Qi Dai, Siyao Li, **gdong Sun, Teruko Mitamura, Alexander G. Hauptmann

    Abstract: Charts are a powerful tool for visually conveying complex data, but their comprehension poses a challenge due to the diverse chart types and intricate components. Existing chart comprehension methods suffer from either heuristic rules or an over-reliance on OCR systems, resulting in suboptimal performance. To address these issues, we present ChartReader, a unified framework that seamlessly integra… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

  8. arXiv:2212.05199  [pdf, other

    cs.CV

    MAGVIT: Masked Generative Video Transformer

    Authors: Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

    Abstract: We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MA… ▽ More

    Submitted 4 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: CVPR 2023 highlight

  9. arXiv:2208.08965  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

    Authors: Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, Alexander G. Hauptmann

    Abstract: Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like" event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework… ▽ More

    Submitted 28 November, 2022; v1 submitted 18 August, 2022; originally announced August 2022.

    Comments: ACM Multimedia 2022 (Oral), Code: https://github.com/zhiqic/GSRFormer

  10. arXiv:2206.05253  [pdf, other

    cs.CV cs.AI cs.LG stat.AP

    Rethinking Spatial Invariance of Convolutional Networks for Object Counting

    Authors: Zhi-Qi Cheng, Qi Dai, Hong Li, **gKuan Song, Xiao Wu, Alexander G. Hauptmann

    Abstract: Previous work generally believes that improving the spatial invariance of convolutional networks is the key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level spatial invariance would cause overfit noise in the density map generation. In this paper, we try to use locally connected Gaussian kernels to replace the original… ▽ More

    Submitted 18 August, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

    Comments: Accepted to CVPR 2022, Code: https://github.com/zhiqic/Rethinking-Counting

  11. arXiv:2201.05290  [pdf, other

    cs.CV cs.MM

    Argus++: Robust Real-time Activity Detection for Unconstrained Video Streams with Overlap** Cube Proposals

    Authors: Lijun Yu, Yijun Qian, Wenhe Liu, Alexander G. Hauptmann

    Abstract: Activity detection is one of the attractive computer vision tasks to exploit the video streams captured by widely installed cameras. Although achieving impressive performance, conventional activity detection algorithms are usually designed under certain constraints, such as using trimmed and/or object-centered video clips as inputs. Therefore, they failed to deal with the multi-scale multi-instanc… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

  12. arXiv:2105.00379  [pdf, other

    cs.CV

    Subspace Representation Learning for Few-shot Image Classification

    Authors: Ting-Yao Hu, Zhi-Qi Cheng, Alexander G. Hauptmann

    Abstract: In this paper, we propose a subspace representation learning (SRL) framework to tackle few-shot image classification tasks. It exploits a subspace in local CNN feature space to represent an image, and measures the similarity between two images according to a weighted subspace distance (WSD). When K images are available for each class, we develop two types of template subspaces to aggregate K-shot… ▽ More

    Submitted 4 May, 2021; v1 submitted 1 May, 2021; originally announced May 2021.

  13. arXiv:2102.10033  [pdf, other

    cs.CV

    Pose Guided Person Image Generation with Hidden p-Norm Regression

    Authors: Ting-Yao Hu, Alexander G. Hauptmann

    Abstract: In this paper, we propose a novel approach to solve the pose guided person image generation task. We assume that the relation between pose and appearance information can be described by a simple matrix operation in hidden space. Based on this assumption, our method estimates a pose-invariant feature matrix for each identity, and uses it to predict the target appearance conditioned on the target po… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Journal ref: ICIP 2021

  14. arXiv:2011.00147  [pdf, other

    cs.CV

    Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

    Authors: Guoliang Kang, Yunchao Wei, Yi Yang, Yueting Zhuang, Alexander G. Hauptmann

    Abstract: Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. T… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.

    Comments: Accepted by NeurIPS 2020 (oral). Code: https://github.com/kgl-prml/Pixel- Level-Cycle-Association

  15. arXiv:2002.00137  [pdf, other

    cs.CV

    Training-free Monocular 3D Event Detection System for Traffic Surveillance

    Authors: Lijun Yu, Peng Chen, Wenhe Liu, Guoliang Kang, Alexander G. Hauptmann

    Abstract: We focus on the problem of detecting traffic events in a surveillance scenario, including the detection of both vehicle actions and traffic collisions. Existing event detection systems are mostly learning-based and have achieved convincing performance when a large amount of training data is available. However, in real-world scenarios, collecting sufficient labeled training data is expensive and so… ▽ More

    Submitted 31 January, 2020; originally announced February 2020.

    Comments: To be published in 2019 IEEE International Conference on Big Data (Big Data), IEEE

  16. arXiv:1901.00976  [pdf, other

    cs.CV

    Contrastive Adaptation Network for Unsupervised Domain Adaptation

    Authors: Guoliang Kang, Lu Jiang, Yi Yang, Alexander G Hauptmann

    Abstract: Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To address this issue, this paper proposes Contrastive Adaptation Network (CAN) optimizing a new metr… ▽ More

    Submitted 10 April, 2019; v1 submitted 3 January, 2019; originally announced January 2019.

    Comments: Accepted by CVPR 2019

  17. arXiv:1808.01119  [pdf, other

    cs.CV

    Multi-shot Person Re-identification through Set Distance with Visual Distributional Representation

    Authors: Ting-Yao Hu, Xiaojun Chang, Alexander G. Hauptmann

    Abstract: Person re-identification aims to identify a specific person at distinct times and locations. It is challenging because of occlusion, illumination, and viewpoint change in camera views. Recently, multi-shot person re-id task receives more attention since it is closer to real-world application. A key point of a good algorithm for multi-shot person re-id is the temporal aggregation of the person appe… ▽ More

    Submitted 8 November, 2018; v1 submitted 3 August, 2018; originally announced August 2018.

  18. arXiv:1804.09288  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    A Closer Look at Weak Label Learning for Audio Events

    Authors: Ankit Shah, Anurag Kumar, Alexander G. Hauptmann, Bhiksha Raj

    Abstract: Audio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for soun… ▽ More

    Submitted 24 April, 2018; originally announced April 2018.

    Comments: 10 pages

  19. arXiv:1712.06679  [pdf, other

    cs.CV

    DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation

    Authors: Jiang Liu, Chenqiang Gao, Deyu Meng, Alexander G. Hauptmann

    Abstract: In real-world crowd counting applications, the crowd densities vary greatly in spatial and temporal domains. A detection based counting method will estimate crowds accurately in low density scenes, while its reliability in congested areas is downgraded. A regression based approach, on the other hand, captures the general density information in crowded regions. Without knowing the location of each… ▽ More

    Submitted 6 March, 2018; v1 submitted 18 December, 2017; originally announced December 2017.

    Comments: CVPR 2018

  20. arXiv:1707.07791  [pdf, other

    cs.CV

    Deep Feature Learning via Structured Graph Laplacian Embedding for Person Re-Identification

    Authors: De Cheng, Yihong Gong, Zhihui Li, Weiwei Shi, Alexander G. Hauptmann, Nanning Zheng

    Abstract: Learning the distance metric between pairs of examples is of great importance for visual recognition, especially for person re-identification (Re-Id). Recently, the contrastive and triplet loss are proposed to enhance the discriminative power of the deeply learned features, and have achieved remarkable success. As can be seen, either the contrastive or triplet loss is just one special case of the… ▽ More

    Submitted 24 July, 2017; originally announced July 2017.

    Comments: 9 pages, 4 figures

  21. arXiv:1707.01408  [pdf, other

    cs.CV

    Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification

    Authors: Po-Yao Huang, Ye Yuan, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann

    Abstract: We report on CMU Informedia Lab's system used in Google's YouTube 8 Million Video Understanding Challenge. In this multi-label video classification task, our pipeline achieved 84.675% and 84.662% GAP on our evaluation split and the official test set. We attribute the good performance to three components: 1) Refined video representation learning with residual links and hypercolumns 2) Latent concep… ▽ More

    Submitted 25 July, 2017; v1 submitted 5 July, 2017; originally announced July 2017.

  22. arXiv:1704.00389  [pdf, other

    cs.CV cs.LG cs.MM

    Hidden Two-Stream Convolutional Networks for Action Recognition

    Authors: Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander G. Hauptmann

    Abstract: Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architectu… ▽ More

    Submitted 30 October, 2018; v1 submitted 2 April, 2017; originally announced April 2017.

    Comments: Accepted at ACCV 2018, camera ready. Code available at https://github.com/bryanyzhu/Hidden-Two-Stream

  23. arXiv:1702.02295  [pdf, other

    cs.CV

    Guided Optical Flow Learning

    Authors: Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander G. Hauptmann

    Abstract: We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of gr… ▽ More

    Submitted 1 July, 2017; v1 submitted 8 February, 2017; originally announced February 2017.

    Comments: CVPR17 Workshop. Code available at https://github.com/bryanyzhu/GuidedNet

  24. arXiv:1702.01229  [pdf, other

    cs.LG stat.ML

    Simple to Complex Cross-modal Learning to Rank

    Authors: Minnan Luo, Xiaojun Chang, Zhihui Li, Liqiang Nie, Alexander G. Hauptmann, Qinghua Zheng

    Abstract: The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding space to measure the cross-modality similarity. However, previous methods often establish the shared embedding space based on linear map** functions which might n… ▽ More

    Submitted 7 July, 2017; v1 submitted 3 February, 2017; originally announced February 2017.

    Comments: 14 pages; Accepted by Computer Vision and Image Understanding

  25. arXiv:1701.07368  [pdf, ps, other

    cs.CV

    Deep Local Video Feature for Action Recognition

    Authors: Zhenzhong Lan, Yi Zhu, Alexander G. Hauptmann

    Abstract: We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the… ▽ More

    Submitted 28 January, 2017; v1 submitted 25 January, 2017; originally announced January 2017.

  26. arXiv:1610.02984  [pdf, other

    cs.CV

    Person Re-identification: Past, Present and Future

    Authors: Liang Zheng, Yi Yang, Alexander G. Hauptmann

    Abstract: Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of… ▽ More

    Submitted 10 October, 2016; originally announced October 2016.

    Comments: 20 pages, 5 tables, 10 images

  27. arXiv:1608.03748  [pdf, other

    cs.CV

    Self-paced Learning for Weakly Supervised Evidence Discovery in Multimedia Event Search

    Authors: Mengyi Liu, Lu Jiang, Shiguang Shan, Alexander G. Hauptmann

    Abstract: Multimedia event detection has been receiving increasing attention in recent years. Besides recognizing an event, the discovery of evidences (which is refered to as "recounting") is also crucial for user to better understand the searching result. Due to the difficulty of evidence annotation, only limited supervision of event labels are available for training a recounting model. To deal with the pr… ▽ More

    Submitted 23 October, 2017; v1 submitted 12 August, 2016; originally announced August 2016.

    Comments: This paper has been withdrawn by the author due to a crucial error in tables

  28. arXiv:1606.05705  [pdf, other

    cs.IR cs.CV cs.MM

    Strategies for Searching Video Content with Text Queries or Video Examples

    Authors: Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann, Chuang Gan, Xingzhong Du, Xiaojun Chang

    Abstract: The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly an… ▽ More

    Submitted 17 June, 2016; originally announced June 2016.

  29. arXiv:1604.07468  [pdf, other

    cs.CV

    Long-Term Identity-Aware Multi-Person Tracking for Surveillance Video Summarization

    Authors: Shoou-I Yu, Yi Yang, Xuanchong Li, Alexander G. Hauptmann

    Abstract: Multi-person tracking plays a critical role in the analysis of surveillance video. However, most existing work focus on shorter-term (e.g. minute-long or hour-long) video sequences. Therefore, we propose a multi-person tracking algorithm for very long-term (e.g. month-long) multi-camera surveillance scenarios. Long-term tracking is challenging because 1) the apparel/appearance of the same person w… ▽ More

    Submitted 10 April, 2017; v1 submitted 25 April, 2016; originally announced April 2016.

  30. arXiv:1601.03679  [pdf, other

    cs.CV

    Dynamic Concept Composition for Zero-Example Event Detection

    Authors: Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, Alexander G. Hauptmann

    Abstract: In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars. In principle, zero-shot learning makes it possible to train an event detection model based on the assumption that events (e.g. \emph{birthday party}) can be described by multiple mid-level semantic concepts (e.g. "blowing candle", "birthday cake"). Towards this goal,… ▽ More

    Submitted 14 January, 2016; originally announced January 2016.

    Comments: 7 pages, AAAI 2016

  31. arXiv:1512.03740  [pdf, other

    cs.CV

    Improving Human Activity Recognition Through Ranking and Re-ranking

    Authors: Zhenzhong Lan, Shoou-I Yu, Alexander G. Hauptmann

    Abstract: We propose two well-motivated ranking-based methods to enhance the performance of current state-of-the-art human activity recognition systems. First, as an improvement over the classic power normalization method, we propose a parameter-free ranking technique called rank normalization (RaN). RaN normalizes each dimension of the video features to address the sparse and bursty distribution problems o… ▽ More

    Submitted 11 December, 2015; originally announced December 2015.

  32. arXiv:1511.05045  [pdf, other

    cs.CV

    Handcrafted Local Features are Convolutional Neural Networks

    Authors: Zhenzhong Lan, Shoou-I Yu, Ming Lin, Bhiksha Raj, Alexander G. Hauptmann

    Abstract: Image and video classification research has made great progress through the development of handcrafted local features and learning based features. These two architectures were proposed roughly at the same time and have flourished at overlap** stages of history. However, they are typically viewed as distinct approaches. In this paper, we emphasize their structural similarities and show how such a… ▽ More

    Submitted 19 November, 2015; v1 submitted 16 November, 2015; originally announced November 2015.

  33. arXiv:1511.04670  [pdf, other

    cs.CV

    Uncovering Temporal Context for Video Question and Answering

    Authors: Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann

    Abstract: In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future. We present an encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using questio… ▽ More

    Submitted 15 November, 2015; originally announced November 2015.

  34. arXiv:1510.04565  [pdf, other

    cs.CV

    Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition

    Authors: Zhenzhong Lan, Alexander G. Hauptmann

    Abstract: We address the problem of generating video features for action recognition. The spatial pyramid and its variants have been very popular feature models due to their success in balancing spatial location encoding and spatial invariance. Although it seems straightforward to extend spatial pyramid to the temporal domain (spatio-temporal pyramid), the large spatio-temporal diversity of unconstrained vi… ▽ More

    Submitted 15 October, 2015; originally announced October 2015.

  35. arXiv:1502.04132  [pdf, other

    cs.CV

    Long-short Term Motion Feature for Action Classification and Retrieval

    Authors: Zhenzhong Lan, Xuanchong Li, Ming Lin, Alexander G. Hauptmann

    Abstract: We propose a method for representing motion information for video classification and retrieval. We improve upon local descriptor based methods that have been among the most popular and successful models for representing videos. The desired local descriptors need to satisfy two requirements: 1) to be representative, 2) to be discriminative. Therefore, they need to occur frequently enough in the vid… ▽ More

    Submitted 13 February, 2015; originally announced February 2015.

    Comments: arXiv admin note: text overlap with arXiv:1411.6660

  36. arXiv:1411.6660  [pdf, other

    cs.CV

    Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition

    Authors: Zhenzhong Lan, Ming Lin, Xuanchong Li, Alexander G. Hauptmann, Bhiksha Raj

    Abstract: Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This attenuation introduces bias to the resulting features and generates ill-conditioned feature matrices. The Gaussian Pyramid has been used as a feature enhancing technique that encodes scale-invariant characteristics into the featu… ▽ More

    Submitted 19 April, 2015; v1 submitted 24 November, 2014; originally announced November 2014.

  37. arXiv:1411.4006  [pdf, other

    cs.CV

    A Discriminative CNN Video Representation for Event Detection

    Authors: Zhongwen Xu, Yi Yang, Alexander G. Hauptmann

    Abstract: In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkit. This paper makes two c… ▽ More

    Submitted 14 November, 2014; originally announced November 2014.

  38. arXiv:1408.7071  [pdf, other

    cs.CV

    Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition

    Authors: Zhenzhong Lan, Xuanchong Li, Alexandar G. Hauptmann

    Abstract: Historically, researchers in the field have spent a great deal of effort to create image representations that have scale invariance and retain spatial location information. This paper proposes to encode equivalent temporal characteristics in video representations for action recognition. To achieve temporal scale invariance, we develop a method called temporal scale pyramid (TSP). To encode tempora… ▽ More

    Submitted 29 August, 2014; originally announced August 2014.

  39. arXiv:1207.1423  [pdf

    cs.LG cs.DB stat.ML

    Mining Associated Text and Images with Dual-Wing Harmoniums

    Authors: Eric P. Xing, Rong Yan, Alexander G. Hauptmann

    Abstract: We propose a multi-wing harmonium model for mining multimedia data that extends and improves on earlier models based on two-layer random fields, which capture bidirectional dependencies between hidden topic aspects and observed inputs. This model can be viewed as an undirected counterpart of the two-layer directed models such as LDA for similar tasks, but bears significant difference in inference/… ▽ More

    Submitted 4 July, 2012; originally announced July 2012.

    Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

    Report number: UAI-P-2005-PG-633-641