Search | arXiv e-print repository

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

Authors: Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang

Abstract: In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low… ▽ More In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2403.18361 [pdf, other]

ViTAR: Vision Transformer with Any Resolution

Authors: Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Abstract: This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, d… ▽ More This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing. △ Less

Submitted 28 March, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.01487 [pdf, other]

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Authors: Haogeng Liu, Quanzeng You, Xiaotian Han, Yiqi Wang, Bohan Zhai, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Abstract: Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architect… ▽ More Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd △ Less

Submitted 3 March, 2024; originally announced March 2024.

arXiv:2401.08968 [pdf, other]

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Authors: Xiaotian Han, Yiqi Wang, Bohan Zhai, Quanzeng You, Hongxia Yang

Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach… ▽ More Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach: transforming existing datasets with rule-based templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet most effective IFT datasets today. Notably, when properly fine-tuned with this dataset, MLLMs can achieve state-of-the-art performance on several benchmarks. However, we noticed that models trained with this dataset often struggle to follow user instructions properly in multi-round dialog. In addition, tradition caption and VQA evaluation benchmarks, with their closed-form evaluation structure, are not fully equipped to assess the capabilities of modern open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k dataset, but may be a potential issue in all IFT datasets constructed from image captioning or VQA sources, though the extent of this issue may vary. We argue that datasets with diverse and high-quality detailed instruction following annotations are essential and adequate for MLLMs IFT. In this work, we establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions. Our experiments show that when fine-tuned with out proposed dataset, MLLMs achieve better performance on open-ended evaluation benchmarks in both single-round and multi-round dialog setting. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.06805 [pdf, other]

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Authors: Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

Abstract: Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various… ▽ More Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning. △ Less

Submitted 18 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

arXiv:2312.05742 [pdf, other]

The Generalization Gap in Offline Reinforcement Learning

Authors: Ishita Mediratta, Qingfei You, Minqi Jiang, Roberta Raileanu

Abstract: Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environ… ▽ More Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area. △ Less

Submitted 14 March, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: Published as a conference paper at ICLR 2024; First two authors contributed equally

arXiv:2312.01408 [pdf, other]

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

Authors: Tianqi Chen, Yongfei Liu, Zhendong Wang, Jianbo Yuan, Quanzeng You, Hongxia Yang, Mingyuan Zhou

Abstract: In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest. Existing approaches in visual in-context learning frequently face hurdles such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limite… ▽ More In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest. Existing approaches in visual in-context learning frequently face hurdles such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limited adaptability to new tasks. In response to these challenges, we introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. This vector is subsequently used to modulate the token embeddings of text prompts. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks and excels in in-context learning for novel vision tasks, such as normal-to-image or image-to-line transformations. The effectiveness of these capabilities relies heavily on a deep visual understanding, which is achieved through relevant visual demonstrations processed by our proposed in-context learning architecture. △ Less

Submitted 3 December, 2023; originally announced December 2023.

arXiv:2311.17126 [pdf, other]

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Authors: Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-** Liu, Hongxia Yang

Abstract: Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability o… ▽ More Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability of T2I diffusion models, they typically require manual layout input. In this work, we introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators. Our method leverages the Chain-of-Thought prompting of LLMs to interpret text and generate spatially reasonable object layouts. The generated layout is then used to enhance the generated images' composition and spatial accuracy. Moreover, we propose an efficient adapter based on a cross-attention mechanism, which explicitly integrates the layout information into the stable diffusion models. Our experiments demonstrate significant improvements in image quality and layout accuracy, showcasing the potential of LLMs in augmenting generative image models. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: preprint

arXiv:2311.11567 [pdf, other]

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Authors: Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang

Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding… ▽ More Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://infimm.github.io/InfiMM-Eval/ △ Less

Submitted 4 December, 2023; v1 submitted 20 November, 2023; originally announced November 2023.

arXiv:2310.06389 [pdf, other]

Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

Authors: Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou

Abstract: Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep netwo… ▽ More Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skip** of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models. Our code and project page are available at https://jegzheng.github.io/LEGODiffusion. △ Less

Submitted 27 June, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

arXiv:2306.04774 [pdf, other]

RefineVIS: Video Instance Segmentation with Temporal Attention Refinement

Authors: Andre Abrantes, Jiang Wang, Peng Chu, Quanzeng You, Zicheng Liu

Abstract: We introduce a novel framework called RefineVIS for Video Instance Segmentation (VIS) that achieves good object association between frames and accurate segmentation masks by iteratively refining the representations using sequence context. RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model: an association representation responsible… ▽ More We introduce a novel framework called RefineVIS for Video Instance Segmentation (VIS) that achieves good object association between frames and accurate segmentation masks by iteratively refining the representations using sequence context. RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model: an association representation responsible for associating objects across frames and a segmentation representation that produces accurate segmentation masks. Contrastive learning is utilized to learn temporally stable association representations. A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships and a novel temporal contrastive denoising technique. Our method supports both online and offline inference. It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets. The visualization shows that the TAR module can generate more accurate instance segmentation masks, particularly for challenging cases such as highly occluded objects. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2208.11663 [pdf, other]

PEER: A Collaborative Language Model

Authors: Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, Sebastian Riedel

Abstract: Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today's language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and i… ▽ More Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today's language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and incapable of verbally planning or explaining their actions. To address these shortcomings, we introduce PEER, a collaborative language model that is trained to imitate the entire writing process itself: PEER can write drafts, add suggestions, propose edits and provide explanations for its actions. Crucially, we train multiple instances of PEER able to infill various parts of the writing process, enabling the use of self-training techniques for increasing the quality, amount and diversity of training data. This unlocks PEER's full potential by making it applicable in domains for which no edit histories are available and improving its ability to follow instructions, to write useful comments, and to explain its actions. We show that PEER achieves strong performance across various domains and editing tasks. △ Less

Submitted 24 August, 2022; originally announced August 2022.

arXiv:2206.07011 [pdf, other]

Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention

Authors: Quanzeng You, Jiang Wang, Peng Chu, Andre Abrantes, Zicheng Liu

Abstract: Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. Although these methods generally predict higher quality… ▽ More Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. Although these methods generally predict higher quality object segmentation masks, they can fail to associate instances in challenging cases because they do not explicitly model the temporal instance consistency for adjacent frames. We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context. Our extensive experiments demonstrate that the Inter-Frame Recurrent Attention significantly improves temporal instance consistency while maintaining the quality of the object segmentation masks. Our model achieves state-of-the-art accuracy on both YouTubeVIS-2019 (62.1\%) and YouTubeVIS-2021 (54.7\%) datasets. In addition, quantitative and qualitative results show that the proposed methods predict more temporally consistent instance segmentation masks. △ Less

Submitted 14 June, 2022; originally announced June 2022.

Comments: 11 pages, 5 figures, 4 tables

arXiv:2203.12198 [pdf, other]

Deep Frequency Filtering for Domain Generalization

Authors: Shiqi Lin, Zhizheng Zhang, Zhipeng Huang, Yan Lu, Cuiling Lan, Peng Chu, Quanzeng You, Jiang Wang, Zicheng Liu, Amey Parulkar, Viraj Navkal, Zhibo Chen

Abstract: Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for… ▽ More Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for learning domain-generalizable features, which is the first endeavour to explicitly modulate the frequency components of different transfer difficulties across domains in the latent space during training. To achieve this, we perform Fast Fourier Transform (FFT) for the feature maps at different layers, then adopt a light-weight module to learn attention masks from the frequency representations after FFT to enhance transferable components while suppressing the components not conducive to generalization. Further, we empirically compare the effectiveness of adopting different types of attention designs for implementing DFF. Extensive experiments demonstrate the effectiveness of our proposed DFF and show that applying our DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks, including close-set classification and open-set retrieval. △ Less

Submitted 25 March, 2023; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: Accepted by CVPR2023

arXiv:2201.10654 [pdf, ps, other]

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Authors: Peixi Xiong, Quanzeng You, Pei Yu, Zicheng Liu, Ying Wu

Abstract: Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, o… ▽ More Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset. △ Less

Submitted 25 January, 2022; originally announced January 2022.

arXiv:2112.06632 [pdf, other]

Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation

Authors: Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Peng Chu, Quanzeng You, Jiang Wang, Zicheng Liu, Zheng-jun Zha

Abstract: Unsupervised domain adaptive person re-identification (ReID) has been extensively investigated to mitigate the adverse effects of domain gaps. Those works assume the target domain data can be accessible all at once. However, for the real-world streaming data, this hinders the timely adaptation to changing data statistics and sufficient exploitation of increasing samples. In this paper, to address… ▽ More Unsupervised domain adaptive person re-identification (ReID) has been extensively investigated to mitigate the adverse effects of domain gaps. Those works assume the target domain data can be accessible all at once. However, for the real-world streaming data, this hinders the timely adaptation to changing data statistics and sufficient exploitation of increasing samples. In this paper, to address more practical scenarios, we propose a new task, Lifelong Unsupervised Domain Adaptive (LUDA) person ReID. This is challenging because it requires the model to continuously adapt to unlabeled data in the target environments while alleviating catastrophic forgetting for such a fine-grained person retrieval task. We design an effective scheme for this task, dubbed CLUDA-ReID, where the anti-forgetting is harmoniously coordinated with the adaptation. Specifically, a meta-based Coordinated Data Replay strategy is proposed to replay old data and update the network with a coordinated optimization direction for both adaptation and memorization. Moreover, we propose Relational Consistency Learning for old knowledge distillation/inheritance in line with the objective of retrieval-based tasks. We set up two evaluation settings to simulate the practical application scenarios. Extensive experiments demonstrate the effectiveness of our CLUDA-ReID for both scenarios with stationary target streams and scenarios with dynamic target streams. △ Less

Submitted 29 March, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: Accepted by CVPR2022

arXiv:2111.15157 [pdf, other]

MMPTRACK: Large-scale Densely Annotated Multi-camera Multiple People Tracking Benchmark

Authors: Xiaotian Han, Quanzeng You, Chunyu Wang, Zhizheng Zhang, Peng Chu, Houdong Hu, Jiang Wang, Zicheng Liu

Abstract: Multi-camera tracking systems are gaining popularity in applications that demand high-quality tracking results, such as frictionless checkout because monocular multi-object tracking (MOT) systems often fail in cluttered and crowded environments due to occlusion. Multiple highly overlapped cameras can significantly alleviate the problem by recovering partial 3D information. However, the cost of cre… ▽ More Multi-camera tracking systems are gaining popularity in applications that demand high-quality tracking results, such as frictionless checkout because monocular multi-object tracking (MOT) systems often fail in cluttered and crowded environments due to occlusion. Multiple highly overlapped cameras can significantly alleviate the problem by recovering partial 3D information. However, the cost of creating a high-quality multi-camera tracking dataset with diverse camera settings and backgrounds has limited the dataset scale in this domain. In this paper, we provide a large-scale densely-labeled multi-camera tracking dataset in five different environments with the help of an auto-annotation system. The system uses overlapped and calibrated depth and RGB cameras to build a high-performance 3D tracker that automatically generates the 3D tracking results. The 3D tracking results are projected to each RGB camera view using camera parameters to create 2D tracking results. Then, we manually check and correct the 3D tracking results to ensure the label quality, which is much cheaper than fully manual annotation. We have conducted extensive experiments using two real-time multi-camera trackers and a person re-identification (ReID) model with different settings. This dataset provides a more reliable benchmark of multi-camera, multi-object tracking systems in cluttered and crowded environments. Also, our results demonstrate that adapting the trackers and ReID models on this dataset significantly improves their performance. Our dataset will be publicly released upon the acceptance of this work. △ Less

Submitted 30 November, 2021; originally announced November 2021.

arXiv:2106.13802 [pdf, other]

Efficient Document Image Classification Using Region-Based Graph Neural Network

Authors: Jaya Krishna Mandivarapu, Eric Bunch, Qian You, Glenn Fung

Abstract: Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources wh… ▽ More Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources which could defeat the cost-saving advantages of automatic document image classification. In the paper we propose an efficient document image classification framework that uses graph convolution neural networks and incorporates textual, visual and layout information of the document. We have rigorously benchmarked our proposed algorithm against several state-of-art vision and language models on both publicly available dataset and a real-life insurance document classification dataset. Empirical results on both publicly available and real-world data show that our methods achieve near SOTA performance yet require much less computing resources and time for model training and inference. This results in solutions than offer better cost advantages, especially in scalable deployment for enterprise applications. The results showed that our algorithm can achieve classification performance quite close to SOTA. We also provide comprehensive comparisons of computing resources, model sizes, train and inference time between our proposed methods and baselines. In addition we delineate the cost per image using our method and other baselines. △ Less

Submitted 25 June, 2021; originally announced June 2021.

arXiv:2106.06471 [pdf, other]

Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation

Authors: Xingyi Yang, Muchao Ye, Quanzeng You, Fenglong Ma

Abstract: Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechani… ▽ More Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval~(VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between sentences, the Language-Language Retrieval~(LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR. △ Less

Submitted 25 May, 2021; originally announced June 2021.

Comments: Accepted by ACL 2021, Camera-ready version

arXiv:2104.00194 [pdf, other]

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Authors: Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, Zicheng Liu

Abstract: Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked o… ▽ More Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets. △ Less

Submitted 3 April, 2021; v1 submitted 31 March, 2021; originally announced April 2021.

arXiv:2103.13917 [pdf, other]

Disentanglement-based Cross-Domain Feature Augmentation for Effective Unsupervised Domain Adaptive Person Re-identification

Authors: Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Quanzeng You, Zicheng Liu, Kecheng Zheng, Zhibo Chen

Abstract: Unsupervised domain adaptive (UDA) person re-identification (ReID) aims to transfer the knowledge from the labeled source domain to the unlabeled target domain for person matching. One challenge is how to generate target domain samples with reliable labels for training. To address this problem, we propose a Disentanglement-based Cross-Domain Feature Augmentation (DCDFA) strategy, where the augment… ▽ More Unsupervised domain adaptive (UDA) person re-identification (ReID) aims to transfer the knowledge from the labeled source domain to the unlabeled target domain for person matching. One challenge is how to generate target domain samples with reliable labels for training. To address this problem, we propose a Disentanglement-based Cross-Domain Feature Augmentation (DCDFA) strategy, where the augmented features characterize well the target and source domain data distributions while inheriting reliable identity labels. Particularly, we disentangle each sample feature into a robust domain-invariant/shared feature and a domain-specific feature, and perform cross-domain feature recomposition to enhance the diversity of samples used in the training, with the constraints of cross-domain ReID loss and domain classification loss. Each recomposed feature, obtained based on the domain-invariant feature (which enables a reliable inheritance of identity) and an enhancement from a domain specific feature (which enables the approximation of real distributions), is thus an "ideal" augmentation. Extensive experimental results demonstrate the effectiveness of our method, which achieves the state-of-the-art performance. △ Less

Submitted 25 March, 2021; originally announced March 2021.

arXiv:2102.00426 [pdf, other]

A Simple yet Brisk and Efficient Active Learning Platform for Text Classification

Authors: Teja Kanchinadam, Qian You, Keith Westpfahl, James Kim, Siva Gunda, Sebastian Seith, Glenn Fung

Abstract: In this work, we propose the use of a fully managed machine learning service, which utilizes active learning to directly build models from unstructured data. With this tool, business users can quickly and easily build machine learning models and then directly deploy them into a production ready hosted environment without much involvement from data scientists. Our approach leverages state-of-the-ar… ▽ More In this work, we propose the use of a fully managed machine learning service, which utilizes active learning to directly build models from unstructured data. With this tool, business users can quickly and easily build machine learning models and then directly deploy them into a production ready hosted environment without much involvement from data scientists. Our approach leverages state-of-the-art text representation like OpenAI's GPT2 and a fast implementation of the active learning workflow that relies on a simple construction of incremental learning using linear models, thus providing a brisk and efficient labeling experience for the users. Experiments on both publicly available and real-life insurance datasets empirically show why our choices of simple and fast classification algorithms are ideal for the task at hand. △ Less

Submitted 31 January, 2021; originally announced February 2021.

arXiv:2012.02420 [pdf, other]

Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation

Authors: Junyu Luo, Zifei Zheng, Hanzhong Ye, Muchao Ye, Yaqing Wang, Quanzeng You, Cao Xiao, Fenglong Ma

Abstract: Patients with low health literacy usually have difficulty understanding medical jargon and the complex structure of professional medical language. Although some studies are proposed to automatically translate expert language into layperson-understandable language, only a few of them focus on both accuracy and readability aspects simultaneously in the clinical domain. Thus, simplification of the cl… ▽ More Patients with low health literacy usually have difficulty understanding medical jargon and the complex structure of professional medical language. Although some studies are proposed to automatically translate expert language into layperson-understandable language, only a few of them focus on both accuracy and readability aspects simultaneously in the clinical domain. Thus, simplification of the clinical language is still a challenging task, but unfortunately, it is not yet fully addressed in previous work. To benchmark this task, we construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. Besides, we propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance compared with eight strong baselines. To fairly evaluate the performance, we also propose three specific evaluation metrics. Experimental results demonstrate the utility of the annotated MedLane dataset and the effectiveness of the proposed model DECLARE. △ Less

Submitted 21 September, 2023; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: COLING 2022

Journal ref: 2022.coling-1.313

arXiv:2006.05480 [pdf]

DcardNet: Diabetic Retinopathy Classification at Multiple Levels Based on Structural and Angiographic Optical Coherence Tomography

Authors: Pengxiao Zang, Liqin Gao, Tristan T. Hormel, Jie Wang, Qisheng You, Thomas S. Hwang, Yali Jia

Abstract: Objective: Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages for the early detection and diagnosis of diabetic retinopathy (DR). However, automated, complete DR classification frameworks based on both OCT and OCTA data have not been proposed. In this study, a convolutional neural network (CNN) based method is proposed to fulfill a DR classification framework usi… ▽ More Objective: Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages for the early detection and diagnosis of diabetic retinopathy (DR). However, automated, complete DR classification frameworks based on both OCT and OCTA data have not been proposed. In this study, a convolutional neural network (CNN) based method is proposed to fulfill a DR classification framework using en face OCT and OCTA. Methods: A densely and continuously connected neural network with adaptive rate dropout (DcardNet) is designed for the DR classification. In addition, adaptive label smoothing was proposed and used to suppress overfitting. Three separate classification levels are generated for each case based on the International Clinical Diabetic Retinopathy scale. At the highest level the network classifies scans as referable or non-referable for DR. The second level classifies the eye as non-DR, non-proliferative DR (NPDR), or proliferative DR (PDR). The last level classifies the case as no DR, mild and moderate NPDR, severe NPDR, and PDR. Results: We used 10-fold cross-validation with 10% of the data to assess the networks performance. The overall classification accuracies of the three levels were 95.7%, 85.0%, and 71.0% respectively. Conclusion/Significance: A reliable, sensitive and specific automated classification framework for referral to an ophthalmologist can be a key technology for reducing vision loss related to DR. △ Less

Submitted 24 September, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: Accepted for publication by IEEE Transactions on Biomedical Engineering

arXiv:2003.11753 [pdf, other]

Real-time 3D Deep Multi-Camera Tracking

Authors: Quanzeng You, Hao Jiang

Abstract: Tracking a crowd in 3D using multiple RGB cameras is a challenging task. Most previous multi-camera tracking algorithms are designed for offline setting and have high computational complexity. Robust real-time multi-camera 3D tracking is still an unsolved problem. In this work, we propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time mul… ▽ More Tracking a crowd in 3D using multiple RGB cameras is a challenging task. Most previous multi-camera tracking algorithms are designed for offline setting and have high computational complexity. Robust real-time multi-camera 3D tracking is still an unsolved problem. In this work, we propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time multi-camera people tracking. Our DMCT consists of 1) a fast and novel perspective-aware Deep GroudPoint Network, 2) a fusion procedure for ground-plane occupancy heatmap estimation, 3) a novel Deep Glimpse Network for person detection and 4) a fast and accurate online tracker. Our design fully unleashes the power of deep neural network to estimate the "ground point" of each person in each color image, which can be optimized to run efficiently and robustly. Our fusion procedure, glimpse network and tracker merge the results from different views, find people candidates using multiple video frames and then track people on the fused heatmap. Our system achieves the state-of-the-art tracking results while maintaining real-time performance. Apart from evaluation on the challenging WILDTRACK dataset, we also collect two more tracking datasets with high-quality labels from two different environments and camera settings. Our experimental results confirm that our proposed real-time pipeline gives superior results to previous approaches. △ Less

Submitted 26 March, 2020; originally announced March 2020.

Comments: 17 pages, 8 figures

arXiv:1903.01695 [pdf, other]

Real-time Multiple People Hand Localization in 4D Point Clouds

Authors: Hao Jiang, Quanzeng You

Abstract: We propose novel real-time algorithm to localize hands and find their associations with multiple people in the cluttered 4D volumetric data (dynamic 3D volumes). Different from the traditional multiple view approaches, which find key points in 2D and then triangulate to recover the 3D locations, our method directly processes the dynamic 3D data that involve both clutter and crowd. The volumetric r… ▽ More We propose novel real-time algorithm to localize hands and find their associations with multiple people in the cluttered 4D volumetric data (dynamic 3D volumes). Different from the traditional multiple view approaches, which find key points in 2D and then triangulate to recover the 3D locations, our method directly processes the dynamic 3D data that involve both clutter and crowd. The volumetric representation is more desirable than the partial observations from different view points and enables more robust and accurate results. However, due to the large amount of data in the volumetric representation brute force 3D schemes are slow. In this paper, we propose novel real-time methods to tackle the problem to achieve both higher accuracy and faster speed than previous approaches. Our method detects the 3D bounding box of each subject and localizes the hands of each person. We develop new 2D features for fast candidate proposals and optimize the trajectory linking using a new max-covering bipartite matching formulation, which is critical for robust performance. We propose a novel decomposition method to reduce the key point localization in each person 3D volume to a sequence of efficient 2D problems. Our experiments show that the proposed method is faster than different competing methods and it gives almost half the localization error. △ Less

Submitted 5 March, 2019; originally announced March 2019.

arXiv:1807.07961 [pdf, other]

doi 10.1145/3240508.3240533

Twitter Sentiment Analysis via Bi-sense Emoji Embedding and Attention-based LSTM

Authors: Yuxiao Chen, Jianbo Yuan, Quanzeng You, Jiebo Luo

Abstract: Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive em… ▽ More Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive emoji uses in sentiment analysis remains light. In this paper, we propose a novel scheme for Twitter sentiment analysis with extra attention on emojis. We first learn bi-sense emoji embeddings under positive and negative sentimental tweets individually, and then train a sentiment classifier by attending on these bi-sense emoji embeddings with an attention-based long short-term memory network (LSTM). Our experiments show that the bi-sense embedding is effective for extracting sentiment-aware embeddings of emojis and outperforms the state-of-the-art models. We also visualize the attentions to show that the bi-sense emoji embedding provides better guidance on the attention mechanism to obtain a more robust understanding of the semantics and sentiments. △ Less

Submitted 6 August, 2018; v1 submitted 20 July, 2018; originally announced July 2018.

arXiv:1807.03871 [pdf, other]

"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention

Authors: Tianlang Chen, Zhong** Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin **, Jiebo Luo

Abstract: Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements in… ▽ More Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision. △ Less

Submitted 29 July, 2018; v1 submitted 10 July, 2018; originally announced July 2018.

Comments: 17 pages, 7 figures, ECCV 2018

arXiv:1806.02424 [pdf, other]

Action4D: Real-time Action Recognition in the Crowd and Clutter

Authors: Quanzeng You, Hao Jiang

Abstract: Recognizing every person's action in a crowded and cluttered environment is a challenging task. In this paper, we propose a real-time action recognition method, Action4D, which gives reliable and accurate results in the real-world settings. We propose to tackle the action recognition problem using a holistic 4D "scan" of a cluttered scene to include every detail about the people and environment. R… ▽ More Recognizing every person's action in a crowded and cluttered environment is a challenging task. In this paper, we propose a real-time action recognition method, Action4D, which gives reliable and accurate results in the real-world settings. We propose to tackle the action recognition problem using a holistic 4D "scan" of a cluttered scene to include every detail about the people and environment. Recognizing multiple people's actions in the cluttered 4D representation is a new problem. In this paper, we propose novel methods to solve this problem. We propose a new method to track people in 4D, which can reliably detect and follow each person in real time. We propose a new deep neural network, the Action4D-Net, to recognize the action of each tracked person. The Action4D-Net's novel structure uses both the global feature and the focused attention to achieve state-of-the-art result. Our real-time method is invariant to camera view angles, resistant to clutter and able to handle crowd. The experimental results show that the proposed method is fast, reliable and accurate. Our method paves the way to action recognition in the real-world applications and is ready to be deployed to enable smart homes, smart factories and smart stores. △ Less

Submitted 6 June, 2018; originally announced June 2018.

arXiv:1803.02952 [pdf, other]

Touch Your Heart: A Tone-aware Chatbot for Customer Care on Social Media

Authors: Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo, Rama Akkiraju

Abstract: Chatbot has become an important solution to rapidly increasing customer care demands on social media in recent years. However, current work on chatbot for customer care ignores a key to impact user experience - tones. In this work, we create a novel tone-aware chatbot that generates toned responses to user requests on social media. We first conduct a formative research, in which the effects of ton… ▽ More Chatbot has become an important solution to rapidly increasing customer care demands on social media in recent years. However, current work on chatbot for customer care ignores a key to impact user experience - tones. In this work, we create a novel tone-aware chatbot that generates toned responses to user requests on social media. We first conduct a formative research, in which the effects of tones are studied. Significant and various influences of different tones on user experience are uncovered in the study. With the knowledge of effects of tones, we design a deep learning based chatbot that takes tone information into account. We train our system on over 1.5 million real customer care conversations collected from Twitter. The evaluation reveals that our tone-aware chatbot generates as appropriate responses to user requests as human agents. More importantly, our chatbot is perceived to be even more empathetic than human agents. △ Less

Submitted 14 March, 2018; v1 submitted 7 March, 2018; originally announced March 2018.

Comments: 12 pages, CHI 2018

ACM Class: H.5.3

arXiv:1801.10121 [pdf, other]

Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions

Authors: Quanzeng You, Hailin **, Jiebo Luo

Abstract: Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as… ▽ More Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions. In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention. The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions. In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions. Compared with the few existing approaches, the proposed models are much simpler and yet more effective. The experimental results show that our model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) image captions. In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: 8 pages, 5 figures and 4 tables

arXiv:1801.02713 [pdf, ps, other]

Duality of Channel Encoding and Decoding - Part II: Rate-1 Non-binary Convolutional Codes

Authors: Qimin You, Yonghui Li, Soung Chang Liew, Branka Vucetic

Abstract: This is the second part of a series of papers on a revisit to the bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-in-soft-out (SISO) maximum a posteriori probability (MAP) decoding algorithm. Part I revisited the BCJR MAP decoding algorithm for rate-1 binary convolutional codes and proposed a linear complexity decoder using shift registers in the complex number field. Part II proposes a low com… ▽ More This is the second part of a series of papers on a revisit to the bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-in-soft-out (SISO) maximum a posteriori probability (MAP) decoding algorithm. Part I revisited the BCJR MAP decoding algorithm for rate-1 binary convolutional codes and proposed a linear complexity decoder using shift registers in the complex number field. Part II proposes a low complexity decoder for rate-1 non-binary convolutional codes that achieves the same error performance as the bidirectional BCJR SISO MAP decoding algorithm. We observe an explicit relationship between the encoding and decoding of rate-1 convolutional codes in $GF(q)$. Based on this relationship, the BCJR forward and backward decoding are implemented by dual encoders using shift registers whose contents are vectors of complex numbers. The input to the dual encoders is the probability mass function (pmf) of the received symbols and the output of the dual encoders is the pmf of the information symbols. The bidirectional BCJR MAP decoding is implemented by linearly combining the shift register contents of the dual encoders for forward and backward decoding. The proposed decoder significantly reduces the computational complexity of the bidirectional BCJR MAP algorithm from exponential to linear with constraint length of convolutional codes. To further reduce complexity, fast Fourier transform (FFT) is applied. Mathematical proofs and simulation results are provided to validate our proposed decoder. △ Less

Submitted 8 January, 2018; originally announced January 2018.

Comments: A linear complexity MAP decoder of rate-1 Non-binary Convolutional Codes. The proposed decoder significantly reduces the computational complexity of the bidirectional BCJR MAP algorithm from exponential to linear with constraint length of convolutional codes.For binary decoder, please refer to part I paper: arXiv:1201.2483

Journal ref: European Transactions on Emerging Telecommunications Technologies (ETT), vol. 27, no. 5, May 2016, pp. 685-697

arXiv:1706.05764 [pdf, ps, other]

doi 10.1145/3097983.3098088

Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks

Authors: Fenglong Ma, Radha Chitta, **g Zhou, Quanzeng You, Tong Sun, **g Gao

Abstract: Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the… ▽ More Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the temporality and high dimensionality of sequential EHR data and to interpret the prediction results. Existing work solves this problem by employing recurrent neural networks (RNNs) to model EHR data and utilizing simple attention mechanism to interpret the results. However, RNN-based approaches suffer from the problem that the performance of RNNs drops when the length of sequences is large, and the relationships between subsequent visits are ignored by current RNN-based approaches. To address these issues, we propose {\sf Dipole}, an end-to-end, simple and robust model for predicting patients' future health information. Dipole employs bidirectional recurrent neural networks to remember all the information of both the past visits and the future visits, and it introduces three attention mechanisms to measure the relationships of different visits for the prediction. With the attention mechanisms, Dipole can interpret the prediction results effectively. Dipole also allows us to interpret the learned medical code representations which are confirmed positively by medical experts. Experimental results on two real world EHR datasets show that the proposed Dipole can significantly improve the prediction accuracy compared with the state-of-the-art diagnosis prediction approaches and provide clinically meaningful interpretation. △ Less

Submitted 18 June, 2017; originally announced June 2017.

arXiv:1705.08974 [pdf, other]

Cultural Diffusion and Trends in Facebook Photographs

Authors: Quanzeng You, Darío García-García, Mahohar Paluri, Jiebo Luo, Jungseock Joo

Abstract: Online social media is a social vehicle in which people share various moments of their lives with their friends, such as playing sports, cooking dinner or just taking a selfie for fun, via visual means, that is, photographs. Our study takes a closer look at the popular visual concepts illustrating various cultural lifestyles from aggregated, de-identified photographs. We perform analysis both at m… ▽ More Online social media is a social vehicle in which people share various moments of their lives with their friends, such as playing sports, cooking dinner or just taking a selfie for fun, via visual means, that is, photographs. Our study takes a closer look at the popular visual concepts illustrating various cultural lifestyles from aggregated, de-identified photographs. We perform analysis both at macroscopic and microscopic levels, to gain novel insights about global and local visual trends as well as the dynamics of interpersonal cultural exchange and diffusion among Facebook friends. We processed images by automatically classifying the visual content by a convolutional neural network (CNN). Through various statistical tests, we find that socially tied individuals more likely post images showing similar cultural lifestyles. To further identify the main cause of the observed social correlation, we use the Shuffle test and the Preference-based Matched Estimation (PME) test to distinguish the effects of influence and homophily. The results indicate that the visual content of each user's photographs are temporally, although not necessarily causally, correlated with the photographs of their friends, which may suggest the effect of influence. Our paper demonstrates that Facebook photographs exhibit diverse cultural lifestyles and preferences and that the social interaction mediated through the visual channel in social media can be an effective mechanism for cultural diffusion. △ Less

Submitted 24 May, 2017; originally announced May 2017.

Comments: 10 pages, To appear in ICWSM 2017 (Full Paper)

arXiv:1611.09180 [pdf, other]

doi 10.1109/TMM.2017.2710804

Image Based Appraisal of Real Estate Properties

Authors: Quanzeng You, Ran Pang, Liangliang Cao, Jiebo Luo

Abstract: Real estate appraisal, which is the process of estimating the price for real estate properties, is crucial for both buys and sellers as the basis for negotiation and transaction. Traditionally, the repeat sales model has been widely adopted to estimate real estate price. However, it depends the design and calculation of a complex economic related index, which is challenging to estimate accurately.… ▽ More Real estate appraisal, which is the process of estimating the price for real estate properties, is crucial for both buys and sellers as the basis for negotiation and transaction. Traditionally, the repeat sales model has been widely adopted to estimate real estate price. However, it depends the design and calculation of a complex economic related index, which is challenging to estimate accurately. Today, real estate brokers provide easy access to detailed online information on real estate properties to their clients. We are interested in estimating the real estate price from these large amounts of easily accessed data. In particular, we analyze the prediction power of online house pictures, which is one of the key factors for online users to make a potential visiting decision. The development of robust computer vision algorithms makes the analysis of visual content possible. In this work, we employ a Recurrent Neural Network (RNN) to predict real estate price using the state-of-the-art visual features. The experimental results indicate that our model outperforms several of other state-of-the-art baseline algorithms in terms of both mean absolute error (MAE) and mean absolute percentage error (MAPE). △ Less

Submitted 27 July, 2017; v1 submitted 28 November, 2016; originally announced November 2016.

Comments: 8 pages, 8 figures

arXiv:1610.09002 [pdf, other]

The Effect of Pets on Happiness: A Data-Driven Approach via Large-Scale Social Media

Authors: Yuchen Wu, Jianbo Yuan, Quanzeng You, Jiebo Luo

Abstract: Psychologists have demonstrated that pets have a positive impact on owners' happiness. For example, lonely people are often advised to have a dog or cat to quell their social isolation. Conventional psychological research methods of analyzing this phenomenon are mostly based on surveys or self-reported questionnaires, which are time-consuming and lack of scalability. Utilizing social media as an a… ▽ More Psychologists have demonstrated that pets have a positive impact on owners' happiness. For example, lonely people are often advised to have a dog or cat to quell their social isolation. Conventional psychological research methods of analyzing this phenomenon are mostly based on surveys or self-reported questionnaires, which are time-consuming and lack of scalability. Utilizing social media as an alternative and complimentary resource could potentially address both issues and provide different perspectives on this psychological investigation. In this paper, we propose a novel and effective approach that exploits social media to study the effect of pets on owners' happiness. The proposed framework includes three major components: 1) collecting user-level data from Instagram consisting of about 300,000 images from 2905 users; 2) constructing a convolutional neural network (CNN) for pets classification, and combined with timeline information, further identifying pet owners and the control group; 3) measuring the confidence score of happiness by detecting and analyzing selfie images. Furthermore, various factors of demographics are employed to analyze the fine-grained effects of pets on happiness. Our experimental results demonstrate the effectiveness of the proposed approach and we believe that this approach can be applied to other related domains as a large-scale, high-confidence methodology of user activity analysis through social media. △ Less

Submitted 27 October, 2016; originally announced October 2016.

Comments: In Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2016

arXiv:1605.02677 [pdf, other]

Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark

Authors: Quanzeng You, Jiebo Luo, Hailin **, Jianchao Yang

Abstract: Psychological research results have confirmed that people can have different emotional reactions to different visual stimuli. Several papers have been published on the problem of visual emotion analysis. In particular, attempts have been made to analyze and predict people's emotional reaction towards images. To this end, different kinds of hand-tuned features are proposed. The results reported on… ▽ More Psychological research results have confirmed that people can have different emotional reactions to different visual stimuli. Several papers have been published on the problem of visual emotion analysis. In particular, attempts have been made to analyze and predict people's emotional reaction towards images. To this end, different kinds of hand-tuned features are proposed. The results reported on several carefully selected and labeled small image data sets have confirmed the promise of such features. While the recent successes of many computer vision related tasks are due to the adoption of Convolutional Neural Networks (CNNs), visual emotion analysis has not achieved the same level of success. This may be primarily due to the unavailability of confidently labeled and relatively large image data sets for visual emotion analysis. In this work, we introduce a new data set, which started from 3+ million weakly labeled images of different emotions and ended up 30 times as large as the current largest publicly available visual emotion data set. We hope that this data set encourages further research on visual emotion analysis. We also perform extensive benchmarking analyses on this large data set using the state of the art methods including CNNs. △ Less

Submitted 9 May, 2016; originally announced May 2016.

Comments: 7 pages, 7 figures, AAAI 2016

arXiv:1604.07103 [pdf, other]

Voting with Feet: Who are Leaving Hillary Clinton and Donald Trump?

Authors: Yu Wang, Yuncheng Li, Quanzeng You, Xiyang Zhang, Richard Niemi, Jiebo Luo

Abstract: From a crowded field with 17 candidates, Hillary Clinton and Donald Trump have emerged as the two front-runners in the 2016 U.S. presidential campaign. The two candidates each boast more than 5 million followers on Twitter, and at the same time both have witnessed hundreds of thousands of people leave their camps. In this paper we attempt to characterize individuals who have left Hillary Clinton a… ▽ More From a crowded field with 17 candidates, Hillary Clinton and Donald Trump have emerged as the two front-runners in the 2016 U.S. presidential campaign. The two candidates each boast more than 5 million followers on Twitter, and at the same time both have witnessed hundreds of thousands of people leave their camps. In this paper we attempt to characterize individuals who have left Hillary Clinton and Donald Trump between September 2015 and March 2016. Our study focuses on three dimensions of social demographics: social capital, gender, and age. Within each camp, we compare the characteristics of the current followers with former followers, i.e., individuals who have left since September 2015. We use the number of followers to measure social capital, and profile images to infer gender and age. For classifying gender, we train a convolutional neural network (CNN). For age, we use the Face++ API. Our study shows that for both candidates followers with more social capital are more likely to leave (or switch camps). For both candidates females make up a larger presence among unfollowers than among current followers. Somewhat surprisingly, the effect is particularly pronounced for Clinton. Lastly, middle-aged individuals are more likely to leave Trump, and the young are more likely to leave Hillary Clinton. △ Less

Submitted 25 April, 2016; v1 submitted 24 April, 2016; originally announced April 2016.

Comments: 4 pages, 8 figures, under review

arXiv:1603.03925 [pdf, other]

Image Captioning with Semantic Attention

Authors: Quanzeng You, Hailin **, Zhaowen Wang, Chen Fang, Jiebo Luo

Abstract: Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which com… ▽ More Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics. △ Less

Submitted 12 March, 2016; originally announced March 2016.

Comments: 10 pages, 5 figures, CVPR16

arXiv:1509.06041 [pdf, other]

Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks

Authors: Quanzeng You, Jiebo Luo, Hailin **, Jianchao Yang

Abstract: Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. Sentiment analysis… ▽ More Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. Sentiment analysis of such large scale visual content can help better extract user sentiments toward events or topics, such as those in image tweets, so that prediction of sentiment from visual content is complementary to textual sentiment analysis. Motivated by the needs in leveraging large scale yet noisy training data to solve the extremely challenging problem of image sentiment analysis, we employ Convolutional Neural Networks (CNN). We first design a suitable CNN architecture for image sentiment analysis. We obtain half a million training samples by using a baseline sentiment algorithm to label Flickr images. To make use of such noisy machine labeled data, we employ a progressive strategy to fine-tune the deep network. Furthermore, we improve the performance on Twitter images by inducing domain transfer with a small number of manually labeled Twitter images. We have conducted extensive experiments on manually labeled Twitter images. The results show that the proposed CNN can achieve better performance in image sentiment analysis than competing algorithms. △ Less

Submitted 20 September, 2015; originally announced September 2015.

Comments: 9 pages, 5 figures, AAAI 2015

arXiv:1504.04558 [pdf, other]

A Picture Tells a Thousand Words -- About You! User Interest Profiling from User Generated Visual Content

Authors: Quanzeng You, Sumit Bhatia, Jiebo Luo

Abstract: Inference of online social network users' attributes and interests has been an active research topic. Accurate identification of users' attributes and interests is crucial for improving the performance of personalization and recommender systems. Most of the existing works have focused on textual content generated by the users and have successfully used it for predicting users' interests and other… ▽ More Inference of online social network users' attributes and interests has been an active research topic. Accurate identification of users' attributes and interests is crucial for improving the performance of personalization and recommender systems. Most of the existing works have focused on textual content generated by the users and have successfully used it for predicting users' interests and other identifying attributes. However, little attention has been paid to user generated visual content (images) that is becoming increasingly popular and pervasive in recent times. We posit that images posted by users on online social networks are a reflection of topics they are interested in and propose an approach to infer user attributes from images posted by them. We analyze the content of individual images and then aggregate the image-level knowledge to infer user-level interest distribution. We employ image-level similarity to propagate the label information between images, as well as utilize the image category information derived from the user created organization structure to further propagate the category-level knowledge for all images. A real life social network dataset created from Pinterest is used for evaluation and the experimental results demonstrate the effectiveness of our proposed approach. △ Less

Submitted 17 April, 2015; originally announced April 2015.

Comments: 7 pages, 6 Figures, 4 Tables

arXiv:1209.4405 [pdf, ps, other]

Strongly Convex Programming for Principal Component Pursuit

Authors: Qingshan You, Qun Wan, Yipeng Liu

Abstract: In this paper, we address strongly convex programming for princi- pal component pursuit with reduced linear measurements, which decomposes a superposition of a low-rank matrix and a sparse matrix from a small set of linear measurements. We first provide sufficient conditions under which the strongly convex models lead to the exact low-rank and sparse matrix recov- ery; Second, we also give suggest… ▽ More In this paper, we address strongly convex programming for princi- pal component pursuit with reduced linear measurements, which decomposes a superposition of a low-rank matrix and a sparse matrix from a small set of linear measurements. We first provide sufficient conditions under which the strongly convex models lead to the exact low-rank and sparse matrix recov- ery; Second, we also give suggestions on how to choose suitable parameters in practical algorithms. △ Less

Submitted 19 September, 2012; originally announced September 2012.

Comments: 10 pages

arXiv:1201.2483 [pdf, ps, other]

doi 10.1002/ett.3018

Duality of Channel Encoding and Decoding - Part I: Rate-1 Binary Convolutional Codes

Authors: Yonghui Li, Qimin You, Soung C. Liew, Branka Vucetic

Abstract: In this paper, we revisit the forward, backward and bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-input soft-output (SISO) maximum a posteriori probability (MAP) decoding process of rate-1 binary convolutional codes. From this we establish some interesting explicit relationships between encoding and decoding of rate-1 convolutional codes. We observe that the forward and backward BCJR SISO MAP… ▽ More In this paper, we revisit the forward, backward and bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-input soft-output (SISO) maximum a posteriori probability (MAP) decoding process of rate-1 binary convolutional codes. From this we establish some interesting explicit relationships between encoding and decoding of rate-1 convolutional codes. We observe that the forward and backward BCJR SISO MAP decoders can be simply represented by their dual SISO channel encoders using shift registers in the complex number field. Similarly, the bidirectional MAP decoding can be implemented by linearly combining the shift register contents of the dual SISO encoders of the respective forward and backward decoders. The dual encoder structures for various recursive and non-recursive rate-1 convolutional codes are derived. △ Less

Submitted 10 November, 2016; v1 submitted 12 January, 2012; originally announced January 2012.

Comments: 32 pages, 20 figures, to appear in ETT

Showing 1–43 of 43 results for author: You, Q