-
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Authors:
Haogeng Liu,
Quanzeng You,
Xiaotian Han,
Yongfei Liu,
Huaibo Huang,
Ran He,
Hongxia Yang
Abstract:
In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low…
▽ More
In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
ViTAR: Vision Transformer with Any Resolution
Authors:
Qihang Fan,
Quanzeng You,
Xiaotian Han,
Yongfei Liu,
Yunzhe Tao,
Huaibo Huang,
Ran He,
Hongxia Yang
Abstract:
This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, d…
▽ More
This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.
△ Less
Submitted 28 March, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Authors:
Haogeng Liu,
Quanzeng You,
Xiaotian Han,
Yiqi Wang,
Bohan Zhai,
Yongfei Liu,
Yunzhe Tao,
Huaibo Huang,
Ran He,
Hongxia Yang
Abstract:
Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architect…
▽ More
Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
Authors:
Xiaotian Han,
Yiqi Wang,
Bohan Zhai,
Quanzeng You,
Hongxia Yang
Abstract:
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach…
▽ More
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach: transforming existing datasets with rule-based templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet most effective IFT datasets today. Notably, when properly fine-tuned with this dataset, MLLMs can achieve state-of-the-art performance on several benchmarks. However, we noticed that models trained with this dataset often struggle to follow user instructions properly in multi-round dialog. In addition, tradition caption and VQA evaluation benchmarks, with their closed-form evaluation structure, are not fully equipped to assess the capabilities of modern open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k dataset, but may be a potential issue in all IFT datasets constructed from image captioning or VQA sources, though the extent of this issue may vary. We argue that datasets with diverse and high-quality detailed instruction following annotations are essential and adequate for MLLMs IFT. In this work, we establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions. Our experiments show that when fine-tuned with out proposed dataset, MLLMs achieve better performance on open-ended evaluation benchmarks in both single-round and multi-round dialog setting.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Authors:
Yiqi Wang,
Wentao Chen,
Xiaotian Han,
Xudong Lin,
Haiteng Zhao,
Yongfei Liu,
Bohan Zhai,
Jianbo Yuan,
Quanzeng You,
Hongxia Yang
Abstract:
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various…
▽ More
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.
△ Less
Submitted 18 January, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
The Generalization Gap in Offline Reinforcement Learning
Authors:
Ishita Mediratta,
Qingfei You,
Minqi Jiang,
Roberta Raileanu
Abstract:
Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environ…
▽ More
Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area.
△ Less
Submitted 14 March, 2024; v1 submitted 9 December, 2023;
originally announced December 2023.
-
Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts
Authors:
Tianqi Chen,
Yongfei Liu,
Zhendong Wang,
Jianbo Yuan,
Quanzeng You,
Hongxia Yang,
Mingyuan Zhou
Abstract:
In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest. Existing approaches in visual in-context learning frequently face hurdles such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limite…
▽ More
In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest. Existing approaches in visual in-context learning frequently face hurdles such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limited adaptability to new tasks. In response to these challenges, we introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. This vector is subsequently used to modulate the token embeddings of text prompts. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks and excels in in-context learning for novel vision tasks, such as normal-to-image or image-to-line transformations. The effectiveness of these capabilities relies heavily on a deep visual understanding, which is achieved through relevant visual demonstrations processed by our proposed in-context learning architecture.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis
Authors:
Xiaohui Chen,
Yongfei Liu,
Yingxiang Yang,
Jianbo Yuan,
Quanzeng You,
Li-** Liu,
Hongxia Yang
Abstract:
Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability o…
▽ More
Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability of T2I diffusion models, they typically require manual layout input. In this work, we introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators. Our method leverages the Chain-of-Thought prompting of LLMs to interpret text and generate spatially reasonable object layouts. The generated layout is then used to enhance the generated images' composition and spatial accuracy. Moreover, we propose an efficient adapter based on a cross-attention mechanism, which explicitly integrates the layout information into the stable diffusion models. Our experiments demonstrate significant improvements in image quality and layout accuracy, showcasing the potential of LLMs in augmenting generative image models.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models
Authors:
Xiaotian Han,
Quanzeng You,
Yongfei Liu,
Wentao Chen,
Huangjie Zheng,
Khalil Mrini,
Xudong Lin,
Yiqi Wang,
Bohan Zhai,
Jianbo Yuan,
Heng Wang,
Hongxia Yang
Abstract:
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding…
▽ More
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://infimm.github.io/InfiMM-Eval/
△ Less
Submitted 4 December, 2023; v1 submitted 20 November, 2023;
originally announced November 2023.
-
Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling
Authors:
Huangjie Zheng,
Zhendong Wang,
Jianbo Yuan,
Guanghan Ning,
Pengcheng He,
Quanzeng You,
Hongxia Yang,
Mingyuan Zhou
Abstract:
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep netwo…
▽ More
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skip** of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models. Our code and project page are available at https://jegzheng.github.io/LEGODiffusion.
△ Less
Submitted 27 June, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
RefineVIS: Video Instance Segmentation with Temporal Attention Refinement
Authors:
Andre Abrantes,
Jiang Wang,
Peng Chu,
Quanzeng You,
Zicheng Liu
Abstract:
We introduce a novel framework called RefineVIS for Video Instance Segmentation (VIS) that achieves good object association between frames and accurate segmentation masks by iteratively refining the representations using sequence context. RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model: an association representation responsible…
▽ More
We introduce a novel framework called RefineVIS for Video Instance Segmentation (VIS) that achieves good object association between frames and accurate segmentation masks by iteratively refining the representations using sequence context. RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model: an association representation responsible for associating objects across frames and a segmentation representation that produces accurate segmentation masks. Contrastive learning is utilized to learn temporally stable association representations. A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships and a novel temporal contrastive denoising technique. Our method supports both online and offline inference. It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets. The visualization shows that the TAR module can generate more accurate instance segmentation masks, particularly for challenging cases such as highly occluded objects.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
PEER: A Collaborative Language Model
Authors:
Timo Schick,
Jane Dwivedi-Yu,
Zhengbao Jiang,
Fabio Petroni,
Patrick Lewis,
Gautier Izacard,
Qingfei You,
Christoforos Nalmpantis,
Edouard Grave,
Sebastian Riedel
Abstract:
Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today's language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and i…
▽ More
Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today's language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and incapable of verbally planning or explaining their actions. To address these shortcomings, we introduce PEER, a collaborative language model that is trained to imitate the entire writing process itself: PEER can write drafts, add suggestions, propose edits and provide explanations for its actions. Crucially, we train multiple instances of PEER able to infill various parts of the writing process, enabling the use of self-training techniques for increasing the quality, amount and diversity of training data. This unlocks PEER's full potential by making it applicable in domains for which no edit histories are available and improving its ability to follow instructions, to write useful comments, and to explain its actions. We show that PEER achieves strong performance across various domains and editing tasks.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention
Authors:
Quanzeng You,
Jiang Wang,
Peng Chu,
Andre Abrantes,
Zicheng Liu
Abstract:
Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. Although these methods generally predict higher quality…
▽ More
Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. Although these methods generally predict higher quality object segmentation masks, they can fail to associate instances in challenging cases because they do not explicitly model the temporal instance consistency for adjacent frames. We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context. Our extensive experiments demonstrate that the Inter-Frame Recurrent Attention significantly improves temporal instance consistency while maintaining the quality of the object segmentation masks. Our model achieves state-of-the-art accuracy on both YouTubeVIS-2019 (62.1\%) and YouTubeVIS-2021 (54.7\%) datasets. In addition, quantitative and qualitative results show that the proposed methods predict more temporally consistent instance segmentation masks.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Deep Frequency Filtering for Domain Generalization
Authors:
Shiqi Lin,
Zhizheng Zhang,
Zhipeng Huang,
Yan Lu,
Cuiling Lan,
Peng Chu,
Quanzeng You,
Jiang Wang,
Zicheng Liu,
Amey Parulkar,
Viraj Navkal,
Zhibo Chen
Abstract:
Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for…
▽ More
Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for learning domain-generalizable features, which is the first endeavour to explicitly modulate the frequency components of different transfer difficulties across domains in the latent space during training. To achieve this, we perform Fast Fourier Transform (FFT) for the feature maps at different layers, then adopt a light-weight module to learn attention masks from the frequency representations after FFT to enhance transferable components while suppressing the components not conducive to generalization. Further, we empirically compare the effectiveness of adopting different types of attention designs for implementing DFF. Extensive experiments demonstrate the effectiveness of our proposed DFF and show that applying our DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks, including close-set classification and open-set retrieval.
△ Less
Submitted 25 March, 2023; v1 submitted 23 March, 2022;
originally announced March 2022.
-
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
Authors:
Peixi Xiong,
Quanzeng You,
Pei Yu,
Zicheng Liu,
Ying Wu
Abstract:
Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, o…
▽ More
Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
△ Less
Submitted 25 January, 2022;
originally announced January 2022.
-
Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation
Authors:
Zhipeng Huang,
Zhizheng Zhang,
Cuiling Lan,
Wenjun Zeng,
Peng Chu,
Quanzeng You,
Jiang Wang,
Zicheng Liu,
Zheng-jun Zha
Abstract:
Unsupervised domain adaptive person re-identification (ReID) has been extensively investigated to mitigate the adverse effects of domain gaps. Those works assume the target domain data can be accessible all at once. However, for the real-world streaming data, this hinders the timely adaptation to changing data statistics and sufficient exploitation of increasing samples. In this paper, to address…
▽ More
Unsupervised domain adaptive person re-identification (ReID) has been extensively investigated to mitigate the adverse effects of domain gaps. Those works assume the target domain data can be accessible all at once. However, for the real-world streaming data, this hinders the timely adaptation to changing data statistics and sufficient exploitation of increasing samples. In this paper, to address more practical scenarios, we propose a new task, Lifelong Unsupervised Domain Adaptive (LUDA) person ReID. This is challenging because it requires the model to continuously adapt to unlabeled data in the target environments while alleviating catastrophic forgetting for such a fine-grained person retrieval task. We design an effective scheme for this task, dubbed CLUDA-ReID, where the anti-forgetting is harmoniously coordinated with the adaptation. Specifically, a meta-based Coordinated Data Replay strategy is proposed to replay old data and update the network with a coordinated optimization direction for both adaptation and memorization. Moreover, we propose Relational Consistency Learning for old knowledge distillation/inheritance in line with the objective of retrieval-based tasks. We set up two evaluation settings to simulate the practical application scenarios. Extensive experiments demonstrate the effectiveness of our CLUDA-ReID for both scenarios with stationary target streams and scenarios with dynamic target streams.
△ Less
Submitted 29 March, 2022; v1 submitted 13 December, 2021;
originally announced December 2021.
-
MMPTRACK: Large-scale Densely Annotated Multi-camera Multiple People Tracking Benchmark
Authors:
Xiaotian Han,
Quanzeng You,
Chunyu Wang,
Zhizheng Zhang,
Peng Chu,
Houdong Hu,
Jiang Wang,
Zicheng Liu
Abstract:
Multi-camera tracking systems are gaining popularity in applications that demand high-quality tracking results, such as frictionless checkout because monocular multi-object tracking (MOT) systems often fail in cluttered and crowded environments due to occlusion. Multiple highly overlapped cameras can significantly alleviate the problem by recovering partial 3D information. However, the cost of cre…
▽ More
Multi-camera tracking systems are gaining popularity in applications that demand high-quality tracking results, such as frictionless checkout because monocular multi-object tracking (MOT) systems often fail in cluttered and crowded environments due to occlusion. Multiple highly overlapped cameras can significantly alleviate the problem by recovering partial 3D information. However, the cost of creating a high-quality multi-camera tracking dataset with diverse camera settings and backgrounds has limited the dataset scale in this domain. In this paper, we provide a large-scale densely-labeled multi-camera tracking dataset in five different environments with the help of an auto-annotation system. The system uses overlapped and calibrated depth and RGB cameras to build a high-performance 3D tracker that automatically generates the 3D tracking results. The 3D tracking results are projected to each RGB camera view using camera parameters to create 2D tracking results. Then, we manually check and correct the 3D tracking results to ensure the label quality, which is much cheaper than fully manual annotation. We have conducted extensive experiments using two real-time multi-camera trackers and a person re-identification (ReID) model with different settings. This dataset provides a more reliable benchmark of multi-camera, multi-object tracking systems in cluttered and crowded environments. Also, our results demonstrate that adapting the trackers and ReID models on this dataset significantly improves their performance. Our dataset will be publicly released upon the acceptance of this work.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
Efficient Document Image Classification Using Region-Based Graph Neural Network
Authors:
Jaya Krishna Mandivarapu,
Eric Bunch,
Qian You,
Glenn Fung
Abstract:
Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources wh…
▽ More
Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources which could defeat the cost-saving advantages of automatic document image classification. In the paper we propose an efficient document image classification framework that uses graph convolution neural networks and incorporates textual, visual and layout information of the document. We have rigorously benchmarked our proposed algorithm against several state-of-art vision and language models on both publicly available dataset and a real-life insurance document classification dataset. Empirical results on both publicly available and real-world data show that our methods achieve near SOTA performance yet require much less computing resources and time for model training and inference. This results in solutions than offer better cost advantages, especially in scalable deployment for enterprise applications. The results showed that our algorithm can achieve classification performance quite close to SOTA. We also provide comprehensive comparisons of computing resources, model sizes, train and inference time between our proposed methods and baselines. In addition we delineate the cost per image using our method and other baselines.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation
Authors:
Xingyi Yang,
Muchao Ye,
Quanzeng You,
Fenglong Ma
Abstract:
Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechani…
▽ More
Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval~(VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between sentences, the Language-Language Retrieval~(LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR.
△ Less
Submitted 25 May, 2021;
originally announced June 2021.
-
TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking
Authors:
Peng Chu,
Jiang Wang,
Quanzeng You,
Haibin Ling,
Zicheng Liu
Abstract:
Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked o…
▽ More
Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.
△ Less
Submitted 3 April, 2021; v1 submitted 31 March, 2021;
originally announced April 2021.
-
Disentanglement-based Cross-Domain Feature Augmentation for Effective Unsupervised Domain Adaptive Person Re-identification
Authors:
Zhizheng Zhang,
Cuiling Lan,
Wenjun Zeng,
Quanzeng You,
Zicheng Liu,
Kecheng Zheng,
Zhibo Chen
Abstract:
Unsupervised domain adaptive (UDA) person re-identification (ReID) aims to transfer the knowledge from the labeled source domain to the unlabeled target domain for person matching. One challenge is how to generate target domain samples with reliable labels for training. To address this problem, we propose a Disentanglement-based Cross-Domain Feature Augmentation (DCDFA) strategy, where the augment…
▽ More
Unsupervised domain adaptive (UDA) person re-identification (ReID) aims to transfer the knowledge from the labeled source domain to the unlabeled target domain for person matching. One challenge is how to generate target domain samples with reliable labels for training. To address this problem, we propose a Disentanglement-based Cross-Domain Feature Augmentation (DCDFA) strategy, where the augmented features characterize well the target and source domain data distributions while inheriting reliable identity labels. Particularly, we disentangle each sample feature into a robust domain-invariant/shared feature and a domain-specific feature, and perform cross-domain feature recomposition to enhance the diversity of samples used in the training, with the constraints of cross-domain ReID loss and domain classification loss. Each recomposed feature, obtained based on the domain-invariant feature (which enables a reliable inheritance of identity) and an enhancement from a domain specific feature (which enables the approximation of real distributions), is thus an "ideal" augmentation. Extensive experimental results demonstrate the effectiveness of our method, which achieves the state-of-the-art performance.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
A Simple yet Brisk and Efficient Active Learning Platform for Text Classification
Authors:
Teja Kanchinadam,
Qian You,
Keith Westpfahl,
James Kim,
Siva Gunda,
Sebastian Seith,
Glenn Fung
Abstract:
In this work, we propose the use of a fully managed machine learning service, which utilizes active learning to directly build models from unstructured data. With this tool, business users can quickly and easily build machine learning models and then directly deploy them into a production ready hosted environment without much involvement from data scientists. Our approach leverages state-of-the-ar…
▽ More
In this work, we propose the use of a fully managed machine learning service, which utilizes active learning to directly build models from unstructured data. With this tool, business users can quickly and easily build machine learning models and then directly deploy them into a production ready hosted environment without much involvement from data scientists. Our approach leverages state-of-the-art text representation like OpenAI's GPT2 and a fast implementation of the active learning workflow that relies on a simple construction of incremental learning using linear models, thus providing a brisk and efficient labeling experience for the users. Experiments on both publicly available and real-life insurance datasets empirically show why our choices of simple and fast classification algorithms are ideal for the task at hand.
△ Less
Submitted 31 January, 2021;
originally announced February 2021.
-
Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation
Authors:
Junyu Luo,
Zifei Zheng,
Hanzhong Ye,
Muchao Ye,
Yaqing Wang,
Quanzeng You,
Cao Xiao,
Fenglong Ma
Abstract:
Patients with low health literacy usually have difficulty understanding medical jargon and the complex structure of professional medical language. Although some studies are proposed to automatically translate expert language into layperson-understandable language, only a few of them focus on both accuracy and readability aspects simultaneously in the clinical domain. Thus, simplification of the cl…
▽ More
Patients with low health literacy usually have difficulty understanding medical jargon and the complex structure of professional medical language. Although some studies are proposed to automatically translate expert language into layperson-understandable language, only a few of them focus on both accuracy and readability aspects simultaneously in the clinical domain. Thus, simplification of the clinical language is still a challenging task, but unfortunately, it is not yet fully addressed in previous work. To benchmark this task, we construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. Besides, we propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance compared with eight strong baselines. To fairly evaluate the performance, we also propose three specific evaluation metrics. Experimental results demonstrate the utility of the annotated MedLane dataset and the effectiveness of the proposed model DECLARE.
△ Less
Submitted 21 September, 2023; v1 submitted 4 December, 2020;
originally announced December 2020.
-
DcardNet: Diabetic Retinopathy Classification at Multiple Levels Based on Structural and Angiographic Optical Coherence Tomography
Authors:
Pengxiao Zang,
Liqin Gao,
Tristan T. Hormel,
Jie Wang,
Qisheng You,
Thomas S. Hwang,
Yali Jia
Abstract:
Objective: Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages for the early detection and diagnosis of diabetic retinopathy (DR). However, automated, complete DR classification frameworks based on both OCT and OCTA data have not been proposed. In this study, a convolutional neural network (CNN) based method is proposed to fulfill a DR classification framework usi…
▽ More
Objective: Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages for the early detection and diagnosis of diabetic retinopathy (DR). However, automated, complete DR classification frameworks based on both OCT and OCTA data have not been proposed. In this study, a convolutional neural network (CNN) based method is proposed to fulfill a DR classification framework using en face OCT and OCTA. Methods: A densely and continuously connected neural network with adaptive rate dropout (DcardNet) is designed for the DR classification. In addition, adaptive label smoothing was proposed and used to suppress overfitting. Three separate classification levels are generated for each case based on the International Clinical Diabetic Retinopathy scale. At the highest level the network classifies scans as referable or non-referable for DR. The second level classifies the eye as non-DR, non-proliferative DR (NPDR), or proliferative DR (PDR). The last level classifies the case as no DR, mild and moderate NPDR, severe NPDR, and PDR. Results: We used 10-fold cross-validation with 10% of the data to assess the networks performance. The overall classification accuracies of the three levels were 95.7%, 85.0%, and 71.0% respectively. Conclusion/Significance: A reliable, sensitive and specific automated classification framework for referral to an ophthalmologist can be a key technology for reducing vision loss related to DR.
△ Less
Submitted 24 September, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Real-time 3D Deep Multi-Camera Tracking
Authors:
Quanzeng You,
Hao Jiang
Abstract:
Tracking a crowd in 3D using multiple RGB cameras is a challenging task. Most previous multi-camera tracking algorithms are designed for offline setting and have high computational complexity. Robust real-time multi-camera 3D tracking is still an unsolved problem. In this work, we propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time mul…
▽ More
Tracking a crowd in 3D using multiple RGB cameras is a challenging task. Most previous multi-camera tracking algorithms are designed for offline setting and have high computational complexity. Robust real-time multi-camera 3D tracking is still an unsolved problem. In this work, we propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time multi-camera people tracking. Our DMCT consists of 1) a fast and novel perspective-aware Deep GroudPoint Network, 2) a fusion procedure for ground-plane occupancy heatmap estimation, 3) a novel Deep Glimpse Network for person detection and 4) a fast and accurate online tracker. Our design fully unleashes the power of deep neural network to estimate the "ground point" of each person in each color image, which can be optimized to run efficiently and robustly. Our fusion procedure, glimpse network and tracker merge the results from different views, find people candidates using multiple video frames and then track people on the fused heatmap. Our system achieves the state-of-the-art tracking results while maintaining real-time performance. Apart from evaluation on the challenging WILDTRACK dataset, we also collect two more tracking datasets with high-quality labels from two different environments and camera settings. Our experimental results confirm that our proposed real-time pipeline gives superior results to previous approaches.
△ Less
Submitted 26 March, 2020;
originally announced March 2020.
-
Real-time Multiple People Hand Localization in 4D Point Clouds
Authors:
Hao Jiang,
Quanzeng You
Abstract:
We propose novel real-time algorithm to localize hands and find their associations with multiple people in the cluttered 4D volumetric data (dynamic 3D volumes). Different from the traditional multiple view approaches, which find key points in 2D and then triangulate to recover the 3D locations, our method directly processes the dynamic 3D data that involve both clutter and crowd. The volumetric r…
▽ More
We propose novel real-time algorithm to localize hands and find their associations with multiple people in the cluttered 4D volumetric data (dynamic 3D volumes). Different from the traditional multiple view approaches, which find key points in 2D and then triangulate to recover the 3D locations, our method directly processes the dynamic 3D data that involve both clutter and crowd. The volumetric representation is more desirable than the partial observations from different view points and enables more robust and accurate results. However, due to the large amount of data in the volumetric representation brute force 3D schemes are slow. In this paper, we propose novel real-time methods to tackle the problem to achieve both higher accuracy and faster speed than previous approaches. Our method detects the 3D bounding box of each subject and localizes the hands of each person. We develop new 2D features for fast candidate proposals and optimize the trajectory linking using a new max-covering bipartite matching formulation, which is critical for robust performance. We propose a novel decomposition method to reduce the key point localization in each person 3D volume to a sequence of efficient 2D problems. Our experiments show that the proposed method is faster than different competing methods and it gives almost half the localization error.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Twitter Sentiment Analysis via Bi-sense Emoji Embedding and Attention-based LSTM
Authors:
Yuxiao Chen,
Jianbo Yuan,
Quanzeng You,
Jiebo Luo
Abstract:
Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive em…
▽ More
Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive emoji uses in sentiment analysis remains light. In this paper, we propose a novel scheme for Twitter sentiment analysis with extra attention on emojis. We first learn bi-sense emoji embeddings under positive and negative sentimental tweets individually, and then train a sentiment classifier by attending on these bi-sense emoji embeddings with an attention-based long short-term memory network (LSTM). Our experiments show that the bi-sense embedding is effective for extracting sentiment-aware embeddings of emojis and outperforms the state-of-the-art models. We also visualize the attentions to show that the bi-sense emoji embedding provides better guidance on the attention mechanism to obtain a more robust understanding of the semantics and sentiments.
△ Less
Submitted 6 August, 2018; v1 submitted 20 July, 2018;
originally announced July 2018.
-
"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention
Authors:
Tianlang Chen,
Zhong** Zhang,
Quanzeng You,
Chen Fang,
Zhaowen Wang,
Hailin **,
Jiebo Luo
Abstract:
Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements in…
▽ More
Generating stylized captions for an image is an emerging topic in image captioning. Given an image as input, it requires the system to generate a caption that has a specific style (e.g., humorous, romantic, positive, and negative) while describing the image content semantically accurately. In this paper, we propose a novel stylized image captioning model that effectively takes both requirements into consideration. To this end, we first devise a new variant of LSTM, named style-factual LSTM, as the building block of our model. It uses two groups of matrices to capture the factual and stylized knowledge, respectively, and automatically learns the word-level weights of the two groups based on previous context. In addition, when we train the model to capture stylized elements, we propose an adaptive learning approach based on a reference factual model, it provides factual knowledge to the model as the model learns from stylized caption labels, and can adaptively compute how much information to supply at each time step. We evaluate our model on two stylized image captioning datasets, which contain humorous/romantic captions and positive/negative captions, respectively. Experiments shows that our proposed model outperforms the state-of-the-art approaches, without using extra ground truth supervision.
△ Less
Submitted 29 July, 2018; v1 submitted 10 July, 2018;
originally announced July 2018.
-
Action4D: Real-time Action Recognition in the Crowd and Clutter
Authors:
Quanzeng You,
Hao Jiang
Abstract:
Recognizing every person's action in a crowded and cluttered environment is a challenging task. In this paper, we propose a real-time action recognition method, Action4D, which gives reliable and accurate results in the real-world settings. We propose to tackle the action recognition problem using a holistic 4D "scan" of a cluttered scene to include every detail about the people and environment. R…
▽ More
Recognizing every person's action in a crowded and cluttered environment is a challenging task. In this paper, we propose a real-time action recognition method, Action4D, which gives reliable and accurate results in the real-world settings. We propose to tackle the action recognition problem using a holistic 4D "scan" of a cluttered scene to include every detail about the people and environment. Recognizing multiple people's actions in the cluttered 4D representation is a new problem. In this paper, we propose novel methods to solve this problem. We propose a new method to track people in 4D, which can reliably detect and follow each person in real time. We propose a new deep neural network, the Action4D-Net, to recognize the action of each tracked person. The Action4D-Net's novel structure uses both the global feature and the focused attention to achieve state-of-the-art result. Our real-time method is invariant to camera view angles, resistant to clutter and able to handle crowd. The experimental results show that the proposed method is fast, reliable and accurate. Our method paves the way to action recognition in the real-world applications and is ready to be deployed to enable smart homes, smart factories and smart stores.
△ Less
Submitted 6 June, 2018;
originally announced June 2018.
-
Touch Your Heart: A Tone-aware Chatbot for Customer Care on Social Media
Authors:
Tianran Hu,
Anbang Xu,
Zhe Liu,
Quanzeng You,
Yufan Guo,
Vibha Sinha,
Jiebo Luo,
Rama Akkiraju
Abstract:
Chatbot has become an important solution to rapidly increasing customer care demands on social media in recent years. However, current work on chatbot for customer care ignores a key to impact user experience - tones. In this work, we create a novel tone-aware chatbot that generates toned responses to user requests on social media. We first conduct a formative research, in which the effects of ton…
▽ More
Chatbot has become an important solution to rapidly increasing customer care demands on social media in recent years. However, current work on chatbot for customer care ignores a key to impact user experience - tones. In this work, we create a novel tone-aware chatbot that generates toned responses to user requests on social media. We first conduct a formative research, in which the effects of tones are studied. Significant and various influences of different tones on user experience are uncovered in the study. With the knowledge of effects of tones, we design a deep learning based chatbot that takes tone information into account. We train our system on over 1.5 million real customer care conversations collected from Twitter. The evaluation reveals that our tone-aware chatbot generates as appropriate responses to user requests as human agents. More importantly, our chatbot is perceived to be even more empathetic than human agents.
△ Less
Submitted 14 March, 2018; v1 submitted 7 March, 2018;
originally announced March 2018.
-
Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions
Authors:
Quanzeng You,
Hailin **,
Jiebo Luo
Abstract:
Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as…
▽ More
Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions. In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention. The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions. In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions. Compared with the few existing approaches, the proposed models are much simpler and yet more effective. The experimental results show that our model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) image captions. In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments.
△ Less
Submitted 30 January, 2018;
originally announced January 2018.
-
Duality of Channel Encoding and Decoding - Part II: Rate-1 Non-binary Convolutional Codes
Authors:
Qimin You,
Yonghui Li,
Soung Chang Liew,
Branka Vucetic
Abstract:
This is the second part of a series of papers on a revisit to the bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-in-soft-out (SISO) maximum a posteriori probability (MAP) decoding algorithm. Part I revisited the BCJR MAP decoding algorithm for rate-1 binary convolutional codes and proposed a linear complexity decoder using shift registers in the complex number field. Part II proposes a low com…
▽ More
This is the second part of a series of papers on a revisit to the bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-in-soft-out (SISO) maximum a posteriori probability (MAP) decoding algorithm. Part I revisited the BCJR MAP decoding algorithm for rate-1 binary convolutional codes and proposed a linear complexity decoder using shift registers in the complex number field. Part II proposes a low complexity decoder for rate-1 non-binary convolutional codes that achieves the same error performance as the bidirectional BCJR SISO MAP decoding algorithm. We observe an explicit relationship between the encoding and decoding of rate-1 convolutional codes in $GF(q)$. Based on this relationship, the BCJR forward and backward decoding are implemented by dual encoders using shift registers whose contents are vectors of complex numbers. The input to the dual encoders is the probability mass function (pmf) of the received symbols and the output of the dual encoders is the pmf of the information symbols. The bidirectional BCJR MAP decoding is implemented by linearly combining the shift register contents of the dual encoders for forward and backward decoding. The proposed decoder significantly reduces the computational complexity of the bidirectional BCJR MAP algorithm from exponential to linear with constraint length of convolutional codes. To further reduce complexity, fast Fourier transform (FFT) is applied. Mathematical proofs and simulation results are provided to validate our proposed decoder.
△ Less
Submitted 8 January, 2018;
originally announced January 2018.
-
Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks
Authors:
Fenglong Ma,
Radha Chitta,
**g Zhou,
Quanzeng You,
Tong Sun,
**g Gao
Abstract:
Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the…
▽ More
Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the temporality and high dimensionality of sequential EHR data and to interpret the prediction results. Existing work solves this problem by employing recurrent neural networks (RNNs) to model EHR data and utilizing simple attention mechanism to interpret the results. However, RNN-based approaches suffer from the problem that the performance of RNNs drops when the length of sequences is large, and the relationships between subsequent visits are ignored by current RNN-based approaches. To address these issues, we propose {\sf Dipole}, an end-to-end, simple and robust model for predicting patients' future health information. Dipole employs bidirectional recurrent neural networks to remember all the information of both the past visits and the future visits, and it introduces three attention mechanisms to measure the relationships of different visits for the prediction. With the attention mechanisms, Dipole can interpret the prediction results effectively. Dipole also allows us to interpret the learned medical code representations which are confirmed positively by medical experts. Experimental results on two real world EHR datasets show that the proposed Dipole can significantly improve the prediction accuracy compared with the state-of-the-art diagnosis prediction approaches and provide clinically meaningful interpretation.
△ Less
Submitted 18 June, 2017;
originally announced June 2017.
-
Cultural Diffusion and Trends in Facebook Photographs
Authors:
Quanzeng You,
Darío García-García,
Mahohar Paluri,
Jiebo Luo,
Jungseock Joo
Abstract:
Online social media is a social vehicle in which people share various moments of their lives with their friends, such as playing sports, cooking dinner or just taking a selfie for fun, via visual means, that is, photographs. Our study takes a closer look at the popular visual concepts illustrating various cultural lifestyles from aggregated, de-identified photographs. We perform analysis both at m…
▽ More
Online social media is a social vehicle in which people share various moments of their lives with their friends, such as playing sports, cooking dinner or just taking a selfie for fun, via visual means, that is, photographs. Our study takes a closer look at the popular visual concepts illustrating various cultural lifestyles from aggregated, de-identified photographs. We perform analysis both at macroscopic and microscopic levels, to gain novel insights about global and local visual trends as well as the dynamics of interpersonal cultural exchange and diffusion among Facebook friends. We processed images by automatically classifying the visual content by a convolutional neural network (CNN). Through various statistical tests, we find that socially tied individuals more likely post images showing similar cultural lifestyles. To further identify the main cause of the observed social correlation, we use the Shuffle test and the Preference-based Matched Estimation (PME) test to distinguish the effects of influence and homophily. The results indicate that the visual content of each user's photographs are temporally, although not necessarily causally, correlated with the photographs of their friends, which may suggest the effect of influence. Our paper demonstrates that Facebook photographs exhibit diverse cultural lifestyles and preferences and that the social interaction mediated through the visual channel in social media can be an effective mechanism for cultural diffusion.
△ Less
Submitted 24 May, 2017;
originally announced May 2017.
-
Image Based Appraisal of Real Estate Properties
Authors:
Quanzeng You,
Ran Pang,
Liangliang Cao,
Jiebo Luo
Abstract:
Real estate appraisal, which is the process of estimating the price for real estate properties, is crucial for both buys and sellers as the basis for negotiation and transaction. Traditionally, the repeat sales model has been widely adopted to estimate real estate price. However, it depends the design and calculation of a complex economic related index, which is challenging to estimate accurately.…
▽ More
Real estate appraisal, which is the process of estimating the price for real estate properties, is crucial for both buys and sellers as the basis for negotiation and transaction. Traditionally, the repeat sales model has been widely adopted to estimate real estate price. However, it depends the design and calculation of a complex economic related index, which is challenging to estimate accurately. Today, real estate brokers provide easy access to detailed online information on real estate properties to their clients. We are interested in estimating the real estate price from these large amounts of easily accessed data. In particular, we analyze the prediction power of online house pictures, which is one of the key factors for online users to make a potential visiting decision. The development of robust computer vision algorithms makes the analysis of visual content possible. In this work, we employ a Recurrent Neural Network (RNN) to predict real estate price using the state-of-the-art visual features. The experimental results indicate that our model outperforms several of other state-of-the-art baseline algorithms in terms of both mean absolute error (MAE) and mean absolute percentage error (MAPE).
△ Less
Submitted 27 July, 2017; v1 submitted 28 November, 2016;
originally announced November 2016.
-
The Effect of Pets on Happiness: A Data-Driven Approach via Large-Scale Social Media
Authors:
Yuchen Wu,
Jianbo Yuan,
Quanzeng You,
Jiebo Luo
Abstract:
Psychologists have demonstrated that pets have a positive impact on owners' happiness. For example, lonely people are often advised to have a dog or cat to quell their social isolation. Conventional psychological research methods of analyzing this phenomenon are mostly based on surveys or self-reported questionnaires, which are time-consuming and lack of scalability. Utilizing social media as an a…
▽ More
Psychologists have demonstrated that pets have a positive impact on owners' happiness. For example, lonely people are often advised to have a dog or cat to quell their social isolation. Conventional psychological research methods of analyzing this phenomenon are mostly based on surveys or self-reported questionnaires, which are time-consuming and lack of scalability. Utilizing social media as an alternative and complimentary resource could potentially address both issues and provide different perspectives on this psychological investigation. In this paper, we propose a novel and effective approach that exploits social media to study the effect of pets on owners' happiness. The proposed framework includes three major components: 1) collecting user-level data from Instagram consisting of about 300,000 images from 2905 users; 2) constructing a convolutional neural network (CNN) for pets classification, and combined with timeline information, further identifying pet owners and the control group; 3) measuring the confidence score of happiness by detecting and analyzing selfie images. Furthermore, various factors of demographics are employed to analyze the fine-grained effects of pets on happiness. Our experimental results demonstrate the effectiveness of the proposed approach and we believe that this approach can be applied to other related domains as a large-scale, high-confidence methodology of user activity analysis through social media.
△ Less
Submitted 27 October, 2016;
originally announced October 2016.
-
Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark
Authors:
Quanzeng You,
Jiebo Luo,
Hailin **,
Jianchao Yang
Abstract:
Psychological research results have confirmed that people can have different emotional reactions to different visual stimuli. Several papers have been published on the problem of visual emotion analysis. In particular, attempts have been made to analyze and predict people's emotional reaction towards images. To this end, different kinds of hand-tuned features are proposed. The results reported on…
▽ More
Psychological research results have confirmed that people can have different emotional reactions to different visual stimuli. Several papers have been published on the problem of visual emotion analysis. In particular, attempts have been made to analyze and predict people's emotional reaction towards images. To this end, different kinds of hand-tuned features are proposed. The results reported on several carefully selected and labeled small image data sets have confirmed the promise of such features. While the recent successes of many computer vision related tasks are due to the adoption of Convolutional Neural Networks (CNNs), visual emotion analysis has not achieved the same level of success. This may be primarily due to the unavailability of confidently labeled and relatively large image data sets for visual emotion analysis. In this work, we introduce a new data set, which started from 3+ million weakly labeled images of different emotions and ended up 30 times as large as the current largest publicly available visual emotion data set. We hope that this data set encourages further research on visual emotion analysis. We also perform extensive benchmarking analyses on this large data set using the state of the art methods including CNNs.
△ Less
Submitted 9 May, 2016;
originally announced May 2016.
-
Voting with Feet: Who are Leaving Hillary Clinton and Donald Trump?
Authors:
Yu Wang,
Yuncheng Li,
Quanzeng You,
Xiyang Zhang,
Richard Niemi,
Jiebo Luo
Abstract:
From a crowded field with 17 candidates, Hillary Clinton and Donald Trump have emerged as the two front-runners in the 2016 U.S. presidential campaign. The two candidates each boast more than 5 million followers on Twitter, and at the same time both have witnessed hundreds of thousands of people leave their camps. In this paper we attempt to characterize individuals who have left Hillary Clinton a…
▽ More
From a crowded field with 17 candidates, Hillary Clinton and Donald Trump have emerged as the two front-runners in the 2016 U.S. presidential campaign. The two candidates each boast more than 5 million followers on Twitter, and at the same time both have witnessed hundreds of thousands of people leave their camps. In this paper we attempt to characterize individuals who have left Hillary Clinton and Donald Trump between September 2015 and March 2016.
Our study focuses on three dimensions of social demographics: social capital, gender, and age. Within each camp, we compare the characteristics of the current followers with former followers, i.e., individuals who have left since September 2015. We use the number of followers to measure social capital, and profile images to infer gender and age. For classifying gender, we train a convolutional neural network (CNN). For age, we use the Face++ API.
Our study shows that for both candidates followers with more social capital are more likely to leave (or switch camps). For both candidates females make up a larger presence among unfollowers than among current followers. Somewhat surprisingly, the effect is particularly pronounced for Clinton. Lastly, middle-aged individuals are more likely to leave Trump, and the young are more likely to leave Hillary Clinton.
△ Less
Submitted 25 April, 2016; v1 submitted 24 April, 2016;
originally announced April 2016.
-
Image Captioning with Semantic Attention
Authors:
Quanzeng You,
Hailin **,
Zhaowen Wang,
Chen Fang,
Jiebo Luo
Abstract:
Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which com…
▽ More
Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
△ Less
Submitted 12 March, 2016;
originally announced March 2016.
-
Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks
Authors:
Quanzeng You,
Jiebo Luo,
Hailin **,
Jianchao Yang
Abstract:
Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. Sentiment analysis…
▽ More
Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using images and videos to express their opinions and share their experiences. Sentiment analysis of such large scale visual content can help better extract user sentiments toward events or topics, such as those in image tweets, so that prediction of sentiment from visual content is complementary to textual sentiment analysis. Motivated by the needs in leveraging large scale yet noisy training data to solve the extremely challenging problem of image sentiment analysis, we employ Convolutional Neural Networks (CNN). We first design a suitable CNN architecture for image sentiment analysis. We obtain half a million training samples by using a baseline sentiment algorithm to label Flickr images. To make use of such noisy machine labeled data, we employ a progressive strategy to fine-tune the deep network. Furthermore, we improve the performance on Twitter images by inducing domain transfer with a small number of manually labeled Twitter images. We have conducted extensive experiments on manually labeled Twitter images. The results show that the proposed CNN can achieve better performance in image sentiment analysis than competing algorithms.
△ Less
Submitted 20 September, 2015;
originally announced September 2015.
-
A Picture Tells a Thousand Words -- About You! User Interest Profiling from User Generated Visual Content
Authors:
Quanzeng You,
Sumit Bhatia,
Jiebo Luo
Abstract:
Inference of online social network users' attributes and interests has been an active research topic. Accurate identification of users' attributes and interests is crucial for improving the performance of personalization and recommender systems. Most of the existing works have focused on textual content generated by the users and have successfully used it for predicting users' interests and other…
▽ More
Inference of online social network users' attributes and interests has been an active research topic. Accurate identification of users' attributes and interests is crucial for improving the performance of personalization and recommender systems. Most of the existing works have focused on textual content generated by the users and have successfully used it for predicting users' interests and other identifying attributes. However, little attention has been paid to user generated visual content (images) that is becoming increasingly popular and pervasive in recent times. We posit that images posted by users on online social networks are a reflection of topics they are interested in and propose an approach to infer user attributes from images posted by them. We analyze the content of individual images and then aggregate the image-level knowledge to infer user-level interest distribution. We employ image-level similarity to propagate the label information between images, as well as utilize the image category information derived from the user created organization structure to further propagate the category-level knowledge for all images. A real life social network dataset created from Pinterest is used for evaluation and the experimental results demonstrate the effectiveness of our proposed approach.
△ Less
Submitted 17 April, 2015;
originally announced April 2015.
-
Strongly Convex Programming for Principal Component Pursuit
Authors:
Qingshan You,
Qun Wan,
Yipeng Liu
Abstract:
In this paper, we address strongly convex programming for princi- pal component pursuit with reduced linear measurements, which decomposes a superposition of a low-rank matrix and a sparse matrix from a small set of linear measurements. We first provide sufficient conditions under which the strongly convex models lead to the exact low-rank and sparse matrix recov- ery; Second, we also give suggest…
▽ More
In this paper, we address strongly convex programming for princi- pal component pursuit with reduced linear measurements, which decomposes a superposition of a low-rank matrix and a sparse matrix from a small set of linear measurements. We first provide sufficient conditions under which the strongly convex models lead to the exact low-rank and sparse matrix recov- ery; Second, we also give suggestions on how to choose suitable parameters in practical algorithms.
△ Less
Submitted 19 September, 2012;
originally announced September 2012.
-
Duality of Channel Encoding and Decoding - Part I: Rate-1 Binary Convolutional Codes
Authors:
Yonghui Li,
Qimin You,
Soung C. Liew,
Branka Vucetic
Abstract:
In this paper, we revisit the forward, backward and bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-input soft-output (SISO) maximum a posteriori probability (MAP) decoding process of rate-1 binary convolutional codes. From this we establish some interesting explicit relationships between encoding and decoding of rate-1 convolutional codes. We observe that the forward and backward BCJR SISO MAP…
▽ More
In this paper, we revisit the forward, backward and bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) soft-input soft-output (SISO) maximum a posteriori probability (MAP) decoding process of rate-1 binary convolutional codes. From this we establish some interesting explicit relationships between encoding and decoding of rate-1 convolutional codes. We observe that the forward and backward BCJR SISO MAP decoders can be simply represented by their dual SISO channel encoders using shift registers in the complex number field. Similarly, the bidirectional MAP decoding can be implemented by linearly combining the shift register contents of the dual SISO encoders of the respective forward and backward decoders. The dual encoder structures for various recursive and non-recursive rate-1 convolutional codes are derived.
△ Less
Submitted 10 November, 2016; v1 submitted 12 January, 2012;
originally announced January 2012.