Skip to main content

Showing 1–50 of 58 results for author: Jie, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02842  [pdf, other

    cs.CV cs.AI cs.CL

    MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis

    Authors: Lei Chen, Feng Yan, Yujie Zhong, Shaoxiang Chen, Zequn Jie, Lin Ma

    Abstract: Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex interactions between elements in structured documents such as mind maps and flowcharts. To address this issue, we introduce the new benchmark named MindBench, which n… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: technical report

  2. arXiv:2406.08024  [pdf, other

    cs.CV cs.AI

    Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

    Authors: Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

    Abstract: Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data. This paper addresses the challenge by leveraging the visual commonalities between images and videos to efficiently evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.00480  [pdf, other

    cs.CV

    AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

    Authors: Duojun Huang, Xinyu Xiong, Jie Ma, Jichang Li, Zequn Jie, Lin Ma, Guanbin Li

    Abstract: Powered by massive curated training data, Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts. However, the vanilla SAM is class agnostic and heavily relies on user-provided prompts to segment objects of interest. Adapting this method to diverse tasks is crucial for accurate target identification and to avoid… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: CVPR2024

  4. arXiv:2405.03025  [pdf, other

    cs.CV

    Matten: Video Generation with Mamba-Attention

    Authors: Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, Lin Ma

    Abstract: In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation. With minimal computational cost, Matten employs spatial-temporal attention for local video content modeling and bidirectional Mamba for global video content modeling. Our comprehensive experimental evaluation demonstrates that Matten has competitive performance with the… ▽ More

    Submitted 10 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

  5. arXiv:2403.07304  [pdf, other

    cs.CV

    Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

    Authors: Yang Jiao, Shaoxiang Chen, Zequn Jie, **g**g Chen, Lin Ma, Yu-Gang Jiang

    Abstract: Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to the format of the language model, which is the main component of a LMM.… ▽ More

    Submitted 28 May, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: Technical Report

  6. arXiv:2402.05937  [pdf, other

    cs.CV

    InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

    Authors: Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, Lin Ma

    Abstract: In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated image… ▽ More

    Submitted 8 April, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: CVPR2024

  7. arXiv:2401.16160  [pdf, other

    cs.CV

    LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

    Authors: Shaoxiang Chen, Zequn Jie, Lin Ma

    Abstract: Instruction finetuning on a variety of image-text instruction data is the key to obtaining a versatile Multimodal Large Language Model (MLLM), and different configurations of the instruction data can lead to finetuned models with different capabilities. However, we have discovered that data conflicts are inevitable when mixing instruction data from distinct domains, which can result in performance… ▽ More

    Submitted 30 January, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

  8. arXiv:2401.08967  [pdf, other

    cs.CL

    ReFT: Reasoning with Reinforced Fine-Tuning

    Authors: Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran **, Hang Li

    Abstract: One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each ques… ▽ More

    Submitted 27 June, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

    Comments: ACL 2024 main conference; adjust with reviewer comments; 13 pages

  9. arXiv:2312.09625  [pdf, other

    cs.CV cs.CL

    Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

    Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang

    Abstract: Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly supervised approach for \… ▽ More

    Submitted 13 April, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

  10. arXiv:2312.08004  [pdf, other

    cs.CV

    Instance-aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning

    Authors: Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, **g**g Chen, Lin Ma, Yu-Gang Jiang

    Abstract: Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted to AAAI 2024

  11. arXiv:2309.11054  [pdf, other

    cs.CL cs.AI cs.LG cs.PL

    Design of Chain-of-Thought in Math Problem Solving

    Authors: Zhanming Jie, Trung Quoc Luong, Xinbo Zhang, Xiaoran **, Hang Li

    Abstract: Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving. We conduct a comprehensive examination of methods for designing CoT, comparing conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program. Furthermore, we investigate the impact of programming language on program… ▽ More

    Submitted 30 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: 15 pages

  12. arXiv:2306.00813  [pdf, other

    cs.CV

    UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

    Authors: Xiao Dong, Runhui Huang, Xiaoyong Wei, Zequn Jie, Jianxing Yu, Jian Yin, Xiaodan Liang

    Abstract: Recent advances in vision-language pre-training have enabled machines to perform better in multimodal object discrimination (e.g., image-text semantic alignment) and image synthesis (e.g., text-to-image generation). On the other hand, fine-tuning pre-trained models with discriminative or generative capabilities such as CLIP and Stable Diffusion on domain-specific datasets has shown to be effective… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: NA

  13. arXiv:2305.18170  [pdf, other

    cs.CL

    Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning

    Authors: Zhanming Jie, Wei Lu

    Abstract: Chain-of-thought (CoT) prompting with large language models has proven effective in numerous natural language processing tasks, but designing prompts that generalize well to diverse problem types can be challenging, especially in the context of math word problem (MWP) solving. Additionally, it is common to have a large amount of training data that have a better diversity coverage but CoT annotatio… ▽ More

    Submitted 9 June, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  14. arXiv:2305.10448  [pdf, other

    cs.CL cs.AI

    Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

    Authors: Shuwei Feng, Tianyang Zhan, Zhanming Jie, Trung Quoc Luong, Xiaoran **

    Abstract: This paper presents GenDoc, a general sequence-to-sequence document understanding model pre-trained with unified masking across three modalities: text, image, and layout. The proposed model utilizes an encoder-decoder architecture, which allows for increased adaptability to a wide range of downstream tasks with diverse output formats, in contrast to the encoder-only models commonly employed in doc… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  15. arXiv:2302.02367  [pdf, other

    cs.CV cs.RO

    FastPillars: A Deployment-friendly Pillar-based 3D Detector

    Authors: Sifan Zhou, Zhi Tian, Xiangxiang Chu, Xinyu Zhang, Bo Zhang, Xiaobo Lu, Chengjian Feng, Zequn Jie, Patrick Yin Chiang, Lin Ma

    Abstract: The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolutions (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment, especially for on-device applications. In this paper, to tackle the challenge of efficient 3D object detection from an ind… ▽ More

    Submitted 13 December, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

    Comments: Submitted to AAAI2024

  16. arXiv:2212.03586  [pdf, other

    cs.CV

    Multiple Object Tracking Challenge Technical Report for Team MT_IoT

    Authors: Feng Yan, Zhiheng Li, Weixin Luo, Zequn jie, Fan Liang, Xiaolin Wei, Lin Ma

    Abstract: This is a brief technical report of our proposed method for Multiple-Object Tracking (MOT) Challenge in Complex Environments. In this paper, we treat the MOT task as a two-stage task including human detection and trajectory matching. Specifically, we designed an improved human detector and associated most of detection to guarantee the integrity of the motion trajectory. We also propose a location-… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: This is a brief technical report for Multiple Object Tracking Challenge of ECCV workshop 2022

  17. arXiv:2211.12501  [pdf, other

    cs.CV

    AeDet: Azimuth-invariant Multi-view 3D Object Detection

    Authors: Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, Lin Ma

    Abstract: Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via the convolutional detector. However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization. To preserve the inherent property of the BEV features and ease the optimization, we propose an azimu… ▽ More

    Submitted 4 April, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: CVPR2023

  18. arXiv:2209.07828  [pdf, other

    cs.CV

    Weakly Supervised Semantic Segmentation via Progressive Patch Learning

    Authors: **long Li, Zequn Jie, Xu Wang, Yu Zhou, Xiaolin Wei, Lin Ma

    Abstract: Most of the existing semantic segmentation approaches with image-level class labels as supervision, highly rely on the initial class activation map (CAM) generated from the standard classification network. In this paper, a novel "Progressive Patch Learning" approach is proposed to improve the local details extraction of the classification, producing the CAM better covering the whole object rather… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: TMM2022 accepted

  19. arXiv:2209.07761  [pdf, other

    cs.CV

    Expansion and Shrinkage of Localization for Weakly-Supervised Semantic Segmentation

    Authors: **long Li, Zequn Jie, Xu Wang, Xiaolin Wei, Lin Ma

    Abstract: Generating precise class-aware pseudo ground-truths, a.k.a, class activation maps (CAMs), is essential for weakly-supervised semantic segmentation. The original CAM method usually produces incomplete and inaccurate localization maps. To tackle with this issue, this paper proposes an Expansion and Shrinkage scheme based on the offset learning in the deformable convolution, to sequentially improve t… ▽ More

    Submitted 19 September, 2022; v1 submitted 16 September, 2022; originally announced September 2022.

    Comments: NeurIPS2022 accepted

  20. arXiv:2209.03102  [pdf, other

    cs.CV

    MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection

    Authors: Yang Jiao, Zequn Jie, Shaoxiang Chen, **g**g Chen, Lin Ma, Yu-Gang Jiang

    Abstract: Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. This is challenging due to the difficulty of combining multi-granularity geometric and semantic features from two drastically different modalities. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera ima… ▽ More

    Submitted 3 March, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

    Comments: Accepted by CVPR 2023

  21. ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

    Authors: Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie, Chengwen Huang, Jianqing Peng, Xiaodan Liang

    Abstract: Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment inform… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Accepted by ACMMM22

  22. arXiv:2207.04781  [pdf, other

    cs.CV

    MT-Net Submission to the Waymo 3D Detection Leaderboard

    Authors: Shaoxiang Chen, Zequn Jie, Xiaolin Wei, Lin Ma

    Abstract: In this technical report, we introduce our submission to the Waymo 3D Detection leaderboard. Our network is based on the Centerpoint architecture, but with significant improvements. We design a 2D backbone to utilize multi-scale features for better detecting objects with various sizes, together with an optimal transport-based target assignment strategy, which dynamically assigns richer supervision… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

  23. arXiv:2203.16513  [pdf, other

    cs.CV

    PromptDet: Towards Open-vocabulary Detection using Uncurated Images

    Authors: Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, Lin Ma

    Abstract: The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained vi… ▽ More

    Submitted 18 July, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: ECCV2022

  24. arXiv:2203.10316  [pdf, other

    cs.CL cs.LG

    Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction

    Authors: Zhanming Jie, Jierui Li, Wei Lu

    Abstract: Solving math word problems requires deductive reasoning over the quantities in the text. Various recent research efforts mostly relied on sequence-to-sequence or sequence-to-tree models to generate mathematical expressions without explicitly performing relational reasoning between quantities in the given context. While empirically effective, such approaches typically do not provide explanations fo… ▽ More

    Submitted 15 September, 2022; v1 submitted 19 March, 2022; originally announced March 2022.

    Comments: 12 pages, 7 figures, ACL-2022, additional experiments for math23k and large-LM

  25. arXiv:2203.05203  [pdf, other

    cs.CV cs.AI cs.CL

    MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

    Authors: Yang Jiao, Shaoxiang Chen, Zequn Jie, **g**g Chen, Lin Ma, Yu-Gang Jiang

    Abstract: 3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. However, it is also more challenging due to the higher complexity and wider variety of inter-object relations contained in point clouds. Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them… ▽ More

    Submitted 20 July, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

    Comments: Accepted by ECCV 2022

  26. arXiv:2203.05186  [pdf, other

    cs.CV cs.AI

    Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

    Authors: Yang Jiao, Zequn Jie, **g**g Chen, Lin Ma, Yu-Gang Jiang

    Abstract: Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query a… ▽ More

    Submitted 21 August, 2023; v1 submitted 10 March, 2022; originally announced March 2022.

    Comments: Accepted to ACM MM 23

  27. arXiv:2110.04435  [pdf, other

    cs.CV cs.CL

    Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

    Authors: Yang Jiao, Zequn Jie, Weixin Luo, **g**g Chen, Yu-Gang Jiang, Xiaolin Wei, Lin Ma

    Abstract: Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions as well as complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching be… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: Accepted by ACM MM 2021

  28. arXiv:2109.08382  [pdf, other

    cs.CL

    To be Closer: Learning to Link up Aspects with Opinions

    Authors: Yuxiang Zhou, Lejian Liao, Yang Gao, Zhanming Jie, Wei Lu

    Abstract: Dependency parse trees are helpful for discovering the opinion words in aspect-based sentiment analysis (ABSA). However, the trees obtained from off-the-shelf dependency parsers are static, and could be sub-optimal in ABSA. This is because the syntactic trees are not designed for capturing the interactions between opinion words and aspect words. In this work, we aim to shorten the distance between… ▽ More

    Submitted 17 September, 2021; originally announced September 2021.

    Comments: Accepted as a long paper in the main conference of EMNLP 2021

  29. arXiv:2104.05316  [pdf, other

    cs.CL cs.LG

    Better Feature Integration for Named Entity Recognition

    Authors: Lu Xu, Zhanming Jie, Wei Lu, Lidong Bing

    Abstract: It has been shown that named entity recognition (NER) could benefit from incorporating the long-distance structured information captured by dependency trees. We believe this is because both types of features - the contextual information captured by the linear sequences and the structured information captured by the dependency trees may complement each other. However, existing approaches largely fo… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: Accepted by NAACL 2021. 13 pages, 9 Figure, 10 Tables

  30. arXiv:2005.11472  [pdf, other

    cs.CV

    Delving into the Imbalance of Positive Proposals in Two-stage Object Detection

    Authors: Zheng Ge, Zequn Jie, Xin Huang, Chengzheng Li, Osamu Yoshie

    Abstract: Imbalance issue is a major yet unsolved bottleneck for the current object detection models. In this work, we observe two crucial yet never discussed imbalance issues. The first imbalance lies in the large number of low-quality RPN proposals, which makes the R-CNN module (i.e., post-classification layers) become highly biased towards the negative proposals in the early training stage. The second im… ▽ More

    Submitted 23 May, 2020; originally announced May 2020.

  31. arXiv:2004.14813  [pdf, other

    cs.CL

    ENT-DESC: Entity Description Generation by Exploring Knowledge Graph

    Authors: Liying Cheng, Dekun Wu, Lidong Bing, Yan Zhang, Zhanming Jie, Wei Lu, Luo Si

    Abstract: Previous works on knowledge-to-text generation take as input a few RDF triples or key-value pairs conveying the knowledge of some entities to generate a natural language description. Existing datasets, such as WIKIBIO, WebNLG, and E2E, basically have a good alignment between an input triple/pair set and its output text. However, in practice, the input knowledge could be more than enough, since the… ▽ More

    Submitted 26 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: 11 pages, 6 figures, accepted by EMNLP 2020

  32. arXiv:2003.14058  [pdf, other

    cs.LG cs.CV stat.ML

    MTL-NAS: Task-Agnostic Neural Architecture Search towards General-Purpose Multi-Task Learning

    Authors: Yuan Gao, Hao** Bai, Zequn Jie, Jiayi Ma, Kui Jia, Wei Liu

    Abstract: We propose to incorporate neural architecture search (NAS) into general-purpose multi-task learning (GP-MTL). Existing NAS methods typically define different search spaces according to different tasks. In order to adapt to different task combinations (i.e., task sets), we disentangle the GP-MTL networks into single-task backbones (optionally encode the task priors), and a hierarchical and layerwis… ▽ More

    Submitted 31 March, 2020; originally announced March 2020.

    Comments: Accepted to CVPR2020. The first two authors contribute equally

    Journal ref: IEEE Conference on Computer Vision and Pattern Recognition, 2020

  33. arXiv:2003.12729  [pdf, other

    cs.CV

    NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing

    Authors: Xin Huang, Zheng Ge, Zequn Jie, Osamu Yoshie

    Abstract: Although significant progress has been made in pedestrian detection recently, pedestrian detection in crowded scenes is still challenging. The heavy occlusion between pedestrians imposes great challenges to the standard Non-Maximum Suppression (NMS). A relative low threshold of intersection over union (IoU) leads to missing highly overlapped pedestrians, while a higher one brings in plenty of fals… ▽ More

    Submitted 21 April, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

    Comments: Accepted by CVPR2020. The first two authors contributed equally, and are listed in alphabetical order

  34. arXiv:2003.07080  [pdf, other

    cs.CV cs.LG eess.IV

    PS-RCNN: Detecting Secondary Human Instances in a Crowd via Primary Object Suppression

    Authors: Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, Osamu Yoshie

    Abstract: Detecting human bodies in highly crowded scenes is a challenging problem. Two main reasons result in such a problem: 1). weak visual cues of heavily occluded instances can hardly provide sufficient information for accurate detection; 2). heavily occluded instances are easier to be suppressed by Non-Maximum-Suppression (NMS). To address these two issues, we introduce a variant of two-stage detector… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

    Comments: 6pages, accepted by ICME2020

  35. arXiv:1909.10148  [pdf, ps, other

    cs.CL

    Dependency-Guided LSTM-CRF for Named Entity Recognition

    Authors: Zhanming Jie, Wei Lu

    Abstract: Dependency tree structures capture long-distance and syntactic relationships between words in a sentence. The syntactic relations (e.g., nominal subject, object) can potentially infer the existence of certain named entities. In addition, the performance of a named entity recognizer could benefit from the long-distance dependencies between the words in dependency trees. In this work, we propose a s… ▽ More

    Submitted 23 September, 2019; originally announced September 2019.

    Comments: 13 pages, 6 figures, accepted by EMNLP 2019

  36. arXiv:1908.00347  [pdf, other

    cs.CV

    Central Similarity Quantization for Efficient Image and Video Retrieval

    Authors: Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, Jiashi Feng

    Abstract: Existing data-dependent hashing methods usually learn hash functions from pairwise or triplet data relationships, which only capture the data similarity locally, and often suffer from low learning efficiency and low collision rate. In this work, we propose a new \emph{global} similarity metric, termed as \emph{central similarity}, with which the hash codes of similar data pairs are encouraged to a… ▽ More

    Submitted 30 March, 2020; v1 submitted 1 August, 2019; originally announced August 2019.

    Comments: CVPR2020, Codes: https://github.com/yuanli2333/Hadamard-Matrix-for-hashing

  37. arXiv:1812.03426  [pdf, other

    cs.CV

    Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

    Authors: Xinpeng Chen, Lin Ma, **gyuan Chen, Zequn Jie, Wei Liu, Jiebo Luo

    Abstract: In this paper, we propose a novel end-to-end model, namely Single-Stage Grounding network (SSG), to localize the referent given a referring expression within an image. Different from previous multi-stage models which rely on object proposals or detected regions, our proposed model aims to comprehend a referring expression through one single stage without resorting to region proposals as well as th… ▽ More

    Submitted 8 December, 2018; originally announced December 2018.

  38. arXiv:1811.09358  [pdf, ps, other

    cs.LG cs.CV math.NA math.OC stat.ML

    A Sufficient Condition for Convergences of Adam and RMSProp

    Authors: Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

    Abstract: Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have… ▽ More

    Submitted 24 June, 2019; v1 submitted 22 November, 2018; originally announced November 2018.

    Comments: Accepted by CVPR2019 as an Oral presentation

  39. arXiv:1810.08436  [pdf, other

    cs.CL

    Efficient Dependency-Guided Named Entity Recognition

    Authors: Zhanming Jie, Aldrian Obaja Muis, Wei Lu

    Abstract: Named entity recognition (NER), which focuses on the extraction of semantically meaningful named entities and their semantic classes from text, serves as an indispensable component for several down-stream natural language processing (NLP) tasks such as relation extraction and event extraction. Dependency trees, on the other hand, also convey crucial semantic-level information. It has been shown pr… ▽ More

    Submitted 22 October, 2018; v1 submitted 19 October, 2018; originally announced October 2018.

    Comments: 8+1 pages, 9 pages supplementary. Published in The 31st AAAI Conference on Artificial Intelligence (AAAI 2017). This version fixes the errors in two equations. arXiv admin note: text overlap with arXiv:1711.07010 by other authors

  40. arXiv:1810.05456  [pdf, other

    cs.CV cs.RO

    Modeling Varying Camera-IMU Time Offset in Optimization-Based Visual-Inertial Odometry

    Authors: Yonggen Ling, Linchao Bao, Zequn Jie, Fengming Zhu, Ziyang Li, Shanmin Tang, Yongsheng Liu, Wei Liu, Tong Zhang

    Abstract: Combining cameras and inertial measurement units (IMUs) has been proven effective in motion tracking, as these two sensing modalities offer complementary characteristics that are suitable for fusion. While most works focus on global-shutter cameras and synchronized sensor measurements, consumer-grade devices are mostly equipped with rolling-shutter cameras and suffer from imperfect sensor synchron… ▽ More

    Submitted 12 October, 2018; originally announced October 2018.

    Comments: European Conference on Computer Vision 2018

  41. arXiv:1809.00107  [pdf, other

    cs.CL

    Dependency-based Hybrid Trees for Semantic Parsing

    Authors: Zhanming Jie, Wei Lu

    Abstract: We propose a novel dependency-based hybrid tree model for semantic parsing, which converts natural language utterance into machine interpretable meaning representations. Unlike previous state-of-the-art models, the semantic information is interpreted as the latent dependency between the natural language words in our joint representation. Such dependency information can capture the interactions bet… ▽ More

    Submitted 31 August, 2018; originally announced September 2018.

    Comments: Accepted by EMNLP 2018

  42. arXiv:1808.03408  [pdf, ps, other

    cs.LG math.NA math.OC stat.ML

    A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration

    Authors: Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu

    Abstract: Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}. In spite of their effectiveness in practice, there is still a large gap in their theories of convergences, especially in the difficult non-convex stochastic setting. To fill this gap, we propo… ▽ More

    Submitted 15 May, 2023; v1 submitted 10 August, 2018; originally announced August 2018.

    Comments: IEEE TNNLS

  43. arXiv:1805.04574  [pdf, other

    cs.CV

    Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation

    Authors: Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, Thomas S. Huang

    Abstract: Despite the remarkable progress, weakly supervised segmentation approaches are still inferior to their fully supervised counterparts. We obverse the performance gap mainly comes from their limitation on learning to produce high-quality dense object localization maps from image-level supervision. To mitigate such a gap, we revisit the dilated convolution [1] and reveal how it can be utilized in a n… ▽ More

    Submitted 27 May, 2018; v1 submitted 11 May, 2018; originally announced May 2018.

    Comments: Accepted by CVPR 2018 as Spotlight

  44. arXiv:1804.03343  [pdf, other

    cs.CV

    Modular Generative Adversarial Networks

    Authors: Bo Zhao, Bo Chang, Zequn Jie, Leonid Sigal

    Abstract: Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains. However, most existing methods have limited scalability and robustness, since they require building independent models for each pair of domains in question. This leads to two significant shortcomings: (1) the need to trai… ▽ More

    Submitted 10 April, 2018; originally announced April 2018.

  45. arXiv:1804.00796  [pdf, other

    cs.CV

    Left-Right Comparative Recurrent Model for Stereo Matching

    Authors: Zequn Jie, Pengfei Wang, Yonggen Ling, Bo Zhao, Yunchao Wei, Jiashi Feng, Wei Liu

    Abstract: Leveraging the disparity information from both left and right views is crucial for stereo disparity estimation. Left-right consistency check is an effective way to enhance the disparity estimation by referring to the information from the opposite view. However, the conventional left-right consistency check is an isolated post-processing step and heavily hand-crafted. This paper proposes a novel le… ▽ More

    Submitted 2 April, 2018; originally announced April 2018.

    Comments: Accepted by CVPR 2018

  46. arXiv:1711.03270  [pdf, other

    cs.CV

    Predicting Scene Parsing and Motion Dynamics in the Future

    Authors: Xiaojie **, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan

    Abstract: The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the… ▽ More

    Submitted 9 November, 2017; originally announced November 2017.

    Comments: To appear in NIPS 2017

  47. arXiv:1708.04483  [pdf, other

    cs.CV

    Learning with Rethinking: Recurrently Improving Convolutional Neural Networks through Feedback

    Authors: Xin Li, Zequn Jie, Jiashi Feng, Changsong Liu, Shuicheng Yan

    Abstract: Recent years have witnessed the great success of convolutional neural network (CNN) based models in the field of computer vision. CNN is able to learn hierarchically abstracted features from images in an end-to-end training manner. However, most of the existing CNN models only learn features through a feedforward structure and no feedback information from top to bottom layers is exploited to enabl… ▽ More

    Submitted 15 August, 2017; originally announced August 2017.

  48. arXiv:1708.02421  [pdf, other

    cs.CV

    FoveaNet: Perspective-aware Urban Scene Parsing

    Authors: Xin Li, Zequn Jie, Wei Wang, Changsong Liu, Jimei Yang, Xiaohui Shen, Zhe Lin, Qiang Chen, Shuicheng Yan, Jiashi Feng

    Abstract: Parsing urban scene images benefits many applications, especially self-driving. Most of the current solutions employ generic image parsing models that treat all scales and locations in the images equally and do not consider the geometry property of car-captured urban scene images. Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and in… ▽ More

    Submitted 8 August, 2017; originally announced August 2017.

  49. arXiv:1707.06777  [pdf, other

    cs.CV

    Neural Person Search Machines

    Authors: Hao Liu, Jiashi Feng, Zequn Jie, Karlekar Jayashree, Bo Zhao, Meibin Qi, Jianguo Jiang, Shuicheng Yan

    Abstract: We investigate the problem of person search in the wild in this work. Instead of comparing the query against all candidate regions generated in a query-blind manner, we propose to recursively shrink the search area from the whole image till achieving precise localization of the target person, by fully exploiting information from the query and contextual cues in every recursive search step. We deve… ▽ More

    Submitted 21 July, 2017; originally announced July 2017.

    Comments: ICCV2017 camera ready

  50. arXiv:1704.05188  [pdf, other

    cs.CV

    Deep Self-Taught Learning for Weakly Supervised Object Localization

    Authors: Zequn Jie, Yunchao Wei, Xiaojie **, Jiashi Feng, Wei Liu

    Abstract: Most existing weakly supervised localization (WSL) approaches learn detectors by finding positive bounding boxes based on features learned with image-level supervision. However, those features do not contain spatial location related information and usually provide poor-quality positive samples for training a detector. To overcome this issue, we propose a deep self-taught learning approach, which m… ▽ More

    Submitted 30 April, 2017; v1 submitted 17 April, 2017; originally announced April 2017.

    Comments: Accepted as spotlight paper by CVPR 2017