Skip to main content

Showing 1–15 of 15 results for author: Wang, A J

.
  1. arXiv:2406.02547  [pdf, ps, other

    cs.CV

    Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

    Authors: Alex **peng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou

    Abstract: Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 12 pages. The website is \url{https://fingerrec.github.io/visincontext}

  2. arXiv:2401.00849  [pdf, other

    cs.CV

    COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

    Authors: Alex **peng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introd… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

    Comments: 16 pages; Website: http://fingerrec.github.io/cosmo

  3. arXiv:2312.14232  [pdf, other

    cs.CV cs.AI

    Parrot Captions Teach CLIP to Spot Text

    Authors: Yiqi Lin, Conghui He, Alex **peng Wang, Bin Wang, Weijia Li, Mike Zheng Shou

    Abstract: Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. O… ▽ More

    Submitted 1 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: project page: https://linyq17.github.io/CLIP-Parrot-Bias/. Add more analysis and ablation studies. Update Figure 3 with a more precise metric

  4. arXiv:2307.16715  [pdf, other

    cs.CV

    UniVTG: Towards Unified Video-Language Temporal Grounding

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex **peng Wang, Rui Yan, Mike Zheng Shou

    Abstract: Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detect… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code: https://github.com/showlab/UniVTG

  5. arXiv:2305.20087  [pdf, other

    cs.CV

    Too Large; Data Reduction for Vision-Language Pre-Training

    Authors: Alex **peng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

    Abstract: This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major s… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: ICCV2023. Code: https://github.com/showlab/datacentric.vlp

  6. arXiv:2212.09737  [pdf, other

    cs.CV

    Position-guided Text Prompt for Vision-Language Pre-training

    Authors: Alex **peng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

    Abstract: Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to… ▽ More

    Submitted 7 June, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Camera-ready version, code is in https://github.com/sail-sg/ptp

  7. arXiv:2207.01622  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pre… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Preprint. 4 pages, 2 figures, 5 tables. Code: https://github.com/showlab/EgoVLP. The Ego4D challenge technical report of EgoVLP arXiv:2206.01670. See EPIC challenge technical report arXiv:2207.01334 for overlap

  8. arXiv:2207.01334  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Rui Yan, Eric Zhongcong Xu, Rongcheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretra… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: To appeared in CVPRW22. 5 pages, 2 figures, 2 tables. Code: https://github.com/showlab/EgoVLP. The EPIC challenge technical report of EgoVLP arXiv:2206.01670. See Ego4D challenge technical report arXiv:2207.01622

  9. arXiv:2206.01670  [pdf, other

    cs.CV cs.AI

    Egocentric Video-Language Pretraining

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create… ▽ More

    Submitted 12 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted by NeurIPS 2022. Double champions at Ego4D and EPIC-Kitchens, CVPR 2022 challenges. 23 pages, 13 figures, 12 tables. Code: https://github.com/showlab/EgoVLP

  10. arXiv:2204.12408  [pdf, other

    cs.CV

    MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

    Authors: Yuying Ge, Yixiao Ge, Xihui Liu, Alex **peng Wang, Jian** Wu, Ying Shan, Xiaohu Qie, ** Luo

    Abstract: Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

  11. arXiv:2203.07720  [pdf, other

    cs.CV

    Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

    Authors: Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex **peng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jian** Wu, Mike Zheng Shou

    Abstract: Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revital… ▽ More

    Submitted 7 February, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

  12. arXiv:2203.07303  [pdf, other

    cs.CV

    All in One: Exploring Unified Video-Language Pre-training

    Authors: Alex **peng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jian** Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: 18 pages. 11 figures. Code: https://github.com/showlab/all-in-one

  13. arXiv:2201.09159  [pdf

    physics.flu-dyn cond-mat.soft physics.app-ph physics.ins-det physics.optics

    Calibration-Free Travel Time After Photobleaching Velocimetry

    Authors: Audrey J. Wang, Jianyu Deng, David Westbury, Yi Wang, Guiren Wang

    Abstract: In interfacial science, there is an increasing need to measure flow velocity fields at interfaces with ultrahigh spatial and temporal resolution to study transport phenomena. Although laser-induced fluorescence photobleaching anemometry (LIFPA) has achieved nanoscopic resolution for flow measurement, it requires pre-calibration, which is unavailable for unknown flows. We present a novel, calibrati… ▽ More

    Submitted 22 January, 2022; originally announced January 2022.

  14. arXiv:2112.01194  [pdf, other

    cs.CV cs.MM

    Video-Text Pre-training with Learned Regions

    Authors: Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex **peng Wang, Xudong Lin, Guanyu Cai, **hui Tang

    Abstract: Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet h… ▽ More

    Submitted 6 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

  15. arXiv:2112.00656  [pdf, other

    cs.CV cs.CL

    Object-aware Video-language Pre-training for Retrieval

    Authors: Alex **peng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object represent… ▽ More

    Submitted 18 May, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: CVPR2022; Code: https://github.com/FingerRec/OA-Transformer