Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Authors:
Sukmin Yun,
Haokun Lin,
Rusiru Thushara,
Mohammad Qazim Bhat,
Yongxin Wang,
Zutao Jiang,
Mingkai Deng,
**hong Wang,
Tianhua Tao,
Junbo Li,
Haonan Li,
Preslav Nakov,
Timothy Baldwin,
Zhengzhong Liu,
Eric P. Xing,
Xiaodan Liang,
Zhiqiang Shen
Abstract:
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-t…
▽ More
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
AutoVideo: An Automated Video Action Recognition System
Authors:
Daochen Zha,
Zaid Pervaiz Bhat,
Yi-Wei Chen,
Yicheng Wang,
Sirui Ding,
Jiaben Chen,
Kwei-Herng Lai,
Mohammad Qazim Bhat,
Anmoll Kumar Jain,
Alfredo Costilla Reyes,
Na Zou,
Xia Hu
Abstract:
Action recognition is an important task for video understanding with broad applications. However, develo** an effective action recognition solution often requires extensive engineering efforts in building and testing different combinations of the modules and their hyperparameters. In this demo, we present AutoVideo, a Python system for automated video action recognition. AutoVideo is featured fo…
▽ More
Action recognition is an important task for video understanding with broad applications. However, develo** an effective action recognition solution often requires extensive engineering efforts in building and testing different combinations of the modules and their hyperparameters. In this demo, we present AutoVideo, a Python system for automated video action recognition. AutoVideo is featured for 1) highly modular and extendable infrastructure following the standard pipeline language, 2) an exhaustive list of primitives for pipeline construction, 3) data-driven tuners to save the efforts of pipeline tuning, and 4) easy-to-use Graphical User Interface (GUI). AutoVideo is released under MIT license at https://github.com/datamllab/autovideo
△ Less
Submitted 16 July, 2022; v1 submitted 9 August, 2021;
originally announced August 2021.