Skip to main content

Showing 1–50 of 140 results for author: Duan, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17770  [pdf, other

    cs.CV

    MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

    Authors: Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang

    Abstract: Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capa… ▽ More

    Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.17279  [pdf, other

    cs.RO cs.AI

    Learning Decentralized Multi-Biped Control for Payload Transport

    Authors: Bikram Pandit, Ashutosh Gupta, Mohitvishnu S. Gadde, Addison Johnson, Aayam Kumar Shrestha, Helei Duan, Jeremy Dao, Alan Fern

    Abstract: Payload transport over flat terrain via multi-wheel robot carriers is well-understood, highly effective, and configurable. In this paper, our goal is to provide similar effectiveness and configurability for transport over rough terrain that is more suitable for legs rather than wheels. For this purpose, we consider multi-biped robot carriers, where wheels are replaced by multiple bipedal robots at… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Submitted to CoRL 2024, Project website: decmbc.github.io

  3. arXiv:2406.15848  [pdf, other

    cs.CV

    Quality-guided Skin Tone Enhancement for Portrait Photography

    Authors: Shiqi Gao, Huiyu Duan, Xinyue Li, Kang Fu, Yicong Peng, Qihang Xu, Yuanyuan Chang, Jia Wang, Xiongkuo Min, Guangtao Zhai

    Abstract: In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a map** from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image contin… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  4. arXiv:2406.14544  [pdf, other

    cs.CV cs.CL

    Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

    Authors: Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

    Abstract: Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an in… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  5. arXiv:2406.14515  [pdf, other

    cs.CV cs.MM

    MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

    Authors: Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen

    Abstract: The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Vide… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  6. arXiv:2406.09394  [pdf, other

    cs.CV cs.GR

    WonderWorld: Interactive 3D Scene Generation from a Single Image

    Authors: Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, Jiajun Wu

    Abstract: We present WonderWorld, a novel framework for interactive 3D scene extrapolation that enables users to explore and shape virtual environments based on a single input image and user-specified text. While significant improvements have been made to the visual quality of scene generation, existing methods are run offline, taking tens of minutes to hours to generate a scene. By leveraging Fast Gaussian… ▽ More

    Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project website: https://WonderWorld-2024.github.io/

  7. arXiv:2406.05054  [pdf, other

    cs.CV

    Prototype Correlation Matching and Class-Relation Reasoning for Few-Shot Medical Image Segmentation

    Authors: Yumin Zhang, Hongliu Li, Yajun Gao, Haoran Duan, Yawen Huang, Yefeng Zheng

    Abstract: Few-shot medical image segmentation has achieved great progress in improving accuracy and efficiency of medical analysis in the biomedical imaging field. However, most existing methods cannot explore inter-class relations among base and novel medical classes to reason unseen novel classes. Moreover, the same kind of medical class has large intra-class variations brought by diverse appearances, sha… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  8. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  9. arXiv:2406.04027  [pdf, other

    cs.CR cs.SE

    PowerPeeler: A Precise and General Dynamic Deobfuscation Method for PowerShell Scripts

    Authors: Ruijie Li, Chenyang Zhang, Huajun Chai, Lingyun Ying, Haixin Duan, Jun Tao

    Abstract: PowerShell is a powerful and versatile task automation tool. Unfortunately, it is also widely abused by cyber attackers. To bypass malware detection and hinder threat analysis, attackers often employ diverse techniques to obfuscate malicious PowerShell scripts. Existing deobfuscation tools suffer from the limitation of static analysis, which fails to simulate the real deobfuscation process accurat… ▽ More

    Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: To appear in the ACM CCS 2024

  10. arXiv:2406.02470  [pdf, other

    quant-ph cs.LG

    Meta-Designing Quantum Experiments with Language Models

    Authors: Sören Arlt, Haonan Duan, Felix Li, Sang Michael Xie, Yuhuai Wu, Mario Krenn

    Abstract: Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by finding solutions beyond human capabilities. However, these super-human solutions are often unintuitive and require considerable effort to uncover underlying principles, if possible at all. Here, we show how a code-generating language model trained on synthetic data can not only find solutions to specif… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 10+3 pages, 5 figures

  11. arXiv:2405.18933  [pdf, other

    cs.LG

    LSPI: Heterogeneous Graph Neural Network Classification Aggregation Algorithm Based on Size Neighbor Path Identification

    Authors: Yufei Zhao, Shiduo Wang, Hua Duan

    Abstract: Existing heterogeneous graph neural network algorithms (HGNNs) mostly rely on meta-paths to capture the rich semantic information contained in heterogeneous graphs (also known as heterogeneous information networks (HINs)), but most of these HGNNs focus on different ways of feature aggre gation and ignore the properties of the meta-paths themselves. This paper studies meta-paths in three commonly u… ▽ More

    Submitted 31 May, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  12. Wearable-based behaviour interpolation for semi-supervised human activity recognition

    Authors: Haoran Duan, Shidong Wang, Varun Ojha, Shizheng Wang, Yawen Huang, Yang Long, Rajiv Ranjan, Yefeng Zheng

    Abstract: While traditional feature engineering for Human Activity Recognition (HAR) involves a trial-anderror process, deep learning has emerged as a preferred method for high-level representations of sensor-based human activities. However, most deep learning-based HAR requires a large amount of labelled data and extracting HAR features from unlabelled data for effective deep learning training remains chal… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  13. arXiv:2405.15914  [pdf, other

    cs.CV

    ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

    Authors: Yumin Zhang, Xingyu Miao, Haoran Duan, Bo Wei, Tejal Shah, Yang Long, Rajiv Ranjan

    Abstract: Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge,… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  14. arXiv:2405.13900  [pdf, other

    cs.LG cs.CV

    Rehearsal-free Federated Domain-incremental Learning

    Authors: Rui Sun, Haoran Duan, Jiahua Dong, Varun Ojha, Tejal Shah, Rajiv Ranjan

    Abstract: We introduce a rehearsal-free federated domain incremental learning framework, RefFiL, based on a global prompt-sharing paradigm to alleviate catastrophic forgetting challenges in federated domain-incremental learning, where unseen domains are continually learned. Typical methods for mitigating forgetting, such as the use of additional datasets and the retention of private data from earlier tasks,… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  15. arXiv:2405.12209  [pdf, other

    cs.CL

    MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

    Authors: Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, Kai Chen

    Abstract: Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment of the LLMs' math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large la… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Project: https://github.com/open-compass/MathBench

  16. arXiv:2405.11252  [pdf, other

    cs.CV

    Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

    Authors: Xingyu Miao, Haoran Duan, Varun Ojha, Jun Song, Tejal Shah, Yang Long, Rajiv Ranjan

    Abstract: In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversi… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  17. arXiv:2405.10674  [pdf, other

    cs.CV cs.AI

    From Sora What We Can See: A Survey of Text-to-Video Generation

    Authors: Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan

    Abstract: With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark fr… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: A comprehensive list of text-to-video generation studies in this survey is available at https://github.com/soraw-ai/Awesome-Text-to-Video-Generation

  18. arXiv:2405.09286  [pdf, other

    cs.MM cs.CV

    MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding

    Authors: Jiajie Teng, Huiyu Duan, Yucheng Zhu, Si**g Wu, Guangtao Zhai

    Abstract: Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

  19. arXiv:2405.07346  [pdf, other

    cs.CV

    Understanding and Evaluating Human Preferences for AI Generated Images with Instruction Tuning

    Authors: Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min

    Abstract: Artificial Intelligence Generated Content (AIGC) has grown rapidly in recent years, among which AI-based image generation has gained widespread attention due to its efficient and imaginative image creation ability. However, AI-generated Images (AIGIs) may not satisfy human preferences due to their unique distortions, which highlights the necessity to understand and evaluate human preferences for A… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  20. arXiv:2404.19173  [pdf, other

    cs.RO

    Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking

    Authors: Bart van Marum, Aayam Shrestha, Helei Duan, Pranay Dugar, Jeremy Dao, Alan Fern

    Abstract: A necessary capability for humanoid robots is the ability to stand and walk while rejecting natural disturbances. Recent progress has been made using sim-to-real reinforcement learning (RL) to train such locomotion controllers, with approaches differing mainly in their reward functions. However, prior works lack a clear method to systematically test new reward functions and compare controller perf… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 8 pages, 5 figs

  21. arXiv:2404.09681  [pdf, other

    cs.CR

    An Empirical Study of Open Edge Computing Platforms: Ecosystem, Usage, and Security Risks

    Authors: Yu Bi, Mingshuo Yang, Yong Fang, Xianghang Mi, Shanqing Guo, Shujun Tang, Haixin Duan

    Abstract: Emerging in recent years, open edge computing platforms (OECPs) claim large-scale edge nodes, the extensive usage and adoption, as well as the openness to any third parties to join as edge nodes. For instance, OneThingCloud, a major OECP operated in China, advertises 5 million edge nodes, 70TB bandwidth, and 1,500PB storage. However, little information is publicly available for such OECPs with reg… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  22. arXiv:2404.07537  [pdf, other

    cs.CV

    How is Visual Attention Influenced by Text Guidance? Database and Model

    Authors: Yinan Sun, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

    Abstract: The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In thi… ▽ More

    Submitted 12 April, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

  23. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, **gwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  24. arXiv:2404.06480  [pdf, other

    cs.CL cs.AI

    Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

    Authors: Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

    Abstract: Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct lo… ▽ More

    Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: NAACL 2024

  25. arXiv:2404.01024  [pdf, other

    cs.CV eess.IV

    AIGCOIQA2024: Perceptual Quality Assessment of AI Generated Omnidirectional Images

    Authors: Liu Yang, Huiyu Duan, Long Teng, Yucheng Zhu, Xiaohong Liu, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet

    Abstract: In recent years, the rapid advancement of Artificial Intelligence Generated Content (AIGC) has attracted widespread attention. Among the AIGC, AI generated omnidirectional images hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications, hence omnidirectional AIGC techniques have also been widely studied. AI-generated omnidirectional images exhibit unique distorti… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  26. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  27. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  28. arXiv:2403.15426  [pdf, other

    cs.LG cs.AI cs.CL

    A Three-Phases SFT Hybrid Model Integrated Strong Prior Module and Data Overlap Estimation in the Eduation Context

    Authors: Zhangquan Chen, Chunjiang Liu, Haobin Duan

    Abstract: In this paper, we propose an end-to-end prior-based three-phases supervised fine-tuned model, which is proved more competitive than traditional fine-tuning method. More specifically, our model realizes the structural disassembly and incremental guided output of educational knowledge. To this end, we robustify data classification of three types via a sampler and overlap estimation neural network, a… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: 9 pages, 2 figures

    ACM Class: I.2.7

  29. arXiv:2403.09363  [pdf, other

    cs.CV

    Sentinel-Guided Zero-Shot Learning: A Collaborative Paradigm without Real Data Exposure

    Authors: Fan Wan, Xingyu Miao, Haoran Duan, **g**g Deng, Rui Gao, Yang Long

    Abstract: With increasing concerns over data privacy and model copyrights, especially in the context of collaborations between AI service providers and data owners, an innovative SG-ZSL paradigm is proposed in this work. SG-ZSL is designed to foster efficient collaboration without the need to exchange models or sensitive data. It consists of a teacher model, a student model and a generator that links both m… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  30. arXiv:2402.09733  [pdf, other

    cs.CL

    Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States

    Authors: Hanyu Duan, Yi Yang, Kar Yan Tam

    Abstract: Large Language Models (LLMs) can make up answers that are not real, and this is known as hallucination. This research aims to see if, how, and to what extent LLMs are aware of hallucination. More specifically, we check whether and how an LLM reacts differently in its hidden states when it answers a question right versus when it hallucinates. To do this, we introduce an experimental framework which… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: 9 pages, 8 figures, 2 tables (13 pages, 12 figures, 13 tables including references and appendices)

  31. arXiv:2402.08183  [pdf, other

    cs.CL cs.CV

    Pixel Sentence Representation Learning

    Authors: Chenghao Xiao, Zhuoxu Huang, Danlu Chen, G Thomas Hudson, Yizhi Li, Haoran Duan, Chenghua Lin, Jie Fu, Jungong Han, Noura Al Moubayed

    Abstract: Pretrained language models are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning to NLP remains an unsolved problem. This is largely due to the discreteness of subword units brought by tokenization of language models, limiting small perturbations of inputs… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  32. arXiv:2402.03413  [pdf, other

    cs.MM cs.CV eess.IV

    Perceptual Video Quality Assessment: A Survey

    Authors: Xiongkuo Min, Huiyu Duan, Wei Sun, Yucheng Zhu, Guangtao Zhai

    Abstract: Perceptual video quality assessment plays a vital role in the field of video processing due to the existence of quality degradations introduced in various stages of video signal acquisition, compression, transmission and display. With the advancement of internet communication and cloud service technology, video content and traffic are growing exponentially, which further emphasizes the requirement… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  33. arXiv:2402.01950  [pdf, other

    cs.CV

    ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields

    Authors: Xingyu Miao, Yang Bai, Haoran Duan, Fan Wan, Yawen Huang, Yang Long, Yefeng Zheng

    Abstract: Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps th… ▽ More

    Submitted 6 March, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

  34. arXiv:2401.16420  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, **gwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XCo… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  35. arXiv:2401.04861  [pdf, other

    cs.CV

    CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video

    Authors: Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Yang Long, Yefeng Zheng

    Abstract: The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of d… ▽ More

    Submitted 26 June, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

    Comments: Accepted by Pattern Recognition

  36. arXiv:2312.07981  [pdf

    cs.LG cs.SD eess.SP

    Time Series Diffusion Method: A Denoising Diffusion Probabilistic Model for Vibration Signal Generation

    Authors: Haiming Yi, Lei Hou, Yuhong **, Nasser A. Saeed, Ali Kandil, Hao Duan

    Abstract: Diffusion models have demonstrated powerful data generation capabilities in various research fields such as image generation. However, in the field of vibration signal generation, the criteria for evaluating the quality of the generated signal are different from that of image generation and there is a fundamental difference between them. At present, there is no research on the ability of diffusion… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Journal ref: Mechanical Systems and Signal Processing, 2024, 216: 111481

  37. arXiv:2312.03884  [pdf, other

    cs.CV cs.GR

    WonderJourney: Going from Anywhere to Everywhere

    Authors: Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann

    Abstract: We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes… ▽ More

    Submitted 12 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: Project website with video results: https://kovenyu.com/WonderJourney/

  38. arXiv:2311.18482  [pdf, other

    cs.CV cs.GR

    Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

    Authors: **-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan

    Abstract: Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

  39. arXiv:2311.18377  [pdf

    physics.chem-ph cs.LG q-bio.BM

    Transfer Learning across Different Chemical Domains: Virtual Screening of Organic Materials with Deep Learning Models Pretrained on Small Molecule and Chemical Reaction Data

    Authors: Chengwei Zhang, Yushuang Zhai, Ziyang Gong, Hongliang Duan, Yuan-Bin She, Yun-Fang Yang, An Su

    Abstract: Machine learning is becoming a preferred method for the virtual screening of organic materials due to its cost-effectiveness over traditional computationally demanding techniques. However, the scarcity of labeled data for organic materials poses a significant challenge for training advanced machine learning models. This study showcases the potential of utilizing databases of drug-like small molecu… ▽ More

    Submitted 5 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

  40. arXiv:2311.18213  [pdf, other

    cs.IR cs.AI

    Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation

    Authors: Liangcai Su, Fan Yan, Jieming Zhu, Xi Xiao, Haoyi Duan, Zhou Zhao, Zhenhua Dong, Ruiming Tang

    Abstract: Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited f… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

    Comments: Accepted by SIGIR 2023. Code will be available at https://reczoo.github.io/SparCode

  41. arXiv:2311.15637  [pdf, other

    cs.CV cs.GR

    Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes

    Authors: Hao-Bin Duan, Miao Wang, Yan-Xun Li, Yong-Liang Yang

    Abstract: We present Neural 3D Strokes, a novel technique to generate stylized images of a 3D scene at arbitrary novel views from multi-view 2D images. Different from existing methods which apply stylization to trained neural radiance fields at the voxel level, our approach draws inspiration from image-to-painting methods, simulating the progressive painting process of human artwork with vector strokes. We… ▽ More

    Submitted 12 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted to CVPR 2024

  42. arXiv:2311.10395  [pdf, other

    cs.CL

    Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads

    Authors: Yi Yang, Hanyu Duan, Ahmed Abbasi, John P. Lalor, Kar Yan Tam

    Abstract: Transformer-based pretrained large language models (PLM) such as BERT and GPT have achieved remarkable success in NLP tasks. However, PLMs are prone to encoding stereotypical biases. Although a burgeoning literature has emerged on stereotypical bias mitigation in PLMs, such as work on debiasing gender and racial stereoty**, how such biases manifest and behave internally within PLMs remains large… ▽ More

    Submitted 15 June, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: 14 pages, 7 figures, 3 tables including references and appendices

  43. arXiv:2311.10367  [pdf, other

    cs.CL

    Exploring the Relationship between In-Context Learning and Instruction Tuning

    Authors: Hanyu Duan, Yixuan Tang, Yi Yang, Ahmed Abbasi, Kar Yan Tam

    Abstract: In-Context Learning (ICL) and Instruction Tuning (IT) are two primary paradigms of adopting Large Language Models (LLMs) to downstream applications. However, they are significantly different. In ICL, a set of demonstrations are provided at inference time but the LLM's parameters are not updated. In IT, a set of demonstrations are used to tune LLM's parameters in training time but no demonstrations… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

  44. arXiv:2311.05521  [pdf, other

    cs.GR cs.CV

    BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

    Authors: Hao-Bin Duan, Miao Wang, **-Chuan Shi, Xu-Chuan Chen, Yan-Pei Cao

    Abstract: Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avat… ▽ More

    Submitted 28 November, 2023; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: ACM Transactions on Graphics (SIGGRAPH Asia 2023). Project Page: https://buaavrcg.github.io/BakedAvatar

    Journal ref: ACM Trans. Graph. 42, 6, Article 225 (December 2023), 14 pages

  45. arXiv:2311.05190  [pdf, other

    cs.CV

    Audio-visual Saliency for Omnidirectional Videos

    Authors: Yuxin Zhu, Xilei Zhu, Huiyu Duan, Jie Li, Kaiwei Zhang, Yucheng Zhu, Li Chen, Xiongkuo Min, Guangtao Zhai

    Abstract: Visual saliency prediction for omnidirectional videos (ODVs) has shown great significance and necessity for omnidirectional videos to help ODV coding, ODV transmission, ODV rendering, etc.. However, most studies only consider visual information for ODV saliency prediction while audio is rarely considered despite its significant influence on the viewing behavior of ODV. This is mainly due to the la… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: 13 pages, 5 figures, conference

  46. arXiv:2311.05152  [pdf, other

    cs.LG cs.AI cs.CV cs.MM

    Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

    Authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao

    Abstract: In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specifi… ▽ More

    Submitted 20 December, 2023; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted to NeurIPS 2023

  47. arXiv:2310.13650  [pdf, other

    cs.CL

    BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

    Authors: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen

    Abstract: Interacting with human via high-quality multi-turn dialogues is a key feature of large language models (LLMs). However, human-based evaluation of such capability involves intensive manual labor. This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting, through an LLM-based approach. We start from real-world human dialogues and keep the ver… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  48. arXiv:2310.03202  [pdf, other

    cs.CR cs.NI cs.SE

    ResolverFuzz: Automated Discovery of DNS Resolver Vulnerabilities with Query-Response Fuzzing

    Authors: Qifan Zhang, Xuesong Bai, Xiang Li, Haixin Duan, Qi Li, Zhou Li

    Abstract: Domain Name System (DNS) is a critical component of the Internet. DNS resolvers, which act as the cache between DNS clients and DNS nameservers, are the central piece of the DNS infrastructure, essential to the scalability of DNS. However, finding the resolver vulnerabilities is non-trivial, and this problem is not well addressed by the existing tools. To list a few reasons, first, most of the kno… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: Extended version. Accepted by USENIX Security 2024

  49. arXiv:2310.03191  [pdf, other

    cs.RO

    Sim-to-Real Learning for Humanoid Box Loco-Manipulation

    Authors: Jeremy Dao, Helei Duan, Alan Fern

    Abstract: In this work we propose a learning-based approach to box loco-manipulation for a humanoid robot. This is a particularly challenging problem due to the need for whole-body coordination in order to lift boxes of varying weight, position, and orientation while maintaining balance. To address this challenge, we present a sim-to-real reinforcement learning approach for training general box pickup and c… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  50. arXiv:2310.01884  [pdf, other

    cs.LG cs.AI

    Enhanced LFTSformer: A Novel Long-Term Financial Time Series Prediction Model Using Advanced Feature Engineering and the DS Encoder Informer Architecture

    Authors: Jianan Zhang, Hongyi Duan

    Abstract: This study presents a groundbreaking model for forecasting long-term financial time series, termed the Enhanced LFTSformer. The model distinguishes itself through several significant innovations: (1) VMD-MIC+FE Feature Engineering: The incorporation of sophisticated feature engineering techniques, specifically through the integration of Variational Mode Decomposition (VMD), Maximal Information C… ▽ More

    Submitted 18 April, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: The methodology, experiments, and language of the original version have been completely updated. Detailed adjustments will be made in the future