Search | arXiv e-print repository

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Ke Ma, Kewei Chen, Dongxu Guo, Tian Ye, Yanting Zhang, Hongwei Wang, Gaoang Wang

Abstract: Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challengin… ▽ More Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5\times$ to $7.3\times$. We begin our exploration with a vanilla large language model, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: CVPR 2024 Embodied AI Workshop

arXiv:2406.04983 [pdf, other]

CityCraft: A Real Crafter for 3D City Generation

Authors: Jie Deng, Wenhao Chai, Junsheng Huang, Zhonghan Zhao, Qixuan Huang, Mingyan Gao, Jianshu Guo, Shengyu Hao, Wenhao Hu, Jenq-Neng Hwang, Xi Li, Gaoang Wang

Abstract: City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neur… ▽ More City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 20 pages, 9 figures

arXiv:2404.17176 [pdf, other]

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Authors: Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang

Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long… ▽ More Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal large language models for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following https://github.com/rese1f/MovieChat. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.12871 [pdf, other]

Expanding the Katz Index for Link Prediction: A Case Study on a Live Fish Movement Network

Authors: Michael-Sam Vidza, Marcin Budka, Wei Koong Chai, Mark Thrush, Mickael Teixeira Alves

Abstract: In aquaculture, disease spread models often neglect the dynamic interactions between farms, hindering accuracy. This study enhances the Katz index (KI) to incorporate spatial and temporal patterns of fish movement, improving the prediction of farms susceptible to disease via live fish transfers. We modified the Katz index to create models like the Weighted Katz Index (WKI), Edge Weighted Katz Inde… ▽ More In aquaculture, disease spread models often neglect the dynamic interactions between farms, hindering accuracy. This study enhances the Katz index (KI) to incorporate spatial and temporal patterns of fish movement, improving the prediction of farms susceptible to disease via live fish transfers. We modified the Katz index to create models like the Weighted Katz Index (WKI), Edge Weighted Katz Index (EWKI), and combined models (e.g., KIEWKI). These incorporate spatial distances and temporal movement patterns for a comprehensive aquaculture network connection prediction framework. Model performance was evaluated using precision, recall, F1-scores, AUPR, and AUROC. The EWKI model significantly outperformed the traditional KI and other variations. It achieved high precision (0.988), recall (0.712), F1-score (0.827), and AUPR (0.970). Combined models (KIEWKI, WKIEWKI) approached, but couldn't surpass, EWKI performance. This study highlights the value of extending Katz index models to improve disease spread predictions in aquaculture networks. The EWKI model's performance demonstrates an innovative and flexible approach to tackling spatial challenges within network analysis. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 15 pages, 3 figures, submitted to Expert Systems with Applications

arXiv:2404.04910 [pdf, other]

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

Authors: Hou-I Liu, Christine Wu, Jen-Hao Cheng, Wenhao Chai, Shian-Yun Wang, Gaowen Liu, Jenq-Neng Hwang, Hong-Han Shuai, Wen-Huang Cheng

Abstract: Monocular 3D object detection (Mono3D) is an indispensable research topic in autonomous driving, thanks to the cost-effective monocular camera sensors and its wide range of applications. Since the image perspective has depth ambiguity, the challenges of Mono3D lie in understanding 3D scene geometry and reconstructing 3D object information from a single image. Previous methods attempted to transfer… ▽ More Monocular 3D object detection (Mono3D) is an indispensable research topic in autonomous driving, thanks to the cost-effective monocular camera sensors and its wide range of applications. Since the image perspective has depth ambiguity, the challenges of Mono3D lie in understanding 3D scene geometry and reconstructing 3D object information from a single image. Previous methods attempted to transfer 3D information directly from the LiDAR-based teacher to the camera-based student. However, a considerable gap in feature representation makes direct cross-modal distillation inefficient, resulting in a significant performance deterioration between the LiDAR-based teacher and the camera-based student. To address this issue, we propose the Teaching Assistant Knowledge Distillation (MonoTAKD) to break down the learning objective by integrating intra-modal distillation with cross-modal residual distillation. In particular, we employ a strong camera-based teaching assistant model to distill powerful visual knowledge effectively through intra-modal distillation. Subsequently, we introduce the cross-modal residual distillation to transfer the 3D spatial cues. By acquiring both visual knowledge and 3D spatial cues, the predictions of our approach are rigorously evaluated on the KITTI 3D object detection benchmark and achieve state-of-the-art performance in Mono3D. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: 14 pages

arXiv:2404.04619 [pdf, other]

Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Authors: Zhonghan Zhao, Ke Ma, Wenhao Chai, Xuan Wang, Kewei Chen, Dongxu Guo, Yanting Zhang, Hongwei Wang, Gaoang Wang

Abstract: With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks m… ▽ More With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2403.08282

arXiv:2403.18493 [pdf, other]

VersaT2I: Improving Text-to-Image Models with Versatile Reward

Authors: Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang

Abstract: Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of… ▽ More Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.10826 [pdf, other]

Exploring Learning-based Motion Models in Multi-Object Tracking

Authors: Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

Abstract: In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman Filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of… ▽ More In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman Filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman Filter with various learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman Filter-based systems. In this paper, we proposed MambaTrack, an online motion-based tracker that outperforms all existing motion-based trackers on the challenging DanceTrack and SportsMOT datasets. Moreover, we further exploit the potential of the state-space-model in trajectory feature extraction to boost the tracking performance and proposed MambaTrack+, which achieves the state-of-the-art performance on DanceTrack dataset with 56.1 HOTA and 54.9 IDF1. △ Less

Submitted 16 March, 2024; originally announced March 2024.

arXiv:2403.08282 [pdf, other]

Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation

Authors: Zhonghan Zhao, Kewei Chen, Dongxu Guo, Wenhao Chai, Tian Ye, Yanting Zhang, Gaoang Wang

Abstract: Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial… ▽ More Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial for effective multi-agent navigation. Furthermore, processing and integrating multi-modal information (such as visual, textual, and auditory data) is essential for agents to comprehend their goals and navigate the environment successfully and fully. To address this issue, we design the HAS framework to auto-organize groups of LLM-based agents to complete navigation tasks. In our approach, we devise a hierarchical auto-organizing navigation system, which is characterized by 1) a hierarchical system for multi-agent organization, ensuring centralized planning and decentralized execution; 2) an auto-organizing and intra-communication mechanism, enabling dynamic group adjustment under subtasks; 3) a multi-modal information platform, facilitating multi-modal perception to perform the three navigation tasks with one system. To assess organizational behavior, we design a series of navigation tasks in the Minecraft environment, which includes searching and exploring. We aim to develop embodied organizations that push the boundaries of embodied AI, moving it towards a more human-like organizational structure. △ Less

Submitted 18 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

Comments: ICLR 2024 Workshop on LLM Agents

arXiv:2403.05227 [pdf, ps, other]

doi 10.1088/1674-1056/ad1c5e

Superconductivity in kagome metal ThRu3Si2

Authors: Yi Liu, **g Li, Wu-Zhang Yang, Jia-Yi Lu, Bo-Ya Cao, Hua-Xun Li, Wan-Li Chai, Si-Qi Wu, Bai-Zhuo Li, Yun-Lei Sun, Wen-He Jiao, Wang Cao, Xiao-Feng Xu, Ren Zhi, Guang-Han Cao

Abstract: We report the physical properties of ThRu$_3$Si$_2$ featured with distorted Ru kagome lattice. The combined experiments of resistivity, magnetization and specific heat reveal bulk superconductivity with $T_{\rm{c}}$ = 3.8 K. The specific heat jump and calculated electron-phonon coupling indicate a moderate coupled BCS superconductor. In comparison with LaRu$_3$Si$_2$, the calculated electronic str… ▽ More We report the physical properties of ThRu$_3$Si$_2$ featured with distorted Ru kagome lattice. The combined experiments of resistivity, magnetization and specific heat reveal bulk superconductivity with $T_{\rm{c}}$ = 3.8 K. The specific heat jump and calculated electron-phonon coupling indicate a moderate coupled BCS superconductor. In comparison with LaRu$_3$Si$_2$, the calculated electronic structure in ThRu$_3$Si$_2$ shows an electron-do** effect with electron filling lifted from 100 meV below flat bands to 300 meV above it. This explains the lower superconducting transition temperature and weaker electron correlations observed in ThRu$_3$Si$_2$. Our work suggests the $T_{\rm{c}}$ and electronic correlations in kagome superconductor could have intimate connection with the flat bands. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 7 pages, 5 figures

Journal ref: Chinese Physics B (2024)

arXiv:2402.09316 [pdf, other]

Only My Model On My Data: A Privacy Preserving Approach Protecting one Model and Deceiving Unauthorized Black-Box Models

Authors: Weiheng Chai, Brian Testa, Huantao Ren, Asif Salekin, Senem Velipasalar

Abstract: Deep neural networks are extensively applied to real-world tasks, such as face recognition and medical image classification, where privacy and data protection are critical. Image data, if not protected, can be exploited to infer personal or contextual information. Existing privacy preservation methods, like encryption, generate perturbed images that are unrecognizable to even humans. Adversarial a… ▽ More Deep neural networks are extensively applied to real-world tasks, such as face recognition and medical image classification, where privacy and data protection are critical. Image data, if not protected, can be exploited to infer personal or contextual information. Existing privacy preservation methods, like encryption, generate perturbed images that are unrecognizable to even humans. Adversarial attack approaches prohibit automated inference even for authorized stakeholders, limiting practical incentives for commercial and widespread adaptation. This pioneering study tackles an unexplored practical privacy preservation use case by generating human-perceivable images that maintain accurate inference by an authorized model while evading other unauthorized black-box models of similar or dissimilar objectives, and addresses the previous research gaps. The datasets employed are ImageNet, for image classification, Celeba-HQ dataset, for identity classification, and AffectNet, for emotion classification. Our results show that the generated images can successfully maintain the accuracy of a protected model and degrade the average accuracy of the unauthorized black-box models to 11.97%, 6.63%, and 55.51% on ImageNet, Celeba-HQ, and AffectNet datasets, respectively. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2312.08887 [pdf, other]

SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models

Authors: Weilong Chai, DanDan Zheng, Jiajiong Cao, Zhiquan Chen, Changbao Wang, Chenguang Ma

Abstract: Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Though many acceleration methods have been proposed, they suffer from generation quality degradation or extra training cost generalizing to new fine-tuned models. To address these limitations, we propose a novel and universal Stable-Diffusion (SD) acceleration module called Speed… ▽ More Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Though many acceleration methods have been proposed, they suffer from generation quality degradation or extra training cost generalizing to new fine-tuned models. To address these limitations, we propose a novel and universal Stable-Diffusion (SD) acceleration module called SpeedUpNet(SUN). SUN can be directly plugged into various fine-tuned SD models without extra training. This technique utilizes cross-attention layers to learn the relative offsets in the generated image results between negative and positive prompts achieving classifier-free guidance distillation with negative prompts controllable, and introduces a Multi-Step Consistency (MSC) loss to ensure a harmonious balance between reducing inference steps and maintaining consistency in the generated output. Consequently, SUN significantly reduces the number of inference steps to just 4 steps and eliminates the need for classifier-free guidance. It leads to an overall speedup of more than 10 times for SD models compared to the state-of-the-art 25-step DPM-solver++, and offers two extra advantages: (1) classifier-free guidance distillation with controllable negative prompts and (2) seamless integration into various fine-tuned Stable-Diffusion models without training. The effectiveness of the SUN has been verified through extensive experimentation. Project Page: https://williechai.github.io/speedup-plugin-for-stable-diffusions.github.io △ Less

Submitted 20 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Table 1. shows the comparison with existing methods, but the lack of experimental data of the LCM method under 12-step makes the table incomplete. We need to temporarily withdraw the manuscript and conduct corresponding experiments before resubmitting it

arXiv:2312.04793 [pdf, other]

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Authors: Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, Gaoang Wang

Abstract: Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing me… ▽ More Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided map** network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.01508 [pdf, other]

CityGen: Infinite and Controllable 3D City Layout Generation

Authors: Jie Deng, Wenhao Chai, Jianshu Guo, Qixuan Huang, Wenhao Hu, Jenq-Neng Hwang, Gaoang Wang

Abstract: City layout generation has recently gained significant attention. The goal of this task is to automatically generate the layout of a city scene, including elements such as roads, buildings, vegetation, as well as other urban infrastructures. Previous methods using VAEs or GANs for 3D city layout generation offer limited diversity and constrained interactivity, only allowing users to selectively re… ▽ More City layout generation has recently gained significant attention. The goal of this task is to automatically generate the layout of a city scene, including elements such as roads, buildings, vegetation, as well as other urban infrastructures. Previous methods using VAEs or GANs for 3D city layout generation offer limited diversity and constrained interactivity, only allowing users to selectively regenerate parts of the layout, which greatly limits customization. In this paper, we propose CityGen, a novel end-to-end framework for infinite, diverse and controllable 3D city layout generation.First, we propose an outpainting pipeline to extend the local layout to an infinite city layout. Then, we utilize a multi-scale diffusion model to generate diverse and controllable local semantic layout patches. The extensive experiments show that CityGen achieves state-of-the-art (SOTA) performance under FID and KID in generating an infinite and controllable 3D city layout. CityGen demonstrates promising applicability in fields like smart cities, urban planning, and digital simulation. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: 12 pages, 9 figures

arXiv:2311.16477 [pdf, other]

UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning

Authors: Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang

Abstract: In recent times, there has been a growing interest in develo** effective perception techniques for combining information from multiple modalities. This involves aligning features obtained from diverse sources to enable more efficient training with larger datasets and constraints, as well as leveraging the wealth of information contained in each modality. 2D and 3D Human Pose Estimation (HPE) are… ▽ More In recent times, there has been a growing interest in develo** effective perception techniques for combining information from multiple modalities. This involves aligning features obtained from diverse sources to enable more efficient training with larger datasets and constraints, as well as leveraging the wealth of information contained in each modality. 2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision, which have numerous downstream applications, such as Action Recognition, Human-Computer Interaction, Object tracking, etc. Yet, there are limited instances where the correlation between Image and 2D/3D human pose has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities, i.e., 2D human pose estimation, lifting-based and image-based 3D human pose estimation, in the same pipeline. To align more than two modalities at the same time, we propose a novel singular value based contrastive learning loss, which better aligns different modalities and further boosts the performance. In our evaluation, UniHPE achieves remarkable performance metrics: MPJPE $50.5$mm on the Human3.6M dataset and PAMPJPE $51.6$mm on the 3DPW dataset. Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.15209 [pdf, other]

See and Think: Embodied Agent in Virtual Environment

Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Jenq-Neng Hwang, Gaoang Wang

Abstract: Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. In this paper, we propose STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE consists of three key components: vision perception, language instruction, and code action. Vision perception involves t… ▽ More Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. In this paper, we propose STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE consists of three key components: vision perception, language instruction, and code action. Vision perception involves the interpretation of visual information in the environment, which is then integrated into the LLMs component with agent state and task instruction. Language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600$+$ vision-environment pairs, 20K knowledge question-answering pairs, and 200$+$ skill-code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most $1.5 \times$ faster unlocking key tech trees and $2.5 \times$ quicker in block search tasks compared to previous state-of-the-art methods. △ Less

Submitted 2 December, 2023; v1 submitted 26 November, 2023; originally announced November 2023.

Comments: Preprint. First three authors contribute equally to this work. Project Website https://rese1f.github.io/STEVE/

arXiv:2311.12043 [pdf, other]

Efficient Domain Adaptation via Generative Prior for 3D Infant Pose Estimation

Authors: Zhuoran Zhou, Zhongyu Jiang, Wenhao Chai, Cheng-Yen Yang, Lei Li, Jenq-Neng Hwang

Abstract: Although 3D human pose estimation has gained impressive development in recent years, only a few works focus on infants, that have different bone lengths and also have limited data. Directly applying adult pose estimation models typically achieves low performance in the infant domain and suffers from out-of-distribution issues. Moreover, the limitation of infant pose data collection also heavily co… ▽ More Although 3D human pose estimation has gained impressive development in recent years, only a few works focus on infants, that have different bone lengths and also have limited data. Directly applying adult pose estimation models typically achieves low performance in the infant domain and suffers from out-of-distribution issues. Moreover, the limitation of infant pose data collection also heavily constrains the efficiency of learning-based models to lift 2D poses to 3D. To deal with the issues of small datasets, domain adaptation and data augmentation are commonly used techniques. Following this paradigm, we take advantage of an optimization-based method that utilizes generative priors to predict 3D infant keypoints from 2D keypoints without the need of large training data. We further apply a guided diffusion model to domain adapt 3D adult pose to infant pose to supplement small datasets. Besides, we also prove that our method, ZeDO-i, could attain efficient domain adaptation, even if only a small number of data is given. Quantitatively, we claim that our model attains state-of-the-art MPJPE performance of 43.6 mm on the SyRIP dataset and 21.2 mm on the MINI-RGBD dataset. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: WACVW 2024

arXiv:2309.13770 [pdf, other]

Devil in the Number: Towards Robust Multi-modality Data Filter

Authors: Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang

Abstract: In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within i… ▽ More In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: ICCV 2023 Workshop: TNGCV-DataComp

arXiv:2309.13514 [pdf, ps, other]

Superconductivity emerging from density-wave-like order in a correlated kagome metal

Authors: Yi Liu, Zi-Yi Liu, **-Ke Bao, Peng-Tao Yang, Liang-Wen Ji, Si-Qi Wu, Qin-Xin Shen, Jun Luo, Jie Yang, Ji-Yong Liu, Chen-Chao Xu, Wu-Zhang Yang, Wan-Li Chai, Jia-Yi Lu, Chang-Chao Liu, Bo-Sen Wang, Hao Jiang, Qian Tao, Zhi Ren, Xiao-Feng Xu, Chao Cao, Zhu-An Xu, Rui Zhou, **-Guang Cheng, Guang-Han Cao

Abstract: Unconventional superconductivity (USC) in a highly correlated kagome system has been theoretically proposed for years, yet the experimental realization is hard to achieve. The recently discovered vanadium-based kagome materials, which exhibit both superconductivity and charge density wave (CDW) orders, are nonmagnetic and weakly correlated, thus unlikely host USC as theories proposed. Here we repo… ▽ More Unconventional superconductivity (USC) in a highly correlated kagome system has been theoretically proposed for years, yet the experimental realization is hard to achieve. The recently discovered vanadium-based kagome materials, which exhibit both superconductivity and charge density wave (CDW) orders, are nonmagnetic and weakly correlated, thus unlikely host USC as theories proposed. Here we report the discovery of a chromium-based kagome metal, CsCr$_3$Sb$_5$, which is contrastingly characterised by strong electron correlations, frustrated magnetism, and characteristic flat bands close to the Fermi level. Under ambient pressure, it undergoes a concurrent structural and magnetic phase transition at 55 K, accompanying with a stripe-like $4a_0$ structural modulation. At high pressure, the phase transition evolves into two transitions, probably associated with CDW and antiferromagnetic spin-density-wave orderings, respectively. These density-wave (DW)-like orders are gradually suppressed with pressure and, remarkably, a superconducting dome emerges at 3.65-8.0 GPa. The maximum of the superconducting transition temperature, $T_\mathrm{c}^{\mathrm{max}}=$ 6.4 K, appears when the DW-like orders are completely suppressed at 4.2 GPa, and the normal state exhibits a non-Fermi-liquid behaviour, reminiscent of USC and quantum criticality in iron-based superconductors. Our work offers an unprecedented platform for investigating possible USC in a correlated kagome system. △ Less

Submitted 16 March, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

Comments: 32 pages, 14 figures

arXiv:2309.03599 [pdf, other]

Chasing Consistency in Text-to-3D Generation from a Single Image

Authors: Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang

Abstract: Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we p… ▽ More Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts. △ Less

Submitted 7 September, 2023; originally announced September 2023.

Comments: 9 pages, 11 figures

arXiv:2308.09953 [pdf, other]

UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

Authors: Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang

Abstract: Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare spe… ▽ More Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples. △ Less

Submitted 19 August, 2023; originally announced August 2023.

arXiv:2308.09678 [pdf, other]

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

Authors: Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to… ▽ More Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to simulate 3D pose distribution in the target domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse pose hypotheses and aligns them with the target domain. To do this, it first utilizes target-specific source augmentation to obtain the target domain distribution data from the source domain by decoupling the scale and position parameters. The process is then further refined through the teacher-student paradigm and low-rank adaptation. With extensive comparison of benchmarks such as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance, even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation in unseen domains. The code is available at https://github.com/hbing-l/PoSynDA. △ Less

Submitted 16 October, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynDA

arXiv:2308.09592 [pdf, other]

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

Authors: Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu

Abstract: Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to… ▽ More Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects. Specifically, we develop a novel inter-frame propagation mechanism for diffusion video editing, which leverages the concept of layered representations to propagate the appearance information from one frame to the next. We then build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing. Extensive experiments demonstrate the strong editing capability of our approach. Compared with state-of-the-art video editing methods, our approach shows superior qualitative and quantitative results. Our code is available at \href{https://github.com/rese1f/StableVideo}{this https URL}. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: ICCV 2023

arXiv:2308.01555 [pdf, other]

Mani-GPT: A Generative Model for Interactive Robotic Manipulation

Authors: Zhe Zhang, Wei Chai, Jiankun Wang

Abstract: In real-world scenarios, human dialogues are multi-round and diverse. Furthermore, human instructions can be unclear and human responses are unrestricted. Interactive robots face difficulties in understanding human intents and generating suitable strategies for assisting individuals through manipulation. In this article, we propose Mani-GPT, a Generative Pre-trained Transformer (GPT) for interacti… ▽ More In real-world scenarios, human dialogues are multi-round and diverse. Furthermore, human instructions can be unclear and human responses are unrestricted. Interactive robots face difficulties in understanding human intents and generating suitable strategies for assisting individuals through manipulation. In this article, we propose Mani-GPT, a Generative Pre-trained Transformer (GPT) for interactive robotic manipulation. The proposed model has the ability to understand the environment through object information, understand human intent through dialogues, generate natural language responses to human input, and generate appropriate manipulation plans to assist the human. This makes the human-robot interaction more natural and humanized. In our experiment, Mani-GPT outperforms existing algorithms with an accuracy of 84.6% in intent recognition and decision-making for actions. Furthermore, it demonstrates satisfying performance in real-world dialogue tests with users, achieving an average response accuracy of 70%. △ Less

Submitted 7 August, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

arXiv:2308.01164 [pdf, other]

Virtual Reality Based Robot Teleoperation via Human-Scene Interaction

Authors: Lingxiao Meng, Jiangshan Liu, Wei Chai, Jiankun Wang, Max Q. -H. Meng

Abstract: Robot teleoperation gains great success in various situations, including chemical pollution rescue, disaster relief, and long-distance manipulation. In this article, we propose a virtual reality (VR) based robot teleoperation system to achieve more efficient and natural interaction with humans in different scenes. A user-friendly VR interface is designed to help users interact with a desktop scene… ▽ More Robot teleoperation gains great success in various situations, including chemical pollution rescue, disaster relief, and long-distance manipulation. In this article, we propose a virtual reality (VR) based robot teleoperation system to achieve more efficient and natural interaction with humans in different scenes. A user-friendly VR interface is designed to help users interact with a desktop scene using their hands efficiently and intuitively. To improve user experience and reduce workload, we simulate the process in the physics engine to help build a preview of the scene after manipulation in the virtual scene before execution. We conduct experiments with different users and compare our system with a direct control method across several teleoperation tasks. The user study demonstrates that the proposed system enables users to perform operations more instinctively with a lighter mental workload. Users can perform pick-and-place and object-stacking tasks in a considerably short time, even for beginners. Our code is available at https://github.com/lingxiaomeng/VR_Teleoperation_Gen3. △ Less

Submitted 2 August, 2023; originally announced August 2023.

arXiv:2307.16449 [pdf, other]

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Authors: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-S… ▽ More Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method. △ Less

Submitted 9 March, 2024; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: CVPR 2024. First three authors contribute equally to this work. Project Website https://rese1f.github.io/MovieChat/

arXiv:2307.07075 [pdf, ps, other]

Adaptive Coding and Modulation Aided Mobile Relaying for Millimeter-Wave Flying Ad-Hoc Networks

Authors: Jiankang Zhang, Sheng Chen, Wei Koong Chai, Lajos Hanzo

Abstract: The emerging drone swarms are capable of carrying out sophisticated tasks in support of demanding Internet-of-Things (IoT) applications by synergistically working together. However, the target area may be out of the coverage of the ground station and it may be impractical to deploy a large number of drones in the target area due to cost, electromagnetic interference and flight-safety regulations.… ▽ More The emerging drone swarms are capable of carrying out sophisticated tasks in support of demanding Internet-of-Things (IoT) applications by synergistically working together. However, the target area may be out of the coverage of the ground station and it may be impractical to deploy a large number of drones in the target area due to cost, electromagnetic interference and flight-safety regulations. By exploiting the innate \emph{agility} and \emph{mobility} of unmanned aerial vehicles (UAVs), we conceive a mobile relaying-assisted drone swarm network architecture, which is capable of extending the coverage of the ground station and enhancing the effective end-to-end throughput. Explicitly, a swarm of drones forms a data-collecting drone swarm (DCDS) designed for sensing and collecting data with the aid of their mounted cameras and/or sensors, and a powerful relay-UAV (RUAV) acts as a mobile relay for conveying data between the DCDS and a ground station (GS). Given a time period, in order to maximize the data delivered whilst minimizing the delay imposed, we harness an $ε$-multiple objective genetic algorithm ($ε$-MOGA) assisted Pareto-optimization scheme. Our simulation results demonstrate that the proposed mobile relaying is capable of delivering more data. As specific examples investigated in our simulations, our mobile relaying-assisted drone swarm network is capable of delivering $45.38\%$ more data than the benchmark solutions, when a stationary relay is available, and it is capable of delivering $26.86\%$ more data than the benchmark solutions when no stationary relay is available. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2307.03833 [pdf, other]

Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation

Authors: Zhongyu Jiang, Zhuoran Zhou, Lei Li, Wenhao Chai, Cheng-Yen Yang, Jenq-Neng Hwang

Abstract: Learning-based methods have dominated the 3D human pose estimation (HPE) tasks with significantly better performance in most benchmarks than traditional optimization-based methods. Nonetheless, 3D HPE in the wild is still the biggest challenge for learning-based models, whether with 2D-3D lifting, image-to-3D, or diffusion-based methods, since the trained networks implicitly learn camera intrinsic… ▽ More Learning-based methods have dominated the 3D human pose estimation (HPE) tasks with significantly better performance in most benchmarks than traditional optimization-based methods. Nonetheless, 3D HPE in the wild is still the biggest challenge for learning-based models, whether with 2D-3D lifting, image-to-3D, or diffusion-based methods, since the trained networks implicitly learn camera intrinsic parameters and domain-based 3D human pose distributions and estimate poses by statistical average. On the other hand, the optimization-based methods estimate results case-by-case, which can predict more diverse and sophisticated human poses in the wild. By combining the advantages of optimization-based and learning-based methods, we propose the \textbf{Ze}ro-shot \textbf{D}iffusion-based \textbf{O}ptimization (\textbf{ZeDO}) pipeline for 3D HPE to solve the problem of cross-domain and in-the-wild 3D HPE. Our multi-hypothesis \textit{\textbf{ZeDO}} achieves state-of-the-art (SOTA) performance on Human3.6M, with minMPJPE $51.4$mm, without training with any 2D-3D or image-3D pairs. Moreover, our single-hypothesis \textit{\textbf{ZeDO}} achieves SOTA performance on 3DPW dataset with PA-MPJPE $40.3$mm on cross-dataset evaluation, which even outperforms learning-based methods trained on 3DPW. △ Less

Submitted 24 October, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

Comments: WACV 2024

arXiv:2307.03353 [pdf, other]

A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

Authors: Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, Gaoang Wang

Abstract: Deep learning has the potential to revolutionize sports performance, with applications ranging from perception and comprehension to decision. This paper presents a comprehensive survey of deep learning in sports performance, focusing on three main aspects: algorithms, datasets and virtual environments, and challenges. Firstly, we discuss the hierarchical structure of deep learning algorithms in sp… ▽ More Deep learning has the potential to revolutionize sports performance, with applications ranging from perception and comprehension to decision. This paper presents a comprehensive survey of deep learning in sports performance, focusing on three main aspects: algorithms, datasets and virtual environments, and challenges. Firstly, we discuss the hierarchical structure of deep learning algorithms in sports performance which includes perception, comprehension and decision while comparing their strengths and weaknesses. Secondly, we list widely used existing datasets in sports and highlight their characteristics and limitations. Finally, we summarize current challenges and point out future trends of deep learning in sports. Our survey provides valuable reference material for researchers interested in deep learning in sports applications. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2306.17201 [pdf, other]

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Authors: Zhenyu Zhang, Wenhao Chai, Zhongyu Jiang, Tian Ye, Mingli Song, Jenq-Neng Hwang, Gaoang Wang

Abstract: Estimating 3D human poses only from a 2D human pose sequence is thoroughly explored in recent years. Yet, prior to this, no such work has attempted to unify 2D and 3D pose representations in the shared feature space. In this paper, we propose MPM, a unified 2D-3D human pose representation framework via masked pose modeling. We treat 2D and 3D poses as two different modalities like vision and langu… ▽ More Estimating 3D human poses only from a 2D human pose sequence is thoroughly explored in recent years. Yet, prior to this, no such work has attempted to unify 2D and 3D pose representations in the shared feature space. In this paper, we propose MPM, a unified 2D-3D human pose representation framework via masked pose modeling. We treat 2D and 3D poses as two different modalities like vision and language and build a single-stream transformer-based architecture. We apply three pretext tasks, which are masked 2D pose modeling, masked 3D pose modeling, and masked 2D pose lifting to pre-train our network and use full-supervision to perform further fine-tuning. A high masking ratio of 72.5% in total with a spatio-temporal mask sampling strategy leading to better relation modeling both in spatial and temporal domains. MPM can handle multiple tasks including 3D human pose estimation, 3D pose estimation from occluded 2D pose, and 3D pose completion in a single framework. We conduct extensive experiments and ablation studies on several widely used human pose datasets and achieve state-of-the-art performance on Human3.6M and MPI-INF-3DHP. Codes and model checkpoints are available at https://github.com/vvirgooo2/MPM △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: Codes and model checkpoints are available at https://github.com/vvirgooo2/MPM

arXiv:2305.08824 [pdf, other]

Five A$^{+}$ Network: You Only Need 9K Parameters for Underwater Image Enhancement

Authors: **gxia Jiang, Tian Ye, **bin Bai, Sixiang Chen, Wenhao Chai, Shi Jun, Yun Liu, Erkang Chen

Abstract: A lightweight underwater image enhancement network is of great significance for resource-constrained platforms, but balancing model size, computational efficiency, and enhancement performance has proven difficult for previous approaches. In this work, we propose the Five A$^{+}$ Network (FA$^{+}$Net), a highly efficient and lightweight real-time underwater image enhancement network with only… ▽ More A lightweight underwater image enhancement network is of great significance for resource-constrained platforms, but balancing model size, computational efficiency, and enhancement performance has proven difficult for previous approaches. In this work, we propose the Five A$^{+}$ Network (FA$^{+}$Net), a highly efficient and lightweight real-time underwater image enhancement network with only $\sim$ 9k parameters and $\sim$ 0.01s processing time. The FA$^{+}$Net employs a two-stage enhancement structure. The strong prior stage aims to decompose challenging underwater degradations into sub-problems, while the fine-grained stage incorporates multi-branch color enhancement module and pixel attention module to amplify the network's perception of details. To the best of our knowledge, FA$^{+}$Net is the only network with the capability of real-time enhancement of 1080P images. Thorough extensive experiments and comprehensive visual comparison, we show that FA$^{+}$Net outperforms previous approaches by obtaining state-of-the-art performance on multiple datasets while significantly reducing both parameter count and computational complexity. The code is open source at https://github.com/Owen718/FiveAPlus-Network. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2303.16456 [pdf, other]

Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation

Authors: Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang, Gaoang Wang

Abstract: When applying a pre-trained 2D-to-3D human pose lifting model to a target unseen dataset, large performance degradation is commonly encountered due to domain shift issues. We observe that the degradation is caused by two factors: 1) the large distribution gap over global positions of poses between the source and target datasets due to variant camera parameters and settings, and 2) the deficient di… ▽ More When applying a pre-trained 2D-to-3D human pose lifting model to a target unseen dataset, large performance degradation is commonly encountered due to domain shift issues. We observe that the degradation is caused by two factors: 1) the large distribution gap over global positions of poses between the source and target datasets due to variant camera parameters and settings, and 2) the deficient diversity of local structures of poses in training. To this end, we combine \textbf{global adaptation} and \textbf{local generalization} in \textit{PoseDA}, a simple yet effective framework of unsupervised domain adaptation for 3D human pose estimation. Specifically, global adaptation aims to align global positions of poses from the source domain to the target domain with a proposed global position alignment (GPA) module. And local generalization is designed to enhance the diversity of 2D-3D pose map** with a local pose augmentation (LPA) module. These modules bring significant performance improvement without introducing additional learnable parameters. In addition, we propose local pose augmentation (LPA) to enhance the diversity of 3D poses following an adversarial training scheme consisting of 1) a augmentation generator that generates the parameters of pre-defined pose transformations and 2) an anchor discriminator to ensure the reality and quality of the augmented data. Our approach can be applicable to almost all 2D-3D lifting models. \textit{PoseDA} achieves 61.3 mm of MPJPE on MPI-INF-3DHP under a cross-dataset evaluation setup, improving upon the previous state-of-the-art method by 10.2\%. △ Less

Submitted 17 August, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: ICCV 2023

arXiv:2303.15124 [pdf, other]

Blind Inpainting with Object-aware Discrimination for Artificial Marker Removal

Authors: Xuechen Guo, Wenhao Hu, Chiming Ni, Wenhao Chai, Shiyan Li, Gaoang Wang

Abstract: Medical images often contain artificial markers added by doctors, which can negatively affect the accuracy of AI-based diagnosis. To address this issue and recover the missing visual contents, inpainting techniques are highly needed. However, existing inpainting methods require manual mask input, limiting their application scenarios. In this paper, we introduce a novel blind inpainting method that… ▽ More Medical images often contain artificial markers added by doctors, which can negatively affect the accuracy of AI-based diagnosis. To address this issue and recover the missing visual contents, inpainting techniques are highly needed. However, existing inpainting methods require manual mask input, limiting their application scenarios. In this paper, we introduce a novel blind inpainting method that automatically completes visual contents without specifying masks for target areas in an image. Our proposed model includes a mask-free reconstruction network and an object-aware discriminator. The reconstruction network consists of two branches that predict the corrupted regions with artificial markers and simultaneously recover the missing visual contents. The object-aware discriminator relies on the powerful recognition capabilities of the dense object detector to ensure that the markers of reconstructed images cannot be detected in any local regions. As a result, the reconstructed image can be close to the clean one as much as possible. Our proposed method is evaluated on different medical image datasets, covering multiple imaging modalities such as ultrasound (US), magnetic resonance imaging (MRI), and electron microscopy (EM), demonstrating that our method is effective and robust against various unknown missing region patterns. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2303.00313 [pdf, other]

Deep Learning Methods for Small Molecule Drug Discovery: A Survey

Authors: Wenhao Hu, Yingying Liu, Xuanyu Chen, Wenhao Chai, Hangyue Chen, Hongwei Wang, Gaoang Wang

Abstract: With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing s… ▽ More With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing surveys only focus on one of the applications, limiting the view of researchers in the community. In this paper, we present a comprehensive review on the aforementioned four aspects, and discuss the relationships among different applications. The latest literature and classical benchmarks are presented for better understanding the development of variety of approaches. We commence by summarizing the molecule representation format in these works, followed by an introduction of recent proposed approaches for each of the four tasks. Furthermore, we review a variety of commonly used datasets and evaluation metrics and compare the performance of deep learning-based models. Finally, we conclude by identifying remaining challenges and discussing the future trend for deep learning methods in drug discovery. △ Less

Submitted 5 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2302.06826 [pdf, other]

DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models

Authors: Shidong Cao, Wenhao Chai, Shengyu Hao, Yanting Zhang, Hangyue Chen, Gaoang Wang

Abstract: Image-based fashion design with AI techniques has attracted increasing attention in recent years. We focus on a new fashion design task, where we aim to transfer a reference appearance image onto a clothing image while preserving the structure of the clothing image. It is a challenging task since there are no reference images available for the newly designed output fashion images. Although diffusi… ▽ More Image-based fashion design with AI techniques has attracted increasing attention in recent years. We focus on a new fashion design task, where we aim to transfer a reference appearance image onto a clothing image while preserving the structure of the clothing image. It is a challenging task since there are no reference images available for the newly designed output fashion images. Although diffusion-based image translation or neural style transfer (NST) has enabled flexible style transfer, it is often difficult to maintain the original structure of the image realistically during the reverse diffusion, especially when the referenced appearance image greatly differs from the common clothing appearance. To tackle this issue, we present a novel diffusion model-based unsupervised structure-aware transfer method to semantically generate new clothes from a given clothing image and a reference appearance image. In specific, we decouple the foreground clothing with automatically generated semantic masks by conditioned labels. And the mask is further used as guidance in the denoising process to preserve the structure information. Moreover, we use the pre-trained vision Transformer (ViT) for both appearance and structure guidance. Our experimental results show that the proposed method outperforms state-of-the-art baseline models, generating more realistic images in the fashion design task. Code and demo can be found at https://github.com/Rem105-210/DiffFashion. △ Less

Submitted 13 February, 2023; originally announced February 2023.

arXiv:2210.14426 [pdf]

Liquid Metal Printed Ultrathin Oxides for Monolayer WS2 Top-Gate Transistors

Authors: Yiyu Zhang, Dasari Venkatakrishnarao, Michel Bosman, Wei Fu, Sarthak Das, Fabio Bussolotti, Rainer Lee, Siew Lang Teo, Ding Huang, Ivan Verzhbitskiy, Zhuojun Jiang, Zhuoling Jiang, Jian Wei Chai, Shi Wun Tong, Zi-En Ooi, Calvin Pei Yu Wong, Yee Sin Ang, Kuan Eng Johnson Goh, Chit Siong Lau

Abstract: Two-dimensional (2D) semiconductors are promising channel materials for continued downscaling of complementary metal-oxide-semiconductor (CMOS) logic circuits. However, their full potential continues to be limited by a lack of scalable high-k dielectrics that can achieve atomically smooth interfaces, small equivalent oxide thicknesses (EOT), excellent gate control, and low leakage currents. Here,… ▽ More Two-dimensional (2D) semiconductors are promising channel materials for continued downscaling of complementary metal-oxide-semiconductor (CMOS) logic circuits. However, their full potential continues to be limited by a lack of scalable high-k dielectrics that can achieve atomically smooth interfaces, small equivalent oxide thicknesses (EOT), excellent gate control, and low leakage currents. Here, we report liquid metal printed ultrathin and scalable Ga2O3 dielectric for 2D electronics and electro-optical devices. We directly visualize the atomically smooth Ga2O3/WS2 interfaces enabled by the conformal nature of liquid metal printing. We demonstrate atomic layer deposition compatibility with high-k Ga2O3/HfO2 top-gate dielectric stacks on chemical vapour deposition grown monolayer WS2, achieving EOTs of ~1 nm and subthreshold swings down to 84.9 mV/dec. Gate leakage currents are well within requirements for ultra-scaled low-power logic circuits. Our results show that liquid metal printed oxides can bridge a crucial gap in scalable dielectric integration of 2D materials for next-generation nano-electronics. △ Less

Submitted 25 October, 2022; originally announced October 2022.

arXiv:2209.11477 [pdf, other]

Weakly Supervised Two-Stage Training Scheme for Deep Video Fight Detection Model

Authors: Zhenting Qi, Ruike Zhu, Zheyu Fu, Wenhao Chai, Volodymyr Kindratenko

Abstract: Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media. Previous work has largely relied on action recognition techniques to tackle this problem. In this paper, we propose a simple but effective method that solves the task from a new perspective: we design the fight detection model as a composition of an action-aware f… ▽ More Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media. Previous work has largely relied on action recognition techniques to tackle this problem. In this paper, we propose a simple but effective method that solves the task from a new perspective: we design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator. Also, considering that collecting frame-level labels for videos is too laborious, we design a weakly supervised two-stage training scheme, where we utilize multiple-instance-learning loss calculated on video-level labels to train the score generator, and adopt the self-training technique to further improve its performance. Extensive experiments on a publicly available large-scale dataset, UBI-Fights, demonstrate the effectiveness of our method, and the performance on the dataset exceeds several previous state-of-the-art approaches. Furthermore, we collect a new dataset, VFD-2000, that specializes in video fight detection, with a larger scale and more scenarios than existing datasets. The implementation of our method and the proposed dataset will be publicly available at https://github.com/Hepta-Col/VideoFightDetection. △ Less

Submitted 23 September, 2022; originally announced September 2022.

Comments: Accepted by ICTAI 2022

arXiv:2207.03586 [pdf, other]

CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships

Authors: Rebecca Roelofs, Liting Sun, Ben Caine, Khaled S. Refaat, Ben Sapp, Scott Ettinger, Wei Chai

Abstract: As machine learning models become increasingly prevalent in motion forecasting for autonomous vehicles (AVs), it is critical to ensure that model predictions are safe and reliable. However, exhaustively collecting and labeling the data necessary to fully test the long tail of rare and challenging scenarios is difficult and expensive. In this work, we construct a new benchmark for evaluating and im… ▽ More As machine learning models become increasingly prevalent in motion forecasting for autonomous vehicles (AVs), it is critical to ensure that model predictions are safe and reliable. However, exhaustively collecting and labeling the data necessary to fully test the long tail of rare and challenging scenarios is difficult and expensive. In this work, we construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. Specifically, we conduct an extensive labeling effort to identify causal agents, or agents whose presence influences human drivers' behavior in any format, in the Waymo Open Motion Dataset (WOMD), and we use these labels to perturb the data by deleting non-causal agents from the scene. We evaluate a diverse set of state-of-the-art deep-learning model architectures on our proposed benchmark and find that all models exhibit large shifts under even non-causal perturbation: we observe a 25-38% relative change in minADE as compared to the original. We also investigate techniques to improve model robustness, including increasing the training dataset size and using targeted data augmentations that randomly drop non-causal agents throughout training. Finally, we release the causal agent labels (at https://github.com/google-research/causal-agents) as an additional attribute to WOMD and the robustness benchmarks to aid the community in building more reliable and safe deep-learning models for motion forecasting. △ Less

Submitted 6 October, 2022; v1 submitted 7 July, 2022; originally announced July 2022.

Comments: Rebecca Roelofs and Liting Sun are equally contributed to the work

arXiv:2111.09515 [pdf, other]

Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Point Density Level Estimation

Authors: Yantao Lu, Xuetao Hao, Yilan Li, Weiheng Chai, Shiqi Sun, Senem Velipasalar

Abstract: 3D object detection from LiDAR data for autonomous driving has been making remarkable strides in recent years. Among the state-of-the-art methodologies, encoding point clouds into a bird's eye view (BEV) has been demonstrated to be both effective and efficient. Different from perspective views, BEV preserves rich spatial and distance information between objects. Yet, while farther objects of the s… ▽ More 3D object detection from LiDAR data for autonomous driving has been making remarkable strides in recent years. Among the state-of-the-art methodologies, encoding point clouds into a bird's eye view (BEV) has been demonstrated to be both effective and efficient. Different from perspective views, BEV preserves rich spatial and distance information between objects. Yet, while farther objects of the same type do not appear smaller in the BEV, they contain sparser point cloud features. This fact weakens BEV feature extraction using shared-weight convolutional neural networks (CNNs). In order to address this challenge, we propose Range-Aware Attention Network (RAANet), which extracts effective BEV features and generates superior 3D object detection outputs. The range-aware attention (RAA) convolutions significantly improve feature extraction for near as well as far objects. Moreover, we propose a novel auxiliary loss for point density estimation to further enhance the detection accuracy of RAANet for occluded objects. It is worth to note that our proposed RAA convolution is lightweight and compatible to be integrated into any CNN architecture used for detection from a BEV. Extensive experiments on the nuScenes and KITTI datasets demonstrate that our proposed approach outperforms the state-of-the-art methods for LiDAR-based 3D object detection, with real-time inference speed of 16 Hz for the full version and 22 Hz for the lite version tested on nuScenes lidar frames. The code is publicly available at our Github repository https://github.com/erbloo/RAAN. △ Less

Submitted 8 August, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

arXiv:1911.11616 [pdf, other]

Enhancing Cross-task Black-Box Transferability of Adversarial Examples with Dispersion Reduction

Authors: Yantao Lu, Yunhan Jia, Jianyu Wang, Bai Li, Weiheng Chai, Lawrence Carin, Senem Velipasalar

Abstract: Neural networks are known to be vulnerable to carefully crafted adversarial examples, and these malicious samples often transfer, i.e., they remain adversarial even against other models. Although great efforts have been delved into the transferability across models, surprisingly, less attention has been paid to the cross-task transferability, which represents the real-world cybercriminal's situati… ▽ More Neural networks are known to be vulnerable to carefully crafted adversarial examples, and these malicious samples often transfer, i.e., they remain adversarial even against other models. Although great efforts have been delved into the transferability across models, surprisingly, less attention has been paid to the cross-task transferability, which represents the real-world cybercriminal's situation, where an ensemble of different defense/detection mechanisms need to be evaded all at once. In this paper, we investigate the transferability of adversarial examples across a wide range of real-world computer vision tasks, including image classification, object detection, semantic segmentation, explicit content detection, and text detection. Our proposed attack minimizes the ``dispersion'' of the internal feature map, which overcomes existing attacks' limitation of requiring task-specific loss functions and/or probing a target model. We conduct evaluation on open source detection and segmentation models as well as four different computer vision tasks provided by Google Cloud Vision (GCV) APIs, to show how our approach outperforms existing attacks by degrading performance of multiple CV tasks by a large margin with only modest perturbations linf=16. △ Less

Submitted 22 November, 2019; originally announced November 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.03333

arXiv:1805.11761 [pdf, other]

Collaborative Learning for Deep Neural Networks

Authors: Guocong Song, Wei Chai

Abstract: We introduce collaborative learning in which multiple classifier heads of the same network are simultaneously trained on the same training data to improve generalization and robustness to label noise with no extra inference cost. It acquires the strengths from auxiliary training, multi-task learning and knowledge distillation. There are two important mechanisms involved in collaborative learning.… ▽ More We introduce collaborative learning in which multiple classifier heads of the same network are simultaneously trained on the same training data to improve generalization and robustness to label noise with no extra inference cost. It acquires the strengths from auxiliary training, multi-task learning and knowledge distillation. There are two important mechanisms involved in collaborative learning. First, the consensus of multiple views from different classifier heads on the same example provides supplementary information as well as regularization to each classifier, thereby improving generalization. Second, intermediate-level representation (ILR) sharing with backpropagation rescaling aggregates the gradient flows from all heads, which not only reduces training computational complexity, but also facilitates supervision to the shared layers. The empirical results on CIFAR and ImageNet datasets demonstrate that deep neural networks learned as a group in a collaborative way significantly reduce the generalization error and increase the robustness to label noise. △ Less

Submitted 6 November, 2018; v1 submitted 29 May, 2018; originally announced May 2018.

Comments: To appear in NIPS 2018

arXiv:1609.09165 [pdf, ps, other]

doi 10.1088/1674-1137/40/12/125101

Reevaluation of thermonuclear reaction rate of 50Fe(p,gamma)51Co

Authors: L. P. Zhang, J. J. He, W. D. Chai, S. Q. Hou, L. Y. Zhang

Abstract: The thermonuclear rate of the 50Fe(p,gamma)51Co reaction in the Type I X-ray bursts (XRBs) temperature range has been reevaluated based on a recent precise mass measurement at CSRe lanzhou, where the proton separation energy Sp=142+/-77 keV has been determined firstly for the 51Co nucleus. Comparing to the previous theoretical predictions, the experimental Sp value has much smaller uncertainty. Ba… ▽ More The thermonuclear rate of the 50Fe(p,gamma)51Co reaction in the Type I X-ray bursts (XRBs) temperature range has been reevaluated based on a recent precise mass measurement at CSRe lanzhou, where the proton separation energy Sp=142+/-77 keV has been determined firstly for the 51Co nucleus. Comparing to the previous theoretical predictions, the experimental Sp value has much smaller uncertainty. Based on the nuclear shell model and mirror nuclear structure information, we have calculated two sets of thermonuclear rates for the 50Fe(p,gamma)51Co reaction by utilizing the experimental Sp value. It shows that the statistical-model calculations are not ideally applicable for this reaction primarily because of the low density of low-lying excited states in 51Co. In this work, we recommend that a set of new reaction rate based on the mirror structure of 51Cr should be incorporated in the future astrophysical network calculations. △ Less

Submitted 28 September, 2016; originally announced September 2016.

Comments: 7 pages, 2 figures and 5 tables

arXiv:1606.07792 [pdf, other]

Wide & Deep Learning for Recommender Systems

Authors: Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah

Abstract: Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks… ▽ More Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning---jointly trained wide linear models and deep neural networks---to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open-sourced our implementation in TensorFlow. △ Less

Submitted 24 June, 2016; originally announced June 2016.

arXiv:1601.00395 [pdf, ps, other]

doi 10.1088/1674-1137/40/8/087004

Injection method of barrier bucket supported by off-aligned electron cooling for CRing of HIAF

Authors: Guo-Dong Shen, Jian-Cheng Yang, Jia-Wen Xia, Li-Jun Mao, Da-Yu Yin, Wei-** Chai, Jian Shi, Li-Na Sheng, A. Smirnov, Bo Wu, He Zhao

Abstract: A new accelerator complex, HIAF (the High Intensity Heavy Ion Accelerator Facility), has been approved in China. It is designed to provide intense primary and radioactive ion beams for research in high energy density physics, nuclear physics, atomic physics as well as other applications. In order to achieve a high intensity of up to 5e11 ppp 238U34+, the Compression Ring (CRing) needs to stack mor… ▽ More A new accelerator complex, HIAF (the High Intensity Heavy Ion Accelerator Facility), has been approved in China. It is designed to provide intense primary and radioactive ion beams for research in high energy density physics, nuclear physics, atomic physics as well as other applications. In order to achieve a high intensity of up to 5e11 ppp 238U34+, the Compression Ring (CRing) needs to stack more than 5 bunches transferred from the Booster Ring (BRing). However, the normal bucket to bucket injection scheme can only achieve an intensity gain of 2, so an injection method, fixed barrier bucket (BB) supported by electron cooling, is proposed. To suppress the severe space charge effect during the stacking process, off-alignment is adopted in the cooler to control the transverse emittance. In this paper, simulation and optimization with the BETACOOL program are presented. △ Less

Submitted 28 March, 2016; v1 submitted 4 January, 2016; originally announced January 2016.

arXiv:1507.03224 [pdf]

Concept for a Future Super Proton-Proton Collider

Authors: **gyu Tang, J. Scott Berg, Wei** Chai, Fusan Chen, Nian Chen, Weiren Chou, Haiyi Dong, Jie Gao, Tao Han, Yongbin Leng, Guangrui Li, Ramesh Gupta, Peng Li, Zhihui Li, Baiqi Liu, Yudong Liu, Xinchou Lou, Qing Luo, Ernie Malamud, Lijun Mao, Robert B. Palmer, Quanling Peng, Yuemei Peng, Manqi Ruan, GianLuca Sabbi , et al. (26 additional authors not shown)

Abstract: Following the discovery of the Higgs boson at LHC, new large colliders are being studied by the international high-energy community to explore Higgs physics in detail and new physics beyond the Standard Model. In China, a two-stage circular collider project CEPC-SPPC is proposed, with the first stage CEPC (Circular Electron Positron Collier, a so-called Higgs factory) focused on Higgs physics, and… ▽ More Following the discovery of the Higgs boson at LHC, new large colliders are being studied by the international high-energy community to explore Higgs physics in detail and new physics beyond the Standard Model. In China, a two-stage circular collider project CEPC-SPPC is proposed, with the first stage CEPC (Circular Electron Positron Collier, a so-called Higgs factory) focused on Higgs physics, and the second stage SPPC (Super Proton-Proton Collider) focused on new physics beyond the Standard Model. This paper discusses this second stage. △ Less

Submitted 19 July, 2015; v1 submitted 12 July, 2015; originally announced July 2015.

Comments: 34 pages, 8 figures, 5 tables

arXiv:1305.4997 [pdf]

doi 10.1088/1674-1137/38/4/047002

The SHER-HIAF Ring Lattice Design

Authors: X. Gao, J. C. Yang, J. W. Xia, W. P. Chai, J. Shi, G. D. Shen

Abstract: Super Heavy Experimental Ring (SHER) is one of the rings of the next accelerator complex High Intensity Heavy Ion Accelerator Facility (HIAF) at IMP[4]. Here, present ideas of the lattice design for the operation of the large acceptance ring are presented. The SHER ring has to be optimized for e-cooling and the lattice is designed for different modes. First of all, it is designed in the so called… ▽ More Super Heavy Experimental Ring (SHER) is one of the rings of the next accelerator complex High Intensity Heavy Ion Accelerator Facility (HIAF) at IMP[4]. Here, present ideas of the lattice design for the operation of the large acceptance ring are presented. The SHER ring has to be optimized for e-cooling and the lattice is designed for different modes. First of all, it is designed in the so called isochronous mode as time-of-flight mass spectrometer for short-lived secondary nuclei. Secondly, SHER can also be used to be a storage ring for collecting and cooling the secondary rare isotope beams from the transport line. In order to fulfil it's purpose, the ion optics can be set to different ion optical modes. △ Less

Submitted 21 May, 2013; originally announced May 2013.

arXiv:1212.0365 [pdf]

Design and Implementation of Flight Visual Simulation System

Authors: Feng Tian, Wenjian Chai, Chuanyun Wang, ** Sun

Abstract: The design requirement for flight visual simulation system is studied and the overall structure and development process are proposed in this paper. Through the construction of 3D scene model library and aircraft model, the rendering and interaction of visual scene are implemented. The changes of aircraft flight attitude in visual system are controlled by real-time calculation of aircraft aerodynam… ▽ More The design requirement for flight visual simulation system is studied and the overall structure and development process are proposed in this paper. Through the construction of 3D scene model library and aircraft model, the rendering and interaction of visual scene are implemented. The changes of aircraft flight attitude in visual system are controlled by real-time calculation of aircraft aerodynamic and dynamic equations and flight simulation effect is enhanced by this kind of control. Several key techniques for optimizing 3D model and relative methods for large terrain modeling are explored for improving loading ability and rendering speed of the system. Experiment shows that, with specific function and performance guaranteed as a premise, the system achieves expected results, that is, precise real-time calculation of flight attitude and smooth realistic screen effect. △ Less

Submitted 3 December, 2012; originally announced December 2012.

Showing 1–47 of 47 results for author: Chai, W