Search | arXiv e-print repository

PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Authors: Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, **g Sun, ** Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang , et al. (11 additional authors not shown)

Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general pu… ▽ More Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains. △ Less

Submitted 3 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.18035 [pdf, other]

Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

Authors: Yaoyu Zhang, Leyang Zhang, Zhongwang Zhang, Zhiwei Bai

Abstract: Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense o… ▽ More Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2211.11623

arXiv:2406.09196 [pdf, other]

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

Authors: Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, Zheng Zhang

Abstract: Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine slot representations. However, a major drawback of most object-centric models, including slot… ▽ More Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine slot representations. However, a major drawback of most object-centric models, including slot attention, is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation, we present a novel complexity-aware object auto-encoder framework. Within this framework, we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore, we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework, tested extensively on object discovery tasks with various datasets, shows performance matching or exceeding top fixed-slot models. Moreover, our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance's complexity, offering the potential for further exploration in slot attention research. Project will be available at https://kfan21.github.io/AdaSlot/ △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: CVPR 2024

arXiv:2406.06543 [pdf, other]

SparrowSNN: A Hardware/software Co-design for Energy Efficient ECG Classification

Authors: Zhanglu Yan, Zhenyu Bai, Tulika Mitra, Weng-Fai Wong

Abstract: Heart disease is one of the leading causes of death worldwide. Given its high risk and often asymptomatic nature, real-time continuous monitoring is essential. Unlike traditional artificial neural networks (ANNs), spiking neural networks (SNNs) are well-known for their energy efficiency, making them ideal for wearable devices and energy-constrained edge computing platforms. However, current energy… ▽ More Heart disease is one of the leading causes of death worldwide. Given its high risk and often asymptomatic nature, real-time continuous monitoring is essential. Unlike traditional artificial neural networks (ANNs), spiking neural networks (SNNs) are well-known for their energy efficiency, making them ideal for wearable devices and energy-constrained edge computing platforms. However, current energy measurement of SNN implementations for detecting heart diseases typically rely on empirical values, often overlooking hardware overhead. Additionally, the integer and fire activations in SNNs require multiple memory accesses and repeated computations, which can further compromise energy efficiency. In this paper, we propose sparrowSNN, a redesign of the standard SNN workflow from a hardware perspective, and present a dedicated ASIC design for SNNs, optimized for ultra-low power wearable devices used in heartbeat classification. Using the MIT-BIH dataset, our SNN achieves a state-of-the-art accuracy of 98.29% for SNNs, with energy consumption of 31.39nJ per inference and power usage of 6.1uW, making sparrowSNN the highest accuracy with the lowest energy use among comparable systems. We also compare the energy-to-accuracy trade-offs between SNNs and quantized ANNs, offering recommendations on insights on how best to use SNNs. △ Less

Submitted 6 May, 2024; originally announced June 2024.

arXiv:2406.01375 [pdf, other]

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Authors: Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

Abstract: Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually… ▽ More Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.17025 [pdf, other]

SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs

Authors: Zhenyu Bai, Pranav Dangi, Huize Li, Tulika Mitra

Abstract: Efficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens, reducing the theoretical complexity from quadratic to linear. Although the sparsity induced by window attenti… ▽ More Efficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens, reducing the theoretical complexity from quadratic to linear. Although the sparsity induced by window attention is highly structured, it does not align perfectly with the microarchitecture of the conventional accelerators, leading to suboptimal implementation. In response, we propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input. The proposed microarchitecture is based on a design that maximizes data reuse by using a combination of row-wise dataflow, kernel fusion optimization, and an input-stationary design considering the distributed memory and computation resources of FPGA. Consequently, it achieves up to 22$\times$ and 5.7$\times$ improvement in latency and energy efficiency compared to the baseline FPGA-based accelerator and 15$\times$ energy efficiency compared to GPU-based solution. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepeted paper for DAC'22

arXiv:2405.16127 [pdf, other]

Finetuning Large Language Model for Personalized Ranking

Authors: Zhuoxi Bai, Ning Wu, Fengyu Cai, Xinyi Zhu, Yun Xiong

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this st… ▽ More Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system. △ Less

Submitted 20 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.13787 [pdf, other]

Disentangle Sample Size and Initialization Effect on Perfect Generalization for Single-Neuron Target

Authors: Jiajie Zhao, Zhiwei Bai, Yaoyu Zhang

Abstract: Overparameterized models like deep neural networks have the intriguing ability to recover target functions with fewer sampled data points than parameters (see arXiv:2307.08921). To gain insights into this phenomenon, we concentrate on a single-neuron target recovery scenario, offering a systematic examination of how initialization and sample size influence the performance of two-layer neural netwo… ▽ More Overparameterized models like deep neural networks have the intriguing ability to recover target functions with fewer sampled data points than parameters (see arXiv:2307.08921). To gain insights into this phenomenon, we concentrate on a single-neuron target recovery scenario, offering a systematic examination of how initialization and sample size influence the performance of two-layer neural networks. Our experiments reveal that a smaller initialization scale is associated with improved generalization, and we identify a critical quantity called the "initial imbalance ratio" that governs training dynamics and generalization under small initialization, supported by theoretical proofs. Additionally, we empirically delineate two critical thresholds in sample size--termed the "optimistic sample size" and the "separation sample size"--that align with the theoretical frameworks established by (see arXiv:2307.08921 and arXiv:2309.00508). Our results indicate a transition in the model's ability to recover the target function: below the optimistic sample size, recovery is unattainable; at the optimistic sample size, recovery becomes attainable albeit with a set of initialization of zero measure. Upon reaching the separation sample size, the set of initialization that can successfully recover the target function shifts from zero to positive measure. These insights, derived from a simplified context, provide a perspective on the intricate yet decipherable complexities of perfect generalization in overparameterized neural networks. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 22 pages, 11 figures

arXiv:2405.13721 [pdf, other]

Connectivity Shapes Implicit Regularization in Matrix Factorization Models for Matrix Completion

Authors: Zhiwei Bai, Jiajie Zhao, Yaoyu Zhang

Abstract: Matrix factorization models have been extensively studied as a valuable test-bed for understanding the implicit biases of overparameterized models. Although both low nuclear norm and low rank regularization have been studied for these models, a unified understanding of when, how, and why they achieve different implicit regularization effects remains elusive. In this work, we systematically investi… ▽ More Matrix factorization models have been extensively studied as a valuable test-bed for understanding the implicit biases of overparameterized models. Although both low nuclear norm and low rank regularization have been studied for these models, a unified understanding of when, how, and why they achieve different implicit regularization effects remains elusive. In this work, we systematically investigate the implicit regularization of matrix factorization for solving matrix completion problems. We empirically discover that the connectivity of observed data plays a crucial role in the implicit bias, with a transition from low nuclear norm to low rank as data shifts from disconnected to connected with increased observations. We identify a hierarchy of intrinsic invariant manifolds in the loss landscape that guide the training trajectory to evolve from low-rank to higher-rank solutions. Based on this finding, we theoretically characterize the training trajectory as following the hierarchical invariant manifold traversal process, generalizing the characterization of Li et al. (2020) to include the disconnected case. Furthermore, we establish conditions that guarantee minimum nuclear norm, closely aligning with our experimental findings, and we provide a dynamics characterization condition for ensuring minimum rank. Our work reveals the intricate interplay between data connectivity, training dynamics, and implicit regularization in matrix factorization models. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 34 pages

arXiv:2404.18930 [pdf, other]

Hallucination of Multimodal Large Language Models: A Survey

Authors: Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou

Abstract: This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge k… ▽ More This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: https://github.com/showlab/Awesome-MLLM-Hallucination. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 140 references

arXiv:2404.18255 [pdf, other]

PatentGPT: A Large Language Model for Intellectual Property

Authors: Zilong Bai, Ruiji Zhang, Linqing Chen, Qijun Cai, Yuan Zhong, Cong Wang, Yan Fang, Jie Fang, **g Sun, Weikuan Wang, Lizhi Zhou, Haoran Hua, Tian Qiu, Chaochao Wang, Cheng Sun, Jian** Lu, Yixin Wang, Yubin Xia, Meng Hu, Haowen Liu, Peng Xu, Licong Xu, Fu Bian, Xiaolong Gu, Lisha Zhang , et al. (2 additional authors not shown)

Abstract: In recent years, large language models(LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge, privacy protection, pro… ▽ More In recent years, large language models(LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge, privacy protection, processing of extremely long text in this field. In this technical report, we present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain. Using this standard process, we have trained the PatentGPT series models based on open-source pretrained models. By evaluating them on the open-source IP-oriented benchmark MOZIP, our domain-specific LLMs outperforms GPT-4, indicating the effectiveness of the proposed training procedure and the expertise of the PatentGPT models in the IP domain. Remarkably, our model surpassed GPT-4 on the 2019 China Patent Agent Qualification Examination, scoring 65 and matching human expert levels. Additionally, the PatentGPT model, which utilizes the SMoE architecture, achieves performance comparable to that of GPT-4 in the IP domain and demonstrates a better cost-performance ratio on long-text tasks, potentially serving as an alternative to GPT-4 within the IP domain. △ Less

Submitted 4 June, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

Comments: 19 pages, 9 figures

ACM Class: I.2.7

arXiv:2404.17466 [pdf, other]

FTL: Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks

Authors: Zhe Bai, Xishuo Wei, William Tang, Leonid Oliker, Zhihong Lin, Samuel Williams

Abstract: Deep learning algorithms provide a new paradigm to study high-dimensional dynamical behaviors, such as those in fusion plasma systems. Development of novel model reduction methods, coupled with detection of abnormal modes with plasma physics, opens a unique opportunity for building efficient models to identify plasma instabilities for real-time control. Our Fusion Transfer Learning (FTL) model dem… ▽ More Deep learning algorithms provide a new paradigm to study high-dimensional dynamical behaviors, such as those in fusion plasma systems. Development of novel model reduction methods, coupled with detection of abnormal modes with plasma physics, opens a unique opportunity for building efficient models to identify plasma instabilities for real-time control. Our Fusion Transfer Learning (FTL) model demonstrates success in reconstructing nonlinear kink mode structures by learning from a limited amount of nonlinear simulation data. The knowledge transfer process leverages a pre-trained neural encoder-decoder network, initially trained on linear simulations, to effectively capture nonlinear dynamics. The low-dimensional embeddings extract the coherent structures of interest, while preserving the inherent dynamics of the complex system. Experimental results highlight FTL's capacity to capture transitional behaviors and dynamical features in plasma dynamics -- a task often challenging for conventional methods. The model developed in this study is generalizable and can be extended broadly through transfer learning to address various magnetohydrodynamics (MHD) modes. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 18 pages, 10 figures

MSC Class: 76W05; 68T45 ACM Class: J.2; I.2.10

arXiv:2404.15067 [pdf, other]

Enhancing Textual Personality Detection toward Social Media: Integrating Long-term and Short-term Perspectives

Authors: Haohao Zhu, Xiaokun Zhang, Junyu Lu, Youlin Wu, Zewen Bai, Changrong Min, Liang Yang, Bo Xu, Dongyu Zhang, Hongfei Lin

Abstract: Textual personality detection aims to identify personality characteristics by analyzing user-generated content toward social media platforms. Numerous psychological literature highlighted that personality encompasses both long-term stable traits and short-term dynamic states. However, existing studies often concentrate only on either long-term or short-term personality representations, without eff… ▽ More Textual personality detection aims to identify personality characteristics by analyzing user-generated content toward social media platforms. Numerous psychological literature highlighted that personality encompasses both long-term stable traits and short-term dynamic states. However, existing studies often concentrate only on either long-term or short-term personality representations, without effectively combining both aspects. This limitation hinders a comprehensive understanding of individuals' personalities, as both stable traits and dynamic states are vital. To bridge this gap, we propose a Dual Enhanced Network(DEN) to jointly model users' long-term and short-term personality for textual personality detection. In DEN, a Long-term Personality Encoding is devised to effectively model long-term stable personality traits. Short-term Personality Encoding is presented to capture short-term dynamic personality states. The Bi-directional Interaction component facilitates the integration of both personality aspects, allowing for a comprehensive representation of the user's personality. Experimental results on two personality detection datasets demonstrate the effectiveness of the DEN model and the benefits of considering both the dynamic and stable nature of personality characteristics for textual personality detection. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 11 pages, 9 figures

arXiv:2404.13896 [pdf, other]

CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory

Authors: Yunlong Ran, Yanxu Li, Qi Ye, Yuchi Huo, Zechun Bai, Jiahao Sun, Jiming Chen

Abstract: Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address t… ▽ More Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose CT-NeRF, an incremental reconstruction optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojected geometric image distance constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction, CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy. △ Less

Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.01543 [pdf, other]

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Authors: Ziqian Bai, Feitong Tan, Sean Fanello, Rohit Pandey, Mingsong Dou, Shichen Liu, ** Tan, Yinda Zhang

Abstract: 3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static… ▽ More 3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes, these methods cannot be simply employed to support realistic facial expressions, such as in the case of a dynamic facial performance. To address these challenges, we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes, which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN, resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP, which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: In CVPR2024. Project page: https://augmentedperception.github.io/monoavatar-plus

arXiv:2403.05428 [pdf, other]

Towards Real-World Stickers Use: A New Dataset for Multi-Tag Sticker Recognition

Authors: Bingbing Wang, Bin Liang, Chun-Mei Feng, Wangmeng Zuo, Zhixin Bai, Shijue Huang, Kam-Fai Wong, Xi Zeng, Ruifeng Xu

Abstract: In real-world conversations, the diversity and ambiguity of stickers often lead to varied interpretations based on the context, necessitating the requirement for comprehensively understanding stickers and supporting multi-tagging. To address this challenge, we introduce StickerTAG, the first multi-tag sticker dataset comprising a collected tag set with 461 tags and 13,571 sticker-tag pairs, design… ▽ More In real-world conversations, the diversity and ambiguity of stickers often lead to varied interpretations based on the context, necessitating the requirement for comprehensively understanding stickers and supporting multi-tagging. To address this challenge, we introduce StickerTAG, the first multi-tag sticker dataset comprising a collected tag set with 461 tags and 13,571 sticker-tag pairs, designed to provide a deeper understanding of stickers. Recognizing multiple tags for stickers becomes particularly challenging due to sticker tags usually are fine-grained attribute aware. Hence, we propose an Attentive Attribute-oriented Prompt Learning method, ie, Att$^2$PL, to capture informative features of stickers in a fine-grained manner to better differentiate tags. Specifically, we first apply an Attribute-oriented Description Generation (ADG) module to obtain the description for stickers from four attributes. Then, a Local Re-attention (LoR) module is designed to perceive the importance of local information. Finally, we use prompt learning to guide the recognition process and adopt confidence penalty optimization to penalize the confident output distribution. Extensive experiments show that our method achieves encouraging results for all commonly used metrics. △ Less

Submitted 16 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.05427 [pdf, other]

Reply with Sticker: New Dataset and Model for Sticker Retrieval

Authors: Bin Liang, Bingbing Wang, Zhixin Bai, Qiwei Lang, Mingwei Sun, Kaiheng Hou, Kam-Fai Wong, Ruifeng Xu

Abstract: Using stickers in online chatting is very prevalent on social media platforms, where the stickers used in the conversation can express someone's intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing sticker retrieval research typically retrieves stickers based on context and the current utterance delivered by the user. That is, the stickers serve as a supplement to the curren… ▽ More Using stickers in online chatting is very prevalent on social media platforms, where the stickers used in the conversation can express someone's intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing sticker retrieval research typically retrieves stickers based on context and the current utterance delivered by the user. That is, the stickers serve as a supplement to the current utterance. However, in the real-world scenario, using stickers to express what we want to say rather than as a supplement to our words only is also important. Therefore, in this paper, we create a new dataset for sticker retrieval in conversation, called StickerInt, where stickers are used to reply to previous conversations or supplement our words. Based on the created dataset, we present a simple yet effective framework for sticker retrieval in conversation based on the learning of intention and the cross-modal relationships between conversation context and stickers, coined as \textbf{Int-RA}. Specifically, we first devise a knowledge-enhanced intention predictor to introduce the intention information into the conversation representations. Subsequently, a relation-aware sticker selector is devised to retrieve the response sticker via cross-modal relationships. Extensive experiments on the created dataset show that the proposed model achieves state-of-the-art performance in sticker retrieval. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.15627 [pdf, other]

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Authors: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao , et al. (7 additional authors not shown)

Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model bl… ▽ More We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlap**, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.14660 [pdf, other]

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

Authors: Yanan Wu, Jie Liu, Xingyuan Bu, Jiaheng Liu, Zhanhui Zhou, Yuanxing Zhang, Chenchen Zhang, Zhiqi Bai, Haibin Chen, Tiezheng Ge, Wanli Ouyang, Wenbo Su, Bo Zheng

Abstract: This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can… ▽ More This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models. △ Less

Submitted 23 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: The benchmark dataset will be released soon

arXiv:2402.13724 [pdf, other]

Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters

Authors: Zechen Bai, Peng Chen, Xiaolan Peng, Lu Liu, Hui Chen, Mike Zheng Shou, Feng Tian

Abstract: Animating virtual characters has always been a fundamental research problem in virtual reality (VR). Facial animations play a crucial role as they effectively convey emotions and attitudes of virtual humans. However, creating such facial animations can be challenging, as current methods often involve utilization of expensive motion capture devices or significant investments of time and effort from… ▽ More Animating virtual characters has always been a fundamental research problem in virtual reality (VR). Facial animations play a crucial role as they effectively convey emotions and attitudes of virtual humans. However, creating such facial animations can be challenging, as current methods often involve utilization of expensive motion capture devices or significant investments of time and effort from human animators in tuning animation parameters. In this paper, we propose a holistic solution to automatically animate virtual human faces. In our solution, a deep learning model was first trained to retarget the facial expression from input face images to virtual human faces by estimating the blendshape coefficients. This method offers the flexibility of generating animations with characters of different appearances and blendshape topologies. Second, a practical toolkit was developed using Unity 3D, making it compatible with the most popular VR applications. The toolkit accepts both image and video as input to animate the target virtual human faces and enables users to manipulate the animation results. Furthermore, inspired by the spirit of Human-in-the-loop (HITL), we leveraged user feedback to further improve the performance of the model and toolkit, thereby increasing the customization properties to suit user preferences. The whole solution, for which we will make the code public, has the potential to accelerate the generation of facial animations for use in VR applications. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 9 pages. To appear in IEEE-VR

arXiv:2402.11909 [pdf, other]

One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation

Authors: Zhixuan Yu, Ziqian Bai, Abhimitra Meka, Feitong Tan, Qiangeng Xu, Rohit Pandey, Sean Fanello, Hyun Soo Park, Yinda Zhang

Abstract: Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from… ▽ More Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and leverage it as a prior for creating personalized avatar from few-shot images. Different from previous 3D-aware face generative models, our prior is built with a 3DMM-anchored neural radiance field backbone, which we show to be more effective for avatar creation through auto-decoding based on few-shot inputs. We also handle unstable 3DMM fitting by jointly optimizing the 3DMM fitting and camera calibration that leads to better few-shot adaptation. Our method demonstrates compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.03933 [pdf]

Development of a Evaluation Tool for Age-Appropriate Software in Aging Environments: A Delphi Study

Authors: Zhenggang Bai, Yougxiang Fang, Hongtu Chen, Xinru Chen, Ning An, Min Zhang, Guoxin Rui, **g **

Abstract: Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were con… ▽ More Objective: We aimed to develop a dependable reliable tool for assessing software ageappropriateness. Methods: We conducted a systematic review to get the indicators of technology ageappropriateness from studies from January 2000 to April 2023.This study engaged 25 experts from the fields of anthropology, sociology,and social technology research across, three rounds of Delphi consultations were conducted. Experts were asked to screen, assess, add and provide feedback on the preliminary indicators identified in the initial indicator pool. Result: We found 76 criterias for evaluating quality criteria was extracted, grouped into 11 distinct domains. After completing three rounds of Delphi consultations,experts drew upon their personal experiences,theoretical frameworks,and industry insights to arrive at a three-dimensional structure for the evaluation tooluser experience,product quality,and social promotion.These metrics were further distilled into a 16-item scale, and a corresponding questionnaire was formulated.The developed tool exhibited strong internal reliability(Cronbach's Alpha is 0.867)and content validity(S-CVI is 0.93). Conclusion: This tool represents a straightforward,objective,and reliable mechanism for evaluating software's appropriateness across age groups. Moreover,it offers valuable insights and practical guidance for designing and develo** of high-quality age-appropriate software,and assisst age groups to select software they like. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2402.02791 [pdf, other]

Rethinking Optimization and Architecture for Tiny Language Models

Authors: Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang

Abstract: The power of large language models (LLMs) has been demonstrated through numerous data and computing resources. However, the application of language models on mobile devices is facing huge challenge on the computation and memory costs, that is, tiny language models with high performance are urgently required. Limited by the highly complex training process, there are many details for optimizing lang… ▽ More The power of large language models (LLMs) has been demonstrated through numerous data and computing resources. However, the application of language models on mobile devices is facing huge challenge on the computation and memory costs, that is, tiny language models with high performance are urgently required. Limited by the highly complex training process, there are many details for optimizing language models that are seldom studied carefully. In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Three perspectives are mainly discussed, \ie, neural architecture, parameter initialization, and optimization strategy. Several design formulas are empirically proved especially effective for tiny language models, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training. Then we train PanGu-$π$-1B Pro and PanGu-$π$-1.5B Pro on 1.6T multilingual corpora, following the established formulas. Experimental results demonstrate the improved optimization and architecture yield a notable average improvement of 8.87 on benchmark evaluation sets for PanGu-$π$-1B Pro. Besides, PanGu-$π$-1.5B Pro surpasses a range of SOTA models with larger model sizes, validating its superior performance. The code is available at https://github.com/YuchuanTian/RethinkTinyLM. △ Less

Submitted 6 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.01345 [pdf, other]

Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models

Authors: Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou

Abstract: Recent advancements in large vision-language models (LVLMs) have demonstrated impressive capability in visual information understanding with human language. Despite these advances, LVLMs still face challenges with multimodal hallucination, such as generating text descriptions of objects that are not present in the visual information. However, the underlying fundamental reasons of multimodal halluc… ▽ More Recent advancements in large vision-language models (LVLMs) have demonstrated impressive capability in visual information understanding with human language. Despite these advances, LVLMs still face challenges with multimodal hallucination, such as generating text descriptions of objects that are not present in the visual information. However, the underlying fundamental reasons of multimodal hallucinations remain poorly explored. In this paper, we propose a new perspective, suggesting that the inherent biases in LVLMs might be a key factor in hallucinations. Specifically, we systematically identify a semantic shift bias related to paragraph breaks (\n\n), where the content before and after '\n\n' in the training data frequently exhibit significant semantic changes. This pattern leads the model to infer that the contents following '\n\n' should be obviously different from the preceding contents with less hallucinatory descriptions, thereby increasing the probability of hallucinatory descriptions subsequent to the '\n\n'. We have validated this hypothesis on multiple publicly available LVLMs. Besides, we find that deliberately inserting '\n\n' at the generated description can induce more hallucinations. A simple method is proposed to effectively mitigate the hallucination of LVLMs by skip** the output of '\n'. △ Less

Submitted 7 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.14856 [pdf, other]

Memory-Inspired Temporal Prompt Interaction for Text-Image Classification

Authors: Xinyao Yu, Hao Sun, Ziwei Niu, Rui Qin, Zhenjia Bai, Yen-Wei Chen, Lanfen Lin

Abstract: In recent years, large-scale pre-trained multimodal models (LMM) generally emerge to integrate the vision and language modalities, achieving considerable success in various natural language processing and computer vision tasks. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy i… ▽ More In recent years, large-scale pre-trained multimodal models (LMM) generally emerge to integrate the vision and language modalities, achieving considerable success in various natural language processing and computer vision tasks. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this contex, we propose a novel prompt-based multimodal interaction strategy inspired by human memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP). Our proposed method involves in two stages as in human memory strategy: the acquiring stage, and the consolidation and activation stage. We utilize temporal prompts on intermediate layers to imitate the acquiring stage, leverage similarity-based prompt interaction to imitate memory consolidation, and employ prompt generation strategy to imitate memory activation. The main strength of our paper is that we interact the prompt vectors on intermediate layers to leverage sufficient information exchange between modalities, with compressed trainable parameters and memory usage. We achieve competitive results on several datasets with relatively small memory usage and 2.0M of trainable parameters (about 1% of the pre-trained foundation model). △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2401.06951 [pdf, other]

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

Authors: Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, Tiezheng Ge, Jie Fu, Wenhu Chen, Bo Zheng

Abstract: Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issue… ▽ More Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issues, we propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced computation cost, which also removes the need to collect long-context data. Concretely, first, the training data of our E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost greatly. Second, the training procedure on the short training context window is performed only once time, and we can support different evaluation context windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings, we introduce two different augmentation methods on the scale and position index parameters for different samples in training. It aims to make the model more robust to the different relative differences when directly interpolating the arbitrary context length at inference. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks. △ Less

Submitted 22 February, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

arXiv:2401.01529 [pdf, other]

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

Authors: Ziyi Bai, Rui** Wang, Xilin Chen

Abstract: Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episod… ▽ More Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. To mimic this effective reasoning strategy, we propose the Glance-Focus model. One simple way is to apply an action detection model to predict a set of actions as key memories. However, these actions within a closed set vocabulary are hard to generalize to various video domains. Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage. Apart from using supervised bipartite matching to obtain the event memories, we further design an unsupervised memory generation method to get rid of dependence on event annotations. Next, at the focusing stage, these event memories act as a bridge to establish the correlation between the questions with high-level event concepts and low-level lengthy video content. Given the question, the model first focuses on the generated key event memory, then focuses on the most relevant moment for reasoning through our designed multi-level cross-attention mechanism. We conduct extensive experiments on four Multi-Event VideoQA benchmarks including STAR, EgoTaskQA, AGQA, and NExT-QA. Our proposed model achieves state-of-the-art results, surpassing current large models in various challenging reasoning tasks. The code and models are available at https://github.com/ByZ0e/Glance-Focus. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: Accepted in NeurIPS 2023

arXiv:2312.17276 [pdf, other]

PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

Authors: Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, Dacheng Tao

Abstract: The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of… ▽ More The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art language model architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for language models, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is further used to enhance the model nonlinearity. We then demonstrate that the proposed approach is significantly effective for enhancing the model nonlinearity through carefully designed ablations; thus, we present a new efficient model architecture for establishing modern, namely, PanGu-$π$. Experiments are then conducted using the same dataset and training strategy to compare PanGu-$π$ with state-of-the-art LLMs. The results show that PanGu-$π$-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-$π$-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-$π$-7B in the high-value domains of finance and law, develo** an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.13108 [pdf, other]

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

Authors: Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou

Abstract: Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper… ▽ More Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain. △ Less

Submitted 1 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: Project Page: https://showlab.github.io/assistgui/

arXiv:2312.04008 [pdf, other]

Natural-language-driven Simulation Benchmark and Copilot for Efficient Production of Object Interactions in Virtual Road Scenes

Authors: Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Die Zuo, Jibin Peng, Zhao Huang, Zhecheng Xu, Fupeng Li, Ziyun Bai, Di Lin

Abstract: We advocate the idea of the natural-language-driven(NLD) simulation to efficiently produce the object interactions between multiple objects in the virtual road scenes, for teaching and testing the autonomous driving systems that should take quick action to avoid collision with obstacles with unpredictable motions. The NLD simulation allows the brief natural-language description to control the obje… ▽ More We advocate the idea of the natural-language-driven(NLD) simulation to efficiently produce the object interactions between multiple objects in the virtual road scenes, for teaching and testing the autonomous driving systems that should take quick action to avoid collision with obstacles with unpredictable motions. The NLD simulation allows the brief natural-language description to control the object interactions, significantly reducing the human efforts for creating a large amount of interaction data. To facilitate the research of NLD simulation, we collect the Language-to-Interaction(L2I) benchmark dataset with 120,000 natural-language descriptions of object interactions in 6 common types of road topologies. Each description is associated with the programming code, which the graphic render can use to visually reconstruct the object interactions in the virtual scenes. As a methodology contribution, we design SimCopilot to translate the interaction descriptions to the renderable code. We use the L2I dataset to evaluate SimCopilot's abilities to control the object motions, generate complex interactions, and generalize interactions across road topologies. The L2I dataset and the evaluation results motivate the relevant research of the NLD simulation. △ Less

Submitted 15 December, 2023; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: 12 pages, 6 figures

arXiv:2312.03548 [pdf, other]

doi 10.1109/TCSII.2023.3333436

Texture-Semantic Collaboration Network for ORSI Salient Object Detection

Authors: Gongyang Li, Zhen Bai, Zhi Liu

Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) has become increasingly popular recently. Due to the characteristics of ORSIs, ORSI-SOD is full of challenges, such as multiple objects, small objects, low illuminations, and irregular shapes. To address these challenges, we propose a concise yet effective Texture-Semantic Collaboration Network (TSCNet) to explore the collabor… ▽ More Salient object detection (SOD) in optical remote sensing images (ORSIs) has become increasingly popular recently. Due to the characteristics of ORSIs, ORSI-SOD is full of challenges, such as multiple objects, small objects, low illuminations, and irregular shapes. To address these challenges, we propose a concise yet effective Texture-Semantic Collaboration Network (TSCNet) to explore the collaboration of texture cues and semantic cues for ORSI-SOD. Specifically, TSCNet is based on the generic encoder-decoder structure. In addition to the encoder and decoder, TSCNet includes a vital Texture-Semantic Collaboration Module (TSCM), which performs valuable feature modulation and interaction on basic features extracted from the encoder. The main idea of our TSCM is to make full use of the texture features at the lowest level and the semantic features at the highest level to achieve the expression enhancement of salient regions on features. In the TSCM, we first enhance the position of potential salient regions using semantic features. Then, we render and restore the object details using the texture features. Meanwhile, we also perceive regions of various scales, and construct interactions between different regions. Thanks to the perfect combination of TSCM and generic structure, our TSCNet can take care of both the position and details of salient objects, effectively handling various scenes. Extensive experiments on three datasets demonstrate that our TSCNet achieves competitive performance compared to 14 state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/TSCNet. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: 5 pages, 3 figures, Accepted by IEEE Transactions on Circuits and Systems II: Express Briefs 2023

arXiv:2311.16568 [pdf, ps, other]

Active Reconfigurable Intelligent Surface Enhanced Spectrum Sensing for Cognitive Radio Networks

Authors: Jungang Ge, Ying-Chang Liang, Sumei Sun, Yonghong Zeng, Zhidong Bai

Abstract: In opportunistic cognitive radio networks, when the primary signal is very weak compared to the background noise, the secondary user requires long sensing time to achieve a reliable spectrum sensing performance, leading to little remaining time for the secondary transmission. To tackle this issue, we propose an active reconfigurable intelligent surface (RIS) assisted spectrum sensing system, where… ▽ More In opportunistic cognitive radio networks, when the primary signal is very weak compared to the background noise, the secondary user requires long sensing time to achieve a reliable spectrum sensing performance, leading to little remaining time for the secondary transmission. To tackle this issue, we propose an active reconfigurable intelligent surface (RIS) assisted spectrum sensing system, where the received signal strength from the interested primary user can be enhanced and underlying interference within the background noise can be mitigated as well. In comparison with the passive RIS, the active RIS can not only adapt the phase shift of each reflecting element but also amplify the incident signals. Notably, we study the reflecting coefficient matrix (RCM) optimization problem to improve the detection probability given a maximum tolerable false alarm probability and limited sensing time. Then, we show that the formulated problem can be equivalently transformed to a weighted mean square error minimization problem using the principle of the well-known weighted minimum mean square error (WMMSE) algorithm, and an iterative optimization approach is proposed to obtain the optimal RCM. In addition, to fairly compare passive RIS and active RIS, we study the required power budget of the RIS to achieve a target detection probability under a special case where the direct links are neglected and the RIS-related channels are line-of-sight. Via extensive simulations, the effectiveness of the WMMSE-based RCM optimization approach is demonstrated. Furthermore, the results reveal that the active RIS can outperform the passive RIS when the underlying interference within the background noise is relatively weak, whereas the passive RIS performs better in strong interference scenarios because the same power budget can support a vast number of passive reflecting elements for interference mitigation. △ Less

Submitted 26 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.01689 [pdf, other]

Data-Free Distillation of Language Model by Text-to-Text Transfer

Authors: Zheyuan Bai, Xinduo Liu, Hailin Hu, Tianyu Guo, Qinghua Zhang, Yunhe Wang

Abstract: Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable. Previous works for DFKD in NLP mainly focus on distilling encoder-only structures like BERT on classification tasks, which overlook the notable progress of generative language modeling. In this work, we propose a novel DFKD framework, namely DFKD-T$^{3}$, where the pretra… ▽ More Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable. Previous works for DFKD in NLP mainly focus on distilling encoder-only structures like BERT on classification tasks, which overlook the notable progress of generative language modeling. In this work, we propose a novel DFKD framework, namely DFKD-T$^{3}$, where the pretrained generative language model can also serve as a controllable data generator for model compression. This novel framework DFKD-T$^{3}$ leads to an end-to-end learnable text-to-text framework to transform the general domain corpus to compression-friendly task data, targeting to improve both the \textit{specificity} and \textit{diversity}. Extensive experiments show that our method can boost the distillation performance in various downstream tasks such as sentiment analysis, linguistic acceptability, and information extraction. Furthermore, we show that the generated texts can be directly used for distilling other language models and outperform the SOTA methods, making our method more appealing in a general DFKD setting. Our code is available at https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2310.17780 [pdf, other]

AutoCT: Automated CT registration, segmentation, and quantification

Authors: Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E. Bouchard

Abstract: The processing and analysis of computed tomography (CT) imaging is important for both basic scientific development and clinical applications. In AutoCT, we provide a comprehensive pipeline that integrates an end-to-end automatic preprocessing, registration, segmentation, and quantitative analysis of 3D CT scans. The engineered pipeline enables atlas-based CT segmentation and quantification leverag… ▽ More The processing and analysis of computed tomography (CT) imaging is important for both basic scientific development and clinical applications. In AutoCT, we provide a comprehensive pipeline that integrates an end-to-end automatic preprocessing, registration, segmentation, and quantitative analysis of 3D CT scans. The engineered pipeline enables atlas-based CT segmentation and quantification leveraging diffeomorphic transformations through efficient forward and inverse map**s. The extracted localized features from the deformation field allow for downstream statistical learning that may facilitate medical diagnostics. On a lightweight and portable software platform, AutoCT provides a new toolkit for the CT imaging community to underpin the deployment of artificial intelligence-driven applications. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.04677 [pdf, other]

AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

Authors: Rongzhao Zhang, Zhian Bai, Ruoying Yu, Wenrao Pang, Lingyun Wang, Lifeng Zhu, Xiaofan Zhang, Huan Zhang, Weiguo Hu

Abstract: When delineating lesions from medical images, a human expert can always keep in mind the anatomical structure behind the voxels. However, although high-quality (though not perfect) anatomical information can be retrieved from computed tomography (CT) scans with modern deep learning algorithms, it is still an open problem how these automatically generated organ masks can assist in addressing challe… ▽ More When delineating lesions from medical images, a human expert can always keep in mind the anatomical structure behind the voxels. However, although high-quality (though not perfect) anatomical information can be retrieved from computed tomography (CT) scans with modern deep learning algorithms, it is still an open problem how these automatically generated organ masks can assist in addressing challenging lesion segmentation tasks, such as the segmentation of colorectal cancer (CRC). In this paper, we develop a novel Anatomy-Guided segmentation framework to exploit the auto-generated organ masks to aid CRC segmentation from CT, namely AG-CRC. First, we obtain multi-organ segmentation (MOS) masks with existing MOS models (e.g., TotalSegmentor) and further derive a more robust organ of interest (OOI) mask that may cover most of the colon-rectum and CRC voxels. Then, we propose an anatomy-guided training patch sampling strategy by optimizing a heuristic gain function that considers both the proximity of important regions (e.g., the tumor or organs of interest) and sample diversity. Third, we design a novel self-supervised learning scheme inspired by the topology of tubular organs like the colon to boost the model performance further. Finally, we employ a masked loss scheme to guide the model to focus solely on the essential learning region. We extensively evaluate the proposed method on two CRC segmentation datasets, where substantial performance improvement (5% to 9% in Dice) is achieved over current state-of-the-art medical image segmentation models, and the ablation studies further evidence the efficacy of every proposed component. △ Less

Submitted 30 November, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: under review

arXiv:2309.12865 [pdf, other]

Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification

Authors: Xizhe Xue, Haokui Zhang, Ying Li, Liuwei Wan, Zongwen Bai, Mike Zheng Shou

Abstract: Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this pap… ▽ More Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this paper, aiming to solve this problem, we propose the single-direction tuning (SDT) strategy, which serves as a bridge, allowing us to leverage existing labeled HSI datasets even RGB datasets to enhance the performance on new HSI datasets with limited samples. The proposed SDT inherits the idea of prompt tuning, aiming to reuse pre-trained models with minimal modifications for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed to accommodate the characteristics of HSIs. The proposed SDT utilizes a parallel architecture, an asynchronous cold-hot gradient update strategy, and unidirectional interaction. It aims to fully harness the potent representation learning capabilities derived from training on heterologous, even cross-modal datasets. In addition, we also introduce a novel Triplet-structured transformer (Tri-Former), where spectral attention and spatial attention modules are merged in parallel to construct the token mixing component for reducing computation cost and a 3D convolution-based channel mixer module is integrated to enhance stability and keep structure information. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.09858 [pdf, other]

Unsupervised Open-Vocabulary Object Localization in Videos

Authors: Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He

Abstract: In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to… ▽ More In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks. △ Less

Submitted 26 June, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted by ICCV 2023; Presented on CVPR 2024 Workshop CORR; Project Page:https://github.com/amazon-science/object-centric-vol

arXiv:2309.08206 [pdf, other]

doi 10.1109/TIP.2023.3314285

Salient Object Detection in Optical Remote Sensing Images Driven by Transformer

Authors: Gongyang Li, Zhen Bai, Zhi Liu, Xinpeng Zhang, Haibin Ling

Abstract: Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Networ… ▽ More Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Network (GeleNet) for ORSI-SOD following the global-to-local paradigm. Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Then, GeleNet employs a Direction-aware Shuffle Weighted Spatial Attention Module (D-SWSAM) and its simplified version (SWSAM) to enhance local interactions, and a Knowledge Transfer Module (KTM) to further enhance cross-level contextual interactions. D-SWSAM comprehensively perceives the orientation information in the lowest-level features through directional convolutions to adapt to various orientations of salient objects in ORSIs, and effectively enhances the details of salient objects with an improved attention mechanism. SWSAM discards the direction-aware part of D-SWSAM to focus on localizing salient objects in the highest-level features. KTM models the contextual correlation knowledge of two middle-level features of different scales based on the self-attention mechanism, and transfers the knowledge to the raw features to generate more discriminative features. Finally, a saliency predictor is used to generate the saliency map based on the outputs of the above three modules. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/GeleNet. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: 13 pages, 6 figures, Accepted by IEEE Transactions on Image Processing 2023

arXiv:2309.00233 [pdf, other]

Object-Centric Multiple Object Tracking

Authors: Zixu Zhao, Jiaze Wang, Max Horn, Yizhuo Ding, Tong He, Zechen Bai, Dominik Zietlow, Carl-Johann Simon-Gabriel, Bing Shuai, Zhuowen Tu, Thomas Brox, Bernt Schiele, Yanwei Fu, Francesco Locatello, Zheng Zhang, Tianjun Xiao

Abstract: Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art model… ▽ More Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers. △ Less

Submitted 5 September, 2023; v1 submitted 31 August, 2023; originally announced September 2023.

Comments: ICCV 2023 camera-ready version

arXiv:2307.08921 [pdf, other]

Optimistic Estimate Uncovers the Potential of Nonlinear Models

Authors: Yaoyu Zhang, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, Zhi-Qin John Xu

Abstract: We propose an optimistic estimate to evaluate the best possible fitting performance of nonlinear models. It yields an optimistic sample size that quantifies the smallest possible sample size to fit/recover a target function using a nonlinear model. We estimate the optimistic sample sizes for matrix factorization models, deep models, and deep neural networks (DNNs) with fully-connected or convoluti… ▽ More We propose an optimistic estimate to evaluate the best possible fitting performance of nonlinear models. It yields an optimistic sample size that quantifies the smallest possible sample size to fit/recover a target function using a nonlinear model. We estimate the optimistic sample sizes for matrix factorization models, deep models, and deep neural networks (DNNs) with fully-connected or convolutional architecture. For each nonlinear model, our estimates predict a specific subset of targets that can be fitted at overparameterization, which are confirmed by our experiments. Our optimistic estimate reveals two special properties of the DNN models -- free expressiveness in width and costly expressiveness in connection. These properties suggest the following architecture design principles of DNNs: (i) feel free to add neurons/kernels; (ii) restrain from connecting neurons. Overall, our optimistic estimate theoretically unveils the vast potential of nonlinear models in fitting at overparameterization. Based on this framework, we anticipate gaining a deeper understanding of how and why numerous nonlinear models such as DNNs can effectively realize their potential in practice in the near future. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2305.18149 [pdf, other]

Multiscale Positive-Unlabeled Detection of AI-Generated Texts

Authors: Yuchuan Tian, Hanting Chen, Xutao Wang, Zheyuan Bai, Qinghua Zhang, Ruifeng Li, Chao Xu, Yunhe Wang

Abstract: Recent releases of Large Language Models (LLMs), e.g. ChatGPT, are astonishing at generating human-like texts, but they may impact the authenticity of texts. Previous works proposed methods to detect these AI-generated texts, including simple ML classifiers, pretrained-model-based zero-shot methods, and finetuned language classification models. However, mainstream detectors always fail on short te… ▽ More Recent releases of Large Language Models (LLMs), e.g. ChatGPT, are astonishing at generating human-like texts, but they may impact the authenticity of texts. Previous works proposed methods to detect these AI-generated texts, including simple ML classifiers, pretrained-model-based zero-shot methods, and finetuned language classification models. However, mainstream detectors always fail on short texts, like SMSes, Tweets, and reviews. In this paper, a Multiscale Positive-Unlabeled (MPU) training framework is proposed to address the difficulty of short-text detection without sacrificing long-texts. Firstly, we acknowledge the human-resemblance property of short machine texts, and rephrase AI text detection as a partial Positive-Unlabeled (PU) problem by regarding these short machine texts as partially ``unlabeled". Then in this PU context, we propose the length-sensitive Multiscale PU Loss, where a recurrent model in abstraction is used to estimate positive priors of scale-variant corpora. Additionally, we introduce a Text Multiscaling module to enrich training corpora. Experiments show that our MPU method augments detection performance on long AI-generated texts, and significantly improves short-text detection of language model detectors. Language Models trained with MPU could outcompete existing detectors on various short-text and long-text detection benchmarks. The codes are available at https://github.com/mindspore-lab/mindone/tree/master/examples/detect_chatgpt and https://github.com/YuchuanTian/AIGC_text_detector. △ Less

Submitted 5 March, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: ICLR2024 (Spotlight)

arXiv:2304.09091 [pdf, other]

Participatory Design of AI with Children: Reflections on IDC Design Challenge

Authors: Zhen Bai, Frances Judd, Naomi Polinsky, Elmira Yadollahi

Abstract: Children growing up in the era of Artificial Intelligence (AI) will be most impacted by the technology across their life span. Participatory Design (PD) is widely adopted by the Interaction Design and Children (IDC) community, which empowers children to bring their interests, needs, and creativity to the design process of future technologies. While PD has drawn increasing attention to human-center… ▽ More Children growing up in the era of Artificial Intelligence (AI) will be most impacted by the technology across their life span. Participatory Design (PD) is widely adopted by the Interaction Design and Children (IDC) community, which empowers children to bring their interests, needs, and creativity to the design process of future technologies. While PD has drawn increasing attention to human-centered AI design, it remains largely untapped in facilitating the design process of AI technologies relevant to children and their community. In this paper, we report intriguing children's design ideas on AI technologies resulting from the "Research and Design Challenge" of the 22nd ACM Interaction Design and Children (IDC 2023) conference. The diversity of design problems, AI applications and capabilities revealed by the children's design ideas shed light on the potential of engaging children in PD activities for future AI technologies. We discuss opportunities and challenges for accessible and inclusive PD experiences with children in sha** the future of AI-powered society. △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2304.08173 [pdf]

A Corpus-based Analysis of Attitudinal Changes in Lin Yutang's Self-translation of Between Tears and Laughter

Authors: Zhi** Bai

Abstract: Attitude is omnipresent in almost every type of text. There has yet to be any relevant research on attitudinal shifts in self-translation. The Chinese version of Between Tears and Laughter is a rare case of self-translation and co-translation in that the first 11 chapters are self-translated by Lin Yutang, and the last 12 chapters by Xu Chengbin. The current study conducted a word frequency analys… ▽ More Attitude is omnipresent in almost every type of text. There has yet to be any relevant research on attitudinal shifts in self-translation. The Chinese version of Between Tears and Laughter is a rare case of self-translation and co-translation in that the first 11 chapters are self-translated by Lin Yutang, and the last 12 chapters by Xu Chengbin. The current study conducted a word frequency analysis of this book's English and Chinese versions with LIWC and AntConc, and made comparative research into Lin Yutang's attitudinal changes. The results show that due to different writing purposes and readerships, there is less anger in Lin's self-translation (M=0.7755, SD=0.2775) than in the first 11 chapters of the English original (M=1.1036, SD=0.3861), which is a significant difference (t=2.2892, p=0.0331). This attitudinal change is also reflected in the translations of some n-grams containing anger words. In contrast, there is no significant difference (t=1.88, p=0.07) between Xu's co-translation and the corresponding part of the original in attitude "anger". This paper believes that corpus tools can help co-translators keep their translation consistent in attitude. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: 30 pages, 5 tables

arXiv:2304.01436 [pdf, other]

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos

Authors: Ziqian Bai, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Mingsong Dou, Sergio Orts-Escolano, Rohit Pandey, ** Tan, Thabo Beeler, Sean Fanello, Yinda Zhang

Abstract: We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduc… ▽ More We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: In CVPR2023. Project page: https://augmentedperception.github.io/monoavatar/

arXiv:2302.03128 [pdf, other]

Cooperverse: A Mobile-Edge-Cloud Framework for Universal Cooperative Perception with Mixed Connectivity and Automation

Authors: Zhengwei Bai, Guoyuan Wu, Matthew J. Barth, Yongkang Liu, Emrah Akin Sisbot, Kentaro Oguchi

Abstract: Cooperative perception (CP) is attracting increasing attention and is regarded as the core foundation to support cooperative driving automation, a potential key solution to addressing the safety, mobility, and sustainability issues of contemporary transportation systems. However, current research on CP is still at the beginning stages where a systematic problem formulation of CP is still missing,… ▽ More Cooperative perception (CP) is attracting increasing attention and is regarded as the core foundation to support cooperative driving automation, a potential key solution to addressing the safety, mobility, and sustainability issues of contemporary transportation systems. However, current research on CP is still at the beginning stages where a systematic problem formulation of CP is still missing, acting as the essential guideline of the system design of a CP system under real-world situations. In this paper, we formulate a universal CP system into an optimization problem and a mobile-edge-cloud framework called Cooperverse. This system addresses CP in a mixed connectivity and automation environment. A Dynamic Feature Sharing (DFS) methodology is introduced to support this CP system under certain constraints and a Random Priority Filtering (RPF) method is proposed to conduct DFS with high performance. Experiments have been conducted based on a high-fidelity CP platform, and the results show that the Cooperverse framework is effective for dynamic node engagement and the proposed DFS methodology can improve system CP performance by 14.5% and the RPF method can reduce the communication cost for mobile nodes by 90% with only 1.7% drop for average precision. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: 6 pages, 7 figures

arXiv:2212.07060 [pdf, other]

VINet: Lightweight, Scalable, and Heterogeneous Cooperative Perception for 3D Object Detection

Authors: Zhengwei Bai, Guoyuan Wu, Matthew J. Barth, Yongkang Liu, Emrah Akin Sisbot, Kentaro Oguchi

Abstract: Utilizing the latest advances in Artificial Intelligence (AI), the computer vision community is now witnessing an unprecedented evolution in all kinds of perception tasks, particularly in object detection. Based on multiple spatially separated perception nodes, Cooperative Perception (CP) has emerged to significantly advance the perception of automated driving. However, current cooperative object… ▽ More Utilizing the latest advances in Artificial Intelligence (AI), the computer vision community is now witnessing an unprecedented evolution in all kinds of perception tasks, particularly in object detection. Based on multiple spatially separated perception nodes, Cooperative Perception (CP) has emerged to significantly advance the perception of automated driving. However, current cooperative object detection methods mainly focus on ego-vehicle efficiency without considering the practical issues of system-wide costs. In this paper, we introduce VINet, a unified deep learning-based CP network for scalable, lightweight, and heterogeneous cooperative 3D object detection. VINet is the first CP method designed from the standpoint of large-scale system-level implementation and can be divided into three main phases: 1) Global Pre-Processing and Lightweight Feature Extraction which prepare the data into global style and extract features for cooperation in a lightweight manner; 2) Two-Stream Fusion which fuses the features from scalable and heterogeneous perception nodes; and 3) Central Feature Backbone and 3D Detection Head which further process the fused features and generate cooperative detection results. An open-source data experimental platform is designed and developed for CP dataset acquisition and model evaluation. The experimental analysis shows that VINet can reduce 84% system-level computational cost and 94% system-level communication cost while improving the 3D detection accuracy. △ Less

Submitted 21 March, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

arXiv:2211.11891 [pdf, other]

A Bi-level Nonlinear Eigenvector Algorithm for Wasserstein Discriminant Analysis

Authors: Dong Min Roh, Zhaojun Bai, Ren-Cang Li

Abstract: Much like the classical Fisher linear discriminant analysis (LDA), the recently proposed Wasserstein discriminant analysis (WDA) is a linear dimensionality reduction method that seeks a projection matrix to maximize the dispersion of different data classes and minimize the dispersion of same data classes via a bi-level optimization. In contrast to LDA, WDA can account for both global and local int… ▽ More Much like the classical Fisher linear discriminant analysis (LDA), the recently proposed Wasserstein discriminant analysis (WDA) is a linear dimensionality reduction method that seeks a projection matrix to maximize the dispersion of different data classes and minimize the dispersion of same data classes via a bi-level optimization. In contrast to LDA, WDA can account for both global and local interconnections between data classes by using the underlying principles of optimal transport. In this paper, a bi-level nonlinear eigenvector algorithm (WDA-nepv) is presented to fully exploit the structures of the bi-level optimization of WDA. The inner level of WDA-nepv for computing the optimal transport matrices is formulated as an eigenvector-dependent nonlinear eigenvalue problem (NEPv), and meanwhile, the outer level for trace ratio optimizations is formulated as another NEPv. Both NEPvs can be computed efficiently under the self-consistent field (SCF) framework. WDA-nepv is derivative-free and surrogate-model-free when compared with existing algorithms. Convergence analysis of the proposed WDA-nepv justifies the utilization of the SCF for solving the bi-level optimization of WDA. Numerical experiments with synthetic and real-life datasets demonstrate the classification accuracy and scalability of WDA-nepv. △ Less

Submitted 27 July, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

arXiv:2211.11623 [pdf, other]

Linear Stability Hypothesis and Rank Stratification for Nonlinear Models

Authors: Yaoyu Zhang, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, Zhi-Qin John Xu

Abstract: Models with nonlinear architectures/parameterizations such as deep neural networks (DNNs) are well known for their mysteriously good generalization performance at overparameterization. In this work, we tackle this mystery from a novel perspective focusing on the transition of the target recovery/fitting accuracy as a function of the training data size. We propose a rank stratification for general… ▽ More Models with nonlinear architectures/parameterizations such as deep neural networks (DNNs) are well known for their mysteriously good generalization performance at overparameterization. In this work, we tackle this mystery from a novel perspective focusing on the transition of the target recovery/fitting accuracy as a function of the training data size. We propose a rank stratification for general nonlinear models to uncover a model rank as an "effective size of parameters" for each function in the function space of the corresponding model. Moreover, we establish a linear stability theory proving that a target function almost surely becomes linearly stable when the training data size equals its model rank. Supported by our experiments, we propose a linear stability hypothesis that linearly stable functions are preferred by nonlinear training. By these results, model rank of a target function predicts a minimal training data size for its successful recovery. Specifically for the matrix factorization model and DNNs of fully-connected or convolutional architectures, our rank stratification shows that the model rank for specific target functions can be much lower than the size of model parameters. This result predicts the target recovery capability even at heavy overparameterization for these nonlinear models as demonstrated quantitatively by our experiments. Overall, our work provides a unified framework with quantitative prediction power to understand the mysterious target recovery behavior at overparameterization for general nonlinear models. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2210.16435 [pdf, other]

Scalable Spectral Clustering with Group Fairness Constraints

Authors: Ji Wang, Ding Lu, Ian Davidson, Zhaojun Bai

Abstract: There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entiret… ▽ More There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints. △ Less

Submitted 14 April, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Journal ref: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6613-6629, 2023

arXiv:2210.05677 [pdf]

Application of Deep Learning on Single-Cell RNA-sequencing Data Analysis: A Review

Authors: Matthew Brendel, Chang Su, Zilong Bai, Hao Zhang, Olivier Elemento, Fei Wang

Abstract: Single-cell RNA-sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during development of complex organisms and improved our understanding o… ▽ More Single-cell RNA-sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during development of complex organisms and improved our understanding of disease states, such as cancer, diabetes, and COVID, among others. Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative, compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analysis tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep algorithms for scRNA-seq data analysis. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Showing 1–50 of 104 results for author: Bai, Z