Search | arXiv e-print repository

arXiv:2406.19070 [pdf, other]

FAGhead: Fully Animate Gaussian Head from Monocular Videos

Authors: Yixin Xuan, Xinyang Li, Gongxin Yao, Shiwei Zhou, Donghui Sun, Xiaoxin Chen, Yu Pan

Abstract: High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Repre… ▽ More High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works. △ Less

Submitted 28 June, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.18549 [pdf]

Advancements in Feature Extraction Recognition of Medical Imaging Systems Through Deep Learning Technique

Authors: Qishi Zhan, Dan Sun, Erdi Gao, Yuhan Ma, Yaxin Liang, Haowei Yang

Abstract: This study introduces a novel unsupervised medical image feature extraction method that employs spatial stratification techniques. An objective function based on weight is proposed to achieve the purpose of fast image recognition. The algorithm divides the pixels of the image into multiple subdomains and uses a quadtree to access the image. A technique for threshold optimization utilizing a simple… ▽ More This study introduces a novel unsupervised medical image feature extraction method that employs spatial stratification techniques. An objective function based on weight is proposed to achieve the purpose of fast image recognition. The algorithm divides the pixels of the image into multiple subdomains and uses a quadtree to access the image. A technique for threshold optimization utilizing a simplex algorithm is presented. Aiming at the nonlinear characteristics of hyperspectral images, a generalized discriminant analysis algorithm based on kernel function is proposed. In this project, a hyperspectral remote sensing image is taken as the object, and we investigate its mathematical modeling, solution methods, and feature extraction techniques. It is found that different types of objects are independent of each other and compact in image processing. Compared with the traditional linear discrimination method, the result of image segmentation is better. This method can not only overcome the disadvantage of the traditional method which is easy to be affected by light, but also extract the features of the object quickly and accurately. It has important reference significance for clinical diagnosis. △ Less

Submitted 23 May, 2024; originally announced June 2024.

Comments: conference

arXiv:2406.08838 [pdf]

Research on Optimization of Natural Language Processing Model Based on Multimodal Deep Learning

Authors: Dan Sun, Yaxin Liang, Yining Yang, Yuhan Ma, Qishi Zhan, Erdi Gao

Abstract: This project intends to study the image representation based on attention mechanism and multimodal data. By adding multiple pattern layers to the attribute model, the semantic and hidden layers of image content are integrated. The word vector is quantified by the Word2Vec method and then evaluated by a word embedding convolutional neural network. The published experimental results of the two group… ▽ More This project intends to study the image representation based on attention mechanism and multimodal data. By adding multiple pattern layers to the attribute model, the semantic and hidden layers of image content are integrated. The word vector is quantified by the Word2Vec method and then evaluated by a word embedding convolutional neural network. The published experimental results of the two groups were tested. The experimental results show that this method can convert discrete features into continuous characters, thus reducing the complexity of feature preprocessing. Word2Vec and natural language processing technology are integrated to achieve the goal of direct evaluation of missing image features. The robustness of the image feature evaluation model is improved by using the excellent feature analysis characteristics of a convolutional neural network. This project intends to improve the existing image feature identification methods and eliminate the subjective influence in the evaluation process. The findings from the simulation indicate that the novel approach has developed is viable, effectively augmenting the features within the produced representations. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08837 [pdf]

Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

Authors: Houze Liu, Iris Li, Yaxin Liang, Dan Sun, Yining Yang, Haowei Yang

Abstract: Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networ… ▽ More Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networks. Firstly, according to the characteristics of pneumonia images, AlexNet and InceptionV3 were selected to obtain better image recognition results. Combining the features of medical images, the forward neural network with deeper and more complex structure is learned. Finally, knowledge extraction technology is used to extract the obtained data into the AlexNet model to achieve the purpose of improving computing efficiency and reducing computing costs. The results showed that the prediction accuracy, specificity, and sensitivity of the trained AlexNet model increased by 4.25 percentage points, 7.85 percentage points, and 2.32 percentage points, respectively. The graphics processing usage has decreased by 51% compared to the InceptionV3 mode. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.07268 [pdf, other]

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Authors: **yuan Li, Ziyan Li, Han Li, Jianfei Yu, Rui Xia, Di Sun, Gang Pan

Abstract: Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases u… ▽ More Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Extension of our Findings of EMNLP 2023 & ACL 2024 paper

arXiv:2406.04299 [pdf, other]

NoisyGL: A Comprehensive Benchmark for Graph Neural Networks under Label Noise

Authors: Zhonghao Wang, Danyu Sun, Sheng Zhou, Haobo Wang, Jiapei Fan, Longtao Huang, Jiajun Bu

Abstract: Graph Neural Networks (GNNs) exhibit strong potential in node classification task through a message-passing mechanism. However, their performance often hinges on high-quality node labels, which are challenging to obtain in real-world scenarios due to unreliable sources or adversarial attacks. Consequently, label noise is common in real-world graph data, negatively impacting GNNs by propagating inc… ▽ More Graph Neural Networks (GNNs) exhibit strong potential in node classification task through a message-passing mechanism. However, their performance often hinges on high-quality node labels, which are challenging to obtain in real-world scenarios due to unreliable sources or adversarial attacks. Consequently, label noise is common in real-world graph data, negatively impacting GNNs by propagating incorrect information during training. To address this issue, the study of Graph Neural Networks under Label Noise (GLN) has recently gained traction. However, due to variations in dataset selection, data splitting, and preprocessing techniques, the community currently lacks a comprehensive benchmark, which impedes deeper understanding and further development of GLN. To fill this gap, we introduce NoisyGL in this paper, the first comprehensive benchmark for graph neural networks under label noise. NoisyGL enables fair comparisons and detailed analyses of GLN methods on noisy labeled graph data across various datasets, with unified experimental settings and interface. Our benchmark has uncovered several important insights that were missed in previous research, and we believe these findings will be highly beneficial for future studies. We hope our open-source benchmark library will foster further advancements in this field. The code of the benchmark can be found in https://github.com/eaglelab-zju/NoisyGL. △ Less

Submitted 6 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: 28 pages, 15 figures

arXiv:2405.19194 [pdf, other]

LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

Authors: Hongen Liu, Di Sun, Jiahao Wang, Yi Liu, Gang Pan

Abstract: Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, recent methods track the zero-shot results of state-of-the-art image text spotters directly, and achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited… ▽ More Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, recent methods track the zero-shot results of state-of-the-art image text spotters directly, and achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO, an innovative framework designed to enhance the performance of conventional text spotters. To achieve this goal, we design a language synergy classifier (LSC) to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Moreover, the glyph supervision is introduced to enhance the recognition accuracy of noisy text regions. In addition, we propose the visual position mixture module, which can merge the position information and visual features efficiently, and acquire more discriminative tracking features. Extensive experiments on public benchmarks validate the effectiveness of the proposed method. △ Less

Submitted 10 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.13290 [pdf, other]

Theoretical Analysis of Meta Reinforcement Learning: Generalization Bounds and Convergence Guarantees

Authors: Cangqing Wang, Mingxiu Sui, Dan Sun, Zecheng Zhang, Yan Zhou

Abstract: This research delves deeply into Meta Reinforcement Learning (Meta RL) through a exploration focusing on defining generalization limits and ensuring convergence. By employing a approach this article introduces an innovative theoretical framework to meticulously assess the effectiveness and performance of Meta RL algorithms. We present an explanation of generalization limits measuring how well thes… ▽ More This research delves deeply into Meta Reinforcement Learning (Meta RL) through a exploration focusing on defining generalization limits and ensuring convergence. By employing a approach this article introduces an innovative theoretical framework to meticulously assess the effectiveness and performance of Meta RL algorithms. We present an explanation of generalization limits measuring how well these algorithms can adapt to learning tasks while maintaining consistent results. Our analysis delves into the factors that impact the adaptability of Meta RL revealing the relationship, between algorithm design and task complexity. Additionally we establish convergence assurances by proving conditions under which Meta RL strategies are guaranteed to converge towards solutions. We examine the convergence behaviors of Meta RL algorithms across scenarios providing a comprehensive understanding of the driving forces behind their long term performance. This exploration covers both convergence and real time efficiency offering a perspective, on the capabilities of these algorithms. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: This paper has been accepted by the 2024 International Conference on Modeling, Natural Language Processing and Machine Learning(CMNM 2024)

arXiv:2405.12850 [pdf, other]

Weakly supervised alignment and registration of MR-CT for cervical cancer radiotherapy

Authors: Jjahao Zhang, Yin Gu, Deyu Sun, Yuhua Gao, Ming Gao, Ming Cui, Teng Zhang, He Ma

Abstract: Cervical cancer is one of the leading causes of death in women, and brachytherapy is currently the primary treatment method. However, it is important to precisely define the extent of paracervical tissue invasion to improve cancer diagnosis and treatment options. The fusion of the information characteristics of both computed tomography (CT) and magnetic resonance imaging(MRI) modalities may be use… ▽ More Cervical cancer is one of the leading causes of death in women, and brachytherapy is currently the primary treatment method. However, it is important to precisely define the extent of paracervical tissue invasion to improve cancer diagnosis and treatment options. The fusion of the information characteristics of both computed tomography (CT) and magnetic resonance imaging(MRI) modalities may be useful in achieving a precise outline of the extent of paracervical tissue invasion. Registration is the initial step in information fusion. However, when aligning multimodal images with varying depths, manual alignment is prone to large errors and is time-consuming. Furthermore, the variations in the size of the Region of Interest (ROI) and the shape of multimodal images pose a significant challenge for achieving accurate registration.In this paper, we propose a preliminary spatial alignment algorithm and a weakly supervised multimodal registration network. The spatial position alignment algorithm efficiently utilizes the limited annotation information in the two modal images provided by the doctor to automatically align multimodal images with varying depths. By utilizing aligned multimodal images for weakly supervised registration and incorporating pyramidal features and cost volume to estimate the optical flow, the results indicate that the proposed method outperforms traditional volume rendering alignment methods and registration networks in various evaluation metrics. This demonstrates the effectiveness of our model in multimodal image registration. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.06598 [pdf, other]

A Lightweight Transformer for Remote Sensing Image Change Captioning

Authors: Dongwei Sun, Yajie Bao, Xiangyong Cao

Abstract: Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity cause… ▽ More Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code can be available at △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.14647 [pdf, other]

Human Behavior Modeling via Identification of Task Objective and Variability

Authors: Sooyung Byeon, Dawei Sun, Inseok Hwang

Abstract: Human behavior modeling is important for the design and implementation of human-automation interactive control systems. In this context, human behavior refers to a human's control input to systems. We propose a novel method for human behavior modeling that uses human demonstrations for a given task to infer the unknown task objective and the variability. The task objective represents the human's i… ▽ More Human behavior modeling is important for the design and implementation of human-automation interactive control systems. In this context, human behavior refers to a human's control input to systems. We propose a novel method for human behavior modeling that uses human demonstrations for a given task to infer the unknown task objective and the variability. The task objective represents the human's intent or desire. It can be inferred by the inverse optimal control and improve the understanding of human behavior by providing an explainable objective function behind the given human behavior. Meanwhile, the variability denotes the intrinsic uncertainty in human behavior. It can be described by a Gaussian mixture model and capture the uncertainty in human behavior which cannot be encoded by the task objective. The proposed method can improve the prediction accuracy of human behavior by leveraging both task objective and variability. The proposed method is demonstrated through human-subject experiments using an illustrative quadrotor remote control example. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 10 pages

arXiv:2404.08636 [pdf, other]

Probing the 3D Awareness of Visual Foundation Models

Authors: Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani

Abstract: Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also repr… ▽ More Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024. Project page: https://github.com/mbanani/probe3d

arXiv:2404.04997 [pdf, other]

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

Authors: Cangqing Wang, Yutian Yang, Ruisi Li, Dan Sun, Ruicong Cai, Yuzhu Zhang, Chengqian Fu, Lillian Floyd

Abstract: The rapid advancement of Large Language Models (LLMs) has inaugurated a transformative epoch in natural language processing, fostering unprecedented proficiency in text generation, comprehension, and contextual scrutiny. Nevertheless, effectively handling extensive contexts, crucial for myriad applications, poses a formidable obstacle owing to the intrinsic constraints of the models' context windo… ▽ More The rapid advancement of Large Language Models (LLMs) has inaugurated a transformative epoch in natural language processing, fostering unprecedented proficiency in text generation, comprehension, and contextual scrutiny. Nevertheless, effectively handling extensive contexts, crucial for myriad applications, poses a formidable obstacle owing to the intrinsic constraints of the models' context window sizes and the computational burdens entailed by their operations. This investigation presents an innovative framework that strategically tailors LLMs for streamlined context processing by harnessing the synergies among natural language summarization, soft prompt compression, and augmented utility preservation mechanisms. Our methodology, dubbed SoftPromptComp, amalgamates natural language prompts extracted from summarization methodologies with dynamically generated soft prompts to forge a concise yet semantically robust depiction of protracted contexts. This depiction undergoes further refinement via a weighting mechanism optimizing information retention and utility for subsequent tasks. We substantiate that our framework markedly diminishes computational overhead and enhances LLMs' efficacy across various benchmarks, while upholding or even augmenting the caliber of the produced content. By amalgamating soft prompt compression with sophisticated summarization, SoftPromptComp confronts the dual challenges of managing lengthy contexts and ensuring model scalability. Our findings point towards a propitious trajectory for augmenting LLMs' applicability and efficiency, rendering them more versatile and pragmatic for real-world applications. This research enriches the ongoing discourse on optimizing language models, providing insights into the potency of soft prompts and summarization techniques as pivotal instruments for the forthcoming generation of NLP solutions. △ Less

Submitted 18 April, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

Comments: This paper has been accepted by the 2024 International Conference on Image Processing and Computer Applications (IPCA 2024)

arXiv:2404.01723 [pdf, other]

Contextual Embedding Learning to Enhance 2D Networks for Volumetric Image Segmentation

Authors: Zhuoyuan Wang, Dong Sun, Xiangyun Zeng, Ruodai Wu, Yi Wang

Abstract: The segmentation of organs in volumetric medical images plays an important role in computer-aided diagnosis and treatment/surgery planning. Conventional 2D convolutional neural networks (CNNs) can hardly exploit the spatial correlation of volumetric data. Current 3D CNNs have the advantage to extract more powerful volumetric representations but they usually suffer from occupying excessive memory a… ▽ More The segmentation of organs in volumetric medical images plays an important role in computer-aided diagnosis and treatment/surgery planning. Conventional 2D convolutional neural networks (CNNs) can hardly exploit the spatial correlation of volumetric data. Current 3D CNNs have the advantage to extract more powerful volumetric representations but they usually suffer from occupying excessive memory and computation nevertheless. In this study we aim to enhance the 2D networks with contextual information for better volumetric image segmentation. Accordingly, we propose a contextual embedding learning approach to facilitate 2D CNNs capturing spatial information properly. Our approach leverages the learned embedding and the slice-wisely neighboring matching as a soft cue to guide the network. In such a way, the contextual information can be transferred slice-by-slice thus boosting the volumetric representation of the network. Experiments on challenging prostate MRI dataset (PROMISE12) and abdominal CT dataset (CHAOS) show that our contextual embedding learning can effectively leverage the inter-slice context and improve segmentation performance. The proposed approach is a plug-and-play, and memory-efficient solution to enhance the 2D networks for volumetric segmentation. Our code is publicly available at https://github.com/JuliusWang-7/CE_Block. △ Less

Submitted 17 May, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 15 pages, 9 figures

arXiv:2403.16082 [pdf, other]

SoK: Comprehensive Analysis of Rug Pull Causes, Datasets, and Detection Tools in DeFi

Authors: Dianxiang Sun, Wei Ma, Liming Nie, Yang Liu

Abstract: Rug pulls pose a grave threat to the cryptocurrency ecosystem, leading to substantial financial loss and undermining trust in decentralized finance (DeFi) projects. With the emergence of new rug pull patterns, research on rug pull is out of state. To fill this gap, we first conducted an extensive analysis of the literature review, encompassing both scholarly and industry sources. By examining exis… ▽ More Rug pulls pose a grave threat to the cryptocurrency ecosystem, leading to substantial financial loss and undermining trust in decentralized finance (DeFi) projects. With the emergence of new rug pull patterns, research on rug pull is out of state. To fill this gap, we first conducted an extensive analysis of the literature review, encompassing both scholarly and industry sources. By examining existing academic articles and industrial discussions on rug pull projects, we present a taxonomy inclusive of 34 root causes, introducing six new categories inspired by industry sources: burn, hidden owner, ownership transfer, unverified contract, external call, and fake LP lock. Based on the developed taxonomy, we evaluated current rug pull datasets and explored the effectiveness and limitations of existing detection mechanisms. Our evaluation indicates that the existing datasets, which document 2,448 instances, address only 7 of the 34 root causes, amounting to a mere 20% coverage. It indicates that existing open-source datasets need to be improved to study rug pulls. In response, we have constructed a more comprehensive dataset containing 2,360 instances, expanding the coverage to 54% with the best effort. In addition, the examination of 14 detection tools showed that they can identify 25 of the 34 root causes, achieving a coverage of 73.5%. There are nine root causes (Fake LP Lock, Hidden Fee, and Destroy Token, Fake Money Transfer, Ownership Transfer, Liquidity Pool Block, Freeze Account, Wash-Trading, Hedge) that the existing tools cannot cover. Our work indicates that there is a significant gap between current research and detection tools, and the actual situation of rug pulls. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.14727 [pdf, other]

Protected group bias and stereotypes in Large Language Models

Authors: Hadas Kotek, David Q. Sun, Zidi Xiu, Margit Bowler, Christopher Klein

Abstract: As modern Large Language Models (LLMs) shatter many state-of-the-art benchmarks in a variety of domains, this paper investigates their behavior in the domains of ethics and fairness, focusing on protected group bias. We conduct a two-part study: first, we solicit sentence continuations describing the occupations of individuals from different protected groups, including gender, sexuality, religion,… ▽ More As modern Large Language Models (LLMs) shatter many state-of-the-art benchmarks in a variety of domains, this paper investigates their behavior in the domains of ethics and fairness, focusing on protected group bias. We conduct a two-part study: first, we solicit sentence continuations describing the occupations of individuals from different protected groups, including gender, sexuality, religion, and race. Second, we have the model generate stories about individuals who hold different types of occupations. We collect >10k sentence completions made by a publicly available LLM, which we subject to human annotation. We find bias across minoritized groups, but in particular in the domains of gender and sexuality, as well as Western bias, in model generations. The model not only reflects societal biases, but appears to amplify them. The model is additionally overly cautious in replies to queries relating to minoritized groups, providing responses that strongly emphasize diversity and equity to an extent that other group characteristics are overshadowed. This suggests that artificially constraining potentially harmful outputs may itself lead to harm, and should be applied in a careful and controlled manner. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.03652 [pdf]

3D Printed Waveguide for Augmented Reality

Authors: Dechuan Sun, Gregory Tanyi, Alan Lee, Chris French, Younger Liang, Christina Lim, Ranjith R Unnithan

Abstract: Mass production of augmented reality (AR) waveguides has been challenging due to the intricate nature of the fabrication technique and the high precision required for its optical characteristics. In this paper, we have presented a novel and low-cost approach for fabricating geometric optical waveguides designed for AR applications utilizing 3D printing techniques. To strike a balance between optic… ▽ More Mass production of augmented reality (AR) waveguides has been challenging due to the intricate nature of the fabrication technique and the high precision required for its optical characteristics. In this paper, we have presented a novel and low-cost approach for fabricating geometric optical waveguides designed for AR applications utilizing 3D printing techniques. To strike a balance between optical performance and fabrication feasibility, we have optimized the conventional geometric waveguide design to facilitate easier fabrication. It is worth noting that our proposed method does not require molding, dicing, and post-surface polishing after printing. A prototype based on this method has been successfully fabricated, showing the immersion between the virtual image and the real-world scene. The proposed method has great potential for adaptation to mass production in various AR applications. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.18318 [pdf]

SD-SLAM: A Semantic SLAM Approach for Dynamic Scenes Based on LiDAR Point Clouds

Authors: Feiya Li, Chunyun Fu, Dongye Sun, Jian Li, Jianwen Wang

Abstract: Point cloud maps generated via LiDAR sensors using extensive remotely sensed data are commonly used by autonomous vehicles and robots for localization and navigation. However, dynamic objects contained in point cloud maps not only downgrade localization accuracy and navigation performance but also jeopardize the map quality. In response to this challenge, we propose in this paper a novel semantic… ▽ More Point cloud maps generated via LiDAR sensors using extensive remotely sensed data are commonly used by autonomous vehicles and robots for localization and navigation. However, dynamic objects contained in point cloud maps not only downgrade localization accuracy and navigation performance but also jeopardize the map quality. In response to this challenge, we propose in this paper a novel semantic SLAM approach for dynamic scenes based on LiDAR point clouds, referred to as SD-SLAM hereafter. The main contributions of this work are in three aspects: 1) introducing a semantic SLAM framework dedicatedly for dynamic scenes based on LiDAR point clouds, 2) Employing semantics and Kalman filtering to effectively differentiate between dynamic and semi-static landmarks, and 3) Making full use of semi-static and pure static landmarks with semantic information in the SD-SLAM process to improve localization and map** performance. To evaluate the proposed SD-SLAM, tests were conducted using the widely adopted KITTI odometry dataset. Results demonstrate that the proposed SD-SLAM effectively mitigates the adverse effects of dynamic objects on SLAM, improving vehicle localization and map** performance in dynamic scenes, and simultaneously constructing a static semantic map with multiple semantic classes for enhanced environment understanding. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.11841 [pdf, other]

ASGNet: Adaptive Semantic Gate Networks for Log-Based Anomaly Diagnosis

Authors: Haitian Yang, Degang Sun, Wen Liu, Yanshu Li, Yan Wang, Weiqing Huang

Abstract: Logs are widely used in the development and maintenance of software systems. Logs can help engineers understand the runtime behavior of systems and diagnose system failures. For anomaly diagnosis, existing methods generally use log event data extracted from historical logs to build diagnostic models. However, we find that existing methods do not make full use of two types of features, (1) statisti… ▽ More Logs are widely used in the development and maintenance of software systems. Logs can help engineers understand the runtime behavior of systems and diagnose system failures. For anomaly diagnosis, existing methods generally use log event data extracted from historical logs to build diagnostic models. However, we find that existing methods do not make full use of two types of features, (1) statistical features: some inherent statistical features in log data, such as word frequency and abnormal label distribution, are not well exploited. Compared with log raw data, statistical features are deterministic and naturally compatible with corresponding tasks. (2) semantic features: Logs contain the execution logic behind software systems, thus log statements share deep semantic relationships. How to effectively combine statistical features and semantic features in log data to improve the performance of log anomaly diagnosis is the key point of this paper. In this paper, we propose an adaptive semantic gate networks (ASGNet) that combines statistical features and semantic features to selectively use statistical features to consolidate log text semantic representation. Specifically, ASGNet encodes statistical features via a variational encoding module and fuses useful information through a well-designed adaptive semantic threshold mechanism. The threshold mechanism introduces the information flow into the classifier based on the confidence of the semantic features in the decision, which is conducive to training a robust classifier and can solve the overfitting problem caused by the use of statistical features. The experimental results on the real data set show that our method proposed is superior to all baseline methods in terms of various performance indicators. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.11139 [pdf, other]

LiGNN: Graph Neural Networks at LinkedIn

Authors: Fedor Borisyuk, Shihai He, Yunbo Ouyang, Morteza Ramezani, Peng Du, Xiaochen Hou, Chengming Jiang, Nitin Pasumarthy, Priya Bannur, Birjodh Tiwana, ** Liu, Siddharth Dangi, Daqi Sun, Zhoutao Pei, Xiao Shi, Sirou Zhu, Qianqi Shen, Kuang-Hsuan Lee, David Stein, Baolei Li, Haichao Wei, Amol Ghoting, Souvik Ghosh

Abstract: In this paper, we present LiGNN, a deployed large-scale Graph Neural Networks (GNNs) Framework. We share our insight on develo** and deployment of GNNs at large scale at LinkedIn. We present a set of algorithmic improvements to the quality of GNN representation learning including temporal graph architectures with long term losses, effective cold start solutions via graph densification, ID embedd… ▽ More In this paper, we present LiGNN, a deployed large-scale Graph Neural Networks (GNNs) Framework. We share our insight on develo** and deployment of GNNs at large scale at LinkedIn. We present a set of algorithmic improvements to the quality of GNN representation learning including temporal graph architectures with long term losses, effective cold start solutions via graph densification, ID embeddings and multi-hop neighbor sampling. We explain how we built and sped up by 7x our large-scale training on LinkedIn graphs with adaptive sampling of neighbors, grou** and slicing of training data batches, specialized shared-memory queue and local gradient optimization. We summarize our deployment lessons and learnings gathered from A/B test experiments. The techniques presented in this work have contributed to an approximate relative improvements of 1% of Job application hearing back rate, 2% Ads CTR lift, 0.5% of Feed engaged daily active users, 0.2% session lift and 0.1% weekly active user lift from people recommendation. We believe that this work can provide practical solutions and insights for engineers who are interested in applying Graph neural networks at large scale. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.09989 [pdf, other]

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Authors: **yuan Li, Han Li, Di Sun, Jiahao Wang, Wenkun Zhang, Zan Wang, Gang Pan

Abstract: Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coars… ▽ More Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks. △ Less

Submitted 29 May, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: Accepted to Findings of ACL 2024

arXiv:2402.07207 [pdf, other]

GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

Authors: Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, **lin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang

Abstract: We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an instance-scene compositional optimization mechanism with condi… ▽ More We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an instance-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. The source codes and models will be available at gala3d.github.io. △ Less

Submitted 11 June, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

arXiv:2402.06859 [pdf, other]

LiRank: Industrial Large Scale Ranking Models at LinkedIn

Authors: Fedor Borisyuk, Mingzhou Zhou, Qingquan Song, Siyu Zhu, Birjodh Tiwana, Ganesh Parameswaran, Siddharth Dangi, Lars Hertel, Qiang Xiao, Xiaochen Hou, Yunbo Ouyang, Aman Gupta, Sheallika Singh, Dan Liu, Hailing Cheng, Lei Le, Jonathan Hung, Sathiya Keerthi, Ruoyan Wang, Fengyu Zhang, Mohit Kothari, Chen Zhu, Daqi Sun, Yun Dai, Xun Luan , et al. (9 additional authors not shown)

Abstract: We present LiRank, a large-scale ranking framework at LinkedIn that brings to production state-of-the-art modeling architectures and optimization methods. We unveil several modeling improvements, including Residual DCN, which adds attention and residual connections to the famous DCNv2 architecture. We share insights into combining and tuning SOTA architectures to create a unified model, including… ▽ More We present LiRank, a large-scale ranking framework at LinkedIn that brings to production state-of-the-art modeling architectures and optimization methods. We unveil several modeling improvements, including Residual DCN, which adds attention and residual connections to the famous DCNv2 architecture. We share insights into combining and tuning SOTA architectures to create a unified model, including Dense Gating, Transformers and Residual DCN. We also propose novel techniques for calibration and describe how we productionalized deep learning based explore/exploit methods. To enable effective, production-grade serving of large ranking models, we detail how to train and compress models using quantization and vocabulary compression. We provide details about the deployment setup for large-scale use cases of Feed ranking, Jobs Recommendations, and Ads click-through rate (CTR) prediction. We summarize our learnings from various A/B tests by elucidating the most effective technical approaches. These ideas have contributed to relative metrics improvements across the board at LinkedIn: +0.5% member sessions in the Feed, +1.76% qualified job applications for Jobs search and recommendations, and +4.3% for Ads CTR. We hope this work can provide practical insights and solutions for practitioners interested in leveraging large-scale deep ranking systems. △ Less

Submitted 9 February, 2024; originally announced February 2024.

ACM Class: H.3.3

arXiv:2401.15753 [pdf, other]

An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion

Authors: Sharib Ali, Yamid Espinel, Yueming **, Peng Liu, Bianca Güttner, Xukun Zhang, Lihua Zhang, Tom Dowrick, Matthew J. Clarkson, Shiting Xiao, Yifan Wu, Yijun Yang, Lei Zhu, Dai Sun, Lan Li, Micha Pfeiffer, Shahid Farid, Lena Maier-Hein, Emmanuel Buc, Adrien Bartoli

Abstract: Augmented reality for laparoscopic liver resection is a visualisation mode that allows a surgeon to localise tumours and vessels embedded within the liver by projecting them on top of a laparoscopic image. Preoperative 3D models extracted from CT or MRI data are registered to the intraoperative laparoscopic images during this process. In terms of 3D-2D fusion, most of the algorithms make use of an… ▽ More Augmented reality for laparoscopic liver resection is a visualisation mode that allows a surgeon to localise tumours and vessels embedded within the liver by projecting them on top of a laparoscopic image. Preoperative 3D models extracted from CT or MRI data are registered to the intraoperative laparoscopic images during this process. In terms of 3D-2D fusion, most of the algorithms make use of anatomical landmarks to guide registration. These landmarks include the liver's inferior ridge, the falciform ligament, and the occluding contours. They are usually marked by hand in both the laparoscopic image and the 3D model, which is time-consuming and may contain errors if done by a non-experienced user. Therefore, there is a need to automate this process so that augmented reality can be used effectively in the operating room. We present the Preoperative-to-Intraoperative Laparoscopic Fusion Challenge (P2ILF), held during the Medical Imaging and Computer Assisted Interventions (MICCAI 2022) conference, which investigates the possibilities of detecting these landmarks automatically and using them in registration. The challenge was divided into two tasks: 1) A 2D and 3D landmark detection task and 2) a 3D-2D registration task. The teams were provided with training data consisting of 167 laparoscopic images and 9 preoperative 3D models from 9 patients, with the corresponding 2D and 3D landmark annotations. A total of 6 teams from 4 countries participated, whose proposed methods were evaluated on 16 images and two preoperative 3D models from two patients. All the teams proposed deep learning-based methods for the 2D and 3D landmark segmentation tasks and differentiable rendering-based methods for the registration task. Based on the experimental outcomes, we propose three key hypotheses that determine current limitations and future directions for research in this domain. △ Less

Submitted 7 February, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

Comments: 24 pages

arXiv:2401.12945 [pdf, other]

Lumiere: A Space-Time Diffusion Model for Video Generation

Authors: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri

Abstract: We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synth… ▽ More We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation. △ Less

Submitted 5 February, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: Webpage: https://lumiere-video.github.io/ | Video: https://www.youtube.com/watch?v=wxLr02Dz2Sc

arXiv:2401.10171 [pdf, other]

SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild

Authors: Andreas Engelhardt, Amit Raj, Mark Boss, Yunzhi Zhang, Abhishek Kar, Yuanzhen Li, Deqing Sun, Ricardo Martin Brualla, Jonathan T. Barron, Hendrik P. A. Lensch, Varun Jampani

Abstract: We present SHINOBI, an end-to-end framework for the reconstruction of shape, material, and illumination from object images captured with varying lighting, pose, and background. Inverse rendering of an object based on unconstrained image collections is a long-standing challenge in computer vision and graphics and requires a joint optimization over shape, radiance, and pose. We show that an implicit… ▽ More We present SHINOBI, an end-to-end framework for the reconstruction of shape, material, and illumination from object images captured with varying lighting, pose, and background. Inverse rendering of an object based on unconstrained image collections is a long-standing challenge in computer vision and graphics and requires a joint optimization over shape, radiance, and pose. We show that an implicit shape representation based on a multi-resolution hash encoding enables faster and robust shape reconstruction with joint camera alignment optimization that outperforms prior work. Further, to enable the editing of illumination and object reflectance (i.e. material) we jointly optimize BRDF and illumination together with the object's shape. Our method is class-agnostic and works on in-the-wild image collections of objects to produce relightable 3D assets for several use cases such as AR/VR, movies, games, etc. Project page: https://shinobi.aengelhardt.com Video: https://www.youtube.com/watch?v=iFENQ6AcYd8&feature=youtu.be △ Less

Submitted 29 March, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024). Updated supplementary material and acknowledgements

arXiv:2401.05960 [pdf, other]

Machine Learning Insides OptVerse AI Solver: Design Principles and Applications

Authors: Xijun Li, Fangzhou Zhu, Hui-Ling Zhen, Weilin Luo, Meng Lu, Yimin Huang, Zhenan Fan, Zirui Zhou, Yufei Kuang, Zhihai Wang, Zijie Geng, Yang Li, Haoyang Liu, Zhiwu An, Muming Yang, Jianshu Li, Jie Wang, Junchi Yan, Defeng Sun, Tao Zhong, Yong Zhang, Jia Zeng, Mingxuan Yuan, Jianye Hao, Jun Yao , et al. (1 additional authors not shown)

Abstract: In an era of digital ubiquity, efficient resource management and decision-making are paramount across numerous industries. To this end, we present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI Solver, which aims to mitigate the scarcity of real-world mathematical programming instances, and to surpass the capabilities of traditional opt… ▽ More In an era of digital ubiquity, efficient resource management and decision-making are paramount across numerous industries. To this end, we present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI Solver, which aims to mitigate the scarcity of real-world mathematical programming instances, and to surpass the capabilities of traditional optimization techniques. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. Furthermore, we introduce a training framework leveraging augmentation policies to maintain solvers' utility in dynamic environments. Besides the data generation and augmentation, our proposed approaches also include novel ML-driven policies for personalized solver strategies, with an emphasis on applications like graph convolutional networks for initial basis selection and reinforcement learning for advanced presolving and cut selection. Additionally, we detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance. Compared with traditional solvers such as Cplex and SCIP, our ML-augmented OptVerse AI Solver demonstrates superior speed and precision across both established benchmarks and real-world scenarios, reinforcing the practical imperative and effectiveness of machine learning techniques in mathematical programming solvers. △ Less

Submitted 17 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.03664 [pdf]

Dual-Channel Reliable Breast Ultrasound Image Classification Based on Explainable Attribution and Uncertainty Quantification

Authors: Shuge Lei, Haonan Hu, Dasheng Sun, Huabin Zhang, Kehong Yuan, Jian Dai, Jijun Tang, Yan Tong

Abstract: This paper focuses on the classification task of breast ultrasound images and researches on the reliability measurement of classification results. We proposed a dual-channel evaluation framework based on the proposed inference reliability and predictive reliability scores. For the inference reliability evaluation, human-aligned and doctor-agreed inference rationales based on the improved feature a… ▽ More This paper focuses on the classification task of breast ultrasound images and researches on the reliability measurement of classification results. We proposed a dual-channel evaluation framework based on the proposed inference reliability and predictive reliability scores. For the inference reliability evaluation, human-aligned and doctor-agreed inference rationales based on the improved feature attribution algorithm SP-RISA are gracefully applied. Uncertainty quantification is used to evaluate the predictive reliability via the Test Time Enhancement. The effectiveness of this reliability evaluation framework has been verified on our breast ultrasound clinical dataset YBUS, and its robustness is verified on the public dataset BUSI. The expected calibration errors on both datasets are significantly lower than traditional evaluation methods, which proves the effectiveness of our proposed reliability measurement. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2401.01674 [pdf, other]

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Authors: Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo

Abstract: Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitation… ▽ More Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01461 [pdf, other]

Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

Authors: Xiaotong Wu, Wei-Sheng Lai, YiChang Shih, Charles Herrmann, Michael Krainin, Deqing Sun, Chia-Kai Liang

Abstract: DSLR cameras can achieve multiple zoom levels via shifting lens distances or swap** lens types. However, these techniques are not possible on smartphone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems cr… ▽ More DSLR cameras can achieve multiple zoom levels via shifting lens distances or swap** lens types. However, these techniques are not possible on smartphone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems crop and digitally upsample images from W, leading to significant detail loss. In this paper, we propose an efficient system for hybrid zoom super-resolution on mobile devices, which captures a synchronous pair of W and T shots and leverages machine learning models to align and transfer details from T to W. We further develop an adaptive blending method that accounts for depth-of-field mismatches, scene occlusion, flow uncertainty, and alignment errors. To minimize the domain gap, we design a dual-phone camera rig to capture real-world inputs and ground-truths for supervised training. Our method generates a 12-megapixel image in 500ms on a mobile platform and compares favorably against state-of-the-art methods under extensive evaluation on real-world scenarios. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: Accepted to SIGGRAPH Asia 2023 (ACM TOG). Project website: https://www.wslai.net/publications/fusion_zoom

arXiv:2401.00935 [pdf, other]

Boundary Attention: Learning to Localize Boundaries under High Noise

Authors: Mia Gaia Polansky, Charles Herrmann, Junhwa Hur, Deqing Sun, Dor Verbin, Todd Zickler

Abstract: We present a differentiable model that infers explicit boundaries, including curves, corners and junctions, using a mechanism that we call boundary attention. Boundary attention is a boundary-aware local attention operation that, when applied densely and repeatedly, progressively refines a field of variables that specify an unrasterized description of the local boundary structure in every overlapp… ▽ More We present a differentiable model that infers explicit boundaries, including curves, corners and junctions, using a mechanism that we call boundary attention. Boundary attention is a boundary-aware local attention operation that, when applied densely and repeatedly, progressively refines a field of variables that specify an unrasterized description of the local boundary structure in every overlap** patch within an image. It operates in a bottom-up fashion, similar to classical methods for sub-pixel edge localization and edge-linking, but with a higher-dimensional description of local boundary structure, a notion of spatial consistency that is learned instead of designed, and a sequence of operations that is end-to-end differentiable. We train our model using simple synthetic data and then evaluate it using photographs that were captured under low-light conditions with variable amounts of noise. We find that our method generalizes to natural images corrupted by real sensor noise, and predicts consistent boundaries under increasingly noisy conditions where other state-of-the-art methods fail. △ Less

Submitted 18 March, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

Comments: Project website at boundaryattention.github.io: http://boundaryattention.github.io

arXiv:2312.16787 [pdf]

L-LO: Enhancing Pose Estimation Precision via a Landmark-Based LiDAR Odometry

Authors: Feiya Li, Chunyun Fu, Dongye Sun

Abstract: The majority of existing LiDAR odometry solutions are based on simple geometric features such as points, lines or planes which cannot fully reflect the characteristics of surrounding environments. In this study, we propose a novel LiDAR odometry which effectively utilizes the overall exterior characteristics of environmental landmarks. The vehicle pose estimation is accomplished by means of two se… ▽ More The majority of existing LiDAR odometry solutions are based on simple geometric features such as points, lines or planes which cannot fully reflect the characteristics of surrounding environments. In this study, we propose a novel LiDAR odometry which effectively utilizes the overall exterior characteristics of environmental landmarks. The vehicle pose estimation is accomplished by means of two sequential pose estimation stages, namely, horizontal pose estimation and vertical pose estimation. To achieve effective landmark registration, a comprehensive index is proposed to evaluate the level of similarity between landmarks. This index takes into account two crucial aspects of landmarks, namely, dimension and shape in evaluating their similarity. To assess the performance of the proposed algorithm, we utilize the widely recognized KITTI dataset as well as experimental data collected by an unmanned ground vehicle platform. Both graphical and numerical results indicate that our algorithm outperforms leading LiDAR odometry solutions in terms of positioning accuracy. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.13252 [pdf, other]

Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

Authors: Saurabh Saxena, Junhwa Hur, Charles Herrmann, Deqing Sun, David J. Fleet

Abstract: While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized mult… ▽ More While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd △ Less

Submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.07920 [pdf, other]

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

Authors: Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang

Abstract: We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects, we first sequentially and progressively model the static background of the entire scene with incremental static 3D Gaussians. We then leverage a composite dynamic Gaussian graph to handle multiple moving objects, individually reconstructing eac… ▽ More We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects, we first sequentially and progressively model the static background of the entire scene with incremental static 3D Gaussians. We then leverage a composite dynamic Gaussian graph to handle multiple moving objects, individually reconstructing each object and restoring their accurate positions and occlusion relationships within the scene. We further use a LiDAR prior for Gaussian Splatting to reconstruct scenes with greater details and maintain panoramic consistency. DrivingGaussian outperforms existing methods in dynamic driving scene reconstruction and enables photorealistic surround-view synthesis with high-fidelity and multi-camera consistency. Our project page is at: https://github.com/VDIGPKU/DrivingGaussian. △ Less

Submitted 20 March, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.06553 [pdf, other]

HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models

Authors: Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang

Abstract: We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication mo… ▽ More We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects. △ Less

Submitted 15 March, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: Project Page: https://neu-vi.github.io/HOI-Diff/

arXiv:2312.03884 [pdf, other]

WonderJourney: Going from Anywhere to Everywhere

Authors: Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann

Abstract: We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes… ▽ More We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary "wonderjourneys". Project website: https://kovenyu.com/WonderJourney/ △ Less

Submitted 12 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: Project website with video results: https://kovenyu.com/WonderJourney/

arXiv:2312.02919 [pdf, other]

Fine-grained Controllable Video Generation via Object Appearance and Context

Authors: Hsin-** Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang

Abstract: Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including th… ▽ More Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: Project page: https://hhsin**.github.io/factor

arXiv:2311.17776 [pdf, other]

One-Shot Open Affordance Learning with Foundation Models

Authors: Gen Li, Deqing Sun, Laura Sevilla-Lara, Varun Jampani

Abstract: We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analy… ▽ More We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.17034 [pdf, other]

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

Abstract: While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporatin… ▽ More While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a [email protected] score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state of the art by 5.5p and 11.0p absolute gains, respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io/. △ Less

Submitted 24 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: Accepted by CVPR 24, project page: https://telling-left-from-right.github.io/

arXiv:2311.16081 [pdf, other]

ViT-Lens: Towards Omni-modal Representations

Authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

Abstract: Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT… ▽ More Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens. △ Less

Submitted 26 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: This work is a follow-up of arXiv:2308.10185. Accepted to CVPR2024

arXiv:2311.12076 [pdf, other]

Towards Few-shot Out-of-Distribution Detection

Authors: Jiuqing Dong, Yongbin Gao, Heng Zhou, Jun Cen, Yifan Yao, Sook Yoon, Park Dong Sun

Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of open-world intelligent systems. Despite the notable advancements in existing OOD detection methodologies, our study identifies a significant performance drop under the scarcity of training samples. In this context, we introduce a novel few-shot OOD detection benchmark, carefully constructed to address this gap. Our emp… ▽ More Out-of-distribution (OOD) detection is critical for ensuring the reliability of open-world intelligent systems. Despite the notable advancements in existing OOD detection methodologies, our study identifies a significant performance drop under the scarcity of training samples. In this context, we introduce a novel few-shot OOD detection benchmark, carefully constructed to address this gap. Our empirical analysis reveals the superiority of ParameterEfficient Fine-Tuning (PEFT) strategies, such as visual prompt tuning and visual adapter tuning, over conventional techniques, including fully fine-tuning and linear probing tuning in the few-shot OOD detection task. Recognizing some crucial information from the pre-trained model, which is pivotal for OOD detection, may be lost during the fine-tuning process, we propose a method termed DomainSpecific and General Knowledge Fusion (DSGF). This approach is designed to be compatible with diverse fine-tuning frameworks. Our experiments show that the integration of DSGF significantly enhances the few-shot OOD detection capabilities across various methods and fine-tuning methodologies, including fully fine-tuning, visual adapter tuning, and visual prompt tuning. The code will be released. △ Less

Submitted 30 January, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.07198 [pdf, other]

MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model

Authors: Shuwei Shao, Zhongcai Pei, Weihai Chen, Dingchi Sun, Peter C. Y. Chen, Zhengguo Li

Abstract: Over the past few years, self-supervised monocular depth estimation that does not depend on ground-truth during the training phase has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, e.g., occlusion and dynamic objects. In this work, we introduce a novel self-supervised depth estimation framework, du… ▽ More Over the past few years, self-supervised monocular depth estimation that does not depend on ground-truth during the training phase has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, e.g., occlusion and dynamic objects. In this work, we introduce a novel self-supervised depth estimation framework, dubbed MonoDiffusion, by formulating it as an iterative denoising process. Because the depth ground-truth is unavailable in the training phase, we develop a pseudo ground-truth diffusion process to assist the diffusion in MonoDiffusion. The pseudo ground-truth diffusion gradually adds noise to the depth map generated by a pre-trained teacher model. Moreover,the teacher model allows applying a distillation loss to guide the denoised depth. Further, we develop a masked visual condition mechanism to enhance the denoising ability of model. Extensive experiments are conducted on the KITTI and Make3D datasets and the proposed MonoDiffusion outperforms prior state-of-the-art competitors. The source code will be available at https://github.com/ShuweiShao/MonoDiffusion. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: 10 pages, 8 figures

arXiv:2310.18130 [pdf, other]

DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues

Authors: David Q. Sun, Artem Abzaliev, Hadas Kotek, Zidi Xiu, Christopher Klein, Jason D. Williams

Abstract: Controversy is a reflection of our zeitgeist, and an important aspect to any discourse. The rise of large language models (LLMs) as conversational systems has increased public reliance on these systems for answers to their various questions. Consequently, it is crucial to systematically examine how these models respond to questions that pertaining to ongoing debates. However, few such datasets exi… ▽ More Controversy is a reflection of our zeitgeist, and an important aspect to any discourse. The rise of large language models (LLMs) as conversational systems has increased public reliance on these systems for answers to their various questions. Consequently, it is crucial to systematically examine how these models respond to questions that pertaining to ongoing debates. However, few such datasets exist in providing human-annotated labels reflecting the contemporary discussions. To foster research in this area, we propose a novel construction of a controversial questions dataset, expanding upon the publicly released Quora Question Pairs Dataset. This dataset presents challenges concerning knowledge recency, safety, fairness, and bias. We evaluate different LLMs using a subset of this dataset, illuminating how they handle controversial issues and the stances they adopt. This research ultimately contributes to our understanding of LLMs' interaction with controversial issues, paving the way for improvements in their comprehension and handling of complex societal debates. △ Less

Submitted 7 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP Industry Track 2023

arXiv:2310.17994 [pdf, other]

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Authors: Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, Jiajun Wu

Abstract: We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture obje… ▽ More We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity, we propose a novel camera conditioning parameterization and normalization scheme. Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis, and demonstrate strong performance in this setting. Our code and data are at http://kylesargent.github.io/zeronvs/ △ Less

Submitted 23 April, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted to CVPR 2024. 12 pages

arXiv:2310.08580 [pdf, other]

OmniControl: Control Any Joint at Any Time for Human Motion Generation

Authors: Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, Huaizu Jiang

Abstract: We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose… ▽ More We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints. △ Less

Submitted 14 April, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: ICLR 2024. Project page: https://neu-vi.github.io/omnicontrol/

arXiv:2310.05717 [pdf, other]

STOPNet: Multiview-based 6-DoF Suction Detection for Transparent Objects on Production Lines

Authors: Yuxuan Kuang, Qin Han, Danshi Li, Qiyu Dai, Lian Ding, Dong Sun, Hanlin Zhao, He Wang

Abstract: In this work, we present STOPNet, a framework for 6-DoF object suction detection on production lines, with a focus on but not limited to transparent objects, which is an important and challenging problem in robotic systems and modern industry. Current methods requiring depth input fail on transparent objects due to depth cameras' deficiency in sensing their geometry, while we proposed a novel fram… ▽ More In this work, we present STOPNet, a framework for 6-DoF object suction detection on production lines, with a focus on but not limited to transparent objects, which is an important and challenging problem in robotic systems and modern industry. Current methods requiring depth input fail on transparent objects due to depth cameras' deficiency in sensing their geometry, while we proposed a novel framework to reconstruct the scene on the production line depending only on RGB input, based on multiview stereo. Compared to existing works, our method not only reconstructs the whole 3D scene in order to obtain high-quality 6-DoF suction poses in real time but also generalizes to novel environments, novel arrangements and novel objects, including challenging transparent objects, both in simulation and the real world. Extensive experiments in simulation and the real world show that our method significantly surpasses the baselines and has better generalizability, which caters to practical industrial needs. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: Under Review. ICRA 2024 submission

arXiv:2310.05551 [pdf, other]

PST: Improving Quantitative Trading via Program Sketch-based Tuning

Authors: Zhiming Li, Junzhe Jiang, Yushi Cao, Aixin Cui, Bozhi Wu, Bo Li, Yang Liu, Dongning Sun

Abstract: Deep reinforcement learning (DRL) has revolutionized quantitative finance by achieving decent performance without significant human expert knowledge. Despite its achievements, we observe that the current state-of-the-art DRL models are still ineffective in identifying the market trend, causing them to miss good trading opportunities or suffer from large drawdowns when encountering market crashes.… ▽ More Deep reinforcement learning (DRL) has revolutionized quantitative finance by achieving decent performance without significant human expert knowledge. Despite its achievements, we observe that the current state-of-the-art DRL models are still ineffective in identifying the market trend, causing them to miss good trading opportunities or suffer from large drawdowns when encountering market crashes. To tackle this limitation, a natural idea is to embed human expert knowledge regarding the market trend. Whereas, such knowledge is abstract and hard to be quantified. In this paper, we propose a universal neuro-symbolic tuning framework, called program sketch-based tuning (PST). Particularly, PST first proposes using a novel symbolic program sketch to embed the abstract human expert knowledge of market trends. Then we utilize the program sketch to tune a trained DRL policy according to the different market trend of the moment. Finally, in order to optimize this neural-symbolic framework, we propose a novel hybrid optimization method. Extensive evaluations on two popular quantitative trading tasks demonstrate that PST can significantly enhance the performance of previous state-of-the-art DRL strategies while being extremely lightweight. △ Less

Submitted 24 April, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2310.05262 [pdf, other]

Structure-Preserving Instance Segmentation via Skeleton-Aware Distance Transform

Authors: Zudi Lin, Donglai Wei, Aarush Gupta, Xingyu Liu, Deqing Sun, Hanspeter Pfister

Abstract: Objects with complex structures pose significant challenges to existing instance segmentation methods that rely on boundary or affinity maps, which are vulnerable to small errors around contacting pixels that cause noticeable connectivity change. While the distance transform (DT) makes instance interiors and boundaries more distinguishable, it tends to overlook the intra-object connectivity for in… ▽ More Objects with complex structures pose significant challenges to existing instance segmentation methods that rely on boundary or affinity maps, which are vulnerable to small errors around contacting pixels that cause noticeable connectivity change. While the distance transform (DT) makes instance interiors and boundaries more distinguishable, it tends to overlook the intra-object connectivity for instances with varying width and result in over-segmentation. To address these challenges, we propose a skeleton-aware distance transform (SDT) that combines the merits of object skeleton in preserving connectivity and DT in modeling geometric arrangement to represent instances with arbitrary structures. Comprehensive experiments on histopathology image segmentation demonstrate that SDT achieves state-of-the-art performance. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: MICCAI 2023 (Oral Presentation)

arXiv:2309.16071 [pdf, other]

Influence Pathway Discovery on Social Media

Authors: Xinyi Liu, Ruijie Wang, Dachun Sun, **ning Li, Christina Youn, You Lyu, Jianyuan Zhan, Dayou Wu, Xinhe Xu, Mingjun Liu, Xinshuo Lei, Zhihao Xu, Yutong Zhang, Zehao Li, Qikai Yang, Tarek Abdelzaher

Abstract: This paper addresses influence pathway discovery, a key emerging problem in today's online media. We propose a discovery algorithm that leverages recently published work on unsupervised interpretable ideological embedding, a map** of ideological beliefs (done in a self-supervised fashion) into interpretable low-dimensional spaces. Computing the ideological embedding at scale allows one to analyz… ▽ More This paper addresses influence pathway discovery, a key emerging problem in today's online media. We propose a discovery algorithm that leverages recently published work on unsupervised interpretable ideological embedding, a map** of ideological beliefs (done in a self-supervised fashion) into interpretable low-dimensional spaces. Computing the ideological embedding at scale allows one to analyze correlations between the ideological positions of leaders, influencers, news portals, or population segments, deriving potential influence pathways. The work is motivated by the importance of social media as the preeminent means for global interactions and collaborations on today's Internet, as well as their frequent (mis-)use to wield influence that targets social beliefs and attitudes of selected populations. Tools that enable the understanding and map** of influence propagation through population segments on social media are therefore increasingly important. In this paper, influence is measured by the perceived ideological shift over time that is correlated with influencers' activity. Correlated shifts in ideological embeddings indicate changes, such as swings/switching (among competing ideologies), polarization (depletion of neutral ideological positions), escalation/radicalization (shifts to more extreme versions of the ideology), or unification/cooldown (shifts towards more neutral stances). Case-studies are presented to explore selected influence pathways (i) in a recent French election, (ii) during political discussions in the Philippines, and (iii) for some Russian messaging during the Russia/Ukraine conflict. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: This paper is accepted by IEEE CIC as an invited vision paper

arXiv:2309.13515 [pdf, other]

Learning-based Inverse Perception Contracts and Applications

Authors: Dawei Sun, Benjamin C. Yang, Sayan Mitra

Abstract: Perception modules are integral in many modern autonomous systems, but their accuracy can be subject to the vagaries of the environment. In this paper, we propose a learning-based approach that can automatically characterize the error of a perception module from data and use this for safe control. The proposed approach constructs an inverse perception contract (IPC) which generates a set that cont… ▽ More Perception modules are integral in many modern autonomous systems, but their accuracy can be subject to the vagaries of the environment. In this paper, we propose a learning-based approach that can automatically characterize the error of a perception module from data and use this for safe control. The proposed approach constructs an inverse perception contract (IPC) which generates a set that contains the ground-truth value that is being estimated by the perception module, with high probability. We apply the proposed approach to study a vision pipeline deployed on a quadcopter. With the proposed approach, we successfully constructed an IPC for the vision pipeline. We then designed a control algorithm that utilizes the learned IPC, with the goal of landing the quadcopter safely on a landing pad. Experiments show that with the learned IPC, the control algorithm safely landed the quadcopter despite the error from the perception module, while the baseline algorithm without using the learned IPC failed to do so. △ Less

Submitted 3 March, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

Showing 1–50 of 213 results for author: Sun, D