Search | arXiv e-print repository

Querying in Constant Time with Learned Indexes

Authors: Luis Croquevielle, Guang Yang, Liang Lian, Ali Hadian, Thomas Heinis

Abstract: Learned indexes leverage machine learning models to accelerate query answering in databases, showing impressive practical performance. However, theoretical understanding of these methods remains incomplete. Existing research suggests that learned indexes have superior asymptotic complexity compared to their non-learned counterparts, but these findings have been established under restrictive probab… ▽ More Learned indexes leverage machine learning models to accelerate query answering in databases, showing impressive practical performance. However, theoretical understanding of these methods remains incomplete. Existing research suggests that learned indexes have superior asymptotic complexity compared to their non-learned counterparts, but these findings have been established under restrictive probabilistic assumptions. Specifically, for a sorted array with $n$ elements, it has been shown that learned indexes can find a key in $O(\log(\log n))$ expected time using at most linear space, compared with $O(\log n)$ for non-learned methods. In this work, we prove $O(1)$ expected time can be achieved with at most linear space, thereby establishing the tightest upper bound so far for the time complexity of an asymptotically optimal learned index. Notably, we use weaker probabilistic assumptions than prior work, meaning our results generalize previous efforts. Furthermore, we introduce a new measure of statistical complexity for data. This metric exhibits an information-theoretical interpretation and can be estimated in practice. This characterization provides further theoretical understanding of learned indexes, by hel** to explain why some datasets seem to be particularly challenging for these methods. △ Less

Submitted 13 June, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

arXiv:2401.14391 [pdf, other]

Rethinking Patch Dependence for Masked Autoencoders

Authors: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg

Abstract: In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework:… ▽ More In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2312.17243 [pdf, other]

Unsupervised Universal Image Segmentation

Authors: Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell

Abstract: Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g., STEGO) or class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at perform… ▽ More Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g., STEGO) or class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks -- instance, semantic and panoptic -- using a novel unified framework. U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models followed by clustering; each cluster represents different semantic and/or instance membership of pixels. We then self-train the model on these pseudo semantic labels, yielding substantial performance gains over specialized methods tailored to each task: a +2.6 AP$^{\text{box}}$ boost vs. CutLER in unsupervised instance segmentation on COCO and a +7.0 PixelAcc increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff. Moreover, our method sets up a new baseline for unsupervised panoptic segmentation, which has not been previously explored. U2Seg is also a strong pretrained model for few-shot segmentation, surpassing CutLER by +5.0 AP$^{\text{mask}}$ when trained on a low-data regime, e.g., only 1% COCO labels. We hope our simple yet effective method can inspire more research on unsupervised universal image segmentation. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2311.16090 [pdf, other]

Self-correcting LLM-controlled Diffusion Models

Authors: Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell

Abstract: Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-cont… ▽ More Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 16 pages, 10 figures

arXiv:2309.17444 [pdf, other]

LLM-grounded Video Diffusion Models

Authors: Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li

Abstract: Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language… ▽ More Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns. △ Less

Submitted 4 May, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: ICLR 2024. Project Page: https://llm-grounded-video-diffusion.github.io/

arXiv:2305.13655 [pdf, other]

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Authors: Long Lian, Boyi Li, Adam Yala, Trevor Darrell

Abstract: Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for gr… ▽ More Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate that our method will unleash users' creativity by accurately following more complex prompts. Our code, demo, and benchmark are available at: https://llm-grounded-diffusion.github.io △ Less

Submitted 4 March, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Transactions on Machine Learning Research (TMLR) 2024, with Featured Certification

arXiv:2304.08025 [pdf, other]

Bootstrap** Objectness from Videos by Relaxed Common Fate and Visual Grou**

Authors: Long Lian, Zhirong Wu, Stella X. Yu

Abstract: We study learning object segmentation from unlabeled videos. Humans can easily segment moving objects without knowing what they are. The Gestalt law of common fate, i.e., what move at the same speed belong together, has inspired unsupervised object discovery based on motion segmentation. However, common fate is not a reliable indicator of objectness: Parts of an articulated / deformable object may… ▽ More We study learning object segmentation from unlabeled videos. Humans can easily segment moving objects without knowing what they are. The Gestalt law of common fate, i.e., what move at the same speed belong together, has inspired unsupervised object discovery based on motion segmentation. However, common fate is not a reliable indicator of objectness: Parts of an articulated / deformable object may not move at the same speed, whereas shadows / reflections of an object always move with it but are not part of it. Our insight is to bootstrap objectness by first learning image features from relaxed common fate and then refining them based on visual appearance grou** within the image itself and across images statistically. Specifically, we learn an image segmenter first in the loop of approximating optical flow with constant segment flow plus small within-segment residual flow, and then by refining it for more coherent appearance and statistical figure-ground relevance. On unsupervised video object segmentation, using only ResNet and convolutional heads, our model surpasses the state-of-the-art by absolute gains of 7/9/5% on DAVIS16 / STv2 / FBMS59 respectively, demonstrating the effectiveness of our ideas. Our code is publicly available. △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: Accepted by CVPR 2023. An extension of preprint 2212.08816. 19 pages, 11 figures

arXiv:2303.02960 [pdf, other]

Adaptive Multi-User Channel Estimation Based on Contrastive Feature Learning

Authors: Yihan Xu, Lixiang Lian

Abstract: Correlation exploitation is essential for efficient multi-user channel estimation (MUCE) in massive MIMO systems. However, the existing works either rely on presumed strong correlation or learn the correlation through large amount of labeled data, which are difficult to acquire in a real system. In this paper, we propose an adaptive MUCE algorithm based on contrastive feature learning. The contras… ▽ More Correlation exploitation is essential for efficient multi-user channel estimation (MUCE) in massive MIMO systems. However, the existing works either rely on presumed strong correlation or learn the correlation through large amount of labeled data, which are difficult to acquire in a real system. In this paper, we propose an adaptive MUCE algorithm based on contrastive feature learning. The contrastive learning (CL) is used to automatically learn the similarity within channels by extracting the channel state information (CSI) features based on location information. The similar features will be fed into the downstream network to explore the strong correlations among CSI features to improve the MUCE performance with a small number of labeled data. Simulation results show that the contrastive feature learning can enhance the overall MUCE performance with high training efficiency. △ Less

Submitted 6 March, 2023; originally announced March 2023.

arXiv:2302.04304 [pdf, other]

Q-Diffusion: Quantizing Diffusion Models

Authors: Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, Kurt Keutzer

Abstract: Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not… ▽ More Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with timestep-aware calibration and split shortcut quantization in this work. Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to >100 for traditional PTQ) in a training-free manner. Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time. △ Less

Submitted 8 June, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: The code is available at https://github.com/Xiuyu-Li/q-diffusion

arXiv:2301.03377 [pdf, other]

Machine Learning for Large-Scale Optimization in 6G Wireless Networks

Authors: Yandong Shi, Lixiang Lian, Yuanming Shi, Zixin Wang, Yong Zhou, Liqun Fu, Lin Bai, Jun Zhang, Wei Zhang

Abstract: The sixth generation (6G) wireless systems are envisioned to enable the paradigm shift from "connected things" to "connected intelligence", featured by ultra high density, large-scale, dynamic heterogeneity, diversified functional requirements and machine learning capabilities, which leads to a growing need for highly efficient intelligent algorithms. The classic optimization-based algorithms usua… ▽ More The sixth generation (6G) wireless systems are envisioned to enable the paradigm shift from "connected things" to "connected intelligence", featured by ultra high density, large-scale, dynamic heterogeneity, diversified functional requirements and machine learning capabilities, which leads to a growing need for highly efficient intelligent algorithms. The classic optimization-based algorithms usually require highly precise mathematical model of data links and suffer from poor performance with high computational cost in realistic 6G applications. Based on domain knowledge (e.g., optimization models and theoretical tools), machine learning (ML) stands out as a promising and viable methodology for many complex large-scale optimization problems in 6G, due to its superior performance, generalizability, computational efficiency and robustness. In this paper, we systematically review the most representative "learning to optimize" techniques in diverse domains of 6G wireless networks by identifying the inherent feature of the underlying optimization problem and investigating the specifically designed ML frameworks from the perspective of optimization. In particular, we will cover algorithm unrolling, learning to branch-and-bound, graph neural network for structured optimization, deep reinforcement learning for stochastic optimization, end-to-end learning for semantic optimization, as well as federated learning for distributed optimization, for solving challenging large-scale optimization problems arising from various important wireless applications. Through the in-depth discussion, we shed light on the excellent performance of ML-based optimization algorithms with respect to the classical methods, and provide insightful guidance to develop advanced ML techniques in 6G networks. △ Less

Submitted 3 January, 2023; originally announced January 2023.

arXiv:2212.08816 [pdf, other]

Improving Unsupervised Video Object Segmentation with Motion-Appearance Synergy

Authors: Long Lian, Zhirong Wu, Stella X. Yu

Abstract: We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference. Previous methods in unsupervised video object segmentation (UVOS) have demonstrated the effectiveness of motion as either input or supervision for segmentation. However, motion signals may be uninformative or even misleading in cases such as deformable objects and objects with… ▽ More We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference. Previous methods in unsupervised video object segmentation (UVOS) have demonstrated the effectiveness of motion as either input or supervision for segmentation. However, motion signals may be uninformative or even misleading in cases such as deformable objects and objects with reflections, causing unsatisfactory segmentation. In contrast, IMAS achieves Improved UVOS with Motion-Appearance Synergy. Our method has two training stages: 1) a motion-supervised object discovery stage that deals with motion-appearance conflicts through a learnable residual pathway; 2) a refinement stage with both low- and high-level appearance supervision to correct model misconceptions learned from misleading motion cues. Additionally, we propose motion-semantic alignment as a model-agnostic annotation-free hyperparam tuning method. We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics. IMAS greatly improves the segmentation quality on several common UVOS benchmarks. For example, we surpass previous methods by 8.3% on DAVIS16 benchmark with only standard ResNet and convolutional heads. We intend to release our code for future research and applications. △ Less

Submitted 17 December, 2022; originally announced December 2022.

Comments: 15 pages, 10 figures

arXiv:2211.03628 [pdf, other]

Decentralized Complete Dictionary Learning via $\ell^{4}$-Norm Maximization

Authors: Qiheng Lu, Lixiang Lian

Abstract: With the rapid development of information technologies, centralized data processing is subject to many limitations, such as computational overheads, communication delays, and data privacy leakage. Decentralized data processing over networked terminal nodes becomes an important technology in the era of big data. Dictionary learning is a powerful representation learning method to exploit the low-dim… ▽ More With the rapid development of information technologies, centralized data processing is subject to many limitations, such as computational overheads, communication delays, and data privacy leakage. Decentralized data processing over networked terminal nodes becomes an important technology in the era of big data. Dictionary learning is a powerful representation learning method to exploit the low-dimensional structure from the high-dimensional data. By exploiting the low-dimensional structure, the storage and the processing overhead of data can be effectively reduced. In this paper, we propose a novel decentralized complete dictionary learning algorithm, which is based on $\ell^{4}$-norm maximization. Compared with existing decentralized dictionary learning algorithms, comprehensive numerical experiments show that the novel algorithm has significant advantages in terms of per-iteration computational complexity, communication cost, and convergence rate in many scenarios. Moreover, a rigorous theoretical analysis shows that the dictionaries learned by the proposed algorithm can converge to the one learned by a centralized dictionary learning algorithm at a linear rate with high probability under certain conditions. △ Less

Submitted 26 November, 2022; v1 submitted 7 November, 2022; originally announced November 2022.

arXiv:2208.03928 [pdf, ps, other]

Reconfigurable Intelligent Surfaces Empowered Cooperative Rate Splitting with User Relaying

Authors: Kangchun Zhao, Yijie Mao, Zhaohui Yang, Lixiang Lian, Bruno Clerckx

Abstract: Cooperative rate splitting (CRS), built upon rate splitting multiple access (RSMA) and opportunistic user relaying, has been recognized as a promising transmission strategy to enhance the user fairness and spectral efficiency in multiantenna broadcast channels. To further boost its performance, the interplay of CRS and reconfigurable intelligent surface (RIS) is investigated in this work. Specific… ▽ More Cooperative rate splitting (CRS), built upon rate splitting multiple access (RSMA) and opportunistic user relaying, has been recognized as a promising transmission strategy to enhance the user fairness and spectral efficiency in multiantenna broadcast channels. To further boost its performance, the interplay of CRS and reconfigurable intelligent surface (RIS) is investigated in this work. Specifically, a novel RIS-aided CRS transmission framework is proposed and the corresponding resource allocation problem to maximize the minimum rate among users is investigated. An alternative optimization algorithm is then proposed to optimize the transmit beamforming, common rate allocation, and RIS phases, iteratively. Numerical results show that the proposed RIS-aided CRS transmission framework significantly improves the spectral efficiency compared with its non-cooperative counterpart and other schemes without RIS. △ Less

Submitted 8 August, 2022; originally announced August 2022.

Comments: 6 pages,5 figures

arXiv:2206.03596 [pdf, other]

Neural Network Compression via Effective Filter Analysis and Hierarchical Pruning

Authors: Ziqi Zhou, Li Lian, Yilong Yin, Ze Wang

Abstract: Network compression is crucial to making the deep networks to be more efficient, faster, and generalizable to low-end hardware. Current network compression methods have two open problems: first, there lacks a theoretical framework to estimate the maximum compression rate; second, some layers may get over-prunned, resulting in significant network performance drop. To solve these two problems, this… ▽ More Network compression is crucial to making the deep networks to be more efficient, faster, and generalizable to low-end hardware. Current network compression methods have two open problems: first, there lacks a theoretical framework to estimate the maximum compression rate; second, some layers may get over-prunned, resulting in significant network performance drop. To solve these two problems, this study propose a gradient-matrix singularity analysis-based method to estimate the maximum network redundancy. Guided by that maximum rate, a novel and efficient hierarchical network pruning algorithm is developed to maximally condense the neuronal network structure without sacrificing network performance. Substantial experiments are performed to demonstrate the efficacy of the new method for pruning several advanced convolutional neural network (CNN) architectures. Compared to existing pruning methods, the proposed pruning algorithm achieved state-of-the-art performance. At the same or similar compression ratio, the new method provided the highest network prediction accuracy as compared to other methods. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2205.04227 [pdf]

Mixed-UNet: Refined Class Activation Map** for Weakly-Supervised Semantic Segmentation with Multi-scale Inference

Authors: Yang Liu, Ersi Zhang, Lulu Xu, Chufan Xiao, Xiaoyun Zhong, Li** Lian, Fang Li, Bin Jiang, Yuhan Dong, Lan Ma, Qiming Huang, Ming Xu, Yongbing Zhang, Dongmei Yu, Chenggang Yan, Peiwu Qin

Abstract: Deep learning techniques have shown great potential in medical image processing, particularly through accurate and reliable image segmentation on magnetic resonance imaging (MRI) scans or computed tomography (CT) scans, which allow the localization and diagnosis of lesions. However, training these segmentation models requires a large number of manually annotated pixel-level labels, which are time-… ▽ More Deep learning techniques have shown great potential in medical image processing, particularly through accurate and reliable image segmentation on magnetic resonance imaging (MRI) scans or computed tomography (CT) scans, which allow the localization and diagnosis of lesions. However, training these segmentation models requires a large number of manually annotated pixel-level labels, which are time-consuming and labor-intensive, in contrast to image-level labels that are easier to obtain. It is imperative to resolve this problem through weakly-supervised semantic segmentation models using image-level labels as supervision since it can significantly reduce human annotation efforts. Most of the advanced solutions exploit class activation map** (CAM). However, the original CAMs rarely capture the precise boundaries of lesions. In this study, we propose the strategy of multi-scale inference to refine CAMs by reducing the detail loss in single-scale reasoning. For segmentation, we develop a novel model named Mixed-UNet, which has two parallel branches in the decoding phase. The results can be obtained after fusing the extracted features from two branches. We evaluate the designed Mixed-UNet against several prevalent deep learning-based segmentation approaches on our dataset collected from the local hospital and public datasets. The validation results demonstrate that our model surpasses available methods under the same supervision level in the segmentation of various lesions from brain imaging. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Comments: 12 pages, 7 figures

arXiv:2204.03329 [pdf]

Information-driven Path Planning for Hybrid Aerial Underwater Vehicles

Authors: Zheng Zeng, Chengke Xiong, Xinyi Yuan, Yulin Bai, Yufei **, Di Lu, Lian Lian

Abstract: This paper presents a novel Rapidly-exploring Adaptive Sampling Tree (RAST) algorithm for the adaptive sampling mission of a hybrid aerial underwater vehicle (HAUV) in an air-sea 3D environment. This algorithm innovatively combines the tournament-based point selection sampling strategy, the information heuristic search process and the framework of Rapidly-exploring Random Tree (RRT) algorithm. Hen… ▽ More This paper presents a novel Rapidly-exploring Adaptive Sampling Tree (RAST) algorithm for the adaptive sampling mission of a hybrid aerial underwater vehicle (HAUV) in an air-sea 3D environment. This algorithm innovatively combines the tournament-based point selection sampling strategy, the information heuristic search process and the framework of Rapidly-exploring Random Tree (RRT) algorithm. Hence can guide the vehicle to the region of interest to scientists for sampling and generate a collision-free path for maximizing information collection by the HAUV under the constraints of environmental effects of currents or wind and limited budget. The simulation results show that the fast search adaptive sampling tree algorithm has higher optimization performance, faster solution speed and better stability than the Rapidly-exploring Information Gathering Tree (RIGT) algorithm and the particle swarm optimization (PSO) algorithm. △ Less

Submitted 8 April, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

arXiv:2201.01490 [pdf, other]

Debiased Learning from Naturally Imbalanced Pseudo-Labels

Authors: Xudong Wang, Zhirong Wu, Long Lian, Stella X. Yu

Abstract: Pseudo-labels are confident predictions made on unlabeled target data by a classifier trained on labeled source data. They are widely used for adapting a model to unlabeled data, e.g., in a semi-supervised learning setting. Our key insight is that pseudo-labels are naturally imbalanced due to intrinsic data similarity, even when a model is trained on balanced source data and evaluated on balance… ▽ More Pseudo-labels are confident predictions made on unlabeled target data by a classifier trained on labeled source data. They are widely used for adapting a model to unlabeled data, e.g., in a semi-supervised learning setting. Our key insight is that pseudo-labels are naturally imbalanced due to intrinsic data similarity, even when a model is trained on balanced source data and evaluated on balanced target data. If we address this previously unknown imbalanced classification problem arising from pseudo-labels instead of ground-truth training labels, we could remove model biases towards false majorities created by pseudo-labels. We propose a novel and effective debiased learning method with pseudo-labels, based on counterfactual reasoning and adaptive margins: The former removes the classifier response bias, whereas the latter adjusts the margin of each class according to the imbalance of pseudo-labels. Validated by extensive experimentation, our simple debiased learning delivers significant accuracy gains over the state-of-the-art on ImageNet-1K: 26% for semi-supervised learning with 0.2% annotations and 9% for zero-shot learning. Our code is available at: https://github.com/frank-xwang/debiased-pseudo-labeling. △ Less

Submitted 21 April, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

Comments: Accepted by CVPR 2022

arXiv:2111.00470 [pdf, other]

Wireless Federated Learning over MIMO Networks: Joint Device Scheduling and Beamforming Design

Authors: Shaoming Huang, Pengfei Zhang, Yijie Mao, Lixiang Lian, Yuanming Shi

Abstract: Federated learning (FL) is recognized as a key enabling technology to support distributed artificial intelligence (AI) services in future 6G. By supporting decentralized data training and collaborative model training among devices, FL inherently tames privacy leakage and reduces transmission costs. Whereas, the performance of the wireless FL is typically restricted by the communication latency. Mu… ▽ More Federated learning (FL) is recognized as a key enabling technology to support distributed artificial intelligence (AI) services in future 6G. By supporting decentralized data training and collaborative model training among devices, FL inherently tames privacy leakage and reduces transmission costs. Whereas, the performance of the wireless FL is typically restricted by the communication latency. Multiple-input multiple-output (MIMO) technique is one promising solution to build up a communication-efficient edge FL system with limited radio resources. In this paper, we propose a novel joint device scheduling and receive beamforming design approach to reduce the FL convergence gap over shared wireless MIMO networks. Specifically, we theoretically establish the convergence analysis of the FL process, and then apply the proposed device scheduling policy to maximize the number of weighted devices under the FL system latency and sum power constraints. Numerical results verify the theoretical analysis of the FL convergence and exhibit the appealing learning performance of the proposed approach. △ Less

Submitted 31 October, 2021; originally announced November 2021.

Comments: Submitted to 2022 IEEE International Conference on Communications

arXiv:2110.03006 [pdf, other]

Unsupervised Selective Labeling for More Effective Semi-Supervised Learning

Authors: Xudong Wang, Long Lian, Stella X. Yu

Abstract: Given an unlabeled dataset and an annotation budget, we study how to selectively label a fixed number of instances so that semi-supervised learning (SSL) on such a partially labeled dataset is most effective. We focus on selecting the right data to label, in addition to usual SSL's propagating labels from labeled data to the rest unlabeled data. This instance selection task is challenging, as with… ▽ More Given an unlabeled dataset and an annotation budget, we study how to selectively label a fixed number of instances so that semi-supervised learning (SSL) on such a partially labeled dataset is most effective. We focus on selecting the right data to label, in addition to usual SSL's propagating labels from labeled data to the rest unlabeled data. This instance selection task is challenging, as without any labeled data we do not know what the objective of learning should be. Intuitively, no matter what the downstream task is, instances to be labeled must be representative and diverse: The former would facilitate label propagation to unlabeled data, whereas the latter would ensure coverage of the entire dataset. We capture this idea by selecting cluster prototypes, either in a pretrained feature space, or along with feature optimization, both without labels. Our unsupervised selective labeling consistently improves SSL methods over state-of-the-art active learning given labeled data, by 8 to 25 times in label efficiency. For example, it boosts FixMatch by 10% (14%) in accuracy on CIFAR-10 (ImageNet-1K) with 0.08% (0.2%) labeled data, demonstrating that small computation spent on selecting what data to label brings significant gain especially under a low annotation budget. Our work sets a new standard for practical and efficient SSL. △ Less

Submitted 23 August, 2023; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: Accepted by ECCV 2022; Fixed a few typos

arXiv:2104.02921 [pdf, other]

Unsupervised Visual Attention and Invariance for Reinforcement Learning

Authors: Xudong Wang, Long Lian, Stella X. Yu

Abstract: Vision-based reinforcement learning (RL) is successful, but how to generalize it to unknown test environments remains challenging. Existing methods focus on training an RL policy that is universal to changing visual domains, whereas we focus on extracting visual foreground that is universal, feeding clean invariant vision to the RL policy learner. Our method is completely unsupervised, without man… ▽ More Vision-based reinforcement learning (RL) is successful, but how to generalize it to unknown test environments remains challenging. Existing methods focus on training an RL policy that is universal to changing visual domains, whereas we focus on extracting visual foreground that is universal, feeding clean invariant vision to the RL policy learner. Our method is completely unsupervised, without manual annotations or access to environment internals. Given videos of actions in a training environment, we learn how to extract foregrounds with unsupervised keypoint detection, followed by unsupervised visual attention to automatically generate a foreground mask per video frame. We can then introduce artificial distractors and train a model to reconstruct the clean foreground mask from noisy observations. Only this learned model is needed during test to provide distraction-free visual input to the RL policy learner. Our Visual Attention and Invariance (VAI) method significantly outperforms the state-of-the-art on visual domain generalization, gaining 15 to 49% (61 to 229%) more cumulative rewards per episode on DeepMind Control (our DrawerWorld Manipulation) benchmarks. Our results demonstrate that it is not only possible to learn domain-invariant vision without any supervision, but freeing RL from visual distractions also makes the policy more focused and thus far better. △ Less

Submitted 16 April, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: Accepted at CVPR 2021

arXiv:2010.01809 [pdf, other]

Long-tailed Recognition by Routing Diverse Distribution-Aware Experts

Authors: Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, Stella X. Yu

Abstract: Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. We take a dynamic view of the training data and provide a principled m… ▽ More Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the head-tail model bias gap remains large, due to more and larger confusion with hard negatives for the tail. We propose a new long-tailed classifier called RoutIng Diverse Experts (RIDE). It reduces the model variance with multiple experts, reduces the model bias with a distribution-aware diversity loss, reduces the computational cost with a dynamic expert routing module. RIDE outperforms the state-of-the-art by 5% to 7% on CIFAR100-LT, ImageNet-LT and iNaturalist 2018 benchmarks. It is also a universal framework that is applicable to various backbone networks, long-tailed algorithms, and training mechanisms for consistent performance gains. Our code is available at: https://github.com/frank-xwang/RIDE-LongTailRecognition. △ Less

Submitted 1 May, 2022; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: Accepted at ICLR 2021 (Spotlight); Add experiments on Swin Transformer

Showing 1–21 of 21 results for author: Lian, L