Search | arXiv e-print repository

PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs

Abstract: Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuni… ▽ More Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted to the ACL 2024 Workshop on Privacy in Natural Language Processing (PrivateNLP)

arXiv:2406.18681 [pdf, other]

Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional Features

Authors: Samuel Gailliot, Rajarshi Guhaniyogi, Roger D. Peng

Abstract: This article focuses on drawing computationally-efficient predictive inference from Gaussian process (GP) regressions with a large number of features when the response is conditionally independent of the features given the projection to a noisy low dimensional manifold. Bayesian estimation of the regression relationship using Markov Chain Monte Carlo and subsequent predictive inference is computat… ▽ More This article focuses on drawing computationally-efficient predictive inference from Gaussian process (GP) regressions with a large number of features when the response is conditionally independent of the features given the projection to a noisy low dimensional manifold. Bayesian estimation of the regression relationship using Markov Chain Monte Carlo and subsequent predictive inference is computationally prohibitive and may lead to inferential inaccuracies since accurate variable selection is essentially impossible in such high-dimensional GP regressions. As an alternative, this article proposes a strategy to sketch the high-dimensional feature vector with a carefully constructed sketching matrix, before fitting a GP with the scalar outcome and the sketched feature vector to draw predictive inference. The analysis is performed in parallel with many different sketching matrices and smoothing parameters in different processors, and the predictive inferences are combined using Bayesian predictive stacking. Since posterior predictive distribution in each processor is analytically tractable, the algorithm allows bypassing the robustness issues due to convergence and mixing of MCMC chains, leading to fast implementation with very large number of features. Simulation studies show superior performance of the proposed approach with a wide variety of competitors. The approach outperforms competitors in drawing point prediction with predictive uncertainties of outdoor air pollution from satellite images. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 32 Pages, 10 Figures

arXiv:2405.17732 [pdf, other]

C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

Authors: Jiahuan Cao, Yongxin Shi, Dezhi Peng, Yang Liu, Lianwen **

Abstract: Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities… ▽ More Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}. △ Less

Submitted 30 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.11336 [pdf, other]

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

Authors: Duo Peng, Qiuhong Ke, Jun Liu

Abstract: Text-to-Image (T2I) models have raised security concerns due to their potential to generate inappropriate or harmful images. In this paper, we propose UPAM, a novel framework that investigates the robustness of T2I models from the attack perspective. Unlike most existing attack methods that focus on deceiving textual defenses, UPAM aims to deceive both textual and visual defenses in T2I models. UP… ▽ More Text-to-Image (T2I) models have raised security concerns due to their potential to generate inappropriate or harmful images. In this paper, we propose UPAM, a novel framework that investigates the robustness of T2I models from the attack perspective. Unlike most existing attack methods that focus on deceiving textual defenses, UPAM aims to deceive both textual and visual defenses in T2I models. UPAM enables gradient-based optimization, offering greater effectiveness and efficiency than previous methods. Given that T2I models might not return results due to defense mechanisms, we introduce a Sphere-Probing Learning (SPL) scheme to support gradient optimization even when no results are returned. Additionally, we devise a Semantic-Enhancing Learning (SEL) scheme to finetune UPAM for generating target-aligned images. Our framework also ensures attack stealthiness. Extensive experiments demonstrate UPAM's effectiveness and efficiency. △ Less

Submitted 25 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

Comments: Accepted by ICML2024

ACM Class: I.2.6

arXiv:2405.08740 [pdf, other]

Reinformer: Max-Return Sequence Modeling for Offline RL

Authors: Zifeng Zhuang, Dengyun Peng, **xin Liu, Ziqi Zhang, Donglin Wang

Abstract: As a data-driven paradigm, offline reinforcement learning (RL) has been formulated as sequence modeling that conditions on the hindsight information including returns, goal or future trajectory. Although promising, this supervised paradigm overlooks the core objective of RL that maximizes the return. This overlook directly leads to the lack of trajectory stitching capability that affects the seque… ▽ More As a data-driven paradigm, offline reinforcement learning (RL) has been formulated as sequence modeling that conditions on the hindsight information including returns, goal or future trajectory. Although promising, this supervised paradigm overlooks the core objective of RL that maximizes the return. This overlook directly leads to the lack of trajectory stitching capability that affects the sequence model learning from sub-optimal data. In this work, we introduce the concept of max-return sequence modeling which integrates the goal of maximizing returns into existing sequence models. We propose Reinforced Transformer (Reinformer), indicating the sequence model is reinforced by the RL objective. Reinformer additionally incorporates the objective of maximizing returns in the training phase, aiming to predict the maximum future return within the distribution. During inference, this in-distribution maximum return will guide the selection of optimal actions. Empirically, Reinformer is competitive with classical RL methods on the D4RL benchmark and outperforms state-of-the-art sequence model particularly in trajectory stitching ability. Code is public at https://github.com/Dragon-Zhuang/Reinformer. △ Less

Submitted 2 June, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

Comments: ICML 2024

arXiv:2405.04408 [pdf, other]

DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

Authors: Jiaxin Zhang, Dezhi Peng, Chongyu Liu, Peirong Zhang, Lianwen **

Abstract: Document image restoration is a crucial aspect of Document AI systems, as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently, leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge, we propose DocRes, a generalist model t… ▽ More Document image restoration is a crucial aspect of Document AI systems, as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently, leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge, we propose DocRes, a generalist model that unifies five document image restoration tasks including dewar**, deshadowing, appearance enhancement, deblurring, and binarization. To instruct DocRes to perform various restoration tasks, we propose a novel visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The DTSPrompt for different tasks comprises distinct prior features, which are additional characteristics extracted from the input image. Beyond its role as a cue for task-specific execution, DTSPrompt can also serve as supplementary information to enhance the model's performance. Moreover, DTSPrompt is more flexible than prior visual prompt approaches as it can be seamlessly applied and adapted to inputs with high and variable resolutions. Experimental results demonstrate that DocRes achieves competitive or superior performance compared to existing state-of-the-art task-specific models. This underscores the potential of DocRes across a broader spectrum of document image restoration tasks. The source code is publicly available at https://github.com/ZZZHANG-jx/DocRes △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.12567 [pdf]

Impact of Vibrotactile Triggers on Mental Well-Being through ASMR Experience in VR

Authors: Danyang Peng, Tanner Person, Ximing Shen, Yun Suen Pai, Giulia Barbareschi, Shengyin Li, Kouta Minamizawa

Abstract: Watching Autonomous Sensory Meridian Response (ASMR) videos is a popular approach to support mental well-being, as the triggered ASMR tingling sensation supports de-stressing and regulating emotions. Therefore, there is increasing research on how to efficiently trigger ASMR tingling sensation. Tactile sensation remains unexplored because current popular ASMR approaches focus on the visual and audi… ▽ More Watching Autonomous Sensory Meridian Response (ASMR) videos is a popular approach to support mental well-being, as the triggered ASMR tingling sensation supports de-stressing and regulating emotions. Therefore, there is increasing research on how to efficiently trigger ASMR tingling sensation. Tactile sensation remains unexplored because current popular ASMR approaches focus on the visual and audio channels. In this study, we explored the impact of tactile feedback on triggering ASMR tingling sensation in a Virtual Reality (VR) environment. Through two experimental studies, we investigated the relaxation effect of a tactile-enabled ASMR experience, as well as the impact of vibrotactile triggers on the ASMR experience. Our results showed that vibrotactile feedback is effective in increasing the likelihood of ASMR tingling sensation and enhancing the feeling of comfort, relaxation, and enjoyment. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.07503 [pdf, other]

Best Practices and Lessons Learned on Synthetic Data for Language Models

Authors: Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, **meng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai

Abstract: The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challeng… ▽ More The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.19386 [pdf, other]

PointCloud-Text Matching: Benchmark Datasets and a Baseline

Authors: Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu

Abstract: In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore,… ▽ More In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore, we construct three new PTM benchmark datasets, namely 3D2T-SR, 3D2T-NR, and 3D2T-QA. We observe that the data is challenging and with noisy correspondence due to the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which make existing cross-modal matching methods ineffective for PTM. To tackle these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL divides negative pairs, which are much less error-prone than positive pairs, into clean and noisy subsets, and assigns them forward and reverse optimization directions respectively, thus enhancing robustness against noisy correspondence. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.18802 [pdf, other]

Long-form factuality in large language models

Authors: Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

Abstract: Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factua… ▽ More Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality. △ Less

Submitted 3 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.13761 [pdf, other]

HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition

Authors: Yuyi Zhang, Yuanzhi Zhu, Dezhi Peng, Peirong Zhang, Zhenhua Yang, Zhibo Yang, Cong Yao, Lianwen **

Abstract: Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose… ▽ More Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features, a notable advantage over existing methods. Extensive experiments across diverse benchmarks, including handwritten, scene, document, web, and ancient text, have showcased HierCode's superiority for both conventional and zero-shot Chinese character or text recognition, exhibiting state-of-the-art performance with significantly fewer parameters and fast inference speed. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2402.14547 [pdf, other]

OmniPred: Language Models as Universal Regressors

Authors: Xingyou Song, Oscar Li, Chansoo Lee, Bangding Yang, Daiyi Peng, Sagi Perel, Yutian Chen

Abstract: Over the broad landscape of experimental design, regression has been a powerful tool to accurately predict the outcome metrics of a system or model given a set of parameters, but has been traditionally restricted to methods which are only applicable to a specific task. In this paper, we propose OmniPred, a framework for training language models as universal end-to-end regressors over $(x,y)$ evalu… ▽ More Over the broad landscape of experimental design, regression has been a powerful tool to accurately predict the outcome metrics of a system or model given a set of parameters, but has been traditionally restricted to methods which are only applicable to a specific task. In this paper, we propose OmniPred, a framework for training language models as universal end-to-end regressors over $(x,y)$ evaluation data from diverse real world experiments. Using data sourced from Google Vizier, one of the largest blackbox optimization databases in the world, our extensive experiments demonstrate that through only textual representations of mathematical parameters and values, language models are capable of very precise numerical regression, and if given the opportunity to train over multiple tasks, can significantly outperform traditional regression models. △ Less

Submitted 4 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: 24 pages, 10 figures. Code can be found in https://github.com/google-research/optformer/tree/main/optformer/omnipred

arXiv:2402.08562 [pdf, other]

Higher Layers Need More LoRA Experts

Authors: Chongyang Gao, Kezhen Chen, **meng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS Subrahmanian

Abstract: Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Re… ▽ More Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at https://github.com/GCYZSL/MoLA. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: The code is available at https://github.com/GCYZSL/MoLA

arXiv:2402.06512 [pdf, other]

Multimodal Clinical Trial Outcome Prediction with Large Language Models

Authors: Wenhao Zheng, Dongsheng Peng, Hongxia Xu, Yun Li, Hongtu Zhu, Tianfan Fu, Huaxiu Yao

Abstract: The clinical trial is a pivotal and costly process, often spanning multiple years and requiring substantial financial resources. Therefore, the development of clinical trial outcome prediction models aims to exclude drugs likely to fail and holds the potential for significant cost savings. Recent data-driven attempts leverage deep learning methods to integrate multimodal data for predicting clinic… ▽ More The clinical trial is a pivotal and costly process, often spanning multiple years and requiring substantial financial resources. Therefore, the development of clinical trial outcome prediction models aims to exclude drugs likely to fail and holds the potential for significant cost savings. Recent data-driven attempts leverage deep learning methods to integrate multimodal data for predicting clinical trial outcomes. However, these approaches rely on manually designed modal-specific encoders, which limits both the extensibility to adapt new modalities and the ability to discern similar information patterns across different modalities. To address these issues, we propose a multimodal mixture-of-experts (LIFTED) approach for clinical trial outcome prediction. Specifically, LIFTED unifies different modality data by transforming them into natural language descriptions. Then, LIFTED constructs unified noise-resilient encoders to extract information from modal-specific language descriptions. Subsequently, a sparse Mixture-of-Experts framework is employed to further refine the representations, enabling LIFTED to identify similar information patterns across different modalities and extract more consistent representations from those patterns using the same expert model. Finally, a mixture-of-experts module is further employed to dynamically integrate different modality representations for prediction, which gives LIFTED the ability to automatically weigh different modalities and pay more attention to critical information. The experiments demonstrate that LIFTED significantly enhances performance in predicting clinical trial outcomes across all three phases compared to the best baseline, showcasing the effectiveness of our proposed key components. △ Less

Submitted 8 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

arXiv:2402.02988 [pdf, other]

Ultrafast Nuclear Dynamics in Double-Core Ionized Water Molecules

Authors: Iyas Ismail, Ludger Inhester, Tatiana Marchenko, Florian Trinter, Abhishek Verma, Alberto De Fanis, Anthony Ferte, Daniel E. Rivas, Dawei Peng, Dimitris Koulentianos, Edwin Kukk, Francis Penent, Gilles Doumy, Giuseppe Sansone, John D. Bozek, Kai Li, Linda Young, Markus Ilchen, Maria Novella Piancastelli, Michael Meyer, Nicolas Velasquez, Oksana Travnikova, Rebecca Boll, Renaud Guillemin, Reinhard Dorner , et al. (8 additional authors not shown)

Abstract: Double-core-hole (DCH) states in isolated water and heavy water molecules, resulting from the sequential absorption of two x-ray photons, have been investigated. A comparison of the subsequent Auger emission spectra from the two isotopes provides direct evidence of ultrafast nuclear motion during the 1.5 fs lifetime of these DCH states. Our numerical results align well with the experimental data,… ▽ More Double-core-hole (DCH) states in isolated water and heavy water molecules, resulting from the sequential absorption of two x-ray photons, have been investigated. A comparison of the subsequent Auger emission spectra from the two isotopes provides direct evidence of ultrafast nuclear motion during the 1.5 fs lifetime of these DCH states. Our numerical results align well with the experimental data, providing for various DCH states an in-depth study of the dynamics responsible of the observed isotope effect. △ Less

Submitted 11 March, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.00585 [pdf, other]

SATac: A Thermoluminescence Enabled Tactile Sensor for Concurrent Perception of Temperature, Pressure, and Shear

Authors: Ziwu Song, Ran Yu, Xuan Zhang, Kit Wa Sou, Shilong Mu, Dengfeng Peng, Xiao-** Zhang, Wenbo Ding

Abstract: Most vision-based tactile sensors use elastomer deformation to infer tactile information, which can not sense some modalities, like temperature. As an important part of human tactile perception, temperature sensing can help robots better interact with the environment. In this work, we propose a novel multimodal vision-based tactile sensor, SATac, which can simultaneously perceive information of te… ▽ More Most vision-based tactile sensors use elastomer deformation to infer tactile information, which can not sense some modalities, like temperature. As an important part of human tactile perception, temperature sensing can help robots better interact with the environment. In this work, we propose a novel multimodal vision-based tactile sensor, SATac, which can simultaneously perceive information of temperature, pressure, and shear. SATac utilizes thermoluminescence of strontium aluminate (SA) to sense a wide range of temperatures with exceptional resolution. Additionally, the pressure and shear can also be perceived by analyzing Voronoi diagram. A series of experiments are conducted to verify the performance of our proposed sensor. We also discuss the possible application scenarios and demonstrate how SATac could benefit robot perception capabilities. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.12754 [pdf, other]

Research on the knee region of cosmic ray by using a novel type of electron-neutron detector array

Authors: Bing-Bing Li, Xin-Hua Ma, Shu-Wang Cui, Hao-Kun Chen, Tian-Lu Chen, Danzengluobu, Wei Gao, Hai-Bing Hu, Denis Kuleshov, Kirill Kurinov, Hu Liu, Mao-Yuan Liu, Ye Liu, Da-Yu Peng, Yao-Hui Qi, Oleg Shchegolev, Yuri Stenkin, Li-Qiao Yin, Heng-Yu Zhang, Liang-Wei Zhang

Abstract: By accurately measuring composition and energy spectrum of cosmic ray, the origin problem of so called "keen" region (energy > 1 PeV) can be solved. However, up to the present, the results of the spectrum in the knee region obtained by several previous experiments have shown obvious differences, so they cannot give effective evidence for judging the theoretical models on the origin of the knee. Re… ▽ More By accurately measuring composition and energy spectrum of cosmic ray, the origin problem of so called "keen" region (energy > 1 PeV) can be solved. However, up to the present, the results of the spectrum in the knee region obtained by several previous experiments have shown obvious differences, so they cannot give effective evidence for judging the theoretical models on the origin of the knee. Recently, the Large High Altitude Air Shower Observatory (LHAASO) has reported several major breakthroughs and important results in astro-particle physics field. Relying on its advantages of wide-sky survey, high altitude location and large area detector arrays, the research content of LHAASO experiment mainly includes ultra high-energy gamma-ray astronomy, measurement of cosmic ray spectra in the knee region, searching for dark matter and new phenomena of particle physics at higher energy. The electron and Thermal Neutron detector (EN-Detector) is a new scintillator detector which applies thermal neutron detection technology to measure cosmic ray extensive air shower (EAS). This technology is an extension of LHAASO. The EN-Detector Array (ENDA) can highly efficiently measure thermal neutrons generated by secondary hadrons so called "skeleton" of EAS. In this paper, we perform the optimization of ENDA configuration, and obtain expectations on the ENDA results, including thermal neutron distribution, trigger efficiency and capability of cosmic ray composition separation. The obtained real data results are consistent with those by the Monte Carlo simulation. △ Less

Submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.07641 [pdf, other]

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

Authors: Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen **

Abstract: End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spottin… ▽ More End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieved state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at \href{https://github.com/mxin262/SwinTextSpotterv2}{SwinTextSpotter v2}. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2203.10209

arXiv:2401.03486 [pdf, ps, other]

Nanofabrication beyond optical diffraction limit: Optical driven assembly enabled by superlubricity

Authors: Liu Jiang-tao, Deli Peng, Qin Yang, Ze Liu, Zhenhua Wu

Abstract: The optical manipulation of nanoparticles on superlubricity surfaces is investigated. The research revealed that, due to the near-zero static friction and extremely low dynamic friction at superlubricity interfaces, the maximum intensity for controlling the optical field can be less than 100 W/cm$^2$, which is nine orders of magnitude lower than controlling nanoparticles on traditional interfaces.… ▽ More The optical manipulation of nanoparticles on superlubricity surfaces is investigated. The research revealed that, due to the near-zero static friction and extremely low dynamic friction at superlubricity interfaces, the maximum intensity for controlling the optical field can be less than 100 W/cm$^2$, which is nine orders of magnitude lower than controlling nanoparticles on traditional interfaces. The controlled nanoparticle radius can be as small as 5 nm, which is more than one order of magnitude smaller than nanoparticles controlled through traditional optical manipulation. Manipulation can be achieved in sub-microsecond to microsecond timescales. Furthermore, the manipulation takes place on solid surfaces and in non-liquid environments, with minimal impact from Brownian motion. By appropriately increasing dynamic friction, controlling light intensity, or reducing pressure, the effects of Brownian motion can be eliminated, allowing for the construction of microstructures with a size as small as 1/75 of the wavelength of light. This enables the control of super-resolution optical microstructures. The optical super-resolution manipulation of nanoparticles on superlubricity surfaces will find important applications in fields such as nanofabrication, photolithography, optical metasurface, and biochemical analysis. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2401.01100 [pdf]

Scalable manifold learning by uniform landmark sampling and constrained locally linear embedding

Authors: Dehua Peng, Zhipeng Gui, Wenzhang Wei, Huayi Wu

Abstract: As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although… ▽ More As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and embedding dimensions, and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases. △ Less

Submitted 5 January, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

Comments: 33 pages, 10 figures

ACM Class: I.5.3

arXiv:2401.00422 [pdf]

Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

Authors: Dehua Peng, Zhipeng Gui, Huayi Wu

Abstract: The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as the dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regres… ▽ More The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as the dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize five challenges associated with manipulating high-dimensional data, and explains the potential causes for the failure of regression, classification or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that nearest neighbor search (NNS) using three typical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless as the dimensionality increases. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions. By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms, and drive to improve the performance of data analysis and machine learning tasks in high-dimensional space. △ Less

Submitted 7 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: 17 pages, 11 figures

arXiv:2312.17024 [pdf, other]

Selective Run-Length Encoding

Authors: Xutan Peng, Yi Zhang, Dejia Peng, Jiafa Zhu

Abstract: Run-Length Encoding (RLE) is one of the most fundamental tools in data compression. However, its compression power drops significantly if there lacks consecutive elements in the sequence. In extreme cases, the output of the encoder may require more space than the input (aka size inflation). To alleviate this issue, using combinatorics, we quantify RLE's space savings for a given input distribution… ▽ More Run-Length Encoding (RLE) is one of the most fundamental tools in data compression. However, its compression power drops significantly if there lacks consecutive elements in the sequence. In extreme cases, the output of the encoder may require more space than the input (aka size inflation). To alleviate this issue, using combinatorics, we quantify RLE's space savings for a given input distribution. With this insight, we develop the first algorithm that automatically identifies suitable symbols, then selectively encodes these symbols with RLE while directly storing the others without RLE. Through experiments on real-world datasets of various modalities, we empirically validate that our method, which maintains RLE's efficiency advantage, can effectively mitigate the size inflation dilemma. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Comments: Accepted at DCC 2024

arXiv:2312.16012 [pdf, other]

Detection-based Intermediate Supervision for Visual Question Answering

Authors: Yuhang Liu, Daowan Peng, Wei Wei, Yuanyuan Fu, Wenfeng Xie, Dangyang Chen

Abstract: Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving infere… ▽ More Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, \textbf{\underline{D}}etection-based \textbf{\underline{I}}ntermediate \textbf{\underline{S}}upervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions.Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI24

arXiv:2312.12142 [pdf, other]

FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning

Authors: Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen **

Abstract: Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusion-based ima… ▽ More Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusion-based image-to-image one-shot font generation method, which innovatively models the font imitation task as a noise-to-denoise paradigm. In our method, we introduce a Multi-scale Content Aggregation (MCA) block, which effectively combines global and local content cues across different scales, leading to enhanced preservation of intricate strokes of complex characters. Moreover, to better manage the large variations in style transfer, we propose a Style Contrastive Refinement (SCR) module, which is a novel structure for style representation learning. It utilizes a style extractor to disentangle styles from images, subsequently supervising the diffusion model via a meticulously designed style contrastive loss. Extensive experiments demonstrate FontDiffuser's state-of-the-art performance in generating diverse characters and styles. It consistently excels on complex characters and large style changes compared to previous methods. The code is available at https://github.com/yeungchenwa/FontDiffuser. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI 2024; Github Page: https://github.com/yeungchenwa/FontDiffuser

Journal ref: 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 2024

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.07616 [pdf, other]

Evaluating the Alignment of a Data Analysis between Analyst and Audience

Authors: Lucy D'Agostino McGowan, Roger D. Peng, Stephanie C. Hicks

Abstract: A challenge that data analysts face is building a data analysis that is useful for a given consumer. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept that we call the alignment of a data analysis between the data analyst and a consumer. We define a succ… ▽ More A challenge that data analysts face is building a data analysis that is useful for a given consumer. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept that we call the alignment of a data analysis between the data analyst and a consumer. We define a successfully aligned data analysis as the matching of principles between the analyst and the consumer for whom the analysis is developed. In this paper, we propose a statistical model for evaluating the alignment of a data analysis and describe some of its properties. We argue that this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists and students in data science courses for how to build better data analyses. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.04067 [pdf]

MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion

Authors: Dehua Peng, Zhipeng Gui, Huayi Wu

Abstract: As the most typical graph clustering method, spectral clustering is popular and attractive due to the remarkable performance, easy implementation, and strong adaptability. Classical spectral clustering measures the edge weights of graph using pairwise Euclidean-based metric, and solves the optimal graph partition by relaxing the constraints of indicator matrix and performing Laplacian decompositio… ▽ More As the most typical graph clustering method, spectral clustering is popular and attractive due to the remarkable performance, easy implementation, and strong adaptability. Classical spectral clustering measures the edge weights of graph using pairwise Euclidean-based metric, and solves the optimal graph partition by relaxing the constraints of indicator matrix and performing Laplacian decomposition. However, Euclidean-based similarity might cause skew graph cuts when handling non-spherical data distributions, and the relaxation strategy introduces information loss. Meanwhile, spectral clustering requires specifying the number of clusters, which is hard to determine without enough prior knowledge. In this work, we leverage the path-based similarity to enhance intra-cluster associations, and propose MeanCut as the objective function and greedily optimize it in degree descending order for a nondestructive graph partition. This algorithm enables the identification of arbitrary shaped clusters and is robust to noise. To reduce the computational complexity of similarity calculation, we transform optimal path search into generating the maximum spanning tree (MST), and develop a fast MST (FastMST) algorithm to further improve its time-efficiency. Moreover, we define a density gradient factor (DGF) for separating the weakly connected clusters. The validity of our algorithm is demonstrated by testifying on real-world benchmarks and application of face recognition. The source code of MeanCut is available at https://github.com/ZPGuiGroupWhu/MeanCut-Clustering. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 17 pages, 8 figures, 6 tables

ACM Class: I.5.3

arXiv:2312.04065 [pdf]

A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction Dispersion

Authors: Dehua Peng, Zhipeng Gui, Huayi Wu

Abstract: Boundary points pose a significant challenge for machine learning tasks, including classification, clustering, and dimensionality reduction. Due to the similarity of features, boundary areas can result in mixed-up classes or clusters, leading to a crowding problem in dimensionality reduction. To address this challenge, numerous boundary point detection methods have been developed, but they are ins… ▽ More Boundary points pose a significant challenge for machine learning tasks, including classification, clustering, and dimensionality reduction. Due to the similarity of features, boundary areas can result in mixed-up classes or clusters, leading to a crowding problem in dimensionality reduction. To address this challenge, numerous boundary point detection methods have been developed, but they are insufficiently to accurately and efficiently identify the boundary points in non-convex structures and high-dimensional manifolds. In this work, we propose a robust and efficient method for detecting boundary points using Local Direction Dispersion (LoDD). LoDD considers that internal points are surrounded by neighboring points in all directions, while neighboring points of a boundary point tend to be distributed only in a certain directional range. LoDD adopts a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points, and defines a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point. We demonstrated the validity of LoDD on five synthetic datasets (2-D and 3-D) and ten real-world benchmarks, and tested its clustering performance by equip** with two typical clustering methods, K-means and Ncut. Our results show that LoDD achieves promising and robust detection accuracy in a time-efficient manner. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 11 pages, 6 figures, 3 tables

ACM Class: I.5.2

arXiv:2312.02694 [pdf, other]

UPOCR: Towards Unified Pixel-Level OCR Interface

Authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen **

Abstract: In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications.… ▽ More In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2311.16610 [pdf, other]

The Empathic Metaverse: An Assistive Bioresponsive Platform For Emotional Experience Sharing

Authors: Yun Suen Pai, Mark Armstrong, Kinga Skiers, Anish Kundu, Danyang Peng, Yixin Wang, Tamil Selvan Gunasekaran, Chi-Lan Yang, Kouta Minamizawa

Abstract: The Metaverse is poised to be a future platform that redefines what it means to communicate, socialize, and interact with each other. Yet, it is important for us to consider avoiding the pitfalls of social media platforms we use today; cyberbullying, lack of transparency and an overall false mental model of society. In this paper, we propose the Empathic Metaverse, a virtual platform that prioriti… ▽ More The Metaverse is poised to be a future platform that redefines what it means to communicate, socialize, and interact with each other. Yet, it is important for us to consider avoiding the pitfalls of social media platforms we use today; cyberbullying, lack of transparency and an overall false mental model of society. In this paper, we propose the Empathic Metaverse, a virtual platform that prioritizes emotional sharing for assistance. It aims to cultivate prosocial behaviour, either egoistically or altruistically, so that our future society can better feel for each other and assist one another. To achieve this, we propose the platform to be bioresponsive; it reacts and adapts to an individual's physiological and cognitive state and reflects this via carefully designed avatars, environments, and interactions. We explore this concept in terms of three research directions: bioresponsive avatars, mediated communications and assistive tools. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: 5 pages including references, 4 figures, presented at the Towards an Inclusive and Accessible Metaverse (TIAM) Workshop at CHI 2023

arXiv:2311.09622 [pdf]

Homography Initialization and Dynamic Weighting Algorithm Based on a Downward-Looking Camera and IMU

Authors: Bo Dong, Yongkang Tao, Deng Peng, Zhigang Fu

Abstract: In recent years, the technology in visual-inertial odometry (VIO) has matured considerably and has been widely used in many applications. However, we still encounter challenges when applying VIO to a micro air vehicle (MAV) equipped with a downward-looking camera. Specifically, VIO cannot compute the correct initialization results during take-off and the cumulative drift is large when the MAV is f… ▽ More In recent years, the technology in visual-inertial odometry (VIO) has matured considerably and has been widely used in many applications. However, we still encounter challenges when applying VIO to a micro air vehicle (MAV) equipped with a downward-looking camera. Specifically, VIO cannot compute the correct initialization results during take-off and the cumulative drift is large when the MAV is flying in the air. To overcome these problems, we propose a homographybased initialization method, which utilizes the fact that the features detected by the downward-looking camera during take-off are approximately on the same plane. Then we introduce the prior normal vector and motion field to make states more accurate. In addition, to deal with the cumulative drift, a strategy for dynamically weighting visual residuals is proposed. Finally, we evaluate our method on the collected real-world datasets. The results demonstrate that our system can be successfully initialized no matter how the MAV takes off and the positioning errors are also greatly improved. △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.08001 [pdf]

doi 10.3389/fpubh.2023.1281259

A Comparative Analysis of the COVID-19 Infodemic in English and Chinese: Insights from Social Media Textual Data

Authors: Jia Luo, Daiyun Peng, Lei Shi, Didier El Baz, Xinran Liu

Abstract: The COVID-19 infodemic, characterized by the rapid spread of misinformation and unverified claims related to the pandemic, presents a significant challenge. This paper presents a comparative analysis of the COVID-19 infodemic in the English and Chinese languages, utilizing textual data extracted from social media platforms. To ensure a balanced representation, two infodemic datasets were created b… ▽ More The COVID-19 infodemic, characterized by the rapid spread of misinformation and unverified claims related to the pandemic, presents a significant challenge. This paper presents a comparative analysis of the COVID-19 infodemic in the English and Chinese languages, utilizing textual data extracted from social media platforms. To ensure a balanced representation, two infodemic datasets were created by augmenting previously collected social media textual data. Through word frequency analysis, the thirty-five most frequently occurring infodemic words are identified, shedding light on prevalent discussions surrounding the infodemic. Moreover, topic clustering analysis uncovers thematic structures and provides a deeper understanding of primary topics within each language context. Additionally, sentiment analysis enables comprehension of the emotional tone associated with COVID-19 information on social media platforms in English and Chinese. This research contributes to a better understanding of the COVID-19 infodemic phenomenon and can guide the development of strategies to combat misinformation during public health crises across different languages. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Journal ref: Frontiers in Public Health, 2023, 11

arXiv:2311.07353 [pdf]

Superconductivity in trilayer nickelate La$_4$Ni$_3$O$_{10}$ single crystals

Authors: Yinghao Zhu, Enkang Zhang, Bingying Pan, Xu Chen, Di Peng, Lixing Chen, Huifen Ren, Feiyang Liu, Nana Li, Zhenfang Xing, Jiyuan Han, Junjie Wang, Donghan Jia, Hongliang Wo, Yiqing Gu, Yimeng Gu, Li Ji, Wenbin Wang, Huiyang Gou, Yao Shen, Tian** Ying, Xiaolong Chen, Wenge Yang, Changlin Zheng, Qiaoshi Zeng , et al. (2 additional authors not shown)

Abstract: The pursuit of discovering new high-temperature superconductors that diverge from the copper-based paradigm carries profound implications for elucidating mechanisms behind superconductivity and may also enable new applications. Here, our investigation reveals that application of pressure effectively suppresses the spin and charge order in trilayer nickelate La$_4$Ni$_3$O$_{10}$ single crystals, le… ▽ More The pursuit of discovering new high-temperature superconductors that diverge from the copper-based paradigm carries profound implications for elucidating mechanisms behind superconductivity and may also enable new applications. Here, our investigation reveals that application of pressure effectively suppresses the spin and charge order in trilayer nickelate La$_4$Ni$_3$O$_{10}$ single crystals, leading to the emergence of superconductivity with a maximum critical temperature (Tc) of around 30 K. In the normal state, we observe a "strange metal" behavior, characterized by a linear temperature-dependent resistance extending up to 300 K. These results could be interpreted as the pressure's influence, inducing dam** on the density-wave gap and spin order, while promoting spin fluctuations and bringing the associated flat dz2 band into close proximity with the Fermi surface. This, in turn, fosters strong correlations and "strange metal" behavior, thus setting the stage for the eventual emergence of superconductivity. Furthermore, the layer-dependent superconductivity observed hints at a unique interlayer coupling mechanism specific to nickelates, setting them apart from cuprates in this regard. Our findings provide crucial insights into the fundamental mechanisms underpinning superconductivity, while also introducing a new material platform to explore the intricate interplay between the spin/charge order, flat band structures, interlayer coupling, strange metal behavior and high-temperature superconductivity. △ Less

Submitted 2 January, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Updated: Zero Resistance, Meissner Effect, and Detailed Structural Property Measurements

arXiv:2310.17506 [pdf, other]

Predicting Patient No-Shows in Community Health Clinics: A Case Study in Designing a Data Analytic Product

Authors: Roger D. Peng

Abstract: The data science revolution has highlighted the varying roles that data analytic products can play in a different industries and applications. There has been particular interest in using analytic products coupled with algorithmic prediction models to aid in human decision-making. However, detailed descriptions of the decision-making process that leads to the design and development of analytic prod… ▽ More The data science revolution has highlighted the varying roles that data analytic products can play in a different industries and applications. There has been particular interest in using analytic products coupled with algorithmic prediction models to aid in human decision-making. However, detailed descriptions of the decision-making process that leads to the design and development of analytic products are lacking in the statistical literature, making it difficult to accumulate a body of knowledge where students interested in the field of data science may look to learn about this process. In this paper, we present a case study describing the development of an analytic product for predicting whether patients will show up for scheduled appointments at a community health clinic. We consider the stakeholders involved and their interests, along with the real-world analytical and technical trade-offs involved in develo** and deploying the product. Our goal here is to highlight the decisions made and evaluate them in the context of possible alternatives. We find that although this case study has some unique characteristics, there are lessons to be learned that could translate to other settings and applications. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.17468 [pdf, other]

Cross-modal Active Complementary Learning with Self-refining Correspondence

Authors: Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu

Abstract: Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a perf… ▽ More Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences. △ Less

Submitted 7 January, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: This paper is accepted by NeurIPS 2023

arXiv:2310.16809 [pdf, other]

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

Authors: Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, Lianwen **

Abstract: This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extr… ▽ More This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Specifically, it showed limitations when dealing with non-Latin languages and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image. Based on these observations, we affirm the necessity and continued research value of specialized OCR models. In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models. How to fully utilize pre-trained general-purpose LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at https://github.com/SCUT-DLVCLab/GPT-4V_OCR. △ Less

Submitted 29 October, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.11989 [pdf, other]

Image Clustering with External Guidance

Authors: Yunfan Li, Peng Hu, Dezhong Peng, Jiancheng Lv, Jian** Fan, Xi Peng

Abstract: The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from dat… ▽ More The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset. △ Less

Submitted 16 May, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

Journal ref: ICML 2024

arXiv:2310.08269 [pdf, ps, other]

The Lattice of Group Topologies on a Group

Authors: Dekui Peng

Abstract: For an infinite group $G$, the set of group topologies, $\mathcal{L}_G$, forms a complete lattice. It is known that $\mathcal{L}_G$ is modular if $G$ is abelian but the same result does not hold for nilpotent groups. We prove that the lattice $\mathcal{L}_G$ is semi-modular if and only if $\mathcal{L}_{G/Z}$ is semi-modular, where $Z$ is the centre. As a corollary, for every nilpotent group $G$, t… ▽ More For an infinite group $G$, the set of group topologies, $\mathcal{L}_G$, forms a complete lattice. It is known that $\mathcal{L}_G$ is modular if $G$ is abelian but the same result does not hold for nilpotent groups. We prove that the lattice $\mathcal{L}_G$ is semi-modular if and only if $\mathcal{L}_{G/Z}$ is semi-modular, where $Z$ is the centre. As a corollary, for every nilpotent group $G$, the lattice $\mathcal{L}_G$ is semi-modular. Moreover, in the famous Kourovka Notebook, several questions about $\mathcal{L}_G$ were formalized by Arnautov. We answer two of them in this note. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 11 pages

MSC Class: 54A10; 20F18

arXiv:2310.06276 [pdf, other]

GPI 2.0: Performance Evaluation of the Wavefront Sensor's EMCCD

Authors: Clarissa R. Do Ó, Saavidra Perera, Jérôme Maire, Jayke S. Nguyen, Vincent Chambouleyron, Quinn M. Konopacky, Jeffrey Chilcote, Joeleff Fitzsimmons, Randall Hamper, Dan Kerley, Bruce Macintosh, Christian Marois, Fredrik Rantakyrö, Dmitry Savranksy, Jean-Pierre Veran, Guido Agapito, S. Mark Ammons, Marco Bonaglia, Marc-Andre Boucher, Jennifer Dunn, Simone Esposito, Guillaume Filion, Jean Thomas Landry, Olivier Lardiere, Duan Li , et al. (4 additional authors not shown)

Abstract: The Gemini Planet Imager (GPI) is a high contrast imaging instrument that aims to detect and characterize extrasolar planets. GPI is being upgraded to GPI 2.0, with several subsystems receiving a re-design to improve the instrument's contrast. To enable observations on fainter targets and increase stability on brighter ones, one of the upgrades is to the adaptive optics system. The current Shack-H… ▽ More The Gemini Planet Imager (GPI) is a high contrast imaging instrument that aims to detect and characterize extrasolar planets. GPI is being upgraded to GPI 2.0, with several subsystems receiving a re-design to improve the instrument's contrast. To enable observations on fainter targets and increase stability on brighter ones, one of the upgrades is to the adaptive optics system. The current Shack-Hartmann wavefront sensor (WFS) is being replaced by a pyramid WFS with an low-noise electron multiplying CCD (EMCCD). EMCCDs are detectors capable of counting single photon events at high speed and high sensitivity. In this work, we characterize the performance of the HNü 240 EMCCD from Nüvü Cameras, which was custom-built for GPI 2.0. The HNü 240 EMCCD's characteristics make it well suited for extreme AO: it has low dark current ($<$ 0.01 e-/pix/fr), low readout noise (0.1 e-/pix/fr at a gain of 5000), high quantum efficiency ( 90% at wavelengths from 600-800 nm; 70% from 800-900 nm), and fast readout (up to 3000 fps full frame). Here we present test results on the EMCCD's noise contributors, such as the readout noise, pixel-to-pixel variability and CCD bias. We also tested the linearity and EM gain calibration of the detector. All camera tests were conducted before its integration into the GPI 2.0 PWFS system. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: 16 pages, 14 figures. Conference Proceedings for AO4ELT7, held in June 2023 in Avignon, France

arXiv:2309.17065 [pdf, other]

Minimality of the inner automorphism group

Authors: Dekui Peng, Menachem Shlossberg

Abstract: By [6], a minimal group $G$ is called $z$-minimal if $G/Z(G)$ is minimal. In this paper, we present the $z$-Minimality Criterion for dense subgroups with some applications to topological matrix groups. For a locally compact group $G$, let $\operatorname{Inn}(G)$ be the group of all inner automorphisms of $G,$ endowed with the Birkhoff topology. Using a theorem by Goto [14], we obtain our main resu… ▽ More By [6], a minimal group $G$ is called $z$-minimal if $G/Z(G)$ is minimal. In this paper, we present the $z$-Minimality Criterion for dense subgroups with some applications to topological matrix groups. For a locally compact group $G$, let $\operatorname{Inn}(G)$ be the group of all inner automorphisms of $G,$ endowed with the Birkhoff topology. Using a theorem by Goto [14], we obtain our main result which asserts that if $G$ is a connected Lie group and $H\in\{G/Z(G), \operatorname{Inn}(G)\},$ then $H$ is minimal if and only if it is centre-free and topologically isomorphic to $\operatorname{Inn}(G/Z(G)).$ In particular, if $G$ is a connected Lie group with discrete centre, then $\operatorname{Inn}(G)$ is minimal. We prove that a connected locally compact nilpotent group is $z$-minimal if and only if it is compact abelian. In contrast, we show that there exists a connected metabelian $z$-minimal Lie group that is neither compact nor abelian. △ Less

Submitted 28 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.08494 [pdf, other]

Modeling Data Analytic Iteration With Probabilistic Outcome Sets

Authors: Roger D. Peng, Stephanie C. Hicks

Abstract: In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected f… ▽ More In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis. △ Less

Submitted 1 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: 30 pages

arXiv:2309.08154 [pdf, other]

Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking

Authors: Wenzhang Wei, Zhipeng Gui, Changguang Wu, Anqi Zhao, Dehua Peng, Huayi Wu

Abstract: The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerou… ▽ More The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward increasing the average expression probability of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution and employ a variance-aware weighting loss to assign different weights to the optimization process. In addition, we develop a Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and enhance the performance. We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role of different components by ablation studies and perform a sensitivity analysis of the hyperparameters. The qualitative analysis of visualized bidirectional retrieval and attention maps further demonstrates the ability of our method to encode semantic variations. △ Less

Submitted 20 December, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

arXiv:2309.01429 [pdf, other]

doi 10.1109/TGRS.2024.3368168

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

Authors: Lei Ding, Kun Zhu, Daifeng Peng, Hao Tang, Kuiwu Yang, Lorenzo Bruzzone

Abstract: Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging characteristics of RS images. In this work, we aim to utilize the strong visual reco… ▽ More Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging characteristics of RS images. In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve the change detection of high-resolution Remote Sensing Images (RSIs). We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes. To adapt FastSAM to focus on some specific ground objects in the RS scenes, we propose a convolutional adaptor to aggregate the task-oriented change information. Moreover, to utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs. The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-supervised CD methods. To the best of our knowledge, this is the first work that adapts VFMs for the CD of HR RSIs. △ Less

Submitted 25 January, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

arXiv:2308.13893 [pdf, other]

Unsupervised Domain Adaptation via Domain-Adaptive Diffusion

Authors: Duo Peng, Qiuhong Ke, Yinjie Lei, Jun Liu

Abstract: Unsupervised Domain Adaptation (UDA) is quite challenging due to the large distribution discrepancy between the source domain and the target domain. Inspired by diffusion models which have strong capability to gradually convert data distributions across a large gap, we consider to explore the diffusion technique to handle the challenging UDA task. However, using diffusion models to convert data di… ▽ More Unsupervised Domain Adaptation (UDA) is quite challenging due to the large distribution discrepancy between the source domain and the target domain. Inspired by diffusion models which have strong capability to gradually convert data distributions across a large gap, we consider to explore the diffusion technique to handle the challenging UDA task. However, using diffusion models to convert data distribution across different domains is a non-trivial problem as the standard diffusion models generally perform conversion from the Gaussian distribution instead of from a specific domain distribution. Besides, during the conversion, the semantics of the source-domain data needs to be preserved for classification in the target domain. To tackle these problems, we propose a novel Domain-Adaptive Diffusion (DAD) module accompanied by a Mutual Learning Strategy (MLS), which can gradually convert data distribution from the source domain to the target domain while enabling the classification model to learn along the domain transition process. Consequently, our method successfully eases the challenge of UDA by decomposing the large domain gap into small ones and gradually enhancing the capacity of classification model to finally adapt to the target domain. Our method outperforms the current state-of-the-arts by a large margin on three widely used UDA datasets. △ Less

Submitted 26 August, 2023; originally announced August 2023.

Comments: 11 pages, 4 figures

arXiv:2308.13241 [pdf, other]

WSTac: Interactive Surface Perception based on Whisker-Inspired and Self-Illuminated Vision-Based Tactile Sensor

Authors: Kai Chong Lei, Kit Wa Sou, Wang Sing Chan, Jiayi Yan, Siqi **, Dengfeng Peng, Wenbo Ding, Xiao-** Zhang

Abstract: Modern Visual-Based Tactile Sensors (VBTSs) use cost-effective cameras to track elastomer deformation, but struggle with ambient light interference. Solutions typically involve using internal LEDs and blocking external light, thus adding complexity. Creating a VBTS resistant to ambient light with just a camera and an elastomer remains a challenge. In this work, we introduce WStac, a self-illuminat… ▽ More Modern Visual-Based Tactile Sensors (VBTSs) use cost-effective cameras to track elastomer deformation, but struggle with ambient light interference. Solutions typically involve using internal LEDs and blocking external light, thus adding complexity. Creating a VBTS resistant to ambient light with just a camera and an elastomer remains a challenge. In this work, we introduce WStac, a self-illuminating VBTS comprising a mechanoluminescence (ML) whisker elastomer, camera, and 3D printed parts. The ML whisker elastomer, inspired by the touch sensitivity of vibrissae, offers both light isolation and high ML intensity under stress, thereby removing the necessity for additional LED modules. With the incorporation of machine learning, the sensor effectively utilizes the dynamic contact variations of 25 whiskers to successfully perform tasks like speed regression, directional identification, and texture classification. Videos are available at: https://sites.google.com/view/wstac/. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2308.12350 [pdf, other]

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

Authors: Duo Peng, ** Hu, Qiuhong Ke, Jun Liu

Abstract: Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source… ▽ More Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV2023

arXiv:2308.11164 [pdf, other]

Decoupled Contrastive Multi-View Clustering with High-Order Random Walks

Authors: Yiding Lu, Yijie Lin, Mouxing Yang, Dezhong Peng, Peng Hu, Xi Peng

Abstract: In recent, some robust contrastive multi-view clustering (MvC) methods have been proposed, which construct data pairs from neighborhoods to alleviate the false negative issue, i.e., some intra-cluster samples are wrongly treated as negative pairs. Although promising performance has been achieved by these methods, the false negative issue is still far from addressed and the false positive issue eme… ▽ More In recent, some robust contrastive multi-view clustering (MvC) methods have been proposed, which construct data pairs from neighborhoods to alleviate the false negative issue, i.e., some intra-cluster samples are wrongly treated as negative pairs. Although promising performance has been achieved by these methods, the false negative issue is still far from addressed and the false positive issue emerges because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address the issues, we propose a novel robust method, dubbed decoupled contrastive multi-view clustering with high-order random walks (DIVIDE). In brief, DIVIDE leverages random walks to progressively identify data pairs in a global instead of local manner. As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. Moreover, DIVIDE embraces a novel MvC architecture to perform inter- and intra-view contrastive learning in different embedding spaces, thus boosting clustering performance and embracing the robustness against missing views. To verify the efficacy of DIVIDE, we carry out extensive experiments on four benchmark datasets comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings. △ Less

Submitted 18 January, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted by AAAI 2024

arXiv:2308.10147 [pdf, other]

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

Authors: Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen **

Abstract: In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interact… ▽ More In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional shared query into task-aware queries for text polygon and content, respectively. Through the decoder with the proposed vision-language communication module, the queries interact with each other in an explicit manner while preserving discriminative patterns of text detection and recognition, thus improving performance significantly. Additionally, we propose a task-aware query initialization scheme to ensure stable training. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods. Code is available at https://github.com/mxin262/ESTextSpotter. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023

arXiv:2308.09911 [pdf, other]

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Authors: Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu

Abstract: Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, th… ▽ More Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE. △ Less

Submitted 28 March, 2024; v1 submitted 19 August, 2023; originally announced August 2023.

arXiv:2308.07405 [pdf, ps, other]

More on Rainbow Cliques in Edge-Colored Graphs

Authors: Xiao-Chuan Liu, Danni Peng, Xu Yang

Abstract: In an edge-colored graph $G$, a rainbow clique $K_k$ is a $k$-complete subgraph in which all the edges have distinct colors. Let $e(G)$ and $c(G)$ be the number of edges and colors in $G$, respectively. In this paper, we show that for any $\varepsilon>0$, if $e(G)+c(G) \geq (1+\frac{k-3}{k-2}+2\varepsilon) {n\choose 2}$ and $k\geq 3$, then for sufficiently large $n$, the number of rainbow cliques… ▽ More In an edge-colored graph $G$, a rainbow clique $K_k$ is a $k$-complete subgraph in which all the edges have distinct colors. Let $e(G)$ and $c(G)$ be the number of edges and colors in $G$, respectively. In this paper, we show that for any $\varepsilon>0$, if $e(G)+c(G) \geq (1+\frac{k-3}{k-2}+2\varepsilon) {n\choose 2}$ and $k\geq 3$, then for sufficiently large $n$, the number of rainbow cliques $K_k$ in $G$ is $Ω(n^k)$. We also characterize the extremal graphs $G$ without a rainbow clique $K_k$, for $k=4,5$, when $e(G)+c(G)$ is maximum. Our results not only address existing questions but also complete the findings of Ehard and Mohr (Ehard and Mohr, Rainbow triangles and cliques in edge-colored graphs. {\it European Journal of Combinatorics, 84:103037,2020}). △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: 16pages

Showing 1–50 of 161 results for author: Peng, D