-
Bulk high-temperature superconductivity in the high-pressure tetragonal phase of bilayer La2PrNi2O7
Authors:
Ningning Wang,
Gang Wang,
Xiaoling Shen,
Jun Hou,
Jun Luo,
** Ma,
Huaixin Yang,
Lifen Shi,
Jie Dou,
Jie Feng,
Jie Yang,
Yunqing Shi,
Zhian Ren,
Hanming Ma,
Pengtao Yang,
Ziyi Liu,
Yue Liu,
Hua Zhang,
Xiaoli Dong,
Yuxin Wang,
Kun Jiang,
Jiang** Hu,
Stuart Calder,
Jiaqiang Yan,
Jian** Sun
, et al. (4 additional authors not shown)
Abstract:
The Ruddlesden-Popper (R-P) bilayer nickelate, La3Ni2O7, was recently found to show signatures of high-temperature superconductivity (HTSC) at pressures above 14 GPa. Subsequent investigations achieved zero resistance in single- and poly-crystalline samples under hydrostatic pressure conditions. Yet, obvious diamagnetic signals, the other hallmark of superconductors, are still lacking owing to the…
▽ More
The Ruddlesden-Popper (R-P) bilayer nickelate, La3Ni2O7, was recently found to show signatures of high-temperature superconductivity (HTSC) at pressures above 14 GPa. Subsequent investigations achieved zero resistance in single- and poly-crystalline samples under hydrostatic pressure conditions. Yet, obvious diamagnetic signals, the other hallmark of superconductors, are still lacking owing to the filamentary nature with low superconducting volume fraction. The presence of a novel "1313" polymorph and competing R-P phases obscured proper identification of the phase for HTSC. Thus, achieving bulk HTSC and identifying the phase at play are the most prominent tasks at present. Here, we address these issues in the praseodymium (Pr)-doped La2PrNi2O7 polycrystalline samples. We find that the substitutions of Pr for La effectively inhibits the intergrowth of different R-P phases, resulting in nearly pure bilayer structure. For La2PrNi2O7, pressure-induced orthorhombic-to-tetragonal structural transition takes place at Pc ~ 11 GPa, above which HTSC emerges gradually upon further compression. The superconducting transition temperatures at 18-20 GPa reach Tconset = 82.5 K and Tczero = 60 K, which are the highest values among known nickelate superconductors. More importantly, bulk HTSC was testified by detecting clear diamagnetic signals below ~75 K corresponding to an estimated superconducting volume fraction ~ 57(5)% at 20 GPa. Our results not only resolve the existing controversies but also illuminate directions for exploring bulk HTSC in the bilayer nickelates.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Language Models Encode Collaborative Signals in Recommendation
Authors:
Leheng Sheng,
An Zhang,
Yi Zhang,
Yuxin Chen,
Xiang Wang,
Tat-Seng Chua
Abstract:
Recent studies empirically indicate that language models (LMs) encode rich world knowledge beyond mere semantics, attracting significant attention across various fields. However, in the recommendation domain, it remains uncertain whether LMs implicitly encode user preference information. Contrary to the prevailing understanding that LMs and traditional recommender models learn two distinct represe…
▽ More
Recent studies empirically indicate that language models (LMs) encode rich world knowledge beyond mere semantics, attracting significant attention across various fields. However, in the recommendation domain, it remains uncertain whether LMs implicitly encode user preference information. Contrary to the prevailing understanding that LMs and traditional recommender models learn two distinct representation spaces due to a huge gap in language and behavior modeling objectives, this work rethinks such understanding and explores extracting a recommendation space directly from the language representation space. Surprisingly, our findings demonstrate that item representations, when linearly mapped from advanced LM representations, yield superior recommendation performance. This outcome suggests the homomorphism between the language representation space and an effective recommendation space, implying that collaborative signals may indeed be encoded within advanced LMs. Motivated by these findings, we propose a simple yet effective collaborative filtering (CF) model named AlphaRec, which utilizes language representations of item textual metadata (e.g., titles) instead of traditional ID-based embeddings. Specifically, AlphaRec is comprised of three main components: a multilayer perceptron (MLP), graph convolution, and contrastive learning (CL) loss function, making it extremely easy to implement and train. Our empirical results show that AlphaRec outperforms leading ID-based CF models on multiple datasets, marking the first instance of such a recommender with text embeddings achieving this level of performance. Moreover, AlphaRec introduces a new language-representation-based CF paradigm with several desirable advantages: being easy to implement, lightweight, rapid convergence, superior zero-shot recommendation abilities in new domains, and being aware of user intention.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Leveraging Data Mining, Active Learning, and Domain Adaptation in a Multi-Stage, Machine Learning-Driven Approach for the Efficient Discovery of Advanced Acidic Oxygen Evolution Electrocatalysts
Authors:
Rui Ding,
Jianguo Liu,
Kang Hua,
Xuebin Wang,
Xiaoben Zhang,
Minhua Shao,
Yuxin Chen,
Junhong Chen
Abstract:
Develo** advanced catalysts for acidic oxygen evolution reaction (OER) is crucial for sustainable hydrogen production. This study introduces a novel, multi-stage machine learning (ML) approach to streamline the discovery and optimization of complex multi-metallic catalysts. Our method integrates data mining, active learning, and domain adaptation throughout the materials discovery process. Unlik…
▽ More
Develo** advanced catalysts for acidic oxygen evolution reaction (OER) is crucial for sustainable hydrogen production. This study introduces a novel, multi-stage machine learning (ML) approach to streamline the discovery and optimization of complex multi-metallic catalysts. Our method integrates data mining, active learning, and domain adaptation throughout the materials discovery process. Unlike traditional trial-and-error methods, this approach systematically narrows the exploration space using domain knowledge with minimized reliance on subjective intuition. Then the active learning module efficiently refines element composition and synthesis conditions through iterative experimental feedback. The process culminated in the discovery of a promising Ru-Mn-Ca-Pr oxide catalyst. Our workflow also enhances theoretical simulations with domain adaptation strategy, providing deeper mechanistic insights aligned with experimental findings. By leveraging diverse data sources and multiple ML strategies, we establish an efficient pathway for electrocatalyst discovery and optimization. This comprehensive, data-driven approach represents a paradigm shift and potentially new benchmark in electrocatalysts research.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design
Authors:
Yiyang Huang,
Yuhui Hao,
Bo Yu,
Feng Yan,
Yuxin Yang,
Feng Min,
Yinhe Han,
Lin Ma,
Shaoshan Liu,
Qiang Liu,
Yiming Gan
Abstract:
Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot act…
▽ More
Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. https://github.com/hyy0613/Corki
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Quantum spectral method for gradient and Hessian estimation
Authors:
Yuxin Zhang,
Changpeng Shao
Abstract:
Gradient descent is one of the most basic algorithms for solving continuous optimization problems. In [Jordan, PRL, 95(5):050501, 2005], Jordan proposed the first quantum algorithm for estimating gradients of functions close to linear, with exponential speedup in the black-box model. This quantum algorithm was greatly enhanced and developed by [Gilyén, Arunachalam, and Wiebe, SODA, pp. 1425-1444,…
▽ More
Gradient descent is one of the most basic algorithms for solving continuous optimization problems. In [Jordan, PRL, 95(5):050501, 2005], Jordan proposed the first quantum algorithm for estimating gradients of functions close to linear, with exponential speedup in the black-box model. This quantum algorithm was greatly enhanced and developed by [Gilyén, Arunachalam, and Wiebe, SODA, pp. 1425-1444, 2019], providing a quantum algorithm with optimal query complexity $\widetildeΘ(\sqrt{d}/\varepsilon)$ for a class of smooth functions of $d$ variables, where $\varepsilon$ is the accuracy. This is quadratically faster than classical algorithms for the same problem.
In this work, we continue this research by proposing a new quantum algorithm for another class of functions, namely, analytic functions $f(\boldsymbol{x})$ which are well-defined over the complex field. Given phase oracles to query the real and imaginary parts of $f(\boldsymbol{x})$ respectively, we propose a quantum algorithm that returns an $\varepsilon$-approximation of its gradient with query complexity $\widetilde{O}(1/\varepsilon)$. This achieves exponential speedup over classical algorithms in terms of the dimension $d$. As an extension, we also propose two quantum algorithms for Hessian estimation, aiming to improve quantum analogs of Newton's method. The two algorithms have query complexity $\widetilde{O}(d/\varepsilon)$ and $\widetilde{O}(d^{1.5}/\varepsilon)$, respectively, under different assumptions. Moreover, if the Hessian is promised to be $s$-sparse, we then have two new quantum algorithms with query complexity $\widetilde{O}(s/\varepsilon)$ and $\widetilde{O}(sd/\varepsilon)$, respectively. The former achieves exponential speedup over classical algorithms. We also prove a lower bound of $\widetildeΩ(d)$ for Hessian estimation in the general case.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models
Authors:
Zheng Lin,
Xuanjie Hu,
Yuxin Zhang,
Zhe Chen,
Zihan Fang,
Xianhao Chen,
Ang Li,
Praneeth Vepakomma,
Yue Gao
Abstract:
The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently h…
▽ More
The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation's gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at https://fduinc.github.io/splitlora/.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Critical fluctuation and noise spectra in two-dimensional Fe$_{3}$GeTe$_{2}$ magnets
Authors:
Yuxin Li,
Zhe Ding,
Chen Wang,
Haoyu Sun,
Zhousheng Chen,
Pengfei Wang,
Ya Wang,
Ming Gong,
Hualing Zeng,
Fazhan Shi,
Jiangfeng Du
Abstract:
Critical fluctuations play a fundamental role in determining the spin orders for low-dimensional quantum materials, especially for recently discovered two-dimensional (2D) magnets. Here we employ the quantum decoherence imaging technique utilizing nitrogen-vacancy centers in diamond to explore the critical magnetic fluctuations and the associated temporal spin noise in van der Waals magnet…
▽ More
Critical fluctuations play a fundamental role in determining the spin orders for low-dimensional quantum materials, especially for recently discovered two-dimensional (2D) magnets. Here we employ the quantum decoherence imaging technique utilizing nitrogen-vacancy centers in diamond to explore the critical magnetic fluctuations and the associated temporal spin noise in van der Waals magnet $\rm{Fe_{3}GeTe_{2}}$. We show that the critical fluctuation contributes to a random magnetic field characterized by the noise spectra, which can be changed dramatically near the critical temperature $T_c$. A theoretical model to describe this phenomenon is developed, showing that the spectral density is characterized by a $1/f$ noise near the $T_c$, while away from this point it behaves like a white noise. The crossover at a certain temperature between these two situations is determined by changing of the distance between the sample and the diamond. This work provides a new way to study critical fluctuation and to extract some of the critical exponents, which may greatly deepen our understanding of criticality in a wide range of physical systems.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues
Authors:
Yuxin Xie,
Tao Zhou,
Yi Zhou,
Geng Chen
Abstract:
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key compon…
▽ More
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance.
△ Less
Submitted 28 June, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models
Authors:
Yicheng Xu,
Yuxin Chen,
Jiahao Nie,
Yusong Wang,
Hui** Zhuang,
Manabu Okumura
Abstract:
Continual learning (CL) with Vision-Language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the catastrophic forgetting on incrementally learned knowledge but also to preserve the zero-shot ability of VLMs. However, existing methods require additional reference datasets to mainta…
▽ More
Continual learning (CL) with Vision-Language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the catastrophic forgetting on incrementally learned knowledge but also to preserve the zero-shot ability of VLMs. However, existing methods require additional reference datasets to maintain such zero-shot ability and rely on domain-identity hints to classify images across different domains. In this study, we propose Regression-based Analytic Incremental Learning (RAIL), which utilizes a recursive ridge regression-based adapter to learn from a sequence of domains in a non-forgetting manner and decouple the cross-domain correlations by projecting features to a higher-dimensional space. Cooperating with a training-free fusion module, RAIL absolutely preserves the VLM's zero-shot ability on unseen domains without any reference data. Additionally, we introduce Cross-domain Task-Agnostic Incremental Learning (X-TAIL) setting. In this setting, a CL learner is required to incrementally learn from multiple domains and classify test images from both seen and unseen domains without any domain-identity hint. We theoretically prove RAIL's absolute memorization on incrementally learned domains. Experiment results affirm RAIL's state-of-the-art performance in both X-TAIL and existing Multi-domain Task-Incremental Learning settings. The code will be released upon acceptance.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
FFN: a Fine-grained Chinese-English Financial Domain Parallel Corpus
Authors:
Yuxin Fu,
Shi**g Si,
Leyi Mai,
Xi-ang Li
Abstract:
Large Language Models (LLMs) have stunningly advanced the field of machine translation, though their effectiveness within the financial domain remains largely underexplored. To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream med…
▽ More
Large Language Models (LLMs) have stunningly advanced the field of machine translation, though their effectiveness within the financial domain remains largely underexplored. To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream media websites such as CNN, FOX, and China Daily. The dataset consists of 1,013 main text and 809 titles, all of which have been manually corrected. We measured the translation quality of two LLMs -- ChatGPT and ERNIE-bot, utilizing BLEU, TER and chrF scores as the evaluation metrics. For comparison, we also trained an OpenNMT model based on our dataset. We detail problems of LLMs and provide in-depth analysis, intending to stimulate further research and solutions in this largely uncharted territory. Our research underlines the need to optimize LLMs within the specific field of financial translation to ensure accuracy and quality.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Banishing LLM Hallucinations Requires Rethinking Generalization
Authors:
Johnny Li,
Saksham Consul,
Eda Zhou,
James Wong,
Naila Farooqui,
Yuxin Ye,
Nithyashree Manohar,
Zhuxiaona Wei,
Tian Wu,
Ben Echols,
Sharon Zhou,
Gregory Diamos
Abstract:
Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional a…
▽ More
Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations -- Lamini-1 -- that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Authors:
Yuang Peng,
Yuxin Cui,
Haomiao Tang,
Zekun Qi,
Runpei Dong,
**g Bai,
Chunrui Han,
Zheng Ge,
Xiangyu Zhang,
Shu-Tao Xia
Abstract:
Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advan…
▽ More
Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, hel** boost the community with innovative findings.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Mixed precision iterative refinement for least squares with linear equality constraints and generalized least squares problems
Authors:
Bowen Gao,
Yuxin Ma,
Meiyue Shao
Abstract:
Recent development on mixed precision techniques has largely enhanced the performance of various linear algebra solvers, one of which being the least squares problem $\min_{x}\lVert b-Ax\rVert_{2}$. By transforming the least squares problem into an augmented linear system, mixed precision techniques are capable of refining the lower precision solution to the working precision. In this paper, we pr…
▽ More
Recent development on mixed precision techniques has largely enhanced the performance of various linear algebra solvers, one of which being the least squares problem $\min_{x}\lVert b-Ax\rVert_{2}$. By transforming the least squares problem into an augmented linear system, mixed precision techniques are capable of refining the lower precision solution to the working precision. In this paper, we propose mixed precision iterative refinement algorithms for two variants of the least squares problem -- the least squares problem with linear equality constraints (LSE) and the generalized least squares problem (GLS). Both classical and GMRES-based iterative refinement can be applied to augmented systems of these two problems to improve the accuracy of the solution. For reasonably well-conditioned problems our algorithms reduce the execution time by a factor of 40% in average compared to the fixed precision ones from LAPACK on the x86-64 architecture.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
MIRReS: Multi-bounce Inverse Rendering using Reservoir Sampling
Authors:
Yuxin Dai,
Qi Wang,
**gsen Zhu,
Dianbing Xi,
Yuchi Huo,
Chen Qian,
Ying He
Abstract:
We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based…
▽ More
We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based inverse rendering model that utilizes multi-bounce path tracing and Monte Carlo integration. By leveraging multi-bounce path tracing, our method effectively estimates indirect illumination, including self-shadowing and internal reflections, which improves the intrinsic decomposition of shape, material, and lighting. Moreover, we incorporate reservoir sampling into our framework to address the noise in Monte Carlo integration, enhancing convergence and facilitating gradient-based optimization with low sample counts. Through qualitative and quantitative evaluation of several scenarios, especially in challenging scenarios with complex shadows, we demonstrate that our method achieves state-of-the-art performance on decomposition results. Additionally, our optimized explicit geometry enables applications such as scene editing, relighting, and material editing with modern graphics engines or CAD software. The source code is available at https://brabbitdousha.github.io/MIRReS/
△ Less
Submitted 24 June, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
Authors:
Yuxin Chen,
Chen Tang,
Chenran Li,
Ran Tian,
Peter Stone,
Masayoshi Tomizuka,
Wei Zhan
Abstract:
Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thu…
▽ More
Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
SimCE: Simplifying Cross-Entropy Loss for Collaborative Filtering
Authors:
Xiaodong Yang,
Huiyuan Chen,
Yuchen Yan,
Yuxin Tang,
Yuying Zhao,
Eric Xu,
Yiwei Cai,
Hanghang Tong
Abstract:
The learning objective is integral to collaborative filtering systems, where the Bayesian Personalized Ranking (BPR) loss is widely used for learning informative backbones. However, BPR often experiences slow convergence and suboptimal local optima, partially because it only considers one negative item for each positive item, neglecting the potential impacts of other unobserved items. To address t…
▽ More
The learning objective is integral to collaborative filtering systems, where the Bayesian Personalized Ranking (BPR) loss is widely used for learning informative backbones. However, BPR often experiences slow convergence and suboptimal local optima, partially because it only considers one negative item for each positive item, neglecting the potential impacts of other unobserved items. To address this issue, the recently proposed Sampled Softmax Cross-Entropy (SSM) compares one positive sample with multiple negative samples, leading to better performance. Our comprehensive experiments confirm that recommender systems consistently benefit from multiple negative samples during training. Furthermore, we introduce a \underline{Sim}plified Sampled Softmax \underline{C}ross-\underline{E}ntropy Loss (SimCE), which simplifies the SSM using its upper bound. Our validation on 12 benchmark datasets, using both MF and LightGCN backbones, shows that SimCE significantly outperforms both BPR and SSM.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
GenQA: Generating Millions of Instructions from a Handful of Prompts
Authors:
Jiuhai Chen,
Rifaa Qadri,
Yuxin Wen,
Neel Jain,
John Kirchenbauer,
Tianyi Zhou,
Tom Goldstein
Abstract:
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study…
▽ More
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study methods for generating large instruction datasets from a single prompt. With little human oversight, we get LLMs to write diverse sets of instruction examples ranging from simple completion tasks to complex multi-turn dialogs across a variety of subject areas. When finetuning a Llama-3 8B base model, our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations. We release our dataset, the "generator" prompts that created it, and our finetuned model checkpoints.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Authors:
Abhimanyu Hans,
Yuxin Wen,
Neel Jain,
John Kirchenbauer,
Hamid Kazemi,
Prajwal Singhania,
Siddharth Singh,
Gowthami Somepalli,
Jonas Gei**,
Abhinav Bhatele,
Tom Goldstein
Abstract:
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verba…
▽ More
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
How Does Distribution Matching Help Domain Generalization: An Information-theoretic Analysis
Authors:
Yuxin Dong,
Tieliang Gong,
Hong Chen,
Shuangyong Song,
Weizhan Zhang,
Chen Li
Abstract:
Domain generalization aims to learn invariance across multiple training domains, thereby enhancing generalization against out-of-distribution data. While gradient or representation matching algorithms have achieved remarkable success, these methods generally lack generalization guarantees or depend on strong assumptions, leaving a gap in understanding the underlying mechanism of distribution match…
▽ More
Domain generalization aims to learn invariance across multiple training domains, thereby enhancing generalization against out-of-distribution data. While gradient or representation matching algorithms have achieved remarkable success, these methods generally lack generalization guarantees or depend on strong assumptions, leaving a gap in understanding the underlying mechanism of distribution matching. In this work, we formulate domain generalization from a novel probabilistic perspective, ensuring robustness while avoiding overly conservative solutions. Through comprehensive information-theoretic analysis, we provide key insights into the roles of gradient and representation matching in promoting generalization. Our results reveal the complementary relationship between these two components, indicating that existing works focusing solely on either gradient or representation alignment are insufficient to solve the domain generalization problem. In light of these theoretical findings, we introduce IDM to simultaneously align the inter-domain gradients and representations. Integrated with the proposed PDM method for complex distribution matching, IDM achieves superior performance over various baseline methods.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Non-$μ$-ordinary smooth cyclic covers of $\mathbb{P}^1$
Authors:
Yuxin Lin,
Elena Mantovan,
Deepesh Singhal
Abstract:
Given a family of cyclic covers of $\mathbb{P}^1$ and a prime $p$ of good reduction, by [12] the generic Newton polygon (resp. Ekedahl--Oort type) in the family ($μ$-ordinary) is known. In this paper, we investigate the existence of non-$μ$-ordinary smooth curves in the family. In particular, under some auxiliary conditions, we show that when $p$ is sufficiently large the complement of the $μ$-ord…
▽ More
Given a family of cyclic covers of $\mathbb{P}^1$ and a prime $p$ of good reduction, by [12] the generic Newton polygon (resp. Ekedahl--Oort type) in the family ($μ$-ordinary) is known. In this paper, we investigate the existence of non-$μ$-ordinary smooth curves in the family. In particular, under some auxiliary conditions, we show that when $p$ is sufficiently large the complement of the $μ$-ordinary locus is always non empty, and for $1$-dimensional families with condition on signature type, we obtain a lower bound for the number of non-$μ$-ordinary smooth curves. In specific examples, for small $m$, the above general statement can be improved, and we establish the non emptiness of all codimension 1 non-$μ$-ordinary Newton/Ekedahl--Oort strata ({\em almost} $μ$-ordinary). Our method relies on further study of the extended Hasse-Witt matrix initiated in [12].
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
On Softmax Direct Preference Optimization for Recommendation
Authors:
Yuxin Chen,
Junfei Tan,
An Zhang,
Zhengyi Yang,
Leheng Sheng,
Enzhi Zhang,
Xiang Wang,
Tat-Seng Chua
Abstract:
Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-t…
▽ More
Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-tuning LM with a language modeling loss. However, the current objective fails to fully leverage preference data and is not optimized for personalized ranking tasks, which hinders the performance of LM-based recommenders. Inspired by the current advancement of Direct Preference Optimization (DPO) in human preference alignment and the success of softmax loss in recommendations, we propose Softmax-DPO (S-DPO) to instill ranking information into the LM to help LM-based recommenders distinguish preferred items from negatives, rather than solely focusing on positives. Specifically, we incorporate multiple negatives in user preference data and devise an alternative version of DPO loss tailored for LM-based recommenders, connected to softmax sampling strategies. Theoretically, we bridge S-DPO with the softmax loss over negative sampling and find that it has a side effect of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while mitigating the data likelihood decline issue of DPO. Our codes are available at https://github.com/chenyuxin1999/S-DPO.
△ Less
Submitted 14 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
FlamePINN-1D: Physics-informed neural networks to solve forward and inverse problems of 1D laminar flames
Authors:
Jiahao Wu,
Su Zhang,
Yuxin Wu,
Guihua Zhang,
Xin Li,
Hai Zhang
Abstract:
Given the existence of various forward and inverse problems in combustion studies and applications that necessitate distinct methods for resolution, a framework to solve them in a unified way is critically needed. A promising approach is the integration of machine learning methods with governing equations of combustion systems, which exhibits superior generality and few-shot learning ability compa…
▽ More
Given the existence of various forward and inverse problems in combustion studies and applications that necessitate distinct methods for resolution, a framework to solve them in a unified way is critically needed. A promising approach is the integration of machine learning methods with governing equations of combustion systems, which exhibits superior generality and few-shot learning ability compared to purely data-driven methods. In this work, the FlamePINN-1D framework is proposed to solve the forward and inverse problems of 1D laminar flames based on physics-informed neural networks. Three cases with increasing complexity have been tested: Case 1 are freely-propagating premixed (FPP) flames with simplified physical models, while Case 2 and Case 3 are FPP and counterflow premixed (CFP) flames with detailed models, respectively. For forward problems, FlamePINN-1D aims to solve the flame fields and infer the unknown eigenvalues (such as laminar flame speeds) under the constraints of governing equations and boundary conditions. For inverse problems, FlamePINN-1D aims to reconstruct the continuous fields and infer the unknown parameters (such as transport and chemical kinetics parameters) from noisy sparse observations of the flame. Our results strongly validate these capabilities of FlamePINN-1D across various flames and working conditions. Compared to traditional methods, FlamePINN-1D is differentiable and mesh-free, exhibits no discretization errors, and is easier to implement for inverse problems. The inverse problem results also indicate the possibility of optimizing chemical mechanisms from measurements of laboratory 1D flames. Furthermore, some proposed strategies, such as hard constraints and thin-layer normalization, are proven to be essential for the robust learning of FlamePINN-1D. The code for this paper is partially available at https://github.com/CAME-THU/FlamePINN-1D.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Kinematics and star formation of hub-filament systems in W49A
Authors:
WenJun Zhang,
Jianjun Zhou,
Jarken Esimbek,
Willem Baan,
Yuxin He,
Xindi Tang,
Dalei Li,
Weiguang Ji,
Gang Wu,
Yingxiu Ma,
Jiasheng Li,
Dongdong Zhou,
Kadirya Tursun,
Toktarkhan Komesh
Abstract:
W49A is a prominent giant molecular cloud (GMC) that exhibits strong star formation activities, yet its structural and kinematic properties remain uncertain. Our study aims to investigate the large-scale structure and kinematics of W49A, and elucidate the role of filaments and hub-filament systems (HFSs) in its star formation activity. We utilized continuum data from Herschel and the James Clerk M…
▽ More
W49A is a prominent giant molecular cloud (GMC) that exhibits strong star formation activities, yet its structural and kinematic properties remain uncertain. Our study aims to investigate the large-scale structure and kinematics of W49A, and elucidate the role of filaments and hub-filament systems (HFSs) in its star formation activity. We utilized continuum data from Herschel and the James Clerk Maxwell Telescope (JCMT) as well as the molecular lines 12CO (3-2), 13CO (3-2), and C18O (3-2) to identify filaments and HFS structures within W49A. Further analysis focused on the physical properties, kinematics, and mass transport within these structures. Additionally, recombination line emission from the H I/OH/Recombination (THOR) line survey was employed to trace the central H II region and ionized gas. Our findings reveal that W49A comprises one blue-shifted (B-S) HFS and one red-shifted (R-S) HFS, each with multiple filaments and dense hubs. Notably, significant velocity gradients were detected along these filaments, indicative of material transport toward the hubs. High mass accretion rates along the filaments facilitate the formation of massive stars in the HFSs. Furthermore, the presence of V-shaped structures around clumps in position-velocity diagrams suggests ongoing gravitational collapse and local star formation within the filaments. Our results indicate that W49A consists of one R-S HFS and one B-S HFS, and that the material transport from filaments to the hub promotes the formation of massive stars in the hub. These findings underscore the significance of HFSs in sha** the star formation history of W49A.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
TraceMesh: Scalable and Streaming Sampling for Distributed Traces
Authors:
Zhuangbin Chen,
Zhihan Jiang,
Yuxin Su,
Michael R. Lyu,
Zibin Zheng
Abstract:
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy…
▽ More
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlap** and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Artificial Intelligence for Neuro MRI Acquisition: A Review
Authors:
Hongjia Yang,
Guanhua Wang,
Ziyu Li,
Haoxiang Li,
Jialan Zheng,
Yuxin Hu,
Xiaozhi Cao,
Congyu Liao,
Huihui Ye,
Qiyuan Tian
Abstract:
Magnetic resonance imaging (MRI) has significantly benefited from the resurgence of artificial intelligence (AI). By leveraging AI's capabilities in large-scale optimization and pattern recognition, innovative methods are transforming the MRI acquisition workflow, including planning, sequence design, and correction of acquisition artifacts. These emerging algorithms demonstrate substantial potenti…
▽ More
Magnetic resonance imaging (MRI) has significantly benefited from the resurgence of artificial intelligence (AI). By leveraging AI's capabilities in large-scale optimization and pattern recognition, innovative methods are transforming the MRI acquisition workflow, including planning, sequence design, and correction of acquisition artifacts. These emerging algorithms demonstrate substantial potential in enhancing the efficiency and throughput of acquisition steps. This review discusses several pivotal AI-based methods in neuro MRI acquisition, focusing on their technological advances, impact on clinical practice, and potential risks.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Evolution-aware VAriance (EVA) Coreset Selection for Medical Image Classification
Authors:
Yuxin Hong,
Xiao Zhang,
Xin Zhang,
Joey Tianyi Zhou
Abstract:
In the medical field, managing high-dimensional massive medical imaging data and performing reliable medical analysis from it is a critical challenge, especially in resource-limited environments such as remote medical facilities and mobile devices. This necessitates effective dataset compression techniques to reduce storage, transmission, and computational cost. However, existing coreset selection…
▽ More
In the medical field, managing high-dimensional massive medical imaging data and performing reliable medical analysis from it is a critical challenge, especially in resource-limited environments such as remote medical facilities and mobile devices. This necessitates effective dataset compression techniques to reduce storage, transmission, and computational cost. However, existing coreset selection methods are primarily designed for natural image datasets, and exhibit doubtful effectiveness when applied to medical image datasets due to challenges such as intra-class variation and inter-class similarity. In this paper, we propose a novel coreset selection strategy termed as Evolution-aware VAriance (EVA), which captures the evolutionary process of model training through a dual-window approach and reflects the fluctuation of sample importance more precisely through variance measurement. Extensive experiments on medical image datasets demonstrate the effectiveness of our strategy over previous SOTA methods, especially at high compression rates. EVA achieves 98.27% accuracy with only 10% training data, compared to 97.20% for the full training set. None of the compared baseline methods can exceed Random at 5% selection rate, while EVA outperforms Random by 5.61%, showcasing its potential for efficient medical image analysis.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
VerilogReader: LLM-Aided Hardware Test Generation
Authors:
Ruiyang Ma,
Yuxin Yang,
Ziqian Liu,
Jiaxi Zhang,
Min Li,
Junhua Huang,
Guojie Luo
Abstract:
Test generation has been a critical and labor-intensive process in hardware design verification. Recently, the emergence of Large Language Model (LLM) with their advanced understanding and inference capabilities, has introduced a novel approach. In this work, we investigate the integration of LLM into the Coverage Directed Test Generation (CDG) process, where the LLM functions as a Verilog Reader.…
▽ More
Test generation has been a critical and labor-intensive process in hardware design verification. Recently, the emergence of Large Language Model (LLM) with their advanced understanding and inference capabilities, has introduced a novel approach. In this work, we investigate the integration of LLM into the Coverage Directed Test Generation (CDG) process, where the LLM functions as a Verilog Reader. It accurately grasps the code logic, thereby generating stimuli that can reach unexplored code branches. We compare our framework with random testing, using our self-designed Verilog benchmark suite. Experiments demonstrate that our framework outperforms random testing on designs within the LLM's comprehension scope. Our work also proposes prompt engineering optimizations to augment LLM's understanding scope and accuracy.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
On Affine Homotopy between Language Encoders
Authors:
Robin SM Chan,
Reda Boumasmoud,
Anej Svete,
Yuxin Ren,
Qipeng Guo,
Zhi**g **,
Shauli Ravfogel,
Mrinmaya Sachan,
Bernhard Schölkopf,
Mennatallah El-Assady,
Ryan Cotterell
Abstract:
Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the…
▽ More
Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the performance on downstream tasks. It is common to consider two encoders similar if they are \emph{homotopic}, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emph{affine} alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Effects of Exponential Gaussian Distribution on (Double Sampling) Randomized Smoothing
Authors:
Youwei Shu,
Xi Xiao,
Derui Wang,
Yuxin Cao,
Siji Chen,
Jason Xue,
Linyi Li,
Bo Li
Abstract:
Randomized Smoothing (RS) is currently a scalable certified defense method providing robustness certification against adversarial examples. Although significant progress has been achieved in providing defenses against $\ell_p$ adversaries, the interaction between the smoothing distribution and the robustness certification still remains vague. In this work, we comprehensively study the effect of tw…
▽ More
Randomized Smoothing (RS) is currently a scalable certified defense method providing robustness certification against adversarial examples. Although significant progress has been achieved in providing defenses against $\ell_p$ adversaries, the interaction between the smoothing distribution and the robustness certification still remains vague. In this work, we comprehensively study the effect of two families of distributions, named Exponential Standard Gaussian (ESG) and Exponential General Gaussian (EGG) distributions, on Randomized Smoothing and Double Sampling Randomized Smoothing (DSRS). We derive an analytic formula for ESG's certified radius, which converges to the origin formula of RS as the dimension $d$ increases. Additionally, we prove that EGG can provide tighter constant factors than DSRS in providing $Ω(\sqrt{d})$ lower bounds of $\ell_2$ certified radius, and thus further addresses the curse of dimensionality in RS. Our experiments on real-world datasets confirm our theoretical analysis of the ESG distributions, that they provide almost the same certification under different exponents $η$ for both RS and DSRS. In addition, EGG brings a significant improvement to the DSRS certification, but the mechanism can be different when the classifier properties are different. Compared to the primitive DSRS, the increase in certified accuracy provided by EGG is prominent, up to 6.4% on ImageNet.
△ Less
Submitted 5 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
On the Nonlinearity of Layer Normalization
Authors:
Yunhao Ni,
Yuxin Guo,
Junlong Jia,
Lei Huang
Abstract:
Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically…
▽ More
Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Optimizing Age of Information in Random Access Networks: A Second-Order Approach for Active/Passive Users
Authors:
Siqi Fan,
Yuxin Zhong,
I-Hong Hou,
Clement K Kam
Abstract:
In this paper, we study the moments of the Age of Information (AoI) for both active and passive users in a random access network. In this network, active users broadcast sensing data, while passive users detect in-band radio activities from out-of-network devices, such as jammers. Collisions occur when multiple active users transmit simultaneously. Passive users can detect radio activities only wh…
▽ More
In this paper, we study the moments of the Age of Information (AoI) for both active and passive users in a random access network. In this network, active users broadcast sensing data, while passive users detect in-band radio activities from out-of-network devices, such as jammers. Collisions occur when multiple active users transmit simultaneously. Passive users can detect radio activities only when no active user transmits. Each active user's transmission behavior follows a Markov process. We aim to minimize the weighted sum of any moments of AoI for both user types. To achieve this, we employ a second-order analysis of system behavior. Specifically, we characterize an active user's transmission Markov process using its mean and temporal variance. We show that any moment of the AoI can be approximated by a function of these two parameters. This insight enables us to analyze and optimize the transmission Markov process for active users. We apply this strategy to two different random access models. Simulation results show that policies derived from this strategy outperform other baseline policies.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet
Authors:
Zhen Qin,
Yuxin Mao,
Xuyang Shen,
Dong Li,
**g Zhang,
Yuchao Dai,
Yiran Zhong
Abstract:
Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establis…
▽ More
Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establish a global receptive field necessitates multiple scans for multi-dimensional data, thereby leading to inefficiencies. This paper identifies the inefficiency caused by a multiplicative linear recurrence and proposes an efficient alternative additive linear recurrence to avoid the issue, as it can handle multi-dimensional data within a single scan. We further develop an efficient multi-dimensional sequential modeling framework called LightNet based on the new recurrence. Moreover, we present two new multi-dimensional linear relative positional encoding methods, MD-TPE and MD-LRPE to enhance the model's ability to discern positional information in multi-dimensional scenarios. Our empirical evaluations across various tasks, including image classification, image generation, bidirectional language modeling, and autoregressive language modeling, demonstrate the efficacy of LightNet, showcasing its potential as a versatile and efficient solution for multi-dimensional sequential modeling.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
SPARE: Symmetrized Point-to-Plane Distance for Robust Non-Rigid Registration
Authors:
Yuxin Yao,
Bailin Deng,
Junhui Hou,
Juyong Zhang
Abstract:
Existing optimization-based methods for non-rigid registration typically minimize an alignment error metric based on the point-to-point or point-to-plane distance between corresponding point pairs on the source surface and target surface. However, these metrics can result in slow convergence or a loss of detail. In this paper, we propose SPARE, a novel formulation that utilizes a symmetrized point…
▽ More
Existing optimization-based methods for non-rigid registration typically minimize an alignment error metric based on the point-to-point or point-to-plane distance between corresponding point pairs on the source surface and target surface. However, these metrics can result in slow convergence or a loss of detail. In this paper, we propose SPARE, a novel formulation that utilizes a symmetrized point-to-plane distance for robust non-rigid registration. The symmetrized point-to-plane distance relies on both the positions and normals of the corresponding points, resulting in a more accurate approximation of the underlying geometry and can achieve higher accuracy than existing methods. To solve this optimization problem efficiently, we propose an alternating minimization solver using a majorization-minimization strategy. Moreover, for effective initialization of the solver, we incorporate a deformation graph-based coarse alignment that improves registration quality and efficiency. Extensive experiments show that the proposed method greatly improves the accuracy of non-rigid registration problems and maintains relatively high solution efficiency. The code is publicly available at https://github.com/yaoyx689/spare.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Unbending strategies shepherd cooperation and suppress extortion in spatial populations
Authors:
Zijie Chen,
Yuxin Geng,
Xingru Chen,
Feng Fu
Abstract:
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-d…
▽ More
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-determinant (ZD) strategies that are able to enforce a linear payoff relationship between them and their co-players. Here, we consider the evolutionary dynamics of always cooperate (AllC), extortionate ZD (extortioners), and unbending players in lattice populations based on the commonly used death-birth updating. Out of the class of unbending strategies, we consider a particular candidate, PSO Gambler, a machine-learning-optimized memory-one strategy, which can foster reciprocal cooperation and fairness among extortionate players. We derive analytical results under weak selection and rare mutations, including pairwise fixation probabilities and long-term frequencies of strategies. In the absence of the third unbending type, extortioners can achieve a half-half split in equilibrium with unconditional cooperators for sufficiently large extortion factors. However, the presence of unbending players fundamentally changes the dynamics and tilts the system to favor unbending cooperation. Most surprisingly, extortioners cannot dominate at all regardless of how large their extortion factor is, and the long-term frequency of unbending players is maintained almost as a constant. Our analytical method is applicable to studying the evolutionary dynamics of multiple strategies in structured populations. Our work provides insights into the interplay between network reciprocity and direct reciprocity, revealing the role of unbending strategies in enforcing fairness and suppressing extortion.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
UniPTS: A Unified Framework for Proficient Post-Training Sparsity
Authors:
**g**g Xie,
Yuxin Zhang,
Mingbao Lin,
Zhihang Lin,
Liujuan Cao,
Rongrong Ji
Abstract:
Post-training Sparsity (PTS) is a recently emerged avenue that chases efficient network sparsity with limited data in need. Existing PTS methods, however, undergo significant performance degradation compared with traditional methods that retrain the sparse networks via the whole dataset, especially at high sparsity ratios. In this paper, we attempt to reconcile this disparity by transposing three…
▽ More
Post-training Sparsity (PTS) is a recently emerged avenue that chases efficient network sparsity with limited data in need. Existing PTS methods, however, undergo significant performance degradation compared with traditional methods that retrain the sparse networks via the whole dataset, especially at high sparsity ratios. In this paper, we attempt to reconcile this disparity by transposing three cardinal factors that profoundly alter the performance of conventional sparsity into the context of PTS. Our endeavors particularly comprise (1) A base-decayed sparsity objective that promotes efficient knowledge transferring from dense network to the sparse counterpart. (2) A reducing-regrowing search algorithm designed to ascertain the optimal sparsity distribution while circumventing overfitting to the small calibration set in PTS. (3) The employment of dynamic sparse training predicated on the preceding aspects, aimed at comprehensively optimizing the sparsity structure while ensuring training stability. Our proposed framework, termed UniPTS, is validated to be much superior to existing PTS methods across extensive benchmarks. As an illustration, it amplifies the performance of POT, a recently proposed recipe, from 3.9% to 68.6% when pruning ResNet-50 at 90% sparsity ratio on ImageNet. We release the code of our paper at https://github.com/xjjxmu/UniPTS.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
De Bruijn Polyominoes
Authors:
D. Condon,
Yuxin Wang,
E. Yang
Abstract:
We introduce the notions of de Bruijn polyominoes and prismatic polyominoes, which generalize the notions of de Bruijn sequences and arrays. Given a small fixed polyomino $p$ and a set of colors $[n]$, a de Bruijn polyomino for $(p,n)$ is a colored fixed polyomino $P$ with cells colored from $[n]$ such that every possible coloring of $p$ from $[n]$ exists as a subset of $P$. We call de Bruijn poly…
▽ More
We introduce the notions of de Bruijn polyominoes and prismatic polyominoes, which generalize the notions of de Bruijn sequences and arrays. Given a small fixed polyomino $p$ and a set of colors $[n]$, a de Bruijn polyomino for $(p,n)$ is a colored fixed polyomino $P$ with cells colored from $[n]$ such that every possible coloring of $p$ from $[n]$ exists as a subset of $P$. We call de Bruijn polyominoes for $(p,n)$ of minimum size $(p,n)$-prismatic. We discuss for some values of $p$ and $n$ the shape of a $(p,n)$-prismatic polyomino $P$, the construction of a coloring of $P$, and the enumeration of the colorings of $P$. We find evidence that the difficulty of these problems may depend on the parity of the size of $p$
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
An empirical study of bloated dependencies in CommonJS packages
Authors:
Yuxin Liu,
Deepika Tiwari,
Cristian Bogdan,
Benoit Baudry
Abstract:
JavaScript packages are notoriously prone to bloat, a factor that significantly impacts the performance and maintainability of web applications. While web bundlers and tree-shaking can mitigate this issue in client-side applications at the function level, they cannot effectively detect and remove bloat in server-side applications. In this paper, we conduct an empirical study to investigate the blo…
▽ More
JavaScript packages are notoriously prone to bloat, a factor that significantly impacts the performance and maintainability of web applications. While web bundlers and tree-shaking can mitigate this issue in client-side applications at the function level, they cannot effectively detect and remove bloat in server-side applications. In this paper, we conduct an empirical study to investigate the bloated dependencies that are entirely unused within server-side applications. Our study focuses on applications built with the widely used and highly dynamic CommonJS module system. We propose a trace-based dynamic analysis that monitors file access, to determine which dependencies are not accessed during runtime. To conduct our study, we curate an original dataset of 92 CommonJS packages with a median test coverage of 96.9% and a total of 50,661 dependencies. Our dynamic analysis identifies and successfully removes 50.7% of these dependencies while maintaining the correct build of all packages. Furthermore, we find that 14.9% of directly used dependencies and 51.3% of indirect dependencies are bloated. A key insight is that focusing on removing only the direct bloated dependencies by cleaning the package.json file, also removes a significant share of unnecessary bloated indirect dependencies. Compared to the state-of-the-art dynamic debloating technique, our analysis based on file accesses has fewer false positives, and demonstrates higher accuracy in detecting bloated dependencies. Our findings suggest that native support for dependency debloating in package managers could significantly alleviate the burden of maintaining dependencies.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
MentalManip: A Dataset For Fine-grained Analysis of Mental Manipulation in Conversations
Authors:
Yuxin Wang,
Ivory Yang,
Saeed Hassanpour,
Soroush Vosoughi
Abstract:
Mental manipulation, a significant form of abuse in interpersonal conversations, presents a challenge to identify due to its context-dependent and often subtle nature. The detection of manipulative language is essential for protecting potential victims, yet the field of Natural Language Processing (NLP) currently faces a scarcity of resources and research on this topic. Our study addresses this ga…
▽ More
Mental manipulation, a significant form of abuse in interpersonal conversations, presents a challenge to identify due to its context-dependent and often subtle nature. The detection of manipulative language is essential for protecting potential victims, yet the field of Natural Language Processing (NLP) currently faces a scarcity of resources and research on this topic. Our study addresses this gap by introducing a new dataset, named ${\rm M{\small ental}M{\small anip}}$, which consists of $4,000$ annotated movie dialogues. This dataset enables a comprehensive analysis of mental manipulation, pinpointing both the techniques utilized for manipulation and the vulnerabilities targeted in victims. Our research further explores the effectiveness of leading-edge models in recognizing manipulative dialogue and its components through a series of experiments with various configurations. The results demonstrate that these models inadequately identify and categorize manipulative content. Attempts to improve their performance by fine-tuning with existing datasets on mental health and toxicity have not overcome these limitations. We anticipate that ${\rm M{\small ental}M{\small anip}}$ will stimulate further research, leading to progress in both understanding and mitigating the impact of mental manipulation in conversations.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
TURNIP: A "Nondeterministic" GPU Runtime with CPU RAM Offload
Authors:
Zhimin Ding,
Jiawen Yao,
Brianna Barrow,
Tania Lorido Botran,
Christopher Jermaine,
Yuxin Tang,
Jiehui Li,
Xinyu Yao,
Sleem Mahmoud Abdelghafar,
Daniel Bourgeois
Abstract:
An obvious way to alleviate memory difficulties in GPU-based AI computing is via CPU offload, where data are moved between GPU and CPU RAM, so inexpensive CPU RAM is used to increase the amount of storage available. While CPU offload is an obvious idea, it can greatly slow down a computation, due to the relatively slow transfer rate between CPU RAM and GPU RAM. Thus, any system for CPU offload nee…
▽ More
An obvious way to alleviate memory difficulties in GPU-based AI computing is via CPU offload, where data are moved between GPU and CPU RAM, so inexpensive CPU RAM is used to increase the amount of storage available. While CPU offload is an obvious idea, it can greatly slow down a computation, due to the relatively slow transfer rate between CPU RAM and GPU RAM. Thus, any system for CPU offload needs to ensure that when such a transfer needs to happen, no computation is blocked waiting for the transfer to finish. One of the key challenges when using CPU offload is that memory transfers introduce nondeterminacy into the system: it is not possible to know before runtime when the transfers will finish, and hence what is the best order of operations to run to ensure there is no blocking. In this paper, we describe TURNIP, which is a system for running AI computations using CPU offload. The key innovation in TURNIP is the compilation of the AI computation into a dependency graph that gives the TURNIP runtime freedom to run operations such as GPU kernel calls in many different orders; at runtime, TURNIP chooses the best order in response to real-time events.
△ Less
Submitted 27 May, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
Improving 3D Occupancy Prediction through Class-balancing Loss and Multi-scale Representation
Authors:
Huizhou Chen,
Jiangyi Wang,
Yuxin Li,
Na Zhao,
Jun Cheng,
Xulei Yang
Abstract:
3D environment recognition is essential for autonomous driving systems, as autonomous vehicles require a comprehensive understanding of surrounding scenes. Recently, the predominant approach to define this real-life problem is through 3D occupancy prediction. It attempts to predict the occupancy states and semantic labels for all voxels in 3D space, which enhances the perception capability. Birds-…
▽ More
3D environment recognition is essential for autonomous driving systems, as autonomous vehicles require a comprehensive understanding of surrounding scenes. Recently, the predominant approach to define this real-life problem is through 3D occupancy prediction. It attempts to predict the occupancy states and semantic labels for all voxels in 3D space, which enhances the perception capability. Birds-Eye-View(BEV)-based perception has achieved the SOTA performance for this task. Nonetheless, this architecture fails to represent various scales of BEV features. In this paper, inspired by the success of UNet in semantic segmentation tasks, we introduce a novel UNet-like Multi-scale Occupancy Head module to relieve this issue. Furthermore, we propose the class-balancing loss to compensate for rare classes in the dataset. The experimental results on nuScenes 3D occupancy challenge dataset show the superiority of our proposed approach over baseline and SOTA methods.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Dense Connector for MLLMs
Authors:
Huan** Yao,
Wenhao Wu,
Taojiannan Yang,
YuXin Song,
Mengxi Zhang,
Haocheng Feng,
Yifan Sun,
Zhiheng Li,
Wanli Ouyang,
**gdong Wang
Abstract:
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well…
▽ More
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Alterations of electrocortical activity during hand movements induced by motor cortex glioma
Authors:
Yihan Wu,
Tao Chang,
Siliang Chen,
Xiaodong Niu,
Yu Li,
Yuan Fang,
Lei Yang,
Yixuan Zong,
Yaoxin Yang,
Yuehua Li,
Mengsong Wang,
Wen Yang,
Yixuan Wu,
Chen Fu,
Xia Fang,
Yuxin Quan,
Xilin Peng,
Qiang Sun,
Marc M. Van Hulle,
Yanhui Liu,
Ning Jiang,
Dario Farina,
Yuan Yang,
Jiayuan He,
Qing Mao
Abstract:
Glioma cells can reshape functional neuronal networks by hijacking neuronal synapses, leading to partial or complete neurological dysfunction. These mechanisms have been previously explored for language functions. However, the impact of glioma on sensorimotor functions is still unknown. Therefore, we recruited a control group of patients with unaffected motor cortex and a group of patients with gl…
▽ More
Glioma cells can reshape functional neuronal networks by hijacking neuronal synapses, leading to partial or complete neurological dysfunction. These mechanisms have been previously explored for language functions. However, the impact of glioma on sensorimotor functions is still unknown. Therefore, we recruited a control group of patients with unaffected motor cortex and a group of patients with glioma-infiltrated motor cortex, and recorded high-density electrocortical signals during finger movement tasks. The results showed that glioma suppresses task-related synchronization in the high-gamma band and reduces the power across all frequency bands. The resulting atypical motor information transmission model with discrete signaling pathways and delayed responses disrupts the stability of neuronal encoding patterns for finger movement kinematics across various temporal-spatial scales. These findings demonstrate that gliomas functionally invade neural circuits within the motor cortex. This result advances our understanding of motor function processing in chronic disease states, which is important to advance the surgical strategies and neurorehabilitation approaches for patients with malignant gliomas.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Automated Multi-level Preference for MLLMs
Authors:
Mengxi Zhang,
Wenhao Wu,
Yu Lu,
Yuxin Song,
Kang Rong,
Huan** Yao,
Jianbo Zhao,
Fanglong Liu,
Yifan Sun,
Haocheng Feng,
**gdong Wang
Abstract:
Current multimodal Large Language Models (MLLMs) suffer from ``hallucination'', occasionally generating responses that are not grounded in the input images. To tackle this challenge, one promising path is to utilize reinforcement learning from human feedback (RLHF), which steers MLLMs towards learning superior responses while avoiding inferior ones. We rethink the common practice of using binary p…
▽ More
Current multimodal Large Language Models (MLLMs) suffer from ``hallucination'', occasionally generating responses that are not grounded in the input images. To tackle this challenge, one promising path is to utilize reinforcement learning from human feedback (RLHF), which steers MLLMs towards learning superior responses while avoiding inferior ones. We rethink the common practice of using binary preferences (i.e., superior, inferior), and find that adopting multi-level preferences (e.g., superior, medium, inferior) is better for two benefits: 1) It narrows the gap between adjacent levels, thereby encouraging MLLMs to discern subtle differences. 2) It further integrates cross-level comparisons (beyond adjacent-level comparisons), thus providing a broader range of comparisons with hallucination examples. To verify our viewpoint, we present the Automated Multi-level Preference (AMP) framework for MLLMs. To facilitate this framework, we first develop an automated dataset generation pipeline that provides high-quality multi-level preference datasets without any human annotators. Furthermore, we design the Multi-level Direct Preference Optimization (MDPO) algorithm to robustly conduct complex multi-level preference learning. Additionally, we propose a new hallucination benchmark, MRHal-Bench. Extensive experiments across public hallucination and general benchmarks, as well as our MRHal-Bench, demonstrate the effectiveness of our proposed method. Code is available at https://github.com/takomc/amp.
△ Less
Submitted 28 May, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer
Authors:
Weifei **,
Yuxin Cao,
Junjie Su,
Qi Shen,
Kai Ye,
Derui Wang,
Jie Hao,
Ziyao Liu
Abstract:
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of…
▽ More
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while kee** sound naturalness due to our user study.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
No-Regret Learning of Nash Equilibrium for Black-Box Games via Gaussian Processes
Authors:
Minbiao Han,
Fengxue Zhang,
Yuxin Chen
Abstract:
This paper investigates the challenge of learning in black-box games, where the underlying utility function is unknown to any of the agents. While there is an extensive body of literature on the theoretical analysis of algorithms for computing the Nash equilibrium with complete information about the game, studies on Nash equilibrium in black-box games are less common. In this paper, we focus on le…
▽ More
This paper investigates the challenge of learning in black-box games, where the underlying utility function is unknown to any of the agents. While there is an extensive body of literature on the theoretical analysis of algorithms for computing the Nash equilibrium with complete information about the game, studies on Nash equilibrium in black-box games are less common. In this paper, we focus on learning the Nash equilibrium when the only available information about an agent's payoff comes in the form of empirical queries. We provide a no-regret learning algorithm that utilizes Gaussian processes to identify the equilibrium in such games. Our approach not only ensures a theoretical convergence rate but also demonstrates effectiveness across a variety collection of games through experimental validation.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment
Authors:
**glin Xu,
Sibo Yin,
Guohao Zhao,
Zishuo Wang,
Yuxin Peng
Abstract:
Existing action quality assessment (AQA) methods mainly learn deep representations at the video level for scoring diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they harshly suffer from low credibility and interpretability, thus insufficient for stringent applications, such as Olympic diving events. We argue that a fine-grained understanding of actions requi…
▽ More
Existing action quality assessment (AQA) methods mainly learn deep representations at the video level for scoring diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they harshly suffer from low credibility and interpretability, thus insufficient for stringent applications, such as Olympic diving events. We argue that a fine-grained understanding of actions requires the model to perceive and parse actions in both time and space, which is also the key to the credibility and interpretability of the AQA technique. Based on this insight, we propose a new fine-grained spatial-temporal action parser named \textbf{FineParser}. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in time and space to minimize the impact of invalid backgrounds during the assessment. In addition, we construct fine-grained annotations of human-centric foreground action masks for the FineDiving dataset, called \textbf{FineDiving-HM}. With refined annotations on diverse target action procedures, FineDiving-HM can promote the development of real-world AQA systems. Through extensive experiments, we demonstrate the effectiveness of FineParser, which outperforms state-of-the-art methods while supporting more tasks of fine-grained action understanding. Data and code are available at \url{https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024}.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
Authors:
Zuan Gao,
Yuxin Wang,
Yadong Qu,
Boqiang Zhang,
Zixiao Wang,
Jianjun Xu,
Hongtao Xie
Abstract:
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture…
▽ More
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.
△ Less
Submitted 10 May, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models
Authors:
**glin Xu,
Yijie Guo,
Yuxin Peng
Abstract:
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this ta…
▽ More
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Non-Abelian Braiding of Topological Edge Bands
Authors:
Yang Long,
Zihao Wang,
Chen Zhang,
Haoran Xue,
Yuxin Zhao,
Baile Zhang
Abstract:
Braiding is a geometric concept that manifests itself in a variety of scientific contexts from biology to physics, and has been employed to classify bulk band topology in topological materials. Topological edge states can also form braiding structures, as demonstrated recently in a type of topological insulators known as Möbius insulators, whose topological edge states form two braided bands exhib…
▽ More
Braiding is a geometric concept that manifests itself in a variety of scientific contexts from biology to physics, and has been employed to classify bulk band topology in topological materials. Topological edge states can also form braiding structures, as demonstrated recently in a type of topological insulators known as Möbius insulators, whose topological edge states form two braided bands exhibiting a Möbius twist. While the formation of Möbius twist is inspiring, it belongs to the simple Abelian braid group $\mathbb{B}_2$. The most fascinating features about topological braids rely on the non-Abelianness in the higher-order braid group $\mathbb{B}_N$ ($N \geq 3$), which necessitates multiple edge bands, but so far it has not been discussed. Here, based on the gauge enriched symmetry, we develop a scheme to realize non-Abelian braiding of multiple topological edge bands. We propose tight-binding models of topological insulators that are able to generate topological edge states forming non-Abelian braiding structures. Experimental demonstrations are conducted in two acoustic crystals, which carry three and four braided acoustic edge bands, respectively. The observed braiding structure can correspond to the topological winding in the complex eigenvalue space of projective translation operator, akin to the previously established point-gap winding topology in the bulk of the Hatano-Nelson model. Our work also constitutes the realization of non-Abelian braiding topology on an actual crystal platform, but not based on the "virtual" synthetic dimensions.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing
Authors:
Boqiang Zhang,
Hongtao Xie,
Zuan Gao,
Yuxin Wang
Abstract:
Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling th…
▽ More
Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.