Search | arXiv e-print repository

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Authors: Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

Abstract: Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers nu… ▽ More Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024. GitHub Repository: https://github.com/emo-box/EmoBox

arXiv:2406.07077 [pdf, other]

Meta-Backscatter: A New ISAC Paradigm for Battery-Free Internet of Things

Authors: Xu Liu, Hongliang Zhang, Kaigui Bian, Xi Weng, Lingyang Song

Abstract: The meta-material sensor has been regarded as a next-generation sensing technology for the battery-free Internet of Things (IoT) due to its battery-free characteristic and improved sensing performance. The meta-material sensors function as backscatter tags that change their reflection coefficients with the conditions of sensing targets such as temperature and gas concentration, allowing transceive… ▽ More The meta-material sensor has been regarded as a next-generation sensing technology for the battery-free Internet of Things (IoT) due to its battery-free characteristic and improved sensing performance. The meta-material sensors function as backscatter tags that change their reflection coefficients with the conditions of sensing targets such as temperature and gas concentration, allowing transceivers to perform sensing by analyzing the reflected signals from the sensors. Simultaneously, the sensors also function as environmental scatterers, creating additional signal paths to enhance communication performance. Therefore, the meta-material sensor potentially provides a new paradigm of Integrated Sensing and Communication (ISAC) for the battery-free IoT system. In this article, we first propose a Meta-Backscatter system that utilizes meta-material sensors to achieve diverse sensing functionalities and improved communication performance. We begin with the introduction of the metamaterial sensor and further elaborate on the Meta-Backscatter system. Subsequently, we present optimization strategies for meta-material sensors, transmitters, and receivers to strike a balance between sensing and communication. Furthermore, this article provides a case study of the system and examines the feasibility and trade-off through the simulation results. Finally, potential extensions of the system and their related research challenges are addressed. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.07048 [pdf, other]

GPU-Accelerated Optimization-Based Collision Avoidance

Authors: Zeming Wu, Zhu** Wang, Hao Zhang

Abstract: This paper proposes a GPU-accelerated optimization framework for collision avoidance problems where the controlled objects and the obstacles can be modeled as the finite union of convex polyhedra. A novel collision avoidance constraint is proposed based on scale-based collision detection and the strong duality of convex optimization. Under this constraint, the high-dimensional non-convex optimizat… ▽ More This paper proposes a GPU-accelerated optimization framework for collision avoidance problems where the controlled objects and the obstacles can be modeled as the finite union of convex polyhedra. A novel collision avoidance constraint is proposed based on scale-based collision detection and the strong duality of convex optimization. Under this constraint, the high-dimensional non-convex optimization problems of collision avoidance can be decomposed into several low-dimensional quadratic programmings (QPs) following the paradigm of alternating direction method of multipliers (ADMM). Furthermore, these low-dimensional QPs can be solved parallel with GPUs, significantly reducing computational time. High-fidelity simulations are conducted to validate the proposed method's effectiveness and practicality. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.07036 [pdf, other]

Paying More Attention to Source Context: Mitigating Unfaithful Translations from Large Language Model

Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

Abstract: Large language models (LLMs) have showcased impressive multilingual machine translation ability. However, unlike encoder-decoder style models, decoder-only LLMs lack an explicit alignment between source and target contexts. Analyzing contribution scores during generation processes revealed that LLMs can be biased towards previously generated tokens over corresponding source tokens, leading to unfa… ▽ More Large language models (LLMs) have showcased impressive multilingual machine translation ability. However, unlike encoder-decoder style models, decoder-only LLMs lack an explicit alignment between source and target contexts. Analyzing contribution scores during generation processes revealed that LLMs can be biased towards previously generated tokens over corresponding source tokens, leading to unfaithful translations. To address this issue, we propose to encourage LLMs to pay more attention to the source context from both source and target perspectives in zeroshot prompting: 1) adjust source context attention weights; 2) suppress irrelevant target prefix influence; Additionally, we propose 3) avoiding over-reliance on the target prefix in instruction tuning. Experimental results from both human-collected unfaithfulness test sets focusing on LLM-generated unfaithful translations and general test sets, verify our methods' effectiveness across multiple language pairs. Further human evaluation shows our method's efficacy in reducing hallucinatory translations and facilitating faithful translation generation. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by ACL2024 Findings

arXiv:2406.06999 [pdf, other]

Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

Authors: Junfei Yi, Jianxu Mao, Tengfei Liu, Mingjie Li, Hanyu Gu, Hui Zhang, Xiaojun Chang, Yaonan Wang

Abstract: Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model's knowledge, which stems from data noise and imperfect training. This limits the student model's ability to learn latent knowle… ▽ More Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model's knowledge, which stems from data noise and imperfect training. This limits the student model's ability to learn latent knowledge, as it may overly rely on the teacher's imperfect guidance. In this paper, we propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection, termed "Uncertainty Estimation-Discriminative Knowledge Extraction-Knowledge Transfer (UET)", which can seamlessly integrate with existing distillation methods. By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model, facilitating deeper exploration of latent knowledge. Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources. Extensive experiments validate the effectiveness of our proposed approach across various distillation strategies, detectors, and backbone architectures. Specifically, following our proposed paradigm, the existing FGD method achieves state-of-the-art (SoTA) performance, with ResNet50-based GFL achieving 44.1% mAP on the COCO dataset, surpassing the baselines by 3.9%. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06918 [pdf, other]

Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond

Authors: Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng

Abstract: To evaluate the code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation approaches have been developed. They typically leverage contextual code from the latest version of a project to facilitate LLMs in accurately generating the desired function. However, such evaluation approaches fail to consider the dynamic evolution of… ▽ More To evaluate the code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation approaches have been developed. They typically leverage contextual code from the latest version of a project to facilitate LLMs in accurately generating the desired function. However, such evaluation approaches fail to consider the dynamic evolution of software projects over time, which we refer to as evolving-ignored situation, leading to issues of future context leakage and useful context missing. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolving nature of software development. To achieve this, we first construct an evolving-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model's performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain many important findings through our experimental study. For example, we find that previous evolving-ignored evaluation approaches lead to inflated performance of the LLMs, ranging from 10.0% to 61.1%. Based on the findings, we give actionable suggestions on more realistic evaluation of LLMs on code generation. We also build a shared evolving-aware code generation toolbox to facilitate future research. Replication package including source code, datasets and appendix is available at https://github.com/DeepSoftwareAnalytics/EvoEval. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06697 [pdf, other]

A quasar-galaxy merger at $z\sim 6.2$: rapid host growth via accretion of two massive satellite galaxies

Authors: Roberto Decarli, Federica Loiacono, Emanuele Paolo Farina, Massimo Dotti, Alessandro Lupi, Romain A. Meyer, Marco Mignoli, Antonio Pensabene, Michael A. Strauss, Bram Venemans, **yi Yang, Fabian Walter, Julien Wolf, Eduardo Bañados, Laura Blecha, Sarah Bosman, Chris L. Carilli, Andrea Comastri, Thomas Connor, Tiago Costa, Anna-Christina Eilers, Xiaohui Fan, Roberto Gilli, Hyunsung D. Jun, Weizhe Liu , et al. (16 additional authors not shown)

Abstract: We present JWST/NIRSpec Integral Field Spectroscopy in the rest-frame optical bands of the system PJ308-21, a quasar at $z=6.2342$ caught as its host galaxy interacts with companion galaxies. We detect spatially extended emission of several emission lines (H$α$, H$β$, [OIII], [NII], [SII], HeII), which we use to study the properties of the ionized phase of the interstellar medium: the source and h… ▽ More We present JWST/NIRSpec Integral Field Spectroscopy in the rest-frame optical bands of the system PJ308-21, a quasar at $z=6.2342$ caught as its host galaxy interacts with companion galaxies. We detect spatially extended emission of several emission lines (H$α$, H$β$, [OIII], [NII], [SII], HeII), which we use to study the properties of the ionized phase of the interstellar medium: the source and hardness of the photoionizing radiation field, metallicity, dust reddening, electron density and temperature, and star formation. We also marginally detect continuum starlight emission associated with the companion sources. We find that at least two independent satellite galaxies are part of the system. While the quasar host appears highly enriched and obscured, with AGN-like photoionization conditions, the western companion shows minimal dust extinction, low metallicity ($Z\sim0.4$ Z$_\odot$), and star-formation driven photoionization. The eastern companion shows higher extinction and metallicity ($Z\sim0.8$ Z$_\odot$) compared to the western companion, and it is at least partially photoionized by the nearby quasar. We do not find any indication of AGN in the companion sources. Our study shows that while the quasar host galaxy is already very massive ($M_{\rm dyn}>10^{11}$ M$_\odot$), it is still rapidly building up by accreting two relatively massive ($M_{\rm star}\sim 10^{10}$ M$_\odot$) companion sources. This dataset showcases the power of JWST in exposing the build-up of massive galaxies in the first Gyr of the Universe. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 15 pages, 16 figures. Accepted for publication in A&A

arXiv:2406.06678 [pdf, other]

Kinematics and Dynamics of the Galactic Bar revealed by Gaia Long Period Variables

Authors: Han-Yuan Zhang, Vasily Belokurov, N. Wyn Evans, Sarah G. Kane, Jason L. Sanders

Abstract: We take low-amplitude, long period variable (LA-LPV) candidates in Gaia DR3 as tracers of the kinematics and dynamics of the Milky Way bar. LA-LPVs, like other LPVs, have high luminosities and follow a tight period-luminosity relation, but unlike e.g. Mira variables, their radial velocity measurements are reliable due to their smaller pulsation amplitudes. We supplement the Gaia astrometric and ra… ▽ More We take low-amplitude, long period variable (LA-LPV) candidates in Gaia DR3 as tracers of the kinematics and dynamics of the Milky Way bar. LA-LPVs, like other LPVs, have high luminosities and follow a tight period-luminosity relation, but unlike e.g. Mira variables, their radial velocity measurements are reliable due to their smaller pulsation amplitudes. We supplement the Gaia astrometric and radial velocity measurements with distance moduli assigned using a period-luminosity relation to acquire full 6D phase space information. The assigned distances are validated by comparing to geometric distances and StarHorse distances, which shows biases less than $\sim5\%$. Our sample provides an unprecedented panoramic picture of the inner Galaxy with minimal selection function effects. We map the kinematics of the inner Milky Way and find a significant kinematic signature corresponding to the Galactic bar. We measure the pattern speed of the Galactic bar using the continuity equation and find $Ω_{\rm b}=34.1\pm2.4$ km s$^{-1}$ kpc$^{-1}$. We develop a simple, robust and model-independent method to measure the dynamical length of the bar using only kinematics and find $R_{\rm b}\sim4.0$ kpc. We validate both measurements using N-body simulations. Assuming knowledge of the gravitational potential of the inner Milky Way, we analyse the orbital structure of the Galactic bar using orbital frequency ratios. The $x_1$ orbits are the dominant bar-supporting orbital family in our sample. Amongst the selected bar stars, the $x_1 v_1$ or "banana" orbits constitute a larger fraction ($\sim 15\%$) than other orbital families in the bar, implying that they are the dominant family contributing to the Galactic X-shape, although contributions from other orbital families are present. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 20 pages, 11 figures, submitted to MNRAS. Comments welcome

arXiv:2406.06517 [pdf, other]

Genomics-guided Representation Learning for Pathologic Pan-cancer Tumor Microenvironment Subtype Prediction

Authors: Fangliangzi Meng, Hongrun Zhang, Ruodan Yan, Guohui Chuai, Chao Li, Qi Liu

Abstract: The characterization of Tumor MicroEnvironment (TME) is challenging due to its complexity and heterogeneity. Relatively consistent TME characteristics embedded within highly specific tissue features, render them difficult to predict. The capability to accurately classify TME subtypes is of critical significance for clinical tumor diagnosis and precision medicine. Based on the observation that tumo… ▽ More The characterization of Tumor MicroEnvironment (TME) is challenging due to its complexity and heterogeneity. Relatively consistent TME characteristics embedded within highly specific tissue features, render them difficult to predict. The capability to accurately classify TME subtypes is of critical significance for clinical tumor diagnosis and precision medicine. Based on the observation that tumors with different origins share similar microenvironment patterns, we propose PathoTME, a genomics-guided Siamese representation learning framework employing Whole Slide Image (WSI) for pan-cancer TME subtypes prediction. Specifically, we utilize Siamese network to leverage genomic information as a regularization factor to assist WSI embeddings learning during the training phase. Additionally, we employ Domain Adversarial Neural Network (DANN) to mitigate the impact of tissue type variations. To eliminate domain bias, a dynamic WSI prompt is designed to further unleash the model's capabilities. Our model achieves better performance than other state-of-the-art methods across 23 cancer types on TCGA dataset. Our code is available at https://github.com/Mengflz/PathoTME. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06367 [pdf, other]

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Authors: Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang

Abstract: Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-vi… ▽ More Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (e.g., Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size. △ Less

Submitted 20 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06156 [pdf, other]

Stronger, Cheaper and Demonstration-Free Log Parsing with LLMs

Authors: Yi Xiao, Van-Hoang Le, Hongyu Zhang

Abstract: Log parsing, the process of converting raw log messages into structured formats, is an important initial step for automated analysis of logs of large-scale software systems. Traditional log parsers often rely on heuristics or handcrafted features, which may not generalize well across diverse log sources or require extensive model tuning. Recently, some log parsers have utilized powerful generative… ▽ More Log parsing, the process of converting raw log messages into structured formats, is an important initial step for automated analysis of logs of large-scale software systems. Traditional log parsers often rely on heuristics or handcrafted features, which may not generalize well across diverse log sources or require extensive model tuning. Recently, some log parsers have utilized powerful generative capabilities of large language models (LLMs). However, they heavily rely on demonstration examples, resulting in substantial overhead in LLM invocations. To address these issues, we propose LogBatcher, a cost-effective LLM-based log parser that requires no training process or labeled data. To leverage latent characteristics of log data and reduce the overhead, we divide logs into several partitions through clustering. Then we perform a cache matching process to match logs with previously parsed log templates. Finally, we provide LLMs with better prompt context specialized for log parsing by batching a group of logs from each partition. We have conducted experiments on 16 public log datasets and the results show that LogBatcher is effective and efficient for log parsing. △ Less

Submitted 12 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06118 [pdf, other]

Strong and weak $CP$ tests in sequential decays of polarized $Σ^0$ hyperons

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (644 additional authors not shown)

Abstract: The $J/ψ, ψ(3686) \to Σ^0 \barΣ^{0}$ processes and subsequent decays are studied using the world's largest $J/ψ$ and $ψ(3686)$ data samples collected with the BESIII detector. The strong-$CP$ symmetry is tested in the decays of the $Σ^0$ hyperons for the first time by measuring the decay parameters, $α_{Σ^0} = -0.0017 \pm 0.0021 \pm 0.0018$ and $\barα_{Σ^0} = 0.0021 \pm 0.0020 \pm 0.0022$. The wea… ▽ More The $J/ψ, ψ(3686) \to Σ^0 \barΣ^{0}$ processes and subsequent decays are studied using the world's largest $J/ψ$ and $ψ(3686)$ data samples collected with the BESIII detector. The strong-$CP$ symmetry is tested in the decays of the $Σ^0$ hyperons for the first time by measuring the decay parameters, $α_{Σ^0} = -0.0017 \pm 0.0021 \pm 0.0018$ and $\barα_{Σ^0} = 0.0021 \pm 0.0020 \pm 0.0022$. The weak-$CP$ test is performed in the subsequent decays of their daughter particles $Λ$ and $\barΛ$. Also for the first time, the transverse polarizations of the $Σ^0$ hyperons in $J/ψ$ and $ψ(3686)$ decays are observed with opposite directions, and the ratios between the S-wave and D-wave contributions of the $J/ψ, ψ(3686) \to Σ^0 \barΣ^{0}$ decays are obtained. These results are crucial to understand the decay dynamics of the charmonium states and the production mechanism of the $Σ^0-\barΣ^0$ pairs. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06063 [pdf, other]

Enabling Large-Scale and High-Precision Fluid Simulations on Near-Term Quantum Computers

Authors: Zhao-Yun Chen, Teng-Yang Ma, Chuang-Chao Ye, Liang Xu, Ming-Yang Tan, Xi-Ning Zhuang, Xiao-Fan Xu, Yun-Jie Wang, Tai-** Sun, Yong Chen, Lei Du, Liang-Liang Guo, Hai-Feng Zhang, Hao-Ran Tao, Tian-Le Wang, Xiao-Yan Yang, Ze-An Zhao, Peng Wang, Sheng Zhang, Chi Zhang, Ren-Ze Zhao, Zhi-Long Jia, Wei-Cheng Kong, Meng-Han Dou, Jun-Chao Wang , et al. (7 additional authors not shown)

Abstract: Quantum computational fluid dynamics (QCFD) offers a promising alternative to classical computational fluid dynamics (CFD) by leveraging quantum algorithms for higher efficiency. This paper introduces a comprehensive QCFD method, including an iterative method "Iterative-QLS" that suppresses error in quantum linear solver, and a subspace method to scale the solution to a larger size. We implement o… ▽ More Quantum computational fluid dynamics (QCFD) offers a promising alternative to classical computational fluid dynamics (CFD) by leveraging quantum algorithms for higher efficiency. This paper introduces a comprehensive QCFD method, including an iterative method "Iterative-QLS" that suppresses error in quantum linear solver, and a subspace method to scale the solution to a larger size. We implement our method on a superconducting quantum computer, demonstrating successful simulations of steady Poiseuille flow and unsteady acoustic wave propagation. The Poiseuille flow simulation achieved a relative error of less than $0.2\%$, and the unsteady acoustic wave simulation solved a 5043-dimensional matrix. We emphasize the utilization of the quantum-classical hybrid approach in applications of near-term quantum computers. By adapting to quantum hardware constraints and offering scalable solutions for large-scale CFD problems, our method paves the way for practical applications of near-term quantum computers in computational science. △ Less

Submitted 19 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: 31 pages, 10 figures

arXiv:2406.06040 [pdf, other]

Vript: A Video Is Worth Thousands of Words

Authors: Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao

Abstract: Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than mo… ▽ More Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video captioning model among open-source models, comparable to GPT-4V in performance. Vriptor is also a powerful model capable of end-to-end generation of dense and detailed captions for long videos. Moreover, we introduce Vript-Hard, a benchmark consisting of three video understanding tasks that are more challenging than existing benchmarks: Vript-HAL is the first benchmark evaluating action and object hallucinations in video LLMs, Vript-RR combines reasoning with retrieval resolving question ambiguity in long-video QAs, and Vript-ERO is a new task to evaluate the temporal understanding of events in long videos rather than actions in short videos in previous works. All code, models, and datasets are available in https://github.com/mutonix/Vript. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: submitted to NeurIPS Dataset & Benchmark track

arXiv:2406.06022 [pdf, other]

GraphStorm: all-in-one graph machine learning framework for industry applications

Authors: Da Zheng, Xiang Song, Qi Zhu, Jian Zhang, Theodore Vasiloudis, Runjie Ma, Houyu Zhang, Zichen Wang, Soji Adeshina, Israt Nisa, Alejandro Mottini, Qingjun Cui, Huzefa Rangwala, Belinda Zeng, Christos Faloutsos, George Karypis

Abstract: Graph machine learning (GML) is effective in many business applications. However, making GML easy to use and applicable to industry applications with massive datasets remain challenging. We developed GraphStorm, which provides an end-to-end solution for scalable graph construction, graph model training and inference. GraphStorm has the following desirable properties: (a) Easy to use: it can perfor… ▽ More Graph machine learning (GML) is effective in many business applications. However, making GML easy to use and applicable to industry applications with massive datasets remain challenging. We developed GraphStorm, which provides an end-to-end solution for scalable graph construction, graph model training and inference. GraphStorm has the following desirable properties: (a) Easy to use: it can perform graph construction and model training and inference with just a single command; (b) Expert-friendly: GraphStorm contains many advanced GML modeling techniques to handle complex graph data and improve model performance; (c) Scalable: every component in GraphStorm can operate on graphs with billions of nodes and can scale model training and inference to different hardware without changing any code. GraphStorm has been used and deployed for over a dozen billion-scale industry applications after its release in May 2023. It is open-sourced in Github: https://github.com/awslabs/graphstorm. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Journal ref: KDD 2024

arXiv:2406.05827 [pdf, ps, other]

Measurement of the integrated luminosity of the data collected at 3.773 GeV by BESIII from 2021 to 2024

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (634 additional authors not shown)

Abstract: We present a measurement of the integrated luminosity of $e^+e^-$ collision data collected with the BESIII detector at the BEPCII collider at a center-of-mass energy of $E_{\rm cm} = 3.773$~GeV. The integrated luminosities of the data sets taken from December 2021 to June 2022, from November 2022 to June 2023, and from October 2023 to February 2024 are determined to be $4.995 \pm 0.019$~fb$^{-1}$,… ▽ More We present a measurement of the integrated luminosity of $e^+e^-$ collision data collected with the BESIII detector at the BEPCII collider at a center-of-mass energy of $E_{\rm cm} = 3.773$~GeV. The integrated luminosities of the data sets taken from December 2021 to June 2022, from November 2022 to June 2023, and from October 2023 to February 2024 are determined to be $4.995 \pm 0.019$~fb$^{-1}$, $8.157 \pm 0.031$~fb$^{-1}$, and $4.191 \pm 0.016$~fb$^{-1}$, respectively, by analyzing large angle Bhabha scattering events. The uncertainties are dominated by systematic effects and the statistical uncertainties are negligible. Our results provide essential input for future analyses and precision measurements. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05780 [pdf, ps, other]

Two-Stage Resource Allocation in Reconfigurable Intelligent Surface Assisted Hybrid Networks via Multi-Player Bandits

Authors: **gwen Tong, Hongliang Zhang, Liqun Fu, Amir Leshem, Zhu Han

Abstract: This paper considers a resource allocation problem where several Internet-of-Things (IoT) devices send data to a base station (BS) with or without the help of the reconfigurable intelligent surface (RIS) assisted cellular network. The objective is to maximize the sum rate of all IoT devices by finding the optimal RIS and spreading factor (SF) for each device. Since these IoT devices lack prior inf… ▽ More This paper considers a resource allocation problem where several Internet-of-Things (IoT) devices send data to a base station (BS) with or without the help of the reconfigurable intelligent surface (RIS) assisted cellular network. The objective is to maximize the sum rate of all IoT devices by finding the optimal RIS and spreading factor (SF) for each device. Since these IoT devices lack prior information on the RISs or the channel state information (CSI), a distributed resource allocation framework with low complexity and learning features is required to achieve this goal. Therefore, we model this problem as a two-stage multi-player multi-armed bandit (MPMAB) framework to learn the optimal RIS and SF sequentially. Then, we put forth an exploration and exploitation boosting (E2Boost) algorithm to solve this two-stage MPMAB problem by combining the $ε$-greedy algorithm, Thompson sampling (TS) algorithm, and non-cooperation game method. We derive an upper regret bound for the proposed algorithm, i.e., $\mathcal{O}(\log^{1+δ}_2 T)$, increasing logarithmically with the time horizon $T$. Numerical results show that the E2Boost algorithm has the best performance among the existing methods and exhibits a fast convergence rate. More importantly, the proposed algorithm is not sensitive to the number of combinations of the RISs and SFs thanks to the two-stage allocation mechanism, which can benefit high-density networks. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: This paper was published in IEEE Transcation on Communications

arXiv:2406.05685 [pdf, other]

Understanding Open Source Contributor Profiles in Popular Machine Learning Libraries

Authors: Jiawen Liu, Haoxiang Zhang, Ying Zou

Abstract: With the increasing popularity of machine learning (ML), many open-source software (OSS) contributors are attracted to develo** and adopting ML approaches. Comprehensive understanding of ML contributors is crucial for successful ML OSS development and maintenance. Without such knowledge, there is a risk of inefficient resource allocation and hindered collaboration in ML OSS projects. Existing re… ▽ More With the increasing popularity of machine learning (ML), many open-source software (OSS) contributors are attracted to develo** and adopting ML approaches. Comprehensive understanding of ML contributors is crucial for successful ML OSS development and maintenance. Without such knowledge, there is a risk of inefficient resource allocation and hindered collaboration in ML OSS projects. Existing research focuses on understanding the difficulties and challenges perceived by ML contributors by user surveys. There is a lack of understanding of ML contributors based on their activities tracked from software repositories. In this paper, we aim to understand ML contributors by identifying contributor profiles in ML libraries. We further study contributors' OSS engagement from three aspects: workload composition, work preferences, and technical importance. By investigating 7,640 contributors from 6 popular ML libraries (TensorFlow, PyTorch, Keras, MXNet, Theano, and ONNX), we identify four contributor profiles: Core-Afterhour, Core-Workhour, Peripheral-Afterhour, and Peripheral-Workhour. We find that: 1) project experience, authored files, collaborations, and geographical location are significant features of all profiles; 2) contributors in Core profiles exhibit significantly different OSS engagement compared to Peripheral profiles; 3) contributors' work preferences and workload compositions significantly impact project popularity; 4) long-term contributors evolve towards making fewer, constant, balanced and less technical contributions. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05678 [pdf, other]

SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models

Authors: Hengyu Zhang

Abstract: Extending the functionality of the Transformer model to accommodate longer sequence lengths has become a critical challenge. This extension is crucial not only for improving tasks such as language translation and long-context processing but also for enabling novel applications like chatbots, code generation, and multimedia content creation. The primary obstacle is the self-attention mechanism, whi… ▽ More Extending the functionality of the Transformer model to accommodate longer sequence lengths has become a critical challenge. This extension is crucial not only for improving tasks such as language translation and long-context processing but also for enabling novel applications like chatbots, code generation, and multimedia content creation. The primary obstacle is the self-attention mechanism, which scales quadratically with sequence length in terms of computation time and memory requirements. LongLoRA proposed shifted sparse attention (S$^2$-Attn), effectively enabling context extension and leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. However, LongLoRA is still not as efficient as vanilla attention, reaching only 39\% of the perplexity improvement compared to full attention. This inefficiency is due to the cyclic shift applied within different attention head patterns, causing either chaos in the attention head structure or unnecessary information exchange between token groups. To address these issues, We propose \textbf{SinkLoRA}, which features better work partitioning. Specifically, (1) we developed SF-Attn with a segmentation and reassembly algorithm to proportionally return cyclically shifted groups of attention heads to their un-shifted state together with global attention of "sink attention tokens", achieving 92\% of the perplexity improvement compared to full attention after fine tuning, and (2) applied a SOTA KV cache compression algorithm H$_2$O to accelerate inference. Furthermore, We conducted supervised fine-tuning with SinkLoRA using a self collected LongAlpaca-plus dataset. All our code, models, datasets, and demos are available at \url{https://github.com/Dexter-GT-86/SinkLoRA}. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: A rethinking of Short Shifted Attention

arXiv:2406.05555 [pdf, ps, other]

doi 10.1109/MCOM.001.2100704

OAM-SWIPT for IoE-Driven 6G

Authors: Runyu Lyu, Wenchi Cheng, Bazhong Shen, Zhiyuan Ren, Hailin Zhang

Abstract: Simultaneous wireless information and power transfer (SWIPT), which achieves both wireless energy transfer (WET) and information transfer, is an attractive technique for future Internet of Everything (IoE) in the sixth-generation (6G) mobile communications. With SWIPT, battery-less IoE devices can be powered while communicating with other devices. Line-of-sight (LOS) RF transmission and near-field… ▽ More Simultaneous wireless information and power transfer (SWIPT), which achieves both wireless energy transfer (WET) and information transfer, is an attractive technique for future Internet of Everything (IoE) in the sixth-generation (6G) mobile communications. With SWIPT, battery-less IoE devices can be powered while communicating with other devices. Line-of-sight (LOS) RF transmission and near-field inductive coupling based transmission are typical SWIPT scenarios, which are both LOS channels and without enough degree of freedom for high spectrum efficiency as well as high energy efficiency. Due to the orthogonal wavefronts, orbital angular momentum (OAM) can facilitate the SWIPT in LOS channels. In this article, we introduce the OAM-based SWIPT as well as discuss some basic advantages and challenges for it. After introducing the OAM-based SWIPT for IoE, we first propose an OAM-based SWIPT system model with the OAM-modes assisted dynamic power splitting (DPS). Then, four basic advantages regarding the OAM-based SWIPT are reviewed with some numerical analyses for further demonstrating the advantages. Next, four challenges regarding integrating OAM into SWIPT and possible solutions are discussed. OAM technology provides multiple orthogonal streams to increase both spectrum and energy efficiencies for SWIPT, thus creating many opportunities for future WET and SWIPT researches. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: 7 pages, 6 figures

Journal ref: in IEEE Communications Magazine, vol. 60, no. 3, pp. 19-25, March 2022

arXiv:2406.05514 [pdf, other]

RAG-Enhanced Commit Message Generation

Authors: Linghao Zhang, Hongyi Zhang, Chong Wang, Peng Liang

Abstract: Commit message is one of the most important textual information in software development and maintenance. However, it is time-consuming and labor-intensive to write commit messages manually. Commit Message Generation (CMG) has become a research hotspot in automated software engineering. Researchers have proposed several methods for CMG and achieved great results. In recent years, CodeBERT, CodeT5,… ▽ More Commit message is one of the most important textual information in software development and maintenance. However, it is time-consuming and labor-intensive to write commit messages manually. Commit Message Generation (CMG) has become a research hotspot in automated software engineering. Researchers have proposed several methods for CMG and achieved great results. In recent years, CodeBERT, CodeT5, and other Pre-trained Language Models (PLMs) for code have been proposed. These models can be easily transferred to code-related downstream tasks including CMG with simple fine-tuning and can achieve impressive performance. Moreover, Large Language Models (LLMs) with code capabilities (e.g., ChatGPT, Llama 3, Gemma) can be directly applied to various tasks by designing instruct prompts without training. This brings new possibilities to the CMG task. In this work, we propose REACT, a novel REtrieval-Augmented framework for CommiT message generation, which effectively integrates advanced retrieval techniques with different PLMs and LLMs and can broadly enhance the performance of various models on the CMG task. Specifically, we design and build a hybrid retriever to retrieve the most relevant code diff and commit message pair from the code base as an "exemplar". Then, the retrieved pair is utilized to guide and enhance the generation of commit messages by PLMs and LLMs through fine-tuning and in-context learning. Our approach is evaluated on a widely-used dataset. The experimental results show that REACT significantly enhances the performance of various models on the CMG task, improving the BLEU score of CodeT5 by up to 55%, boosting Llama 3's BLEU score by 102%, and substantially surpassing all baselines, achieving a new SOTA. This demonstrates the effectiveness and broad applicability of our framework that can enhance CMG by a large margin. △ Less

Submitted 14 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.05412 [pdf]

Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes

Authors: Hao Zhang, Shuaijie Zhang, Renbin Zou

Abstract: Data augmentation refers to the process of applying a series of transformations or expansions to original data to generate new samples, thereby increasing the diversity and quantity of the data, effectively improving the performance and robustness of models. As a common data augmentation method, Mosaic data augmentation technique stitches multiple images together to increase the diversity and comp… ▽ More Data augmentation refers to the process of applying a series of transformations or expansions to original data to generate new samples, thereby increasing the diversity and quantity of the data, effectively improving the performance and robustness of models. As a common data augmentation method, Mosaic data augmentation technique stitches multiple images together to increase the diversity and complexity of training data, thereby reducing the risk of overfitting. Although Mosaic data augmentation achieves excellent results in general detection tasks by stitching images together, it still has certain limitations for specific detection tasks. This paper addresses the challenge of detecting a large number of densely distributed small objects in aerial images by proposing the Select-Mosaic data augmentation method, which is improved with a fine-grained region selection strategy. The improved Select-Mosaic method demonstrates superior performance in handling dense small object detection tasks, significantly enhancing the accuracy and stability of detection models. Code is available at https://github.com/malagoutou/Select-Mosaic. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.05391 [pdf, other]

DUPLEX: Dual GAT for Complex Embedding of Directed Graphs

Authors: Zhaoru Ke, Hang Yu, Jianguo Li, Haipeng Zhang

Abstract: Current directed graph embedding methods build upon undirected techniques but often inadequately capture directed edge information, leading to challenges such as: (1) Suboptimal representations for nodes with low in/out-degrees, due to the insufficient neighbor interactions; (2) Limited inductive ability for representing new nodes post-training; (3) Narrow generalizability, as training is overly c… ▽ More Current directed graph embedding methods build upon undirected techniques but often inadequately capture directed edge information, leading to challenges such as: (1) Suboptimal representations for nodes with low in/out-degrees, due to the insufficient neighbor interactions; (2) Limited inductive ability for representing new nodes post-training; (3) Narrow generalizability, as training is overly coupled with specific tasks. In response, we propose DUPLEX, an inductive framework for complex embeddings of directed graphs. It (1) leverages Hermitian adjacency matrix decomposition for comprehensive neighbor integration, (2) employs a dual GAT encoder for directional neighbor modeling, and (3) features two parameter-free decoders to decouple training from particular tasks. DUPLEX outperforms state-of-the-art models, especially for nodes with sparse connectivity, and demonstrates robust inductive capability and adaptability across various tasks. The code is available at https://github.com/alipay/DUPLEX. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.05261 [pdf, other]

Split-and-Fit: Learning B-Reps via Structure-Aware Voronoi Partitioning

Authors: Yilin Liu, Jiale Chen, Shanshan Pan, Daniel Cohen-Or, Hao Zhang, Hui Huang

Abstract: We introduce a novel method for acquiring boundary representations (B-Reps) of 3D CAD models which involves a two-step process: it first applies a spatial partitioning, referred to as the ``split``, followed by a ``fit`` operation to derive a single primitive within each partition. Specifically, our partitioning aims to produce the classical Voronoi diagram of the set of ground-truth (GT) B-Rep pr… ▽ More We introduce a novel method for acquiring boundary representations (B-Reps) of 3D CAD models which involves a two-step process: it first applies a spatial partitioning, referred to as the ``split``, followed by a ``fit`` operation to derive a single primitive within each partition. Specifically, our partitioning aims to produce the classical Voronoi diagram of the set of ground-truth (GT) B-Rep primitives. In contrast to prior B-Rep constructions which were bottom-up, either via direct primitive fitting or point clustering, our Split-and-Fit approach is top-down and structure-aware, since a Voronoi partition explicitly reveals both the number of and the connections between the primitives. We design a neural network to predict the Voronoi diagram from an input point cloud or distance field via a binary classification. We show that our network, coined NVD-Net for neural Voronoi diagrams, can effectively learn Voronoi partitions for CAD models from training data and exhibits superior generalization capabilities. Extensive experiments and evaluation demonstrate that the resulting B-Reps, consisting of parametric surfaces, curves, and vertices, are more plausible than those obtained by existing alternatives, with significant improvements in reconstruction quality. Code will be released on https://github.com/yilinliu77/NVDNet. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: ACM Transactions on Graphics (SIGGRAPH 2024); Project page: https://vcc.tech/research/2024/BRepVP; Code: https://github.com/yilinliu77/NVDNet

arXiv:2406.05167 [pdf, other]

Universal Critical Holography and Domain Wall Formation

Authors: Tian-Chi Ma, Han-Qing Shi, Hai-Qing Zhang, Adolfo del Campo

Abstract: Using holography, we study the universal scaling laws governing the coarsening dynamics of strongly coupled domain walls. Specifically, we studied the universal dependence of the length of the domain wall interfaces on the quench rate. The relation satisfies the Kibble-Zurek scaling shortly after the critical point. However, as time goes by, the coarsening dynamics suppresses the Kibble-Zurek scal… ▽ More Using holography, we study the universal scaling laws governing the coarsening dynamics of strongly coupled domain walls. Specifically, we studied the universal dependence of the length of the domain wall interfaces on the quench rate. The relation satisfies the Kibble-Zurek scaling shortly after the critical point. However, as time goes by, the coarsening dynamics suppresses the Kibble-Zurek scaling in favor of a universal dynamical scaling of the characteristic length and the adiabatic growth of the system. Theoretical predictions of the universal scaling laws are consistent with numerical findings in both regimes for both weak and strongly coupled systems. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 9 pages, 9 figures

arXiv:2406.05127 [pdf, other]

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Authors: Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, r… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/. △ Less

Submitted 27 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: Technical Report. The project page: https://chocowu.github.io/SeTok-web/

arXiv:2406.05031 [pdf, other]

Unified view of scalar and vector dark matter solitons

Authors: Hong-Yi Zhang

Abstract: The existence of solitons -- stable, long-lived, and localized field configurations -- is a generic prediction for ultralight dark matter. These solitons, known by various names such as boson stars, axion stars, oscillons, and Q-balls depending on the context, are typically treated as distinct entities in the literature. This study aims to provide a unified perspective on these solitonic objects f… ▽ More The existence of solitons -- stable, long-lived, and localized field configurations -- is a generic prediction for ultralight dark matter. These solitons, known by various names such as boson stars, axion stars, oscillons, and Q-balls depending on the context, are typically treated as distinct entities in the literature. This study aims to provide a unified perspective on these solitonic objects for real or complex, scalar or vector dark matter, considering self-interactions and nonminimal gravitational interactions. We demonstrate that these solitons share universal nonrelativistic properties, such as conserved charges, mass-radius relations, stability and profiles. Without accounting for alternative interactions or relativistic effects, distinguishing between real and complex scalar dark matter is challenging. However, self-interactions differentiate real and complex vector dark matter due to their different dependencies on the macroscopic spin density of dark matter waves. Furthermore, gradient-dependent nonminimal gravitational interactions impose an upper bound on soliton amplitudes, influencing their mass distribution and phenomenology in the present-day universe. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 17+3 pages, 5 figures

arXiv:2406.04949 [pdf, other]

Nacala-Roof-Material: Drone Imagery for Roof Detection, Classification, and Segmentation to Support Mosquito-borne Disease Risk Assessment

Authors: Venkanna Babu Guthula, Stefan Oehmcke, Remigio Chilaule, Hui Zhang, Nico Lang, Ankit Kariryaa, Johan Mottelson, Christian Igel

Abstract: As low-quality housing and in particular certain roof characteristics are associated with an increased risk of malaria, classification of roof types based on remote sensing imagery can support the assessment of malaria risk and thereby help prevent the disease. To support research in this area, we release the Nacala-Roof-Material dataset, which contains high-resolution drone images from Mozambique… ▽ More As low-quality housing and in particular certain roof characteristics are associated with an increased risk of malaria, classification of roof types based on remote sensing imagery can support the assessment of malaria risk and thereby help prevent the disease. To support research in this area, we release the Nacala-Roof-Material dataset, which contains high-resolution drone images from Mozambique with corresponding labels delineating houses and specifying their roof types. The dataset defines a multi-task computer vision problem, comprising object detection, classification, and segmentation. In addition, we benchmarked various state-of-the-art approaches on the dataset. Canonical U-Nets, YOLOv8, and a custom decoder on pretrained DINOv2 served as baselines. We show that each of the methods has its advantages but none is superior on all tasks, which highlights the potential of our dataset for future research in multi-task learning. While the tasks are closely related, accurate segmentation of objects does not necessarily imply accurate instance separation, and vice versa. We address this general issue by introducing a variant of the deep ordinal watershed (DOW) approach that additionally separates the interior of objects, allowing for improved object delineation and separation. We show that our DOW variant is a generic approach that improves the performance of both U-Net and DINOv2 backbones, leading to a better trade-off between semantic segmentation and instance segmentation. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04762 [pdf, other]

Holographic Intelligence Surface Assisted Integrated Sensing and Communication

Authors: Zhuoyang Liu, Yuchen Zhang, Haiyang Zhang, Feng Xu, Yonina C. Eldar

Abstract: Traditional discrete-array-based systems fail to exploit interactions between closely spaced antennas, resulting in inadequate utilization of the aperture resource. In this paper, we propose a holographic intelligence surface (HIS) assisted integrated sensing and communication (HISAC) system, wherein both the transmitter and receiver are fabricated using a continuous-aperture array. A continuous-d… ▽ More Traditional discrete-array-based systems fail to exploit interactions between closely spaced antennas, resulting in inadequate utilization of the aperture resource. In this paper, we propose a holographic intelligence surface (HIS) assisted integrated sensing and communication (HISAC) system, wherein both the transmitter and receiver are fabricated using a continuous-aperture array. A continuous-discrete transformation of the HIS pattern based on the Fourier transform is proposed, converting the continuous pattern design into a discrete beamforming design. We formulate a joint transmit-receive beamforming optimization problem for the HISAC system, aiming to balance the performance of multi-target sensing while fulfilling the performance requirement of multi-user communication. To solve the non-convex problem with coupled variables, an alternating optimization-based algorithm is proposed to optimize the HISAC transmit-receive beamforming in an alternate manner. Specifically, the transmit beamforming design is solved by decoupling into a series of feasibility-checking sub-problems while the receive beamforming is determined by the Rayleigh quotient-based method. Simulation results demonstrate the superiority of the proposed HISAC system over traditional discrete-array-based ISAC systems, achieving significantly higher sensing performance while guaranteeing predetermined communication performance. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04584 [pdf, other]

CLoG: Benchmarking Continual Learning of Image Generation Models

Authors: Haotian Zhang, Junting Zhou, Haowei Lin, Hang Ye, Jianhua Zhu, Zihao Wang, Liangcai Gao, Yizhou Wang, Yitao Liang

Abstract: Continual Learning (CL) poses a significant challenge in Artificial Intelligence, aiming to mirror the human ability to incrementally acquire knowledge and skills. While extensive research has focused on CL within the context of classification tasks, the advent of increasingly powerful generative models necessitates the exploration of Continual Learning of Generative models (CLoG). This paper advo… ▽ More Continual Learning (CL) poses a significant challenge in Artificial Intelligence, aiming to mirror the human ability to incrementally acquire knowledge and skills. While extensive research has focused on CL within the context of classification tasks, the advent of increasingly powerful generative models necessitates the exploration of Continual Learning of Generative models (CLoG). This paper advocates for shifting the research focus from classification-based CL to CLoG. We systematically identify the unique challenges presented by CLoG compared to traditional classification-based CL. We adapt three types of existing CL methodologies, replay-based, regularization-based, and parameter-isolation-based methods to generative tasks and introduce comprehensive benchmarks for CLoG that feature great diversity and broad task coverage. Our benchmarks and results yield intriguing insights that can be valuable for develo** future CLoG methods. Additionally, we will release a codebase designed to facilitate easy benchmarking and experimentation in CLoG publicly at https://github.com/linhaowei1/CLoG. We believe that shifting the research focus to CLoG will benefit the continual learning community and illuminate the path for next-generation AI-generated content (AIGC) in a lifelong learning paradigm. △ Less