Search | arXiv e-print repository

The background model of the CUPID-Mo $0νββ$ experiment

Authors: CUPID-Mo Collaboration, :, C. Augier, A. S. Barabash, F. Bellini, G. Benato, M. Beretta, L. Bergé, J. Billard, Yu. A. Borovlev, L. Cardani, N. Casali, A. Cazes, E. Celi, M. Chapellier, D. Chiesa, I. Dafinei, F. A. Danevich, M. De Jesus, P. de Marcillac, T. Dixon, L. Dumoulin, K. Eitel, F. Ferri, B. K. Fujikawa , et al. (58 additional authors not shown)

Abstract: CUPID-Mo, located in the Laboratoire Souterrain de Modane (France), was a demonstrator for the next generation $0νββ$ decay experiment, CUPID. It consisted of an array of 20 enriched Li$_{2}$$ ^{100}$MoO$_4$ bolometers and 20 Ge light detectors and has demonstrated that the technology of scintillating bolometers with particle identification capabilities is mature. Furthermore, CUPID-Mo can inform… ▽ More CUPID-Mo, located in the Laboratoire Souterrain de Modane (France), was a demonstrator for the next generation $0νββ$ decay experiment, CUPID. It consisted of an array of 20 enriched Li$_{2}$$ ^{100}$MoO$_4$ bolometers and 20 Ge light detectors and has demonstrated that the technology of scintillating bolometers with particle identification capabilities is mature. Furthermore, CUPID-Mo can inform and validate the background prediction for CUPID. In this paper, we present a detailed model of the CUPID-Mo backgrounds. This model is able to describe well the features of the experimental data and enables studies of the $2νββ$ decay and other processes with high precision. We also measure the radio-purity of the Li$_{2}$$^{100}$MoO$_4$ crystals which are found to be sufficient for the CUPID goals. Finally, we also obtain a background index in the region of interest of 3.7$^{+0.9}_{-0.8}$(stat)$^{+1.5}_{-0.7}$(syst)$\times10^{-3}$counts/$Δ$E$_{FWHM}$/mol$_{iso}$/yr, the lowest in a bolometric $0νββ$ decay experiment. △ Less

Submitted 2 May, 2023; originally announced May 2023.

arXiv:2305.00787 [pdf, other]

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

Authors: Zhenhui Ye, **zheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, **glin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao

Abstract: Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it coul… ▽ More Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high temporal consistency and audio-lip accuracy; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain input condition and produce bad rendering results occasionally; 3) as for the system efficiency, the slow training and inference speed of the vanilla NeRF severely obstruct its usage in real-world applications. In this paper, we propose GeneFace++ to handle these challenges by 1) utilizing the pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process; 2) proposing a landmark locally linear embedding method to regulate the outliers in the predicted motion sequence to avoid robustness issues; 3) designing a computationally efficient NeRF-based motion-to-video renderer to achieves fast training and real-time inference. With these settings, GeneFace++ becomes the first NeRF-based method that achieves stable and real-time talking face generation with generalized audio-lip synchronization. Extensive experiments show that our method outperforms state-of-the-art baselines in terms of subjective and objective evaluation. Video samples are available at https://genefaceplusplus.github.io . △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: 18 Pages, 7 figures

arXiv:2305.00601 [pdf, ps, other]

doi 10.1007/s00208-024-02865-1

$G$-invariant Bergman kernel and geometric quantization on complex manifolds with boundary

Authors: Chin-Yu Hsiao, Rung-Tzung Huang, Xiaoshan Li, Guokuan Shao

Abstract: Let $M$ be a complex manifold with boundary $X$, which admits a holomorphic Lie group $G$-action preserving $X$. We establish a full asymptotic expansion for the $G$-invariant Bergman kernel under certain assumptions. As an application, we get $G$-invariant version of Fefferman's result about regularity of biholomorphic maps on strongly pseudoconvex domains of $\mathbb C^n$. Moreover, we show that… ▽ More Let $M$ be a complex manifold with boundary $X$, which admits a holomorphic Lie group $G$-action preserving $X$. We establish a full asymptotic expansion for the $G$-invariant Bergman kernel under certain assumptions. As an application, we get $G$-invariant version of Fefferman's result about regularity of biholomorphic maps on strongly pseudoconvex domains of $\mathbb C^n$. Moreover, we show that the Guillemin-Sternberg map on a complex manifold with boundary is Fredholm by develo** reduction to boundary technique, which establish ``quantization commutes with reduction" in this case. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: 36 pages

Journal ref: Math. Ann. 2024

arXiv:2305.00427 [pdf, other]

An overview of Web3.0 Technology: Infrastructure, Applications, and Popularity

Authors: Renke Huang, Jiachi Chen, Yanlin Wang, Tingting Bi, Zibin Zheng

Abstract: Web3, the next generation of the Internet, represents a decentralized and democratized web. Although it has garnered significant public interest and found numerous real-world applications, there is a limited understanding of people's perceptions and experiences with Web3. In this study, we conducted an empirical study to investigate the categories of Web3 application and their popularity, as well… ▽ More Web3, the next generation of the Internet, represents a decentralized and democratized web. Although it has garnered significant public interest and found numerous real-world applications, there is a limited understanding of people's perceptions and experiences with Web3. In this study, we conducted an empirical study to investigate the categories of Web3 application and their popularity, as well as the potential challenges and opportunities within this emerging landscape. Our research was carried out in two phases. In the first phase, we analyzed 200 popular Web3 projects associated with 10 leading Web3 venture capital firms. In the second phase, we collected and examined code-related data from GitHub and market-related data from blockchain browsers (e.g., Etherscan) for these projects. Our analysis revealed that the Web3 ecosystem can be categorized into two groups, i.e., Web3 infrastructure and Web3 applications, with each consisting of several subcategories or subdomains. We also gained insights into the popularity of these Web3 projects at both the code and market levels and pointed out the challenges in the Web3 ecosystem at the system, developer, and user levels, as well as the opportunities it presents. Our findings contribute to a better understanding of Web3 for researchers and developers, promoting further exploration and advancement in this innovative field. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: 25 pages, 5 figures

arXiv:2305.00113 [pdf, other]

doi 10.1103/PhysRevB.95.014111

Lattice dynamics and ferroelectric properties of the nitride perovskite ${\mathrm{LaWN}}_{3}$

Authors: Yue-Wen Fang, Craig A. J. Fisher, Akihide Kuwabara, Xin-Wei Shen, Takafumi Ogawa, Hiroki Moriwake, Rong Huang, Chun-Gang Duan

Abstract: Using first-principles calculations we examine the crystal structures and phase transitions of nitride perovskite LaWN$_3$. Lattice dynamics calculations indicate that the ground-state structure belongs to space group $R3c$. Two competitive phase transition pathways are identified which are characterized by symmetry-adapted distortion modes. The results suggest that $R3c$ LaWN$_3$ should be an exc… ▽ More Using first-principles calculations we examine the crystal structures and phase transitions of nitride perovskite LaWN$_3$. Lattice dynamics calculations indicate that the ground-state structure belongs to space group $R3c$. Two competitive phase transition pathways are identified which are characterized by symmetry-adapted distortion modes. The results suggest that $R3c$ LaWN$_3$ should be an excellent ferroelectric semiconductor: its large spontaneous polarization of around 61 $μ$C/cm$^2$ is comparable to that of PbTiO$_3$, and its band gap is about 1.72 eV. Ferroelectricity is found to result from the \emph{B}-site instability driven by hybridization between W-5$d$ and N-2$p$ orbitals. These properties make LaWN$_3$ an attractive candidate material for use in ferroelectric memory devices and photovoltaic cells. △ Less

Submitted 28 April, 2023; originally announced May 2023.

Comments: 13 pages, 8 figures in main text and 5 figures in supplementary

Journal ref: Phys. Rev. B 95, 014111 (2017)

arXiv:2304.12995 [pdf, other]

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Authors: Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, **glin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

Abstract: Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements… ▽ More Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.11053 [pdf, other]

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Authors: Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, Ronny Huang, Tara Sainath

Abstract: Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. However, little guidance exists on deploying these methods to improve production ASR systems that are trained on very large supervised corpora and with realistic requirements like a constrained model size and CPU budget, streaming capability, and a rich lattice… ▽ More Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. However, little guidance exists on deploying these methods to improve production ASR systems that are trained on very large supervised corpora and with realistic requirements like a constrained model size and CPU budget, streaming capability, and a rich lattice for rescoring and for downstream NLU tasks. In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting using joint training. We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2304.07999

Everyone Can Be Picasso? A Computational Framework into the Myth of Human versus AI Painting

Authors: Yilin Ye, Rong Huang, Kang Zhang, Wei Zeng

Abstract: The recent advances of AI technology, particularly in AI-Generated Content (AIGC), have enabled everyone to easily generate beautiful paintings with simple text description. With the stunning quality of AI paintings, it is widely questioned whether there still exists difference between human and AI paintings and whether human artists will be replaced by AI. To answer these questions, we develop a… ▽ More The recent advances of AI technology, particularly in AI-Generated Content (AIGC), have enabled everyone to easily generate beautiful paintings with simple text description. With the stunning quality of AI paintings, it is widely questioned whether there still exists difference between human and AI paintings and whether human artists will be replaced by AI. To answer these questions, we develop a computational framework combining neural latent space and aesthetics features with visual analytics to investigate the difference between human and AI paintings. First, with categorical comparison of human and AI painting collections, we find that AI artworks show distributional difference from human artworks in both latent space and some aesthetic features like strokes and sharpness, while in other aesthetic features like color and composition there is less difference. Second, with individual artist analysis of Picasso, we show human artists' strength in evolving new styles compared to AI. Our findings provide concrete evidence for the existing discrepancies between human and AI paintings and further suggest improvements of AI art with more consideration of aesthetics and human artists' involvement. △ Less

Submitted 22 February, 2024; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: The results in Figure 3 in Section 4 have error due to my mistakes in feature calculation. Particularly the error is in the classification accuracy

ACM Class: I.2.0; J.5; H.5.2

arXiv:2304.07036 [pdf, other]

Hierarchical Agent-based Reinforcement Learning Framework for Automated Quality Assessment of Fetal Ultrasound Video

Authors: Si**g Liu, Qilong Ying, Shuangchi He, Xin Yang, Dong Ni, Ruobing Huang

Abstract: Ultrasound is the primary modality to examine fetal growth during pregnancy, while the image quality could be affected by various factors. Quality assessment is essential for controlling the quality of ultrasound images to guarantee both the perceptual and diagnostic values. Existing automated approaches often require heavy structural annotations and the predictions may not necessarily be consiste… ▽ More Ultrasound is the primary modality to examine fetal growth during pregnancy, while the image quality could be affected by various factors. Quality assessment is essential for controlling the quality of ultrasound images to guarantee both the perceptual and diagnostic values. Existing automated approaches often require heavy structural annotations and the predictions may not necessarily be consistent with the assessment results by human experts. Furthermore, the overall quality of a scan and the correlation between the quality of frames should not be overlooked. In this work, we propose a reinforcement learning framework powered by two hierarchical agents that collaboratively learn to perform both frame-level and video-level quality assessments. It is equipped with a specially-designed reward mechanism that considers temporal dependency among frame quality and only requires sparse binary annotations to train. Experimental results on a challenging fetal brain dataset verify that the proposed framework could perform dual-level quality assessment and its predictions correlate well with the subjective assessment results. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2304.06590 [pdf, other]

Maximizing temporal quantum correlation by approaching an exceptional point

Authors: Chun-Wang Wu, Man-Chao Zhang, Yan-Li Zhou, Ting Chen, Ran Huang, Yi Xie, Bao-Quan Ou, Wei Wu, Adam Miranowicz, Jie Zhang, Hui **g, **-Xing Chen

Abstract: Quantum correlations, both spatial and temporal, are the central pillars of quantum mechanics. Over the last two decades, a big breakthrough in quantum physics is its complex extension to the non-Hermitian realm, and dizzying varieties of novel phenomena and applications beyond the Hermitian framework have been uncovered. However, unique features of non-Hermitian quantum correlations, especially i… ▽ More Quantum correlations, both spatial and temporal, are the central pillars of quantum mechanics. Over the last two decades, a big breakthrough in quantum physics is its complex extension to the non-Hermitian realm, and dizzying varieties of novel phenomena and applications beyond the Hermitian framework have been uncovered. However, unique features of non-Hermitian quantum correlations, especially in the time domain, still remain to be explored. Here, for the first time, we experimentally achieve this goal by using a parity-time (PT )-symmetric trapped-ion system. The upper limit of temporal quantum correlations, known as the algebraic bound, which has so far not been achieved in the standard measurement scenario, is reached here by approaching the exceptional point (EP), thus showing the unexpected ability of EPs in tuning temporal quantum correlation effects. Our study, unveiling the fundamental interplay of non-Hermiticity, nonlinearity, and temporal quantum correlations, provides the first step towards exploring and utilizing various non-Hermitian temporal quantum effects by operating a wide range of EP devices, which are important for both fundamental studies and applications of quantum EP systems. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 4 figures and 8 pages

arXiv:2303.17997 [pdf, other]

Switching classical and quantum nonreciprocities with spinning photonics

Authors: Yonglin Xiang, Yunlan Zuo, Xun-Wei Xu, Ran Huang, Hui **g

Abstract: We study how to achieve, manipulate, and switch classical or quantum nonreciprocal effects of light with a spinning Kerr resonator. In particular, we show that even when there is no classical nonreciprocity (i.e., with the same mean number of photons for both clockwise and counterclockwise propagating modes), it is still possible to realize nonreciprocity of quantum correlations of photons in such… ▽ More We study how to achieve, manipulate, and switch classical or quantum nonreciprocal effects of light with a spinning Kerr resonator. In particular, we show that even when there is no classical nonreciprocity (i.e., with the same mean number of photons for both clockwise and counterclockwise propagating modes), it is still possible to realize nonreciprocity of quantum correlations of photons in such a device. Also, by tuning the angular velocity and the optical backscattering strength, higher-order quantum nonreciprocity can appear, featuring qualitatively different third-order optical correlations, even in the absence of any nonreciprocity for both the mean photon number and its second-order correlations. The possibility to switch a single device between a classical isolator and a purely quantum directional system can provide more functions for nonreciprocal materials and new opportunities to realize novel quantum effects and applications, such as nonreciprocal multi-photon blockade, one-way photon bundles, and backaction-immune quantum communications. △ Less

Submitted 28 August, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.17007 [pdf]

doi 10.1103/PhysRevD.107.112012

Impact of cross-section uncertainties on supernova neutrino spectral parameter fitting in the Deep Underground Neutrino Experiment

Authors: DUNE Collaboration, A. Abed Abud, B. Abi, R. Acciarri, M. A. Acero, M. R. Adames, G. Adamov, M. Adamowski, D. Adams, M. Adinolfi, C. Adriano, A. Aduszkiewicz, J. Aguilar, Z. Ahmad, J. Ahmed, B. Aimard, F. Akbar, K. Allison, S. Alonso Monsalve, M. Alrashed, A. Alton, R. Alvarez, P. Amedo, J. Anderson, D. A. Andrade , et al. (1294 additional authors not shown)

Abstract: A primary goal of the upcoming Deep Underground Neutrino Experiment (DUNE) is to measure the $\mathcal{O}(10)$ MeV neutrinos produced by a Galactic core-collapse supernova if one should occur during the lifetime of the experiment. The liquid-argon-based detectors planned for DUNE are expected to be uniquely sensitive to the $ν_e$ component of the supernova flux, enabling a wide variety of physics… ▽ More A primary goal of the upcoming Deep Underground Neutrino Experiment (DUNE) is to measure the $\mathcal{O}(10)$ MeV neutrinos produced by a Galactic core-collapse supernova if one should occur during the lifetime of the experiment. The liquid-argon-based detectors planned for DUNE are expected to be uniquely sensitive to the $ν_e$ component of the supernova flux, enabling a wide variety of physics and astrophysics measurements. A key requirement for a correct interpretation of these measurements is a good understanding of the energy-dependent total cross section $σ(E_ν)$ for charged-current $ν_e$ absorption on argon. In the context of a simulated extraction of supernova $ν_e$ spectral parameters from a toy analysis, we investigate the impact of $σ(E_ν)$ modeling uncertainties on DUNE's supernova neutrino physics sensitivity for the first time. We find that the currently large theoretical uncertainties on $σ(E_ν)$ must be substantially reduced before the $ν_e$ flux parameters can be extracted reliably: in the absence of external constraints, a measurement of the integrated neutrino luminosity with less than 10\% bias with DUNE requires $σ(E_ν)$ to be known to about 5%. The neutrino spectral shape parameters can be known to better than 10% for a 20% uncertainty on the cross-section scale, although they will be sensitive to uncertainties on the shape of $σ(E_ν)$. A direct measurement of low-energy $ν_e$-argon scattering would be invaluable for improving the theoretical precision to the needed level. △ Less

Submitted 7 July, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: 25 pages, 21 figures

Report number: FERMILAB-PUB-23-132-CSAID-LBNF-ND-T

Journal ref: Phys. Rev. D 107, 112012 (2023)

arXiv:2303.14491 [pdf, other]

Is It the End? Guidelines for Cinematic Endings in Data Videos

Authors: Xian Xu, Aoyu Wu, Leni Yang, Zheng Wei, Rong Huang, David Yip, Huamin Qu

Abstract: Data videos are becoming increasingly popular in society and academia. Yet little is known about how to create endings that strengthen a lasting impression and persuasion. To fulfill the gap, this work aims to develop guidelines for data video endings by drawing inspiration from cinematic arts. To contextualize cinematic endings in data videos, 111 film endings and 105 data video endings are first… ▽ More Data videos are becoming increasingly popular in society and academia. Yet little is known about how to create endings that strengthen a lasting impression and persuasion. To fulfill the gap, this work aims to develop guidelines for data video endings by drawing inspiration from cinematic arts. To contextualize cinematic endings in data videos, 111 film endings and 105 data video endings are first analyzed to identify four common styles using the framework of ending punctuation marks. We conducted expert interviews (N=11) and formulated 20 guidelines for creating cinematic endings in data videos. To validate our guidelines, we conducted a user study where 24 participants were invited to design endings with and without our guidelines, which are evaluated by experts and the general public. The participants praise the clarity and usability of the guidelines, and results show that the endings with guidelines are perceived to be more understandable, impressive, and reflective. △ Less

Submitted 25 March, 2023; originally announced March 2023.

arXiv:2303.12583 [pdf, other]

doi 10.1103/PhysRevMaterials.7.064408

Enhanced functional reversibility in lead-free ferroelectric material over long cycle pyroelectric energy conversion

Authors: Chenbo Zhang, Zeyuan Zhu, Ka Hung Chan, Ruhao Huang, Xian Chen

Abstract: The ferroelectric material usually exhibits temperature dependent spontaneous polarization, known as pyroelectricity, which can be used to directly convert thermal energy to electricity from ambient low-grade waste heat. When utilizing the structural phase transformations of the material, the conversion capability can be magnified, consequently the device performance can be strongly boosted by ord… ▽ More The ferroelectric material usually exhibits temperature dependent spontaneous polarization, known as pyroelectricity, which can be used to directly convert thermal energy to electricity from ambient low-grade waste heat. When utilizing the structural phase transformations of the material, the conversion capability can be magnified, consequently the device performance can be strongly boosted by orders of magnitude. However, common ferroelectric oxides suffer the mechanical fatigue and functional degradation over cyclic phase transformations, hindering widespread applications of the energy conversion device. In this paper, we investigate the mechanical and functional reversibility of the material by lattice tuning and grain coarsening. We discover the lead-free compound Ba(Ce$_{0.005}$Zr$_{0.005}$)Ti$_{0.99}$O3-0.10(Ba$_{0.7}$Ca$_{0.3}$)TiO$_3$ (BCZT-0.10BCT) satisfying the compatibility condition among all present phases by its lattice parameters, making the phase transformations highly reversible. We demonstrated that the energy conversion device with the equiaxial coarse grains exhibits exceptional fatigue-resistance, with stable pyroelectric current output at 4$μ$A/cm$^2$ over 3,000 energy conversion cycles. Our work opens a new way to fabricate high-performance material that advances the pyroelectric energy conversion for practical application in engineering. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 18 pages, 5 figures, 1 table

arXiv:2303.12270 [pdf, other]

EBSR: Enhanced Binary Neural Network for Image Super-Resolution

Authors: Renjie Wei, Shuwen Zhang, Zechun Liu, Meng Li, Yuchen Fan, Runsheng Wang, Ru Huang

Abstract: While the performance of deep convolutional neural networks for image super-resolution (SR) has improved significantly, the rapid increase of memory and computation requirements hinders their deployment on resource-constrained devices. Quantized networks, especially binary neural networks (BNN) for SR have been proposed to significantly improve the model inference efficiency but suffer from large… ▽ More While the performance of deep convolutional neural networks for image super-resolution (SR) has improved significantly, the rapid increase of memory and computation requirements hinders their deployment on resource-constrained devices. Quantized networks, especially binary neural networks (BNN) for SR have been proposed to significantly improve the model inference efficiency but suffer from large performance degradation. We observe the activation distribution of SR networks demonstrates very large pixel-to-pixel, channel-to-channel, and image-to-image variation, which is important for high performance SR but gets lost during binarization. To address the problem, we propose two effective methods, including the spatial re-scaling as well as channel-wise shifting and re-scaling, which augments binary convolutions by retaining more spatial and channel-wise information. Our proposed models, dubbed EBSR, demonstrate superior performance over prior art methods both quantitatively and qualitatively across different datasets and different model sizes. Specifically, for x4 SR on Set5 and Urban100, EBSRlight improves the PSNR by 0.31 dB and 0.28 dB compared to SRResNet-E2FIF, respectively, while EBSR outperforms EDSR-E2FIF by 0.29 dB and 0.32 dB PSNR, respectively. △ Less

Submitted 21 March, 2023; originally announced March 2023.

arXiv:2303.10859 [pdf, other]

Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

Authors: Yuan Cheng, Ruiquan Huang, **g Yang, Yingbin Liang

Abstract: In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-r… ▽ More In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $ε$-optimal policy and achieve an $ε$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $ε$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted by ICLR 2023

arXiv:2303.05309 [pdf, other]

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

Authors: Xize Cheng, Linjun Li, Tao **, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao

Abstract: Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and tran… ▽ More Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present \textbf{AVMuST-TED}, the first dataset for \textbf{A}udio-\textbf{V}isual \textbf{Mu}ltilingual \textbf{S}peech \textbf{T}ranslation, derived from \textbf{TED} talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a map** from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap and its impact on knowledge transfer, we suggest adopting mixed speech, which is created by interpolating audio and visual streams, along with a curriculum learning strategy to adjust the mixing ratio as needed. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves state-of-the-art performance in lip reading on CMLR (11.1\%), LRS2 (25.5\%), and LRS3 (28.0\%). △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: https://github.com/Exgc/AVMuST-TED

arXiv:2303.02299 [pdf, other]

doi 10.3389/fphy.2023.1215468

Qubit Energy Tuner Based on Single Flux Quantum Circuits

Authors: Xiao Geng, Rutian Huang, Yongcheng He, Kaiyong He, Genting Dai, Liangliang Yang, Xinyu Wu, Qing Yu, Mingjun Cheng, Guodong Chen, Jianshe Liu, Wei Chen

Abstract: A device called qubit energy tuner (QET) based on single flux quantum (SFQ) circuits is proposed for Z control of superconducting qubits. Created from the improvement of flux digital-to-analog converters (flux DACs), a QET is able to set the energy levels or the frequencies of qubits, especially flux-tunable transmons, and perform gate operations requiring Z control. The circuit structure of QET i… ▽ More A device called qubit energy tuner (QET) based on single flux quantum (SFQ) circuits is proposed for Z control of superconducting qubits. Created from the improvement of flux digital-to-analog converters (flux DACs), a QET is able to set the energy levels or the frequencies of qubits, especially flux-tunable transmons, and perform gate operations requiring Z control. The circuit structure of QET is elucidated, which consists of an inductor loop and flux bias units for coarse tuning or fine tuning. The key feature of a QET is analyzed to understand how SFQ pulses change the inductor loop current, which provides external flux for qubits. To verify the functionality of the QET, three simulations are carried out. The first one verifies the responses of the inductor loop current to SFQ pulses. The results show that there is about 4.2% relative deviation between analytical solutions of the inductor loop current and the solutions from WRSpice time-domain simulation. The second and the third simulations with QuTip show how a Z gate and an iSWAP gate can be performed by this QET, respectively, with corresponding fidelities 99.99884% and 99.93906% for only once gate operation to specific initial states. These simulations indicate that the SFQ-based QET could act as an efficient component of SFQ-based quantum-classical interfaces for digital Z control of large-scale superconducting quantum computers. △ Less

Submitted 3 March, 2023; originally announced March 2023.

arXiv:2303.01038 [pdf, other]

Neural Intrinsic Embedding for Non-rigid Point Cloud Matching

Authors: Puhua Jiang, Mingze Sun, Ruqi Huang

Abstract: As a primitive 3D data representation, point clouds are prevailing in 3D sensing, yet short of intrinsic structural information of the underlying objects. Such discrepancy poses great challenges on directly establishing correspondences between point clouds sampled from deformable shapes. In light of this, we propose Neural Intrinsic Embedding (NIE) to embed each vertex into a high-dimensional spac… ▽ More As a primitive 3D data representation, point clouds are prevailing in 3D sensing, yet short of intrinsic structural information of the underlying objects. Such discrepancy poses great challenges on directly establishing correspondences between point clouds sampled from deformable shapes. In light of this, we propose Neural Intrinsic Embedding (NIE) to embed each vertex into a high-dimensional space in a way that respects the intrinsic structure. Based upon NIE, we further present a weakly-supervised learning framework for non-rigid point cloud registration. Unlike the prior works, we do not require expansive and sensitive off-line basis construction (e.g., eigen-decomposition of Laplacians), nor do we require ground-truth correspondence labels for supervision. We empirically show that our framework performs on par with or even better than the state-of-the-art baselines, which generally require more supervision and/or more structural geometric input. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: To appear at CVPR 2023

arXiv:2303.00802 [pdf, other]

Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition

Authors: Philipp Klumpp, Pooja Chitkara, Leda Sarı, Prashant Serai, Jilong Wu, Irina-Elena Veliche, Rongqing Huang, Qing He

Abstract: The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate… ▽ More The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform. Furthermore, we investigate the feasibility of learned accent representations instead of static embeddings. Generated data was then used to train two state-of-the-art ASR systems. We evaluated our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents. This observation did not translate to unseen accents, and it was not observed for a model that had been pre-trained exclusively with native speech. △ Less

Submitted 1 March, 2023; originally announced March 2023.

arXiv:2303.00772 [pdf, other]

doi 10.1103/PhysRevB.109.094416

Demonstrating the wormhole mechanism of the entanglement spectrum via a perturbed boundary

Authors: Zenan Liu, Rui-Zhen Huang, Zheng Yan, Dao-Xin Yao

Abstract: The Li-Haldane conjecture is one of the most famous conjectures in physics and opens a new research area in the quantum entanglement and topological phase. Although a lot of theoretical and numerical works have confirmed the conjecture in topological states with bulk-boundary correspondence, the cases with gapped boundary and the systems in high dimension are widely unknown. What is the valid scop… ▽ More The Li-Haldane conjecture is one of the most famous conjectures in physics and opens a new research area in the quantum entanglement and topological phase. Although a lot of theoretical and numerical works have confirmed the conjecture in topological states with bulk-boundary correspondence, the cases with gapped boundary and the systems in high dimension are widely unknown. What is the valid scope of the Li-Haldane conjecture? Via the newly developed quantum Monte Carlo scheme, we are now able to extract the large-scale entanglement spectrum (ES) and study its relation with the edge energy spectrum generally. Taking the two-dimensional Affleck-Kennedy-Lieb-Tasaki model with a tunable boundary on the square-octagon lattice as an example, we find several counterexamples which cannot be explained by the Li-Haldane conjecture; e.g., the low-lying entanglement spectrum does not always show similar behaviors as the energy spectrum on the virtual boundary, and sometimes the ES resembles the energy spectrum of the edge even if it is gapped. Finally, we demonstrate that the newly proposed wormhole mechanism on the path integral of a reduced density matrix is the formation principle of the general ES. We find that the Li-Haldane conjecture is a particular case in some limit of the wormhole picture while all the examples of the conjecture we have studied can totally be explained within the wormhole mechanism framework. Our results provide important evidence for demonstrating that the wormhole mechanism is the fundamental principle to explain the ES. △ Less

Submitted 9 April, 2024; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: 11 pages, 11 figures

Journal ref: Phys. Rev. B 109,094416 (2024)

arXiv:2302.14177 [pdf, other]

Soft-Search: Two Datasets to Study the Identification and Production of Research Software

Authors: Eva Maxfield Brown, Lindsey Schwartz, Richard Lewei Huang, Nicholas Weber

Abstract: Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of so… ▽ More Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of software production from National Science Foundation (NSF) awarded research projects. We use this dataset to train models that predict software production. Our second dataset is created by applying the trained predictive models across the abstracts and project outcomes reports for all NSF funded projects between the years of 2010 and 2023. The result is an inferred dataset of software production for over 150,000 NSF awards. We release the Soft-Search dataset to aid in identifying and understanding research software production: https://github.com/si2-urssi/eager △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.13737 [pdf, ps, other]

On Coresets for Clustering in Small Dimensional Euclidean Spaces

Authors: Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, Xuan Wu

Abstract: We consider the problem of constructing small coresets for $k$-Median in Euclidean spaces. Given a large set of data points $P\subset \mathbb{R}^d$, a coreset is a much smaller set $S\subset \mathbb{R}^d$, so that the $k$-Median costs of any $k$ centers w.r.t. $P$ and $S$ are close. Existing literature mainly focuses on the high-dimension case and there has been great success in obtaining dimensio… ▽ More We consider the problem of constructing small coresets for $k$-Median in Euclidean spaces. Given a large set of data points $P\subset \mathbb{R}^d$, a coreset is a much smaller set $S\subset \mathbb{R}^d$, so that the $k$-Median costs of any $k$ centers w.r.t. $P$ and $S$ are close. Existing literature mainly focuses on the high-dimension case and there has been great success in obtaining dimension-independent bounds, whereas the case for small $d$ is largely unexplored. Considering many applications of Euclidean clustering algorithms are in small dimensions and the lack of systematic studies in the current literature, this paper investigates coresets for $k$-Median in small dimensions. For small $d$, a natural question is whether existing near-optimal dimension-independent bounds can be significantly improved. We provide affirmative answers to this question for a range of parameters. Moreover, new lower bound results are also proved, which are the highest for small $d$. In particular, we completely settle the coreset size bound for $1$-d $k$-Median (up to log factors). Interestingly, our results imply a strong separation between $1$-d $1$-Median and $1$-d $2$-Median. As far as we know, this is the first such separation between $k=1$ and $k=2$ in any dimension. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.12471 [pdf, other]

Cubic singularities in binary linear electromechanical oscillators

Authors: Xin Zhou, Hui **g, Xing**g Ren, Jianqi Zhang, Ran Huang, Zhipeng Li, Xiaopeng Sun, Xuezhong Wu, Cheng-Wei Qiu, Franco Nori, Dingbang Xiao

Abstract: Singularities arise in diverse disciplines and play a key role in both exploring fundamental laws of physics and making highly-sensitive sensors. Higher-order (>3) singularities, with further improved performance, however, usually require exquisite tuning of multiple (>3) coupled degrees of freedom or nonlinear control, thus severely limiting their applications in practice. Here we propose theoret… ▽ More Singularities arise in diverse disciplines and play a key role in both exploring fundamental laws of physics and making highly-sensitive sensors. Higher-order (>3) singularities, with further improved performance, however, usually require exquisite tuning of multiple (>3) coupled degrees of freedom or nonlinear control, thus severely limiting their applications in practice. Here we propose theoretically and confirm using mechanics experiments that, cubic singularities can be realized in a coupled binary system without any nonlinearity, only by observing the phase tomography of the driven response. By steering the cubic phase-tomographic singularities in an electrostatically-tunable micromechanical system, enhanced cubic-root response to frequency perturbation and voltage-controlled nonreciprocity are demonstrated. Our work opens up a new phase-tomographic method for interacted-system research and sheds new light on building and engineering advanced singular devices with simple and well-controllable elements, with a wide range of applications including precision metrology, portable nonreciprocal devices, and on-chip mechanical computing. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.10463 [pdf, other]

Multimodal Trajectory Prediction: A Survey

Authors: Renhao Huang, Hao Xue, Maurice Pagnucco, Flora Salim, Yang Song

Abstract: Trajectory prediction is an important task to support safe and intelligent behaviours in autonomous systems. Many advanced approaches have been proposed over the years with improved spatial and temporal feature extraction. However, human behaviour is naturally multimodal and uncertain: given the past trajectory and surrounding environment information, an agent can have multiple plausible trajector… ▽ More Trajectory prediction is an important task to support safe and intelligent behaviours in autonomous systems. Many advanced approaches have been proposed over the years with improved spatial and temporal feature extraction. However, human behaviour is naturally multimodal and uncertain: given the past trajectory and surrounding environment information, an agent can have multiple plausible trajectories in the future. To tackle this problem, an essential task named multimodal trajectory prediction (MTP) has recently been studied, which aims to generate a diverse, acceptable and explainable distribution of future predictions for each agent. In this paper, we present the first survey for MTP with our unique taxonomies and comprehensive analysis of frameworks, datasets and evaluation metrics. In addition, we discuss multiple future directions that can help researchers develop novel multimodal trajectory prediction systems. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2301.13662 [pdf, other]

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

Abstract: Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined… ▽ More Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable. In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, e.g., "Sigh tone in full of sad mood with some helpless feeling". Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated with not only content transcriptions but also style descriptions in natural language. Then we propose an expressive TTS model, named as InstructTTS, which is novel in the sense of following aspects: (1) We fully take the advantage of self-supervised learning and cross-modal metric learning, and propose a novel three-stage training procedure to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts and control the speaking style in the generated speech. (2) We propose to model acoustic features in discrete latent space and train a novel discrete diffusion probabilistic model to generate vector-quantized (VQ) acoustic tokens rather than the commonly-used mel spectrogram. (3) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, avoiding possible content and speaker information leakage from the style prompt. △ Less

Submitted 25 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

Comments: Submit to TASLP

arXiv:2301.12661 [pdf, other]

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Lu** Liu, Mingze Li, Zhenhui Ye, **glin Liu, Xiang Yin, Zhou Zhao

Abstract: Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses t… ▽ More Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io △ Less

Submitted 29 January, 2023; originally announced January 2023.

Comments: Audio samples are available at https://Text-to-Audio.github.io

arXiv:2301.12520 [pdf, other]

Producing Usable Taxonomies Cheaply and Rapidly at Pinterest Using Discovered Dynamic $μ$-Topics

Authors: Abhijit Mahabal, Jiyun Luo, Rui Huang, Michael Ellsworth, Rui Li

Abstract: Creating a taxonomy of interests is expensive and human-effort intensive: not only do we need to identify nodes and interconnect them, in order to use the taxonomy, we must also connect the nodes to relevant entities such as users, pins, and queries. Connecting to entities is challenging because of ambiguities inherent to language but also because individual interests are dynamic and evolve. Here,… ▽ More Creating a taxonomy of interests is expensive and human-effort intensive: not only do we need to identify nodes and interconnect them, in order to use the taxonomy, we must also connect the nodes to relevant entities such as users, pins, and queries. Connecting to entities is challenging because of ambiguities inherent to language but also because individual interests are dynamic and evolve. Here, we offer an alternative approach that begins with bottom-up discovery of $μ$-topics called pincepts. The discovery process itself connects these $μ$-topics dynamically with relevant queries, pins, and users at high precision, automatically adapting to shifting interests. Pincepts cover all areas of user interest and automatically adjust to the specificity of user interests and are thus suitable for the creation of various kinds of taxonomies. Human experts associate taxonomy nodes with $μ$-topics (on average, 3 $μ$-topics per node), and the $μ$-topics offer a high-level data layer that allows quick definition, immediate inspection, and easy modification. Even more powerfully, $μ$-topics allow easy exploration of nearby semantic space, enabling curators to spot and fill gaps. Curators' domain knowledge is heavily leveraged and we thus don't need untrained mechanical Turks, allowing further cost reduction. These $μ$-topics thus offer a satisfactory "symbolic" stratum over which to define taxonomies. We have successfully applied this technique for very rapidly iterating on and launching the home decor and fashion styles taxonomy for style-based personalization, prominently featured at the top of Pinterest search results, at 94% precision, improving search success rate by 34.8% as well as boosting long clicks and pin saves. △ Less

Submitted 29 January, 2023; originally announced January 2023.

arXiv:2301.07584 [pdf, other]

Joint Representation Learning for Text and 3D Point Cloud

Authors: Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang

Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data struc… ▽ More Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point △ Less

Submitted 18 January, 2023; originally announced January 2023.

arXiv:2301.04327 [pdf, other]

Dual Learning for Large Vocabulary On-Device ASR

Authors: Cal Peyser, Ronny Huang, Tara Sainath, Rohit Prabhavalkar, Michael Picheny, Kyunghyun Cho

Abstract: Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address on… ▽ More Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2301.03398 [pdf, other]

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

Authors: Chao Yu, Xinyi Yang, Jiaxuan Gao, Jiayu Chen, Yunfei Li, Jijia Liu, Yunfei Xiang, Ruixin Huang, Huazhong Yang, Yi Wu, Yu Wang

Abstract: We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a f… ▽ More We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency. △ Less

Submitted 11 April, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: This paper is accepted by AAMAS 2023. The source code can be found in https://github.com/yang-xy20/async_mappo

arXiv:2301.02814 [pdf, ps, other]

Randomized Greedy Algorithms and Composable Coreset for k-Center Clustering with Outliers

Authors: Hu Ding, Ruomin Huang, Kai Liu, Haikuo Yu, Zixiu Wang

Abstract: In this paper, we study the problem of {\em $k$-center clustering with outliers}. The problem has many important applications in real world, but the presence of outliers can significantly increase the computational complexity. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem… ▽ More In this paper, we study the problem of {\em $k$-center clustering with outliers}. The problem has many important applications in real world, but the presence of outliers can significantly increase the computational complexity. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, that was developed for solving the ordinary $k$-center clustering problem. Based on some novel observations, we show that a simple randomized version of this greedy strategy actually can handle outliers efficiently. We further show that this randomized greedy approach also yields small coreset for the problem in doubling metrics (even if the doubling dimension is not given), which can greatly reduce the computational complexity. Moreover, together with the partial clustering framework proposed in arXiv:1703.01539 , we prove that our coreset method can be applied to distributed data with a low communication complexity. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower complexities comparing with the existing methods. △ Less

Submitted 7 January, 2023; originally announced January 2023.

arXiv:2301.02497 [pdf, ps, other]

Some asymptotic formulae for torsion in homotopy groups

Authors: Guy Boyde, Ruizhi Huang

Abstract: Inspired by a remarkable work of Félix, Halperin and Thomas on the asymptotic estimation of the ranks of rational homotopy groups, and more recent works of Wu and the authors on local hyperbolicity, we prove two asymptotic formulae for torsion rank of homotopy groups, one using ordinary homology and one using $K$-theory. We use these to obtain explicit quantitative asymptotic lower bounds on the t… ▽ More Inspired by a remarkable work of Félix, Halperin and Thomas on the asymptotic estimation of the ranks of rational homotopy groups, and more recent works of Wu and the authors on local hyperbolicity, we prove two asymptotic formulae for torsion rank of homotopy groups, one using ordinary homology and one using $K$-theory. We use these to obtain explicit quantitative asymptotic lower bounds on the torsion rank of the homotopy groups for many interesting spaces after suspension, including Moore spaces, Eilenberg-MacLane spaces, complex projective spaces, complex Grassmannians, Milnor hypersurfaces and unitary groups. △ Less

Submitted 6 January, 2023; originally announced January 2023.

Comments: 19 pages; comments are very welcome

MSC Class: 55Q52; 55Q05 (Primary) 55Q15; 55P40 (Secondary)

arXiv:2301.01995 [pdf, other]

doi 10.1093/mnras/stad108

Exploring the Intrinsic Scatter of the Star-Forming Galaxy Main Sequence at redshift 0.5 to 3.0

Authors: Rongjun Huang, Andrew J. Battisti, Kathryn Grasha, Elisabete da Cunha, Claudia del P Lagos, Sarah K. Leslie, Emily Wisnioski

Abstract: Previous studies have shown that the normalization and scatter of the galaxy 'main sequence' (MS), the relation between star formation rate (SFR) and stellar mass ($M_*$), evolves over cosmic time. However, such studies often rely on photometric redshifts and/or only rest-frame UV to near-IR data, which may underestimate the SFR and $M_*$ uncertainties. We use MAGPHYS+photo-z to fit the UV to radi… ▽ More Previous studies have shown that the normalization and scatter of the galaxy 'main sequence' (MS), the relation between star formation rate (SFR) and stellar mass ($M_*$), evolves over cosmic time. However, such studies often rely on photometric redshifts and/or only rest-frame UV to near-IR data, which may underestimate the SFR and $M_*$ uncertainties. We use MAGPHYS+photo-z to fit the UV to radio spectral energy distributions of 12,380 galaxies in the COSMOS field at $0.5<z<3.0$ and self-consistently include photometric redshift uncertainties on the derived SFR and $M_*$. We quantify the effect on the observed MS scatter from (1) photometric redshift uncertainties (which are minor) and (2) fitting only rest-frame ultraviolet to near-infrared observations (which are severe). At fixed redshift and $M_*$, we find that the intrinsic MS scatter for our sample of galaxies is 1.4 to 2.6 times larger than the measurement uncertainty. The average intrinsic MS scatter has decreased by 0.1 dex from $z=0.5$ to $\sim2.0$. At low-$z$, the trend between the intrinsic MS scatter and $M_*$ follows a functional form similar to an inverse stellar mass-halo mass relation (SMHM; $M_*$/$M_{\rm halo}$ vs $M_*$), with a minimum in intrinsic MS scatter at log($M_*/M_{\odot})\sim10.25$ and larger scatter at both lower and higher $M_*$; while this distribution becomes flatter for high-$z$. The SMHM is thought to be a consequence of feedback effects and this similarity may suggest a link between galaxy feedback and the intrinsic MS scatter. These results favor a slight evolution in the intrinsic MS scatter with both redshift and mass. △ Less

Submitted 10 January, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: 16 pages, 15 figures, 3 tables. The paper has been accepted in MNRAS on January 3rd, 2023

arXiv:2212.09807 [pdf, other]

Highly-parallelized simulation of a pixelated LArTPC on a GPU

Authors: DUNE Collaboration, A. Abed Abud, B. Abi, R. Acciarri, M. A. Acero, M. R. Adames, G. Adamov, M. Adamowski, D. Adams, M. Adinolfi, C. Adriano, A. Aduszkiewicz, J. Aguilar, Z. Ahmad, J. Ahmed, B. Aimard, F. Akbar, K. Allison, S. Alonso Monsalve, M. Alrashed, C. Alt, A. Alton, R. Alvarez, P. Amedo, J. Anderson , et al. (1282 additional authors not shown)

Abstract: The rapid development of general-purpose computing on graphics processing units (GPGPU) is allowing the implementation of highly-parallelized Monte Carlo simulation chains for particle physics experiments. This technique is particularly suitable for the simulation of a pixelated charge readout for time projection chambers, given the large number of channels that this technology employs. Here we pr… ▽ More The rapid development of general-purpose computing on graphics processing units (GPGPU) is allowing the implementation of highly-parallelized Monte Carlo simulation chains for particle physics experiments. This technique is particularly suitable for the simulation of a pixelated charge readout for time projection chambers, given the large number of channels that this technology employs. Here we present the first implementation of a full microphysical simulator of a liquid argon time projection chamber (LArTPC) equipped with light readout and pixelated charge readout, developed for the DUNE Near Detector. The software is implemented with an end-to-end set of GPU-optimized algorithms. The algorithms have been written in Python and translated into CUDA kernels using Numba, a just-in-time compiler for a subset of Python and NumPy instructions. The GPU implementation achieves a speed up of four orders of magnitude compared with the equivalent CPU version. The simulation of the current induced on $10^3$ pixels takes around 1 ms on the GPU, compared with approximately 10 s on the CPU. The results of the simulation are compared against data from a pixel-readout LArTPC prototype. △ Less

Submitted 28 February, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: 26 pages, 15 figures

Report number: FERMILAB-PUB-22-926-LBNF

arXiv:2212.08568 [pdf, other]

Biomedical image analysis competitions: The state of current participation practice

Authors: Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J. Adler, Patrick Godau, Veronika Cheplygina, Michal Kozubek, Sharib Ali, Anubha Gupta, Jan Kybic, Alison Noble, Carlos Ortiz de Solórzano, Samiksha Pachade, Caroline Petitjean, Daniel Sage, Donglai Wei, Elizabeth Wilden, Deepak Alapatt, Vincent Andrearczyk, Ujjwal Baid, Spyridon Bakas, Niranjan Balu, Sophia Bano , et al. (331 additional authors not shown)

Abstract: The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis,… ▽ More The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps. △ Less

Submitted 12 September, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

arXiv:2212.07086 [pdf, other]

NLIP: Noise-robust Language-Image Pre-training

Authors: Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chun**g Xu, Xiaodan Liang

Abstract: Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual ru… ▽ More Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks. △ Less

Submitted 4 January, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: AAAI 2023

arXiv:2212.05524 [pdf, other]

Bayesian inference for partial orders from random linear extensions: power relations from 12th Century Royal Acta

Authors: Geoff K. Nicholls, Jeong Eun Lee, Nicholas Karn, David Johnson, Rukuang Huang, Alexis Muir-Watt

Abstract: We give a new class of models for time series data in which actors are listed in order of precedence. We model the lists as a realisation of a queue in which queue-position is constrained by an underlying social hierarchy. We model the hierarchy as a partial order so that the lists are random linear extensions. We account for noise via a random queue-jum** process. We give a marginally consisten… ▽ More We give a new class of models for time series data in which actors are listed in order of precedence. We model the lists as a realisation of a queue in which queue-position is constrained by an underlying social hierarchy. We model the hierarchy as a partial order so that the lists are random linear extensions. We account for noise via a random queue-jum** process. We give a marginally consistent prior for the stochastic process of partial orders based on a latent variable representation for the partial order. This allows us to introduce a parameter controlling partial order depth and incorporate actor-covariates informing the position of actors in the hierarchy. We fit the model to witness lists from Royal Acta from England, Wales and Normandy in the eleventh and twelfth centuries. Witnesses are listed in order of social rank, with any bishops present listed as a group. Do changes in the order in which the bishops appear reflect changes in their personal authority? The underlying social order which constrains the positions of bishops within lists need not be a complete order and so we model the evolving social order as an evolving partial order. The status of an Anglo-Norman bishop was at the time partly determined by the length of time they had been in office. This enters our model as a time-dependent covariate. We fit the model, estimate partial orders and find evidence for changes in status over time. We interpret our results in terms of court politics. Simpler models, based on Bucket Orders and vertex-series-parallel orders, are rejected. We compare our results with a time-series extension of the Plackett-Luce model. △ Less

Submitted 1 August, 2023; v1 submitted 11 December, 2022; originally announced December 2022.

Comments: 57 pages, 37 figures and 2 tables including appendices

MSC Class: 62M05 (Primary) 06A06; 62P25 (Secondary)

arXiv:2212.02715 [pdf, other]

Efficient Learning of Voltage Control Strategies via Model-based Deep Reinforcement Learning

Authors: Ramij R. Hossain, Tianzhixi Yin, Yan Du, Renke Huang, Jie Tan, Wenhao Yu, Yuan Liu, Qiuhua Huang

Abstract: This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms… ▽ More This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms practically applicable. DRL-agent learns an optimal policy via a trial-and-error method while interacting with the real-world environment. And it is desirable to minimize the direct interaction of the DRL agent with the real-world power grid due to its safety-critical nature. Additionally, state-of-the-art DRL-based policies are mostly trained using a physics-based grid simulator where dynamic simulation is computationally intensive, lowering the training efficiency. We propose a novel model-based-DRL framework where a deep neural network (DNN)-based dynamic surrogate model, instead of a real-world power-grid or physics-based simulation, is utilized with the policy learning framework, making the process faster and sample efficient. However, stabilizing model-based DRL is challenging because of the complex system dynamics of large-scale power systems. We solved these issues by incorporating imitation learning to have a warm start in policy learning, reward-sha**, and multi-step surrogate loss. Finally, we achieved 97.5% sample efficiency and 87.7% training efficiency for an application to the IEEE 300-bus test system. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.15432 [pdf, other]

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Authors: W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

Abstract: We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated wi… ▽ More We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder. △ Less

Submitted 5 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: ICASSP 2023

arXiv:2211.15117 [pdf, other]

doi 10.1364/JOSAB.481956

Analysis and design of transition radiation in layered uniaxial crystals using Tandem neural networks

Authors: Xiaoke Gao, Xiaoyu Zhao, Ruoyu Huang, Siyuan Ma, Xikui Ma, Tianyu Dong

Abstract: With the flourishing development of nanophotonics, Cherenkov radiation pattern can be designed to achieve superior performance in particle detection by fine-tuning the properties of metamaterials such as photonic crystals (PCs) surrounding the swift particle. However, the radiation pattern can be sensitive to the geometry and material properties of PCs, such as periodicity, unit thickness, and die… ▽ More With the flourishing development of nanophotonics, Cherenkov radiation pattern can be designed to achieve superior performance in particle detection by fine-tuning the properties of metamaterials such as photonic crystals (PCs) surrounding the swift particle. However, the radiation pattern can be sensitive to the geometry and material properties of PCs, such as periodicity, unit thickness, and dielectric fraction, making direct analysis and inverse design difficult. In this article, we propose a systematic method to analyze and design PC-based transition radiation, which is assisted by deep learning neural networks. By matching boundary conditions at the interfaces, Cherenkov-like radiation of multilayered structures can be resolved analytically using the cascading scattering matrix method, despite the optical axes not being aligned with the swift electron trajectory. Once well trained, forward deep learning neural networks can be utilized to predict the radiation pattern without further direct electromagnetic simulations; moreover, Tandem neural networks have been proposed to inversely design the geometry and/or material properties for desired Cherenkov radiation pattern. Our proposal demonstrates a promising strategy for dealing with layered-medium-based Cherenkov radiation detectors, and it can be extended for other emerging metamaterials, such as photonic time crystals. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.14864 [pdf, other]

A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition

Authors: Rui Huang, Ze Huang, Songzhi Su

Abstract: Visual Place Recognition is an essential component of systems for camera localization and loop closure detection, and it has attracted widespread interest in multiple domains such as computer vision, robotics and AR/VR. In this work, we propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage. We designed RepVGG-l… ▽ More Visual Place Recognition is an essential component of systems for camera localization and loop closure detection, and it has attracted widespread interest in multiple domains such as computer vision, robotics and AR/VR. In this work, we propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage. We designed RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task. RepVGG-lite has more speed advantages while achieving higher performance. We extract only one scale patch-level descriptors from global descriptors in the feature extraction stage. Then we design a trainable feature matcher to exploit both spatial relationships of the features and their visual appearance, which is based on the attention mechanism. Comprehensive experiments on challenging benchmark datasets demonstrate the proposed method outperforming recent other state-of-the-art learned approaches, and achieving even higher inference speed. Our system has 14 times less params than Patch-NetVLAD, 6.8 times lower theoretical FLOPs, and run faster 21 and 33 times in feature extraction and feature matching. Moreover, the performance of our approach is 0.5\% better than Patch-NetVLAD in Recall@1. We used subsets of Mapillary Street Level Sequences dataset to conduct experiments for all other challenging conditions. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: CCF Conference on Computer Supported Cooperative Work and Social Computing (ChineseCSCW)

arXiv:2211.13955 [pdf, other]

MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention

Authors: Wenxuan Zeng, Meng Li, Wenjie Xiong, Tong Tong, Wen-jie Lu, ** Tan, Runsheng Wang, Ru Huang

Abstract: Secure multi-party computation (MPC) enables computation directly on encrypted data and protects both data and model privacy in deep learning inference. However, existing neural network architectures, including Vision Transformers (ViTs), are not designed or optimized for MPC and incur significant latency overhead. We observe Softmax accounts for the major latency bottleneck due to a high communic… ▽ More Secure multi-party computation (MPC) enables computation directly on encrypted data and protects both data and model privacy in deep learning inference. However, existing neural network architectures, including Vision Transformers (ViTs), are not designed or optimized for MPC and incur significant latency overhead. We observe Softmax accounts for the major latency bottleneck due to a high communication complexity, but can be selectively replaced or linearized without compromising the model accuracy. Hence, in this paper, we propose an MPC-friendly ViT, dubbed MPCViT, to enable accurate yet efficient ViT inference in MPC. Based on a systematic latency and accuracy evaluation of the Softmax attention and other attention variants, we propose a heterogeneous attention optimization space. We also develop a simple yet effective MPC-aware neural architecture search algorithm for fast Pareto optimization. To further boost the inference efficiency, we propose MPCViT+, to jointly optimize the Softmax attention and other network components, including GeLU, matrix multiplication, etc. With extensive experiments, we demonstrate that MPCViT achieves 1.9%, 1.3% and 3.6% higher accuracy with 6.2x, 2.9x and 1.9x latency reduction compared with baseline ViT, MPCFormer and THE-X on the Tiny-ImageNet dataset, respectively. MPCViT+ further achieves a better Pareto front compared with MPCViT. The code and models for evaluation are available at https://github.com/PKU-SEC-Lab/mpcvit. △ Less

Submitted 19 August, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: Accepted by ICCV 2023 conference

arXiv:2211.11478 [pdf, other]

Background-Mixed Augmentation for Weakly Supervised Change Detection

Authors: Rui Huang, Ruofei Wang, Qing Guo, Jieda Wei, Yuxiang Zhang, Wei Fan, Yang Liu

Abstract: Change detection (CD) is to decouple object changes (i.e., object missing or appearing) from background changes (i.e., environment variations) like light and season variations in two images captured in the same scene over a long time span, presenting critical applications in disaster management, urban development, etc. In particular, the endless patterns of background changes require detectors to… ▽ More Change detection (CD) is to decouple object changes (i.e., object missing or appearing) from background changes (i.e., environment variations) like light and season variations in two images captured in the same scene over a long time span, presenting critical applications in disaster management, urban development, etc. In particular, the endless patterns of background changes require detectors to have a high generalization against unseen environment variations, making this task significantly challenging. Recent deep learning-based methods develop novel network architectures or optimization strategies with paired-training examples, which do not handle the generalization issue explicitly and require huge manual pixel-level annotation efforts. In this work, for the first attempt in the CD community, we study the generalization issue of CD from the perspective of data augmentation and develop a novel weakly supervised training algorithm that only needs image-level labels. Different from general augmentation techniques for classification, we propose the background-mixed augmentation that is specifically designed for change detection by augmenting examples under the guidance of a set of background-changing images and letting deep CD models see diverse environment variations. Moreover, we propose the augmented & real data consistency loss that encourages the generalization increase significantly. Our method as a general framework can enhance a wide range of existing deep learning-based detectors. We conduct extensive experiments in two public datasets and enhance four state-of-the-art methods, demonstrating the advantages of our method. We release the code at https://github.com/tsingqguo/bgmix. △ Less

Submitted 19 June, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: AAAI 2023 Accepted

arXiv:2211.11255 [pdf, other]

Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection

Authors: Lu** Liu, Yi Ren, Xize Cheng, Rongjie Huang, Chongxuan Li, Zhou Zhao

Abstract: Out-of-distribution (OOD) detection is a crucial task for ensuring the reliability and safety of deep learning. Currently, discriminator models outperform other methods in this regard. However, the feature extraction process used by discriminator models suffers from the loss of critical information, leaving room for bad cases and malicious attacks. In this paper, we introduce a new perceptron bias… ▽ More Out-of-distribution (OOD) detection is a crucial task for ensuring the reliability and safety of deep learning. Currently, discriminator models outperform other methods in this regard. However, the feature extraction process used by discriminator models suffers from the loss of critical information, leaving room for bad cases and malicious attacks. In this paper, we introduce a new perceptron bias assumption that suggests discriminator models are more sensitive to certain features of the input, leading to the overconfidence problem. To address this issue, we propose a novel framework that combines discriminator and generation models and integrates diffusion models (DMs) into OOD detection. We demonstrate that the diffusion denoising process (DDP) of DMs serves as a novel form of asymmetric interpolation, which is well-suited to enhance the input and mitigate the overconfidence problem. The discriminator model features of OOD data exhibit sharp changes under DDP, and we utilize the norm of this change as the indicator score. Our experiments on CIFAR10, CIFAR100, and ImageNet show that our method outperforms SOTA approaches. Notably, for the challenging InD ImageNet and OOD species datasets, our method achieves an AUROC of 85.7, surpassing the previous SOTA method's score of 77.4. Our implementation is available at \url{https://github.com/lu**-liu/DiffOOD}. △ Less

Submitted 3 June, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

arXiv:2211.10666 [pdf, other]

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Authors: Chenye Cui, Yi Ren, **glin Liu, Rongjie Huang, Zhou Zhao

Abstract: Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of genera… ▽ More Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample. To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information. We first use three encoders to encode these components respectively: 1) a temporal encoder to encode temporal information, which is fed with video frames since the input video shares the same temporal information as the original audio; 2) an acoustic encoder to encode timbre information, which takes the original audio as input and discards its temporal information by a temporal-corrupting operation; and 3) a background encoder to encode the residual or background sound, which uses the background part of the original audio as input. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio. △ Less

Submitted 19 November, 2022; originally announced November 2022.

arXiv:2211.09623 [pdf, other]

Cross-Modal Adapter for Text-Video Retrieval

Authors: Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, Gao Huang

Abstract: Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to t… ▽ More Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $\textbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $\textbf{99.6}\%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at \url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Tech Report

arXiv:2211.08743 [pdf, other]

Yield Evaluation of Citrus Fruits based on the YoloV5 compressed by Knowledge Distillation

Authors: Yuqi Li, Yuting He, Yihang Zhou, Zirui Gong, Renjie Huang

Abstract: In the field of planting fruit trees, pre-harvest estimation of fruit yield is important for fruit storage and price evaluation. However, considering the cost, the yield of each tree cannot be assessed by directly picking the immature fruit. Therefore, the problem is a very difficult task. In this paper, a fruit counting and yield assessment method based on computer vision is proposed for citrus f… ▽ More In the field of planting fruit trees, pre-harvest estimation of fruit yield is important for fruit storage and price evaluation. However, considering the cost, the yield of each tree cannot be assessed by directly picking the immature fruit. Therefore, the problem is a very difficult task. In this paper, a fruit counting and yield assessment method based on computer vision is proposed for citrus fruit trees as an example. Firstly, images of single fruit trees from different angles are acquired and the number of fruits is detected using a deep Convolutional Neural Network model YOLOv5, and the model is compressed using a knowledge distillation method. Then, a linear regression method is used to model yield-related features and evaluate yield. Experiments show that the proposed method can accurately count fruits and approximate the yield. △ Less

Submitted 16 November, 2022; originally announced November 2022.

arXiv:2211.06693 [pdf, other]

Smoluchowski coagulation equation with velocity dependence

Authors: Franco Flandoli, Ruojun Huang, Andrea Papini

Abstract: In the present article we introduce a variant of Smoluchowski's coagulation equation with both position and velocity variables taking a kinetic viewpoint arising as the scaling limit of a system of second-order (microscopic) coagulating particles. We focus on the rigorous study of the PDE system in the spatially-homogeneous case proving existence and uniqueness under different initial conditions i… ▽ More In the present article we introduce a variant of Smoluchowski's coagulation equation with both position and velocity variables taking a kinetic viewpoint arising as the scaling limit of a system of second-order (microscopic) coagulating particles. We focus on the rigorous study of the PDE system in the spatially-homogeneous case proving existence and uniqueness under different initial conditions in suitable weighted space, investigating also the regularity of such solutions. △ Less

Submitted 12 November, 2022; originally announced November 2022.

Comments: 38 pages, single column

MSC Class: 35Q70; 82C22; 40K05

arXiv:2211.03624 [pdf]

Extremely-Fast, Energy-Efficient Massive MIMO Precoding with Analog RRAM Matrix Computing

Authors: Pushen Zuo, Zhong Sun, Ru Huang

Abstract: Signal processing in wireless communications, such as precoding, detection, and channel estimation, are basically about solving inverse matrix problems, which, however, are slow and inefficient in conventional digital computers, thus requiring a radical paradigm shift to achieve fast, real-time solutions. Here, for the first time, we apply the emerging analog matrix computing (AMC) to the linear p… ▽ More Signal processing in wireless communications, such as precoding, detection, and channel estimation, are basically about solving inverse matrix problems, which, however, are slow and inefficient in conventional digital computers, thus requiring a radical paradigm shift to achieve fast, real-time solutions. Here, for the first time, we apply the emerging analog matrix computing (AMC) to the linear precoding of massive MIMO. The real-valued AMC concept is extended to process complex-valued signals. In order to adapt the MIMO channel models to RRAM conductance map**, a new matrix inversion circuit is developed. In addition, fully analog dataflow and optimized operational amplifiers are designed to support AMC precoding implementation. Simulation results show that the zero-forcing precoding is solved within 20 ns for a 16x128 MIMO system, which is two orders of magnitude faster than the conventional digital approach. Meanwhile, the energy efficiency is improved by 50x. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Submitted to an IEEE journal for possible publication

Showing 201–250 of 775 results for author: Huang, R