Search | arXiv e-print repository

Searching for Best Practices in Retrieval-Augmented Generation

Authors: Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuan**g Huang

Abstract: Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolong… ▽ More Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00608 [pdf, other]

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Authors: Shian Du, Xiaotian Cheng, Qi Qian, Henglu Wei, Yi Xu, Xiangyang Ji

Abstract: Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizin… ▽ More Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2406.17006 [pdf, other]

Probing the nature of the $χ_{c1}(3872)$ state using radiative decays

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1094 additional authors not shown)

Abstract: The radiative decays $χ_{c1}(3872)\rightarrowψ(2S)γ$ and $χ_{c1}(3872)\rightarrow J/ψγ$ are used to probe the~nature of the~$χ_{c1}(3872)$ state using proton-proton collision data collected with the LHCb detector, corresponding to an~integrated luminosity of~9fb$^{-1}$. Using the~$B^+\rightarrow χ_{c1}(3872)K^+$decay, the $χ_{c1}(3872)\rightarrow ψ(2S)γ$ process is observed for the first time and… ▽ More The radiative decays $χ_{c1}(3872)\rightarrowψ(2S)γ$ and $χ_{c1}(3872)\rightarrow J/ψγ$ are used to probe the~nature of the~$χ_{c1}(3872)$ state using proton-proton collision data collected with the LHCb detector, corresponding to an~integrated luminosity of~9fb$^{-1}$. Using the~$B^+\rightarrow χ_{c1}(3872)K^+$decay, the $χ_{c1}(3872)\rightarrow ψ(2S)γ$ process is observed for the first time and the ratio of its partial width to that of the $χ_{c1}(3872)\rightarrow J/ψγ$ decay is measured to be $$ \frac{Γ_{χ_{c1}(3872)\rightarrow ψ(2S)γ}} {Γ_{χ_{c1}(3872)\rightarrow J/ψγ}} = 1.67 \pm 0.21 \pm 0.12 \pm0.04 , $$ where the first uncertainty is statistical, the second systematic and the third is due to the uncertainties on the branching fractions of the $ψ(2S)$ and $J/ψ$ mesons. The measured ratio makes the interpretation of the $χ_{c1}(3872)$ state as a~pure $D^0\bar{D}^{*0}+\bar{D}^0D^{*0}$ molecule questionable and strongly indicates a sizeable compact charmonium or tetraquark component within the $χ_{c1}(3872)$ state. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 31 pages, 2 figures. All figures and tables, along with any supplementary material and additional information, are available at https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2024-015.html (LHCb public pages)

Report number: LHCb-PAPER-2024-015, CERN-EP-2025-157

arXiv:2406.12111 [pdf, other]

Precision measurement of the $Ξ^-_b$ baryon lifetime

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1064 additional authors not shown)

Abstract: A sample of $pp$ collision data, corresponding to an integrated luminosity of 5.5 fb$^{-1}$ and collected by the LHCb experiment during Run 2, is used to measure the ratio of the lifetime of the $Ξ^-_b$ baryon to that of the $Λ^0_b$ baryon, $r_τ\equivτ_{Ξ^-_b}/τ_{Λ^0_b}$. The value ${r_τ^{\rm Run\,2}=1.076\pm0.013\pm0.006}$ is obtained, where the first uncertainty is statistical and the second sys… ▽ More A sample of $pp$ collision data, corresponding to an integrated luminosity of 5.5 fb$^{-1}$ and collected by the LHCb experiment during Run 2, is used to measure the ratio of the lifetime of the $Ξ^-_b$ baryon to that of the $Λ^0_b$ baryon, $r_τ\equivτ_{Ξ^-_b}/τ_{Λ^0_b}$. The value ${r_τ^{\rm Run\,2}=1.076\pm0.013\pm0.006}$ is obtained, where the first uncertainty is statistical and the second systematic. This value is averaged with the corresponding value from Run 1 to obtain ${r_τ^{\rm Run\,1,2} = 1.078\pm0.012\pm0.007}$. Multiplying by the world-average value of the $Λ^0_b$ lifetime yields $τ_{Ξ^-_b}^{\rm Run~1,2} = 1.578\pm0.018\pm0.010\pm0.011$ ps, where the uncertainties are statistical, systematic, and due to the limited knowledge of the $Λ^0_b$ lifetime. This measurement improves the precision of the current world average of the $Ξ^-_b$ lifetime by about a factor of two, and is in good agreement with the most recent theoretical predictions. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 12 pages, 5 figures. All figures and tables, along with any supplementary material and additional information, are available at https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2014-010.html (LHCb public pages)

Report number: LHCb-PAPER-2024-010, CERN-EP-2024-139

arXiv:2406.09248 [pdf, other]

Wigner non-negative states that verify the Wigner entropy conjecture

Authors: Qipeng Qian, Christos N. Gagatsos

Abstract: We present further progress, in the form of analytical results, on the Wigner entropy conjecture set forth in https://link.aps.org/doi/10.1103/PhysRevA.104.042211 and https://iopscience.iop.org/article/10.1088/1751-8121/aa852f/meta. Said conjecture asserts that the differential entropy defined for non-negative, yet physical, Wigner functions is minimized by pure Gaussian states while the minimum e… ▽ More We present further progress, in the form of analytical results, on the Wigner entropy conjecture set forth in https://link.aps.org/doi/10.1103/PhysRevA.104.042211 and https://iopscience.iop.org/article/10.1088/1751-8121/aa852f/meta. Said conjecture asserts that the differential entropy defined for non-negative, yet physical, Wigner functions is minimized by pure Gaussian states while the minimum entropy is equal to $1+\lnπ$. We prove this conjecture for the qubits formed by Fock states $|0\rangle$ and $|1\rangle$ that correspond to non-negative Wigner functions. In particular, we derive an explicit form of the Wigner entropy for those states lying on the boundary of the set of Wigner non-negative qubits. We then consider general mixed states and derive a sufficient condition for Wigner non-negativity. For states satisfying our condition we verify that the conjecture is true. Lastly, we elaborate on the states of the set which is in accordance with our condition. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2405.17347 [pdf, other]

Comprehensive analysis of local and nonlocal amplitudes in the $B^0\rightarrow K^{*0}μ^+μ^-$ decay

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1070 additional authors not shown)

Abstract: A comprehensive study of the local and nonlocal amplitudes contributing to the decay $B^0\rightarrow K^{*0}(\to K^+π^-) μ^+μ^-$ is performed by analysing the phase-space distribution of the decay products. The analysis is based on \proton\proton collision data corresponding to an integrated luminosity of 8.4fb$^{-1}$ collected by the LHCb experiment. This measurement employs for the first time a m… ▽ More A comprehensive study of the local and nonlocal amplitudes contributing to the decay $B^0\rightarrow K^{*0}(\to K^+π^-) μ^+μ^-$ is performed by analysing the phase-space distribution of the decay products. The analysis is based on \proton\proton collision data corresponding to an integrated luminosity of 8.4fb$^{-1}$ collected by the LHCb experiment. This measurement employs for the first time a model of both one-particle and two-particle nonlocal amplitudes, and utilises the complete dimuon mass spectrum without any veto regions around the narrow charmonium resonances. In this way it is possible to explicitly isolate the local and nonlocal contributions and capture the interference between them. The results show that interference with nonlocal contributions, although larger than predicted, only has a minor impact on the Wilson Coefficients determined from the fit to the data. For the local contributions, the Wilson Coefficient $C_9$, responsible for vector dimuon currents, exhibits a $2.1σ$ deviation from the Standard Model expectation. The Wilson Coefficients $C_{10}$, $C_{9}'$ and $C_{10}'$ are all in better agreement than $C_{9}$ with the Standard Model and the global significance is at the level of $1.5σ$. The model used also accounts for nonlocal contributions from $B^{0}\to K^{*0}\left[τ^+τ^-\to μ^+μ^-\right]$ rescattering, resulting in the first direct measurement of the $b sττ$ vector effective-coupling $C_{9τ}$. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: All figures and tables, along with any supplementary material and additional information, are available at https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2024-011.html (LHCb public pages)

Report number: LHCb-PAPER-2024-011, CERN-EP-2024-122

arXiv:2405.12688 [pdf, other]

Study of $b$-hadron decays to $Λ_c^+ h^- h^{\prime -}$ final states

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1072 additional authors not shown)

Abstract: Decays of $Ξ_b^-$ and $Ω_b^-$ baryons to $Λ_c^+ h^- h^{\prime -}$ final states, with $h^- h^{\prime -}$ being $π^-π^-$, $K^-π^-$ and $K^-K^-$ meson pairs, are searched for using data collected with the LHCb detector. The data sample studied corresponds to an integrated luminosity of $8.7\,\mathrm{fb}^{-1}$ of $pp$ collisions collected at centre-of-mass energies $\sqrt{s} = 7$, $8$ and… ▽ More Decays of $Ξ_b^-$ and $Ω_b^-$ baryons to $Λ_c^+ h^- h^{\prime -}$ final states, with $h^- h^{\prime -}$ being $π^-π^-$, $K^-π^-$ and $K^-K^-$ meson pairs, are searched for using data collected with the LHCb detector. The data sample studied corresponds to an integrated luminosity of $8.7\,\mathrm{fb}^{-1}$ of $pp$ collisions collected at centre-of-mass energies $\sqrt{s} = 7$, $8$ and $13\,\mathrm{Te\kern -0.1em V}$. The products of the relative branching fractions and fragmentation fractions for each signal mode, relative to the $B^- \to Λ_c^+ \overline{p} π^-$ mode, are measured, with $Ξ_{b}^- \toΛ_{c}^+ K^- π^-$, $Ξ_{b}^- \toΛ_{c}^+ K^- K^-$ and $Ω_{b}^- \toΛ_{c}^+ K^- K^-$ decays being observed at over $5\,σ$ significance. The $Ξ_{b}^- \toΛ_{c}^+ K^- π^-$ mode is also used to measure the $Ξ_{b}^-$ production asymmetry, which is found to be consistent with zero. In addition, the $B^- \to Λ_{c}^+ \overline{p} K^-$ decay is observed for the first time, and its branching fraction is measured relative to that of the $B^- \to Λ_{c}^+ \overline{p} π^-$ mode. △ Less

Submitted 22 May, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

Comments: All figures and tables, along with any supplementary material and additional information, are available at https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2024-013.html

Report number: CERN-EP-2024-116, LHCb-PAPER-2024-013

arXiv:2405.11324 [pdf, other]

Transverse polarization measurement of $Λ$ hyperons in $p$Ne collisions at $\sqrt{s_{NN}}$ = 68.4 GeV with the $\mbox{LHCb}$ detector

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1065 additional authors not shown)

Abstract: A measurement of the transverse polarization of the $Λ$ and $\barΛ$ hyperons in $p$Ne fixed-target collisions at $\sqrt{s_{NN}}$ = 68.4 GeV is presented using data collected by the LHCb detector. The polarization is studied using the decay $Λ\rightarrow p π^-$ together with its charge conjugated process, the integrated values measured are… ▽ More A measurement of the transverse polarization of the $Λ$ and $\barΛ$ hyperons in $p$Ne fixed-target collisions at $\sqrt{s_{NN}}$ = 68.4 GeV is presented using data collected by the LHCb detector. The polarization is studied using the decay $Λ\rightarrow p π^-$ together with its charge conjugated process, the integrated values measured are $$ P_Λ = 0.029 \pm 0.019 \, (\rm{stat}) \pm 0.012 \, (\rm{syst}) \, , $$ $$ P_{\barΛ} = 0.003 \pm 0.023 \, (\rm{stat}) \pm 0.014 \,(\rm{syst}) \,. $$ Furthermore, the results are shown as a function of the Feynman~$x$~variable, transverse momentum, pseudorapidity and rapidity of the hyperons, and are compared with previous measurements. △ Less

Submitted 24 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

Comments: All figures and tables, along with any supplementary material and additional information, are available at https://lbfence.cern.ch/alcm/public/analysis/full-details/3120 (LHCb public pages)

Report number: CERN-EP-2024-121, LHCb-PAPER-2024-009

arXiv:2404.15655 [pdf, other]

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

Authors: Jiawei Yao, Qi Qian, Juhua Hu

Abstract: Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusteri… ▽ More Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusterings that algorithms generate, and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally, aligning a user's brief keyword of interest with the corresponding vision components was challenging, but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response, given unlabeled target visual data, we propose Multi-MaP, a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings, with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover, reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at https://github.com/Alexander-Yao/Multi-MaP. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024. Project page: https://github.com/Alexander-Yao/Multi-MaP

arXiv:2401.06040 [pdf, other]

Wavelet-Inspired Multiscale Graph Convolutional Recurrent Network for Traffic Forecasting

Authors: Qipeng Qian, Tanwi Mallick

Abstract: Traffic forecasting is the foundation for intelligent transportation systems. Spatiotemporal graph neural networks have demonstrated state-of-the-art performance in traffic forecasting. However, these methods do not explicitly model some of the natural characteristics in traffic data, such as the multiscale structure that encompasses spatial and temporal variations at different levels of granulari… ▽ More Traffic forecasting is the foundation for intelligent transportation systems. Spatiotemporal graph neural networks have demonstrated state-of-the-art performance in traffic forecasting. However, these methods do not explicitly model some of the natural characteristics in traffic data, such as the multiscale structure that encompasses spatial and temporal variations at different levels of granularity or scale. To that end, we propose a Wavelet-Inspired Graph Convolutional Recurrent Network (WavGCRN) which combines multiscale analysis (MSA)-based method with Deep Learning (DL)-based method. In WavGCRN, the traffic data is decomposed into time-frequency components with Discrete Wavelet Transformation (DWT), constructing a multi-stream input structure; then Graph Convolutional Recurrent networks (GCRNs) are employed as encoders for each stream, extracting spatiotemporal features in different scales; and finally the learnable Inversed DWT and GCRN are combined as the decoder, fusing the information from all streams for traffic metrics reconstruction and prediction. Furthermore, road-network-informed graphs and data-driven graph learning are combined to accurately capture spatial correlation. The proposed method can offer well-defined interpretability, powerful learning capability, and competitive forecasting performance on real-world traffic data sets. △ Less

Submitted 4 March, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

arXiv:2312.08635 [pdf, other]

OGLE-2017-BLG-0448Lb: A Low Mass-Ratio Wide-Orbit Microlensing Planet?

Authors: Ruocheng Zhai, Radosław Poleski, Weicheng Zang, Youn Kil Jung, Andrzej Udalski, Renkun Kuang, Michael D. Albrow, Sun-Ju Chung, Andrew Gould, Cheongho Han, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hong**g Yang, Jennifer C. Yee, Sang-Mok Cha, Dong-** Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge , et al. (16 additional authors not shown)

Abstract: The gravitational microlensing technique is most sensitive to planets in a Jupiter-like orbit and has detected more than 200 planets. However, only a few wide-orbit ($s > 2$) microlensing planets have been discovered, where $s$ is the planet-to-host separation normalized to the angular Einstein ring radius, $θ_{\rm E}$. Here we present the discovery and analysis of a strong candidate wide-orbit mi… ▽ More The gravitational microlensing technique is most sensitive to planets in a Jupiter-like orbit and has detected more than 200 planets. However, only a few wide-orbit ($s > 2$) microlensing planets have been discovered, where $s$ is the planet-to-host separation normalized to the angular Einstein ring radius, $θ_{\rm E}$. Here we present the discovery and analysis of a strong candidate wide-orbit microlensing planet in the event, OGLE-2017-BLG-0448. The whole light curve exhibits long-term residuals to the static binary-lens single-source model, so we investigate the residuals by adding the microlensing parallax, microlensing xallarap, an additional lens, or an additional source. For the first time, we observe a complex degeneracy between all four effects. The wide-orbit models with $s \sim 2.5$ and a planet-to-host mass-ratio of $q \sim 10^{-4}$ are significantly preferred, but we cannot rule out the close models with $s \sim 0.35$ and $q \sim 10^{-3}$. A Bayesian analysis based on a Galactic model indicates that, despite the complicated degeneracy, the surviving wide-orbit models all contain a super-Earth-mass to Neptune-mass planet at a projected planet-host separation of $\sim 6$ au and the surviving close-orbit models all consist of a Jovian-mass planet at $\sim 1$ au. The host star is probably an M or K dwarf. We discuss the implications of this dimension-degeneracy disaster on microlensing light-curve analysis and its potential impact on statistical studies. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: submitted to AJ

arXiv:2311.18248 [pdf, other]

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Authors: Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang

Abstract: Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly fo… ▽ More Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl. △ Less

Submitted 9 January, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: 20 pages, 12 figures

arXiv:2311.15800 [pdf]

Public sentiment analysis and topic modeling regarding ChatGPT in mental health on Reddit: Negative sentiments increase over time

Authors: Yunna Cai, Fan Wang, Haowei Wang, Qianwen Qian

Abstract: In order to uncover users' attitudes towards ChatGPT in mental health, this study examines public opinions about ChatGPT in mental health discussions on Reddit. Researchers used the bert-base-multilingual-uncased-sentiment techniques for sentiment analysis and the BERTopic model for topic modeling. It was found that overall, negative sentiments prevail, followed by positive ones, with neutral sent… ▽ More In order to uncover users' attitudes towards ChatGPT in mental health, this study examines public opinions about ChatGPT in mental health discussions on Reddit. Researchers used the bert-base-multilingual-uncased-sentiment techniques for sentiment analysis and the BERTopic model for topic modeling. It was found that overall, negative sentiments prevail, followed by positive ones, with neutral sentiments being the least common. The prevalence of negative emotions has increased over time. Negative emotions encompass discussions on ChatGPT providing bad mental health advice, debates on machine vs. human value, the fear of AI, and concerns about Universal Basic Income (UBI). In contrast, positive emotions highlight ChatGPT's effectiveness in counseling, with mentions of keywords like "time" and "wallet." Neutral discussions center around private data concerns. These findings shed light on public attitudes toward ChatGPT in mental health, potentially contributing to the development of trustworthy AI in mental health from the public perspective. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 11 pages.8 figures, 2 tables

arXiv:2311.14310 [pdf, other]

Stable Cluster Discrimination for Deep Clustering

Authors: Qi Qian

Abstract: Deep clustering can optimize representations of instances (i.e., representation learning) and explore the inherent data distribution (i.e., clustering) simultaneously, which demonstrates a superior performance over conventional clustering methods with given features. However, the coupled objective implies a trivial solution that all instances collapse to the uniform features. To tackle the challen… ▽ More Deep clustering can optimize representations of instances (i.e., representation learning) and explore the inherent data distribution (i.e., clustering) simultaneously, which demonstrates a superior performance over conventional clustering methods with given features. However, the coupled objective implies a trivial solution that all instances collapse to the uniform features. To tackle the challenge, a two-stage training strategy is developed for decoupling, where it introduces an additional pre-training stage for representation learning and then fine-tunes the obtained model for clustering. Meanwhile, one-stage methods are developed mainly for representation learning rather than clustering, where various constraints for cluster assignments are designed to avoid collapsing explicitly. Despite the success of these methods, an appropriate learning objective tailored for deep clustering has not been investigated sufficiently. In this work, we first show that the prevalent discrimination task in supervised learning is unstable for one-stage clustering due to the lack of ground-truth labels and positive instances for certain clusters in each mini-batch. To mitigate the issue, a novel stable cluster discrimination (SeCu) task is proposed and a new hardness-aware clustering criterion can be obtained accordingly. Moreover, a global entropy constraint for cluster assignments is studied with efficient optimization. Extensive experiments are conducted on benchmark data sets and ImageNet. SeCu achieves state-of-the-art performance on all of them, which demonstrates the effectiveness of one-stage deep clustering. Code is available at \url{https://github.com/idstcv/SeCu}. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: accepted by ICCV'23

arXiv:2311.07577 [pdf, ps, other]

Algorithms for Object Detection in Substations

Authors: Bingying **, Yadong Liu, Qinlin Qian

Abstract: Inspection of high-voltage power equipment is an effective way to ensure power supply reliability. Object recognition, one of the key technologies in automatic power equipment inspection, attracts attention of many researchers and engineers. Although quite a few existing models have some their own advantages, object relationship between equipment which is very important in this task is scarcely co… ▽ More Inspection of high-voltage power equipment is an effective way to ensure power supply reliability. Object recognition, one of the key technologies in automatic power equipment inspection, attracts attention of many researchers and engineers. Although quite a few existing models have some their own advantages, object relationship between equipment which is very important in this task is scarcely considered. This paper combining object relationship modeling and Transformer Model proposes a Relation Transformer Model. It has four parts -- backbone, encoder, decoder and prediction heads. With this structure, the proposed method shows in experiments a much better performance than other three commonly used models in object recognition in substation, largely promoting the development of automatic power equipment inspection. △ Less

Submitted 23 September, 2023; originally announced November 2023.

arXiv:2311.04876 [pdf]

Systematic Reanalysis of KMTNet microlensing events, Paper I: Updates of the Photometry Pipeline and a New Planet Candidate

Authors: Hong**g Yang, Jennifer C. Yee, Kyu-Ha Hwang, Qiyue Qian, Ian A. Bond, Andrew Gould, Zhecheng Hu, Jiyuan Zhang, Shude Mao, Wei Zhu, Michael D. Albrow, Sun-Ju Chung, Cheongho Han, Youn Kil Jung, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Sang-Mok Cha, Dong-** Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park , et al. (30 additional authors not shown)

Abstract: In this work, we update and develop algorithms for KMTNet tender-love care (TLC) photometry in order to create an new, mostly automated, TLC pipeline. We then start a project to systematically apply the new TLC pipeline to the historic KMTNet microlensing events, and search for buried planetary signals. We report the discovery of such a planet candidate in the microlensing event MOA-2019-BLG-421/K… ▽ More In this work, we update and develop algorithms for KMTNet tender-love care (TLC) photometry in order to create an new, mostly automated, TLC pipeline. We then start a project to systematically apply the new TLC pipeline to the historic KMTNet microlensing events, and search for buried planetary signals. We report the discovery of such a planet candidate in the microlensing event MOA-2019-BLG-421/KMT-2019-BLG-2991. The anomalous signal can be explained by either a planet around the lens star or the orbital motion of the source star. For the planetary interpretation, despite many degenerate solutions, the planet is most likely to be a Jovian planet orbiting an M or K dwarf, which is a typical microlensing planet. The discovery proves that the project can indeed increase the sensitivity of historic events and find previously undiscovered signals. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: 19 pages, 13 figures, 7 tables. Submitted to MNRAS

arXiv:2311.04257 [pdf, other]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Authors: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, **gren Zhou

Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks… ▽ More Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models. △ Less

Submitted 8 November, 2023; v1 submitted 7 November, 2023; originally announced November 2023.

arXiv:2310.19752 [pdf, other]

Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP

Authors: Qi Qian, Yuanhong Xu, Juhua Hu

Abstract: Vision-language pre-training methods, e.g., CLIP, demonstrate an impressive zero-shot performance on visual categorizations with the class proxy from the text embedding of the class name. However, the modality gap between the text and vision space can result in a sub-optimal performance. We theoretically show that the gap cannot be reduced sufficiently by minimizing the contrastive loss in CLIP an… ▽ More Vision-language pre-training methods, e.g., CLIP, demonstrate an impressive zero-shot performance on visual categorizations with the class proxy from the text embedding of the class name. However, the modality gap between the text and vision space can result in a sub-optimal performance. We theoretically show that the gap cannot be reduced sufficiently by minimizing the contrastive loss in CLIP and the optimal proxy for vision tasks may reside only in the vision space. Therefore, given unlabeled target vision data, we propose to learn the vision proxy directly with the help from the text proxy for zero-shot transfer. Moreover, according to our theoretical analysis, strategies are developed to further refine the pseudo label obtained by the text proxy to facilitate the intra-modal proxy learning (InMaP) for vision. Experiments on extensive downstream tasks confirm the effectiveness and efficiency of our proposal. Concretely, InMaP can obtain the vision proxy within one minute on a single GPU while improving the zero-shot accuracy from $77.02\%$ to $80.21\%$ on ImageNet with ViT-L/14@336 pre-trained by CLIP. Code is available at \url{https://github.com/idstcv/InMaP}. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: accepted by NeurIPS'23

arXiv:2310.05126 [pdf, other]

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Authors: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin **, Liang He, Xin Alex Lin, Fei Huang

Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and… ▽ More Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive crop** module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2310.04257 [pdf, other]

On Solving Close Enough Orienteering Problems with Overlapped Neighborhoods

Authors: Qiuchen Qian, Yanran Wang, David Boyle

Abstract: Close Enough Traveling Salesman Problem (CETSP) is a well-known variant of TSP whereby the agent may complete its mission at any point within a target neighborhood. Heuristics based on overlapped neighborhoods, known as Steiner Zones (SZ), have gained attention in addressing CETSP. While SZs offer effective approximations to the original graph, their inherent overlap imposes constraints on search… ▽ More Close Enough Traveling Salesman Problem (CETSP) is a well-known variant of TSP whereby the agent may complete its mission at any point within a target neighborhood. Heuristics based on overlapped neighborhoods, known as Steiner Zones (SZ), have gained attention in addressing CETSP. While SZs offer effective approximations to the original graph, their inherent overlap imposes constraints on search space, potentially conflicting with global optimization objectives. Here we show how such limitations can be converted into advantages in a Close Enough Orienteering Problem (CEOP) by aggregating prizes across overlapped neighborhoods. We further extend classic CEOP with Non-uniform Neighborhoods (CEOP-N) by introducing non-uniform costs for prize collection. To tackle CEOP and CEOP-N, we develop a new approach featuring a Randomized Steiner Zone Discretization (RSZD) scheme coupled with a hybrid algorithm based on Particle Swarm Optimization (PSO) and Ant Colony System (ACS), CRaSZe-AntS. The RSZD scheme identifies sub-regions for PSO exploration, and ACS determines the discrete visiting sequence. We evaluate the RSZD's discretization performance on CEOP instances derived from established CETSP instances and compare CRaSZe-AntS against the most relevant state-of-the-art heuristic focused on single-neighborhood optimization for CEOP instances. We also compare the performance of the interior search within SZs and the boundary search on individual neighborhoods in the context of CEOP-N. Our experimental results show that CRaSZe-AntS can yield comparable solution quality with significantly reduced computation time compared to the single neighborhood strategy, where we observe an average 140.44% increase in prize collection and a 55.18% reduction in algorithm execution time. CRaSZe-AntS is thus highly effective in solving emerging CEOP-N, examples of which include truck-and-drone delivery scenarios. △ Less

Submitted 15 May, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: 30 pages, 11 figures

arXiv:2309.04145 [pdf, other]

Depth Completion with Multiple Balanced Bases and Confidence for Dense Monocular SLAM

Authors: Weijian Xie, Guanyi Chu, Quanhao Qian, Yihao Yu, Hai Li, Danpeng Chen, Shang** Zhai, Nan Wang, Hujun Bao, Guofeng Zhang

Abstract: Dense SLAM based on monocular cameras does indeed have immense application value in the field of AR/VR, especially when it is performed on a mobile device. In this paper, we propose a novel method that integrates a light-weight depth completion network into a sparse SLAM system using a multi-basis depth representation, so that dense map** can be performed online even on a mobile phone. Specifica… ▽ More Dense SLAM based on monocular cameras does indeed have immense application value in the field of AR/VR, especially when it is performed on a mobile device. In this paper, we propose a novel method that integrates a light-weight depth completion network into a sparse SLAM system using a multi-basis depth representation, so that dense map** can be performed online even on a mobile phone. Specifically, we present a specifically optimized multi-basis depth completion network, called BBC-Net, tailored to the characteristics of traditional sparse SLAM systems. BBC-Net can predict multiple balanced bases and a confidence map from a monocular image with sparse points generated by off-the-shelf keypoint-based SLAM systems. The final depth is a linear combination of predicted depth bases that can be optimized by tuning the corresponding weights. To seamlessly incorporate the weights into traditional SLAM optimization and ensure efficiency and robustness, we design a set of depth weight factors, which makes our network a versatile plug-in module, facilitating easy integration into various existing sparse SLAM systems and significantly enhancing global depth consistency through bundle adjustment. To verify the portability of our method, we integrate BBC-Net into two representative SLAM systems. The experimental results on various datasets show that the proposed method achieves better performance in monocular dense map** than the state-of-the-art methods. We provide an online demo running on a mobile phone, which verifies the efficiency and map** quality of the proposed method in real-world scenarios. △ Less

Submitted 20 September, 2023; v1 submitted 8 September, 2023; originally announced September 2023.

arXiv:2309.01280 [pdf, ps, other]

KMT-2021-BLG-1547Lb: Giant microlensing planet detected through a signal deformed by source binarity

Authors: Cheongho Han, Weicheng Zang, Youn Kil Jung, Ian A. Bond, Sun-Ju Chung, Michael D. Albrow, Andrew Gould, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hong**g Yang, Jennifer C. Yee, Sang-Mok Cha, Doeon Kim, Dong-** Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, L. A. G. Monard, Qiyue Qian, Zhuokai Liu , et al. (30 additional authors not shown)

Abstract: We investigate the previous microlensing data collected by the KMTNet survey in search of anomalous events for which no precise interpretations of the anomalies have been suggested. From this investigation, we find that the anomaly in the lensing light curve of the event KMT-2021-BLG-1547 is approximately described by a binary-lens (2L1S) model with a lens possessing a giant planet, but the model… ▽ More We investigate the previous microlensing data collected by the KMTNet survey in search of anomalous events for which no precise interpretations of the anomalies have been suggested. From this investigation, we find that the anomaly in the lensing light curve of the event KMT-2021-BLG-1547 is approximately described by a binary-lens (2L1S) model with a lens possessing a giant planet, but the model leaves unexplained residuals. We investigate the origin of the residuals by testing more sophisticated models that include either an extra lens component (3L1S model) or an extra source star (2L2S model) to the 2L1S configuration of the lens system. From these analyses, we find that the residuals from the 2L1S model originate from the existence of a faint companion to the source. The 2L2S solution substantially reduces the residuals and improves the model fit by $Δχ^2=67.1$ with respect to the 2L1S solution. The 3L1S solution also improves the fit, but its fit is worse than that of the 2L2S solution by $Δχ^2=24.7$. According to the 2L2S solution, the lens of the event is a planetary system with planet and host masses $(M_{\rm p}/M_{\rm J}, M_{\rm h}/M_\odot)=\left( 1.47^{+0.64}_{-0.77}, 0.72^{+0.32}_{-0.38}\right)$ lying at a distance $\D_{\rm L} =5.07^{+0.98}_{-1.50}$~kpc, and the source is a binary composed of a subgiant primary of a late G or an early K spectral type and a main-sequence companion of a K spectral type. The event demonstrates the need of sophisticated modeling for unexplained anomalies for the construction of a complete microlensing planet sample. △ Less

Submitted 3 September, 2023; originally announced September 2023.

Comments: 9 pages, 4 tables, 7 figures

arXiv:2307.07084 [pdf, other]

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

Authors: Yanran Wang, Qiuchen Qian, David Boyle

Abstract: Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in… ▽ More Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability. △ Less

Submitted 17 June, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: 25 pages, 9 figures, containing Appendix

arXiv:2307.05305 [pdf, other]

doi 10.1103/PhysRevA.108.012414

Visualization of all two-qubit states via partial-transpose-moments

Authors: Lin Zhang, Yi Shen, Hua Xiang, Quan Qian, Bo Li

Abstract: Efficiently detecting entanglement based on measurable quantities is a basic problem for quantum information processing. Recently, the measurable quantities called partial-transpose (PT)-moments have been proposed to detect and characterize entanglement. In the recently published paper [L. Zhang \emph{et al.}, \href{https://doi.org/10.1002/andp.202200289}{Ann. Phys.(Berlin) \textbf{534}, 2200289 (… ▽ More Efficiently detecting entanglement based on measurable quantities is a basic problem for quantum information processing. Recently, the measurable quantities called partial-transpose (PT)-moments have been proposed to detect and characterize entanglement. In the recently published paper [L. Zhang \emph{et al.}, \href{https://doi.org/10.1002/andp.202200289}{Ann. Phys.(Berlin) \textbf{534}, 2200289 (2022)}], we have already identified the 2-dimensional (2D) region, comprised of the second and third PT-moments, corresponding to two-qubit entangled states, and described the whole region for all two-qubit states. In the present paper, we visualize the 3D region corresponding to all two-qubit states by further involving the fourth PT-moment (the last one for two-qubit states). The characterization of this 3D region can finally be achieved by optimizing some polynomials. Furthermore, we identify the dividing surface which separates the two parts of the whole 3D region corresponding to entangled and separable states respectively. Due to the measurability of PT-moments, we obtain a complete and operational criterion for the detection of two-qubit entanglement. △ Less

Submitted 11 July, 2023; originally announced July 2023.

Comments: 29 pages, LaTeX, 8 figures, 2 tables

Journal ref: Phys. Rev. A 108, 012414 (2023)

arXiv:2306.16706 [pdf, other]

Parametric study of the polarization dependence of nonlinear Breit-Wheeler pair creation process using two laser pulses

Authors: Qian Qian, Daniel Seipt, Marija Vranic, Thomas E. Grismayer, Tom G. Blackburn, Christopher P. Ridgers, Alexander G. R. Thomas

Abstract: With the rapid development of high-power petawatt class lasers worldwide, exploring physics in the strong field QED regime will become one of the frontiers for laser-plasma interactions research. Particle-in-cell codes, including quantum emission processes, are powerful tools for predicting and analyzing future experiments where the physics of relativistic plasma is strongly affected by strong-fie… ▽ More With the rapid development of high-power petawatt class lasers worldwide, exploring physics in the strong field QED regime will become one of the frontiers for laser-plasma interactions research. Particle-in-cell codes, including quantum emission processes, are powerful tools for predicting and analyzing future experiments where the physics of relativistic plasma is strongly affected by strong-field QED processes. The spin/polarization dependence of these quantum processes has been of recent interest. In this article, we perform a parametric study of the interaction of two laser pulses with an ultrarelativistic electron beam. The first pulse is optimized to generate high-energy photons by nonlinear Compton scattering and efficiently decelerate the electron beam through quantum radiation reaction. The second pulse is optimized to generate electron-positron pairs by nonlinear Breit-Wheeler decay of the photons with the maximum polarization dependence. This may be experimentally realized as a verification of the strong field QED framework, including the spin/polarization rates. △ Less

Submitted 16 October, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

Comments: 15 pages, 13 figures

arXiv:2306.08792 [pdf, other]

Graph Convolution Based Efficient Re-Ranking for Visual Retrieval

Authors: Yuqi Zhang, Qi Qian, Hongsong Wang, Chong Liu, Weihua Chen, Fan Wang

Abstract: Visual retrieval tasks such as image retrieval and person re-identification (Re-ID) aim at effectively and thoroughly searching images with similar content or the same identity. After obtaining retrieved examples, re-ranking is a widely adopted post-processing step to reorder and improve the initial retrieval results by making use of the contextual information from semantically neighboring samples… ▽ More Visual retrieval tasks such as image retrieval and person re-identification (Re-ID) aim at effectively and thoroughly searching images with similar content or the same identity. After obtaining retrieved examples, re-ranking is a widely adopted post-processing step to reorder and improve the initial retrieval results by making use of the contextual information from semantically neighboring samples. Prevailing re-ranking approaches update distance metrics and mostly rely on inefficient crosscheck set comparison operations while computing expanded neighbors based distances. In this work, we present an efficient re-ranking method which refines initial retrieval results by updating features. Specifically, we reformulate re-ranking based on Graph Convolution Networks (GCN) and propose a novel Graph Convolution based Re-ranking (GCR) for visual retrieval tasks via feature propagation. To accelerate computation for large-scale retrieval, a decentralized and synchronous feature propagation algorithm which supports parallel or distributed computing is introduced. In particular, the plain GCR is extended for cross-camera retrieval and an improved feature propagation formulation is presented to leverage affinity relationships across different cameras. It is also extended for video-based retrieval, and Graph Convolution based Re-ranking for Video (GCRV) is proposed by mathematically deriving a novel profile vector generation method for the tracklet. Without bells and whistles, the proposed approaches achieve state-of-the-art performances on seven benchmark datasets from three different tasks, i.e., image retrieval, person Re-ID and video-based person Re-ID. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: Code is publicly available: https://github.com/WesleyZhang1991/GCN_rerank

arXiv:2306.04362 [pdf, other]

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

Authors: Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

Abstract: To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chi… ▽ More To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: Working in progress

arXiv:2304.14178 [pdf, other]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Authors: Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, **gren Zhou

Abstract: Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstrac… ▽ More Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl. △ Less

Submitted 29 March, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: Working in Process

arXiv:2304.07849 [pdf, other]

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

Authors: Junfeng Tian, Hehong Chen, Guohai Xu, Ming Yan, Xing Gao, Jianhai Zhang, Chenliang Li, Jiayi Liu, Wenshen Xu, Haiyang Xu, Qi Qian, Wei Wang, Qinghao Ye, Jie**g Zhang, Ji Zhang, Fei Huang, **gren Zhou

Abstract: In this paper, we present ChatPLUG, a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format. Different from other open-domain dialogue models that focus on large-scale pre-training and scaling up model size or dialogue corpus, we aim to build a powerful and practical dialogue system for… ▽ More In this paper, we present ChatPLUG, a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format. Different from other open-domain dialogue models that focus on large-scale pre-training and scaling up model size or dialogue corpus, we aim to build a powerful and practical dialogue system for digital human with diverse skills and good multi-task generalization by internet-augmented instruction tuning. To this end, we first conduct large-scale pre-training on both common document corpus and dialogue data with curriculum learning, so as to inject various world knowledge and dialogue abilities into ChatPLUG. Then, we collect a wide range of dialogue tasks spanning diverse features of knowledge, personality, multi-turn memory, and empathy, on which we further instruction tune \modelname via unified natural language instruction templates. External knowledge from an internet search is also used during instruction finetuning for alleviating the problem of knowledge hallucinations. We show that \modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation, and demonstrates strong multi-task generalization on a variety of text understanding and generation tasks. In addition, we deploy \modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference. Our models and code will be made publicly available on ModelScope: https://modelscope.cn/models/damo/ChatPLUG-3.7B and Github: https://github.com/X-PLUG/ChatPLUG . △ Less

Submitted 15 May, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

Comments: 36 pages

arXiv:2304.01489 [pdf, other]

Improved Visual Fine-tuning with Natural Language Supervision

Authors: Junyang Wang, Yuanhong Xu, Juhua Hu, Ming Yan, Jitao Sang, Qi Qian

Abstract: Fine-tuning a visual pre-trained model can leverage the semantic information from large-scale pre-training data and mitigate the over-fitting problem on downstream vision tasks with limited training examples. While the problem of catastrophic forgetting in pre-trained backbone has been extensively studied for fine-tuning, its potential bias from the corresponding pre-training task and data, attrac… ▽ More Fine-tuning a visual pre-trained model can leverage the semantic information from large-scale pre-training data and mitigate the over-fitting problem on downstream vision tasks with limited training examples. While the problem of catastrophic forgetting in pre-trained backbone has been extensively studied for fine-tuning, its potential bias from the corresponding pre-training task and data, attracts less attention. In this work, we investigate this problem by demonstrating that the obtained classifier after fine-tuning will be close to that induced by the pre-trained model. To reduce the bias in the classifier effectively, we introduce a reference distribution obtained from a fixed text classifier, which can help regularize the learned vision classifier. The proposed method, Text Supervised fine-tuning (TeS), is evaluated with diverse pre-trained vision models including ResNet and ViT, and text encoders including BERT and CLIP, on 11 downstream tasks. The consistent improvement with a clear margin over distinct scenarios confirms the effectiveness of our proposal. Code is available at \url{https://github.com/idstcv/TeS}. △ Less

Submitted 14 August, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: accepted by ICCV'23

arXiv:2304.01290 [pdf, other]

A Simple Approach for General Task-Oriented Picking using Placing constraints

Authors: Jen-Wei Wang, Lingfeng Sun, Xinghao Zhu, Qiyang Qian, Masayoshi Tomizuka

Abstract: Pick-and-place is an important manipulation task in domestic or manufacturing applications. There exist many works focusing on grasp detection with high picking success rate but lacking consideration of downstream manipulation tasks (e.g., placing). Although some research works proposed methods to incorporate task conditions into grasp selection, most of them are data-driven and are therefore hard… ▽ More Pick-and-place is an important manipulation task in domestic or manufacturing applications. There exist many works focusing on grasp detection with high picking success rate but lacking consideration of downstream manipulation tasks (e.g., placing). Although some research works proposed methods to incorporate task conditions into grasp selection, most of them are data-driven and are therefore hard to adapt to arbitrary operating environments. Observing this challenge, we propose a general task-oriented pick-place framework that treats the target task and operating environment as placing constraints into gras** optimization. Combined with existing grasp detectors, our framework is able to generate feasible grasps for different downstream tasks and adapt to environmental changes without time-consuming re-training processes. Moreover, the framework can accept different definitions of placing constraints, so it is easy to integrate with other modules. Experiments in the simulator and real-world on multiple pick-place tasks are conducted to evaluate the performance of our framework. The result shows that our framework achieves a high and robust task success rate on a wide variety of the pick-place tasks. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2302.05078 [pdf]

Signatures of Chiral Superconductivity in Chiral Molecule Intercalated Tantalum Disulfide

Authors: Zhong Wan, Gang Qiu, Huaying Ren, Qi Qian, Dong Xu, **gyuan Zhou, **gxuan Zhou, Boxuan Zhou, Laiyuan Wang, Yu Huang, Kang L. Wang, Xiangfeng Duan

Abstract: Chiral superconductors, a unique class of unconventional superconductors in which the complex superconducting order parameter winds clockwise or counter-clockwise in the momentum space, represent a topologically non-trivial system with direct implications for topological quantum computing. Intrinsic chiral superconductors are extremely rare, with only a few arguable examples including heavy fermio… ▽ More Chiral superconductors, a unique class of unconventional superconductors in which the complex superconducting order parameter winds clockwise or counter-clockwise in the momentum space, represent a topologically non-trivial system with direct implications for topological quantum computing. Intrinsic chiral superconductors are extremely rare, with only a few arguable examples including heavy fermion metals (UTe$_2$, UPt$_3$) and perovskite superconductor Sr$_2$RuO$_4$. Chiral molecules with neither mirror nor inversion symmetry have been widely investigated, in which the spin degeneracy may be lifted by the molecular chirality. Thus, a combination of superconductivity with chiral molecules may lead to a spin-polarized ground state for realizing chiral superconductivity. Herein we report the first investigation of unconventional superconductivity in chiral molecule intercalated tantalum disulfide (TaS$_2$) and reveal key signatures of chiral superconductivity. Little-Parks measurements demonstrate a robust and reproducible half-flux quantum phase shift in both left- and right-handed chiral molecule intercalated TaS$_2$, which is absent in pristine TaS$_2$ or achiral molecule intercalated TaS$_2$, highlighting the essential role of molecular chirality in inducing unconventional superconductivity. The robust half-flux quantum phase shift demonstrates unconventional superconductivity and constitutes strong evidence supporting a chiral superconducting ordering parameter. Critical current measurements at lower temperature reveal a peculiar asymmetric phase shift under opposite supercurrent, with a relative phase difference approaching the unity of π at below 0.5 K, further supporting topologically non-trivial superconductivity. Our study signifies the potential of hybrid superlattices with intriguing coupling between the crystalline atomic layers and the self-assembled molecular layers. △ Less

Submitted 10 February, 2023; originally announced February 2023.

Comments: 22 Pages, 5 figures, 6 Extended Data figures

arXiv:2302.00402 [pdf, other]

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Authors: Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, **gren Zhou

Abstract: Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or… ▽ More Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind. △ Less

Submitted 1 February, 2023; originally announced February 2023.

Journal ref: ICML2023

arXiv:2301.06779 [pdf]

doi 10.1093/mnras/stad1398

KMT-2022-BLG-0440Lb: A New $q < 10^{-4}$ Microlensing Planet with the Central-Resonant Caustic Degeneracy Broken

Authors: Jiyuan Zhang, Weicheng Zang, Youn Kil Jung, Hong**g Yang, Andrew Gould, Takahiro Sumi, Shude Mao, Subo Dong, Michael D. Albrow, Sun-Ju Chung, Cheongho Han, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Jennifer C. Yee, Sang-Mok Cha, Dong-** Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge , et al. (35 additional authors not shown)

Abstract: We present the observations and analysis of a high-magnification microlensing planetary event, KMT-2022-BLG-0440, for which the weak and short-lived planetary signal was covered by both the KMTNet survey and follow-up observations. The binary-lens models with a central caustic provide the best fits, with a planet/host mass ratio, $q = 0.75$--$1.00 \times 10^{-4}$ at $1σ$. The binary-lens models wi… ▽ More We present the observations and analysis of a high-magnification microlensing planetary event, KMT-2022-BLG-0440, for which the weak and short-lived planetary signal was covered by both the KMTNet survey and follow-up observations. The binary-lens models with a central caustic provide the best fits, with a planet/host mass ratio, $q = 0.75$--$1.00 \times 10^{-4}$ at $1σ$. The binary-lens models with a resonant caustic and a brown-dwarf mass ratio are both excluded by $Δχ^2 > 70$. The binary-source model can fit the anomaly well but is rejected by the ``color argument'' on the second source. From Bayesian analyses, it is estimated that the host star is likely a K or M dwarf located in the Galactic disk, the planet probably has a Neptune-mass, and the projected planet-host separation is $1.9^{+0.6}_{-0.7}$ or $4.6^{+1.4}_{-1.7}$ au, subject to the close/wide degeneracy. This is the third $q < 10^{-4}$ planet from a high-magnification planetary signal ($A \gtrsim 65$). Together with another such planet, KMT-2021-BLG-0171Lb, the ongoing follow-up program for the KMTNet high-magnification events has demonstrated its ability in detecting high-magnification planetary signals for $q < 10^{-4}$ planets, which are challenging for the current microlensing surveys. △ Less

Submitted 2 May, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: MNRAS accepted

arXiv:2212.14546 [pdf, other]

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Authors: Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang

Abstract: Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training fr… ▽ More Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope. △ Less

Submitted 29 December, 2022; originally announced December 2022.

arXiv:2208.02803 [pdf, other]

Semantic Data Augmentation based Distance Metric Learning for Domain Generalization

Authors: Mengzhu Wang, Jianlong Yuan, Qi Qian, Zhibin Wang, Hao Li

Abstract: Domain generalization (DG) aims to learn a model on one or more different but related source domains that could be generalized into an unseen target domain. Existing DG methods try to prompt the diversity of source domains for the model's generalization ability, while they may have to introduce auxiliary networks or striking computational costs. On the contrary, this work applies the implicit sema… ▽ More Domain generalization (DG) aims to learn a model on one or more different but related source domains that could be generalized into an unseen target domain. Existing DG methods try to prompt the diversity of source domains for the model's generalization ability, while they may have to introduce auxiliary networks or striking computational costs. On the contrary, this work applies the implicit semantic augmentation in feature space to capture the diversity of source domains. Concretely, an additional loss function of distance metric learning (DML) is included to optimize the local geometry of data distribution. Besides, the logits from cross entropy loss with infinite augmentations is adopted as input features for the DML loss in lieu of the deep features. We also provide a theoretical analysis to show that the logits can approximate the distances defined on original features well. Further, we provide an in-depth analysis of the mechanism and rational behind our approach, which gives us a better understanding of why leverage logits in lieu of features can help domain generalization. The proposed DML loss with the implicit augmentation is incorporated into a recent DG method, that is, Fourier Augmented Co-Teacher framework (FACT). Meanwhile, our method also can be easily plugged into various DG methods. Extensive experiments on three benchmarks (Digits-DG, PACS and Office-Home) have demonstrated that the proposed method is able to achieve the state-of-the-art performance. △ Less

Submitted 13 September, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

Comments: Accept to ACMMM2022

arXiv:2207.07789 [pdf, other]

QuaDUE-CCM: Interpretable Distributional Reinforcement Learning using Uncertain Contraction Metrics for Precise Quadrotor Trajectory Tracking

Authors: Yanran Wang, James O'Keeffe, Qiuchen Qian, David Boyle

Abstract: Accuracy and stability are common requirements for Quadrotor trajectory tracking systems. Designing an accurate and stable tracking controller remains challenging, particularly in unknown and dynamic environments with complex aerodynamic disturbances. We propose a Quantile-approximation-based Distributional-reinforced Uncertainty Estimator (QuaDUE) to accurately identify the effects of aerodynamic… ▽ More Accuracy and stability are common requirements for Quadrotor trajectory tracking systems. Designing an accurate and stable tracking controller remains challenging, particularly in unknown and dynamic environments with complex aerodynamic disturbances. We propose a Quantile-approximation-based Distributional-reinforced Uncertainty Estimator (QuaDUE) to accurately identify the effects of aerodynamic disturbances, i.e., the uncertainties between the true and estimated Control Contraction Metrics (CCMs). Taking inspiration from contraction theory and integrating the QuaDUE for uncertainties, our novel CCM-based trajectory tracking framework tracks any feasible reference trajectory precisely whilst guaranteeing exponential convergence. More importantly, the convergence and training acceleration of the distributional RL are guaranteed and analyzed, respectively, from theoretical perspectives. We also demonstrate our system under unknown and diverse aerodynamic forces. Under large aerodynamic forces (>2m/s^2), compared with the classic data-driven approach, our QuaDUE-CCM achieves at least a 56.6% improvement in tracking error. Compared with QuaDRED-MPC, a distributional RL-based approach, QuaDUE-CCM achieves at least a 3 times improvement in contraction rate. △ Less

Submitted 15 July, 2022; originally announced July 2022.

Comments: 18 pages, 9 figures, Quadrotor trajectory tracking, Learning-based control

arXiv:2205.12753 [pdf, other]

An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation

Authors: Ziquan Liu, Yi Xu, Yuanhong Xu, Qi Qian, Hao Li, Rong **, Xiangyang Ji, Antoni B. Chan

Abstract: The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem fro… ▽ More The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem from the perspective of pre-training and data augmentation, two important factors in the practice of deep learning that have not been systematically investigated by existing work. By evaluating seven pre-trained models, including ResNets and ViT's with self-supervision and supervision mode, on five important distribution-shift datasets, from WILDS and DomainBed benchmarks, with five different learning algorithms, we provide the first comprehensive empirical study focusing on pre-training and data augmentation. With our empirical result obtained from 1,330 models, we provide the following main observations: 1) ERM combined with data augmentation can achieve state-of-the-art performance if we choose a proper pre-trained model respecting the data property; 2) specialized algorithms further improve the robustness on top of ERM when handling a specific type of distribution shift, e.g., GroupDRO for spurious correlation and CORAL for large-scale out-of-distribution data; 3) Comparing different pre-training modes, architectures and data sizes, we provide novel observations about pre-training on distribution shift, which sheds light on designing or selecting pre-training strategy for different kinds of distribution shifts. In summary, our empirical study provides a comprehensive baseline for a wide range of pre-training models fine-tuned with data augmentation, which potentially inspires research exploiting the power of pre-training and data augmentation in the future of distribution shift study. △ Less

Submitted 25 May, 2022; originally announced May 2022.

arXiv:2205.08924 [pdf, other]

Financial Time Series Data Augmentation with Generative Adversarial Networks and Extended Intertemporal Return Plots

Authors: Justin Hellermann, Qinzhuan Qian, Ankit Shah

Abstract: Data augmentation is a key regularization method to support the forecast and classification performance of highly parameterized models in computer vision. In the time series domain however, regularization in terms of augmentation is not equally common even though these methods have proven to mitigate effects from small sample size or non-stationarity. In this paper we apply state-of-the art image-… ▽ More Data augmentation is a key regularization method to support the forecast and classification performance of highly parameterized models in computer vision. In the time series domain however, regularization in terms of augmentation is not equally common even though these methods have proven to mitigate effects from small sample size or non-stationarity. In this paper we apply state-of-the art image-based generative models for the task of data augmentation and introduce the extended intertemporal return plot (XIRP), a new image representation for time series. Multiple tests are conducted to assess the quality of the augmentation technique regarding its ability to synthesize time series effectively and improve forecast results on a subset of the M4 competition. We further investigate the relationship between data set characteristics and sampling results via Shapley values for feature attribution on the performance metrics and the optimal ratio of augmented data. Over all data sets, our approach proves to be effective in reducing the return forecast error by 7% on 79% of the financial data sets with varying statistical properties and frequencies. △ Less

Submitted 19 May, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

arXiv:2205.07150 [pdf, other]

Interpretable Stochastic Model Predictive Control using Distributional Reinforced Estimation for Quadrotor Tracking Systems

Authors: Yanran Wang, James O'Keeffe, Qiuchen Qian, David Boyle

Abstract: This paper presents a novel trajectory tracker for autonomous quadrotor navigation in dynamic and complex environments. The proposed framework integrates a distributional Reinforcement Learning (RL) estimator for unknown aerodynamic effects into a Stochastic Model Predictive Controller (SMPC) for trajectory tracking. Aerodynamic effects derived from drag forces and moment variations are difficult… ▽ More This paper presents a novel trajectory tracker for autonomous quadrotor navigation in dynamic and complex environments. The proposed framework integrates a distributional Reinforcement Learning (RL) estimator for unknown aerodynamic effects into a Stochastic Model Predictive Controller (SMPC) for trajectory tracking. Aerodynamic effects derived from drag forces and moment variations are difficult to model directly and accurately. Most current quadrotor tracking systems therefore treat them as simple `disturbances' in conventional control approaches. We propose Quantile-approximation-based Distributional Reinforced-disturbance-estimator, an aerodynamic disturbance estimator, to accurately identify disturbances, i.e., uncertainties between the true and estimated values of aerodynamic effects. Simplified Affine Disturbance Feedback is employed for control parameterization to guarantee convexity, which we then integrate with a SMPC to achieve sufficient and non-conservative control signals. We demonstrate our system to improve the cumulative tracking errors by at least 66% with unknown and diverse aerodynamic forces compared with recent state-of-the-art. Concerning traditional Reinforcement Learning's non-interpretability, we provide convergence and stability guarantees of Distributional RL and SMPC, respectively, with non-zero mean disturbances. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: 8 pages, 4 figures

arXiv:2204.02251 [pdf, other]

RBGNet: Ray-based Grou** for 3D Object Detection

Authors: Haiyang Wang, Shaoshuai Shi, Ze Yang, Rongyao Fang, Qi Qian, Hongsheng Li, Bernt Schiele, Liwei Wang

Abstract: As a fundamental problem in computer vision, 3D object detection is experiencing rapid growth. To extract the point-wise features from the irregularly and sparsely distributed points, previous methods usually take a feature grou** module to aggregate the point features to an object candidate. However, these methods have not yet leveraged the surface geometry of foreground objects to enhance grou… ▽ More As a fundamental problem in computer vision, 3D object detection is experiencing rapid growth. To extract the point-wise features from the irregularly and sparsely distributed points, previous methods usually take a feature grou** module to aggregate the point features to an object candidate. However, these methods have not yet leveraged the surface geometry of foreground objects to enhance grou** and 3D box generation. In this paper, we propose the RBGNet framework, a voting-based 3D detector for accurate 3D object detection from point clouds. In order to learn better representations of object shape to enhance cluster features for predicting 3D boxes, we propose a ray-based feature grou** module, which aggregates the point-wise features on object surfaces using a group of determined rays uniformly emitted from cluster centers. Considering the fact that foreground points are more meaningful for box estimation, we design a novel foreground biased sampling strategy in downsample process to sample more points on object surfaces and further boost the detection performance. Our model achieves state-of-the-art 3D detection performance on ScanNet V2 and SUN RGB-D with remarkable performance gains. Code will be available at https://github.com/Haiyang-W/RBGNet. △ Less

Submitted 5 April, 2022; originally announced April 2022.

arXiv:2203.04595 [pdf, other]

Practical Mission Planning for Optimized UAV-Sensor Wireless Recharging

Authors: Qiuchen Qian, James O'Keeffe, Yanran Wang, David Boyle

Abstract: Optimal maintenance of sensor nodes in a Wireless Rechargeable Sensor Network (WRSN) requires effective scheduling of power delivery vehicles by solving the Charging Scheduling Problem (CSP). Deploying Unmanned Aerial Vehicles (UAVs) as mobile chargers has emerged as a promising solution due to their mobility and flexibility. The CSP can be formulated as a Mixed-Integer Non-Linear Programming prob… ▽ More Optimal maintenance of sensor nodes in a Wireless Rechargeable Sensor Network (WRSN) requires effective scheduling of power delivery vehicles by solving the Charging Scheduling Problem (CSP). Deploying Unmanned Aerial Vehicles (UAVs) as mobile chargers has emerged as a promising solution due to their mobility and flexibility. The CSP can be formulated as a Mixed-Integer Non-Linear Programming problem whose optimization objective is maximizing the recharged energy of sensor nodes within the UAV battery constraint. While many studies have demonstrated satisfactory performance of heuristic algorithms in addressing specific routing problems, few studies explore online updating (i.e., mission re-planning `on the fly') in the CSP context. Here we present a new offline and online mission planner leveraging a first-principles power consumption model that uses real-time state information and environmental information. The planner, namely Rapid Online Metaheuristic-based Planner (ROMP), supplements solutions from a Guided Local Search (GLS) with our Context-aware Black Hole Algorithm. Our results demonstrate that ROMP outperforms GLS in most cases tested. We developed and proposed FastROMP to speed up the online mission (re-)planning algorithm by introducing a new online adjustment operator that uses the latest state information as input, eliminating the need for re-initialization. FastROMP not only provides a better quality route, but it also significantly reduces computational time. The reduction ranges from 39.57% in sparse deployment to 93.3% in denser deployments. △ Less

Submitted 14 April, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: 15 pages, 13 figures

arXiv:2202.12419 [pdf, other]

KinoJGM: A framework for efficient and accurate quadrotor trajectory generation and tracking in dynamic environments

Authors: Yanran Wang, James O'Keeffe, Qiuchen Qian, David Boyle

Abstract: Unmapped areas and aerodynamic disturbances render autonomous navigation with quadrotors extremely challenging. To fly safely and efficiently, trajectory planners and trackers must be able to navigate unknown environments with unpredictable aerodynamic effects in real-time. When encountering aerodynamic effects such as strong winds, most current approaches to quadrotor trajectory planning and trac… ▽ More Unmapped areas and aerodynamic disturbances render autonomous navigation with quadrotors extremely challenging. To fly safely and efficiently, trajectory planners and trackers must be able to navigate unknown environments with unpredictable aerodynamic effects in real-time. When encountering aerodynamic effects such as strong winds, most current approaches to quadrotor trajectory planning and tracking will not attempt to deviate from a determined plan, even if it is risky, in the hope that any aerodynamic disturbances can be resisted by a robust controller. This paper presents a novel systematic trajectory planning and tracking framework for autonomous quadrotors. We propose a Kinodynamic Jump Space Search (Kino-JSS) to generate a safe and efficient route in unknown environments with aerodynamic disturbances. A real-time Gaussian Process is employed to model the effects of aerodynamic disturbances, which we then integrate with a Model Predictive Controller to achieve efficient and accurate trajectory optimization and tracking. We demonstrate our system to improve the efficiency of trajectory generation in unknown environments by up to 75\% in the cases tested, compared with recent state-of-the-art. We also demonstrate that our system improves the accuracy of tracking in selected environments with unpredictable aerodynamic effects. △ Less

Submitted 11 March, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

Comments: 7pages, 8 figures, IEEE International Conference on Robotics and Automation 2022, accepted

arXiv:2202.11484 [pdf, other]

Reconstruction Task Finds Universal Winning Tickets

Authors: Ruichen Li, Binghui Li, Qi Qian, Liwei Wang

Abstract: Pruning well-trained neural networks is effective to achieve a promising accuracy-efficiency trade-off in computer vision regimes. However, most of existing pruning algorithms only focus on the classification task defined on the source domain. Different from the strong transferability of the original model, a pruned network is hard to transfer to complicated downstream tasks such as object detecti… ▽ More Pruning well-trained neural networks is effective to achieve a promising accuracy-efficiency trade-off in computer vision regimes. However, most of existing pruning algorithms only focus on the classification task defined on the source domain. Different from the strong transferability of the original model, a pruned network is hard to transfer to complicated downstream tasks such as object detection arXiv:arch-ive/2012.04643. In this paper, we show that the image-level pretrain task is not capable of pruning models for diverse downstream tasks. To mitigate this problem, we introduce image reconstruction, a pixel-level task, into the traditional pruning framework. Concretely, an autoencoder is trained based on the original model, and then the pruning process is optimized with both autoencoder and classification losses. The empirical study on benchmark downstream tasks shows that the proposed method can outperform state-of-the-art results explicitly. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: Under review

arXiv:2202.03321 [pdf, ps, other]

doi 10.3390/e24020247

A characterization of maximally entangled two-qubit states

Authors: Junjun Duan, Lin Zhang, Quan Qian, Shao-Ming Fei

Abstract: As already known by Rana's result \href{https://doi.org/10.1103/PhysRevA.87.054301}{[\pra {\bf87} (2013) 054301]}, all eigenvalues of any partial-transposed bipartite state fall within the closed interval $[-\frac12,1]$. In this note, we study a family of bipartite quantum states whose minimal eigenvalues of partial-transposed states being $-\frac12$. For a two-qubit system, we find that the minim… ▽ More As already known by Rana's result \href{https://doi.org/10.1103/PhysRevA.87.054301}{[\pra {\bf87} (2013) 054301]}, all eigenvalues of any partial-transposed bipartite state fall within the closed interval $[-\frac12,1]$. In this note, we study a family of bipartite quantum states whose minimal eigenvalues of partial-transposed states being $-\frac12$. For a two-qubit system, we find that the minimal eigenvalue of its partial-transposed state is $-\frac12$ if and only if such two-qubit state must be maximally entangled. However this result does not hold in general for a two-qudit system when the dimensions of the underlying space are larger than two. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: 11 pages, LaTeX

Journal ref: Entropy 24(2),247(2022)

arXiv:2111.12292 [pdf, other]

Improved Fine-Tuning by Better Leveraging Pre-Training Data

Authors: Ziquan Liu, Yi Xu, Yuanhong Xu, Qi Qian, Hao Li, Xiangyang Ji, Antoni Chan, Rong **

Abstract: As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revis… ▽ More As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline. △ Less

Submitted 25 May, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

arXiv:2109.00650 [pdf, other]

Dash: Semi-Supervised Learning with Dynamic Thresholding

Authors: Yi Xu, Lei Shang, **xing Ye, Qi Qian, Yu-Feng Li, Baigui Sun, Hao Li, Rong **

Abstract: While semi-supervised learning (SSL) has received tremendous attentions in many machine learning tasks due to its successful use of unlabeled data, existing SSL algorithms use either all unlabeled examples or the unlabeled examples with a fixed high-confidence prediction during the training progress. However, it is possible that too many correct/wrong pseudo labeled examples are eliminated/selecte… ▽ More While semi-supervised learning (SSL) has received tremendous attentions in many machine learning tasks due to its successful use of unlabeled data, existing SSL algorithms use either all unlabeled examples or the unlabeled examples with a fixed high-confidence prediction during the training progress. However, it is possible that too many correct/wrong pseudo labeled examples are eliminated/selected. In this work we develop a simple yet powerful framework, whose key idea is to select a subset of training examples from the unlabeled data when performing existing SSL methods so that only the unlabeled examples with pseudo labels related to the labeled data will be used to train models. The selection is performed at each updating iteration by only kee** the examples whose losses are smaller than a given threshold that is dynamically adjusted through the iteration. Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection and its theoretical guarantee. Specifically, we theoretically establish the convergence rate of Dash from the view of non-convex optimization. Finally, we empirically demonstrate the effectiveness of the proposed method in comparison with state-of-the-art over benchmarks. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: ICML 2021

arXiv:2105.11527 [pdf, other]

Unsupervised Visual Representation Learning by Online Constrained K-Means

Authors: Qi Qian, Yuanhong Xu, Juhua Hu, Hao Li, Rong **

Abstract: Cluster discrimination is an effective pretext task for unsupervised representation learning, which often consists of two phases: clustering and discrimination. Clustering is to assign each instance a pseudo label that will be used to learn representations in discrimination. The main challenge resides in clustering since prevalent clustering methods (e.g., k-means) have to run in a batch mode. Bes… ▽ More Cluster discrimination is an effective pretext task for unsupervised representation learning, which often consists of two phases: clustering and discrimination. Clustering is to assign each instance a pseudo label that will be used to learn representations in discrimination. The main challenge resides in clustering since prevalent clustering methods (e.g., k-means) have to run in a batch mode. Besides, there can be a trivial solution consisting of a dominating cluster. To address these challenges, we first investigate the objective of clustering-based representation learning. Based on this, we propose a novel clustering-based pretext task with online \textbf{Co}nstrained \textbf{K}-m\textbf{e}ans (\textbf{CoKe}). Compared with the balanced clustering that each cluster has exactly the same size, we only constrain the minimal size of each cluster to flexibly capture the inherent data structure. More importantly, our online assignment method has a theoretical guarantee to approach the global optimum. By decoupling clustering and discrimination, CoKe can achieve competitive performance when optimizing with only a single view from each instance. Extensive experiments on ImageNet and other benchmark data sets verify both the efficacy and efficiency of our proposal. Code is available at \url{https://github.com/idstcv/CoKe}. △ Less

Submitted 28 March, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

Comments: accepted by CVPR'22

arXiv:2105.06015 [pdf, ps, other]

Why Does Multi-Epoch Training Help?

Authors: Yi Xu, Qi Qian, Hao Li, Rong **

Abstract: Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been ob… ▽ More Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been observed that SGD taking more than one pass over the training data (multi-pass SGD) has much better excess risk bound performance than the SGD only taking one pass over the training data (one-pass SGD). However, it is not very clear that how to explain this phenomenon in theory. In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstance. Specifically, we consider smooth risk minimization problems whose objective function is non-convex least squared loss. Under Polyak-Lojasiewicz (PL) condition, we establish faster convergence rate of excess risk bound for multi-pass SGD than that for one-pass SGD. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2104.04114 [pdf, ps, other]

A Theoretical Analysis of Learning with Noisily Labeled Data

Authors: Yi Xu, Qi Qian, Hao Li, Rong **

Abstract: Noisy labels are very common in deep supervised learning. Although many studies tend to improve the robustness of deep training for noisy labels, rare works focus on theoretically explaining the training behaviors of learning with noisily labeled data, which is a fundamental principle in understanding its generalization. In this draft, we study its two phenomena, clean data first and phase transit… ▽ More Noisy labels are very common in deep supervised learning. Although many studies tend to improve the robustness of deep training for noisy labels, rare works focus on theoretically explaining the training behaviors of learning with noisily labeled data, which is a fundamental principle in understanding its generalization. In this draft, we study its two phenomena, clean data first and phase transition, by explaining them from a theoretical viewpoint. Specifically, we first show that in the first epoch training, the examples with clean labels will be learned first. We then show that after the learning from clean data stage, continuously training model can achieve further improvement in testing error when the rate of corrupted class labels is smaller than a certain threshold; otherwise, extensively training could lead to an increasing testing error. △ Less

Submitted 8 April, 2021; originally announced April 2021.

Showing 1–50 of 92 results for author: Qian, Q