-
Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection
Authors:
Weibo Jiang,
Weihong Ren,
Jiandong Tian,
Liangqiong Qu,
Zhiyong Wang,
Honghai Liu
Abstract:
Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of <human, object, action>. Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregati…
▽ More
Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of <human, object, action>. Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model
Authors:
Yinghui Xing,
Litao Qu,
Shizhou Zhang,
Kai Zhang,
Yanning Zhang
Abstract:
Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When…
▽ More
Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation of pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional DDPM, while in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite's datasets. We will release our code for reproducibility.
△ Less
Submitted 13 January, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
On-Device Recommender Systems: A Tutorial on The New-Generation Recommendation Paradigm
Authors:
Hongzhi Yin,
Tong Chen,
Liang Qu,
Bin Cui
Abstract:
Given the sheer volume of contemporary e-commerce applications, recommender systems (RSs) have gained significant attention in both academia and industry. However, traditional cloud-based RSs face inevitable challenges, such as resource-intensive computation, reliance on network access, and privacy breaches. In response, a new paradigm called on-device recommender systems (ODRSs) has emerged recen…
▽ More
Given the sheer volume of contemporary e-commerce applications, recommender systems (RSs) have gained significant attention in both academia and industry. However, traditional cloud-based RSs face inevitable challenges, such as resource-intensive computation, reliance on network access, and privacy breaches. In response, a new paradigm called on-device recommender systems (ODRSs) has emerged recently in various industries like Taobao, Google, and Kuaishou. ODRSs unleash the computational capacity of user devices with lightweight recommendation models tailored for resource-constrained environments, enabling real-time inference with users' local data. This tutorial aims to systematically introduce methodologies of ODRSs, including (1) an overview of existing research on ODRSs; (2) a comprehensive taxonomy of ODRSs, where the core technical content to be covered span across three major ODRS research directions, including on-device deployment and inference, on-device training, and privacy/security of ODRSs; (3) limitations and future directions of ODRSs. This tutorial expects to lay the foundation and spark new insights for follow-up research and applications concerning this new recommendation paradigm.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
Oriented Supersingular Elliptic Curves and Eichler Orders
Authors:
Guanju Xiao,
Zijian Zhou,
Longjiang Qu
Abstract:
Let $p>3$ be a prime and $E$ be a supersingular elliptic curve defined over $\mathbb{F}_{p^2}$. Let $c$ be a prime with $c < 3p/16$ and $G$ be a subgroup of $E[c]$ of order $c$. The pair $(E,G)$ is called a supersingular elliptic curve with level-$c$ structure, and the endomorphism ring $\text{End}(E,G)$ is isomorphic to an Eichler order with level $c$. We construct two kinds of Eichler orders…
▽ More
Let $p>3$ be a prime and $E$ be a supersingular elliptic curve defined over $\mathbb{F}_{p^2}$. Let $c$ be a prime with $c < 3p/16$ and $G$ be a subgroup of $E[c]$ of order $c$. The pair $(E,G)$ is called a supersingular elliptic curve with level-$c$ structure, and the endomorphism ring $\text{End}(E,G)$ is isomorphic to an Eichler order with level $c$. We construct two kinds of Eichler orders $\mathcal{O}_c(q,r)$ and $\mathcal{O}'_c(q,r')$ with level $c$. Interestingly, we can reduce each $\mathcal{O}_c(q,r)$ or $\mathcal{O}'_c(q,r')$ to a primitive reduced binary quadratic form with discriminant $-16cp$ or $-cp$ respectively. If the curve $E$ is $\mathbb{Z}[\sqrt{-cp}]$-oriented, then we prove that $\text{End}(E,G)$ is isomorphic to $\mathcal{O}_c(q,r)$ or $\mathcal{O}'_c(q,r')$. Due to the fact that $\mathbb{Z}[\sqrt{-cp}]$-oriented isogenies between $\mathbb{Z}[\sqrt{-cp}]$-oriented elliptic curves could be represented by quadratic forms, we show that these isogenies are reflected in the corresponding Eichler orders via the composition law for their corresponding quadratic forms.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
TMID: A Comprehensive Real-world Dataset for Trademark Infringement Detection in E-Commerce
Authors:
Tongxin Hu,
Zhuang Li,
Xin **,
Lizhen Qu,
Xin Zhang
Abstract:
Annually, e-commerce platforms incur substantial financial losses due to trademark infringements, making it crucial to identify and mitigate potential legal risks tied to merchant information registered to the platforms. However, the absence of high-quality datasets hampers research in this area. To address this gap, our study introduces TMID, a novel dataset to detect trademark infringement in me…
▽ More
Annually, e-commerce platforms incur substantial financial losses due to trademark infringements, making it crucial to identify and mitigate potential legal risks tied to merchant information registered to the platforms. However, the absence of high-quality datasets hampers research in this area. To address this gap, our study introduces TMID, a novel dataset to detect trademark infringement in merchant registrations. This is a real-world dataset sourced directly from Alipay, one of the world's largest e-commerce and digital payment platforms. As infringement detection is a legal reasoning task requiring an understanding of the contexts and legal rules, we offer a thorough collection of legal rules and merchant and trademark-related contextual information with annotations from legal experts. We ensure the data quality by performing an extensive statistical analysis. Furthermore, we conduct an empirical study on this dataset to highlight its value and the key challenges. Through this study, we aim to contribute valuable resources to advance research into legal compliance related to trademark infringement within the e-commerce sphere. The dataset is available at https://github.com/emnlpTMID/emnlpTMID.github.io .
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Hide Your Model: A Parameter Transmission-free Federated Recommender System
Authors:
Wei Yuan,
Chaoqun Yang,
Liang Qu,
Quoc Viet Hung Nguyen,
Jianxin Li,
Hongzhi Yin
Abstract:
With the growing concerns regarding user data privacy, Federated Recommender System (FedRec) has garnered significant attention recently due to its privacy-preserving capabilities. Existing FedRecs generally adhere to a learning protocol in which a central server shares a global recommendation model with clients, and participants achieve collaborative learning by frequently communicating the model…
▽ More
With the growing concerns regarding user data privacy, Federated Recommender System (FedRec) has garnered significant attention recently due to its privacy-preserving capabilities. Existing FedRecs generally adhere to a learning protocol in which a central server shares a global recommendation model with clients, and participants achieve collaborative learning by frequently communicating the model's public parameters. Nevertheless, this learning framework has two drawbacks that limit its practical usability: (1) It necessitates a global-sharing recommendation model; however, in real-world scenarios, information related to the recommender model, including its algorithm and parameters, constitutes the platforms' intellectual property. Hence, service providers are unlikely to release such information actively. (2) The communication costs of model parameter transmission are expensive since the model parameters are usually high-dimensional matrices. With the model size increasing, the communication burden will be the bottleneck for such traditional FedRecs.
Given the above limitations, this paper introduces a novel parameter transmission-free federated recommendation framework that balances the protection between users' data privacy and platforms' model privacy, namely PTF-FedRec. Specifically, participants in PTF-FedRec collaboratively exchange knowledge by sharing their predictions within a privacy-preserving mechanism. Through this way, the central server can learn a recommender model without disclosing its model parameters or accessing clients' raw data, preserving both the server's model privacy and users' data privacy. Besides, since clients and the central server only need to communicate prediction scores which are just a few real numbers, the overhead is significantly reduced compared to traditional FedRecs. The code is available at\url{https://github.com/hi-weiyuan/PTF-FedRec}.
△ Less
Submitted 12 February, 2024; v1 submitted 25 November, 2023;
originally announced November 2023.
-
A Phase-resolved View of the Low-frequency Quasiperiodic Oscillations from the Black Hole Binary MAXI J1820+070
Authors:
Qing C. Shui,
S. Zhang,
Shuang N. Zhang,
Yu P. Chen,
Ling D. Kong,
Peng J. Wang,
**g Q. Peng,
L. Ji,
A. Santangelo,
Hong X. Yin,
** L. Qu,
L. Tao,
Ming Y. Ge,
Y. Huang,
L. Zhang,
Hong H. Liu,
P. Zhang,
W. Yu,
Z. Chang,
J. Li,
Wen T. Ye,
Pan P. Li,
Zhuo L. Yu,
Z. Yan
Abstract:
Although low-frequency quasiperiodic oscillations (LFQPOs) are commonly detected in the X-ray light curves of accreting black hole X-ray binaries, their origin still remains elusive. In this study, we conduct phase-resolved spectroscopy in a broad energy band for LFQPOs in MAXI J1820+070 during its 2018 outburst, utilizing Insight-HXMT observations. By employing the Hilbert-Huang transform method,…
▽ More
Although low-frequency quasiperiodic oscillations (LFQPOs) are commonly detected in the X-ray light curves of accreting black hole X-ray binaries, their origin still remains elusive. In this study, we conduct phase-resolved spectroscopy in a broad energy band for LFQPOs in MAXI J1820+070 during its 2018 outburst, utilizing Insight-HXMT observations. By employing the Hilbert-Huang transform method, we extract the intrinsic quasiperiodic oscillation (QPO) variability, and obtain the corresponding instantaneous amplitude, phase, and frequency functions for each data point. With well-defined phases, we construct QPO waveforms and phase-resolved spectra. By comparing the phase-folded waveform with that obtained from the Fourier method, we find that phase folding on the phase of the QPO fundamental frequency leads to a slight reduction in the contribution of the harmonic component. This suggests that the phase difference between QPO harmonics exhibits time variability. Phase-resolved spectral analysis reveals strong concurrent modulations of the spectral index and flux across the bright hard state. The modulation of the spectral index could potentially be explained by both the corona and jet precession models, with the latter requiring efficient acceleration within the jet. Furthermore, significant modulations in the reflection fraction are detected exclusively during the later stages of the bright hard state. These findings provide support for the geometric origin of LFQPOs and offer valuable insights into the evolution of the accretion geometry during the outburst in MAXI J1820+070.
△ Less
Submitted 8 November, 2023; v1 submitted 6 November, 2023;
originally announced November 2023.
-
How Robust is Federated Learning to Communication Error? A Comparison Study Between Uplink and Downlink Channels
Authors:
Lin** Qu,
Shenghui Song,
Chi-Ying Tsui,
Yuyi Mao
Abstract:
Because of its privacy-preserving capability, federated learning (FL) has attracted significant attention from both academia and industry. However, when being implemented over wireless networks, it is not clear how much communication error can be tolerated by FL. This paper investigates the robustness of FL to the uplink and downlink communication error. Our theoretical analysis reveals that the r…
▽ More
Because of its privacy-preserving capability, federated learning (FL) has attracted significant attention from both academia and industry. However, when being implemented over wireless networks, it is not clear how much communication error can be tolerated by FL. This paper investigates the robustness of FL to the uplink and downlink communication error. Our theoretical analysis reveals that the robustness depends on two critical parameters, namely the number of clients and the numerical range of model parameters. It is also shown that the uplink communication in FL can tolerate a higher bit error rate (BER) than downlink communication, and this difference is quantified by a proposed formula. The findings and theoretical analyses are further validated by extensive experiments.
△ Less
Submitted 12 January, 2024; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Can ChatGPT Perform Reasoning Using the IRAC Method in Analyzing Legal Scenarios Like a Lawyer?
Authors:
Xiaoxi Kang,
Lizhen Qu,
Lay-Ki Soon,
Adnan Trakic,
Terry Yue Zhuo,
Patrick Charles Emerton,
Genevieve Grant
Abstract:
Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions recently in the legal domain due to its emergent ability to tackle a variety of legal tasks. However, it is still unknown if LLMs are able to analyze a legal case and perform reasoning in the same manner as lawyers. Therefore, we constructed a novel corpus consisting of scenarios pertain to Contract Acts Malaysia and Aus…
▽ More
Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions recently in the legal domain due to its emergent ability to tackle a variety of legal tasks. However, it is still unknown if LLMs are able to analyze a legal case and perform reasoning in the same manner as lawyers. Therefore, we constructed a novel corpus consisting of scenarios pertain to Contract Acts Malaysia and Australian Social Act for Dependent Child. ChatGPT is applied to perform analysis on the corpus using the IRAC method, which is a framework widely used by legal professionals for organizing legal analysis. Each scenario in the corpus is annotated with a complete IRAC analysis in a semi-structured format so that both machines and legal professionals are able to interpret and understand the annotations. In addition, we conducted the first empirical assessment of ChatGPT for IRAC analysis in order to understand how well it aligns with the analysis of legal professionals. Our experimental results shed lights on possible future research directions to improve alignments between LLMs and legal experts in terms of legal reasoning.
△ Less
Submitted 2 November, 2023; v1 submitted 23 October, 2023;
originally announced October 2023.
-
FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning
Authors:
Peiran Xu,
Zeyu Wang,
Jieru Mei,
Liangqiong Qu,
Alan Yuille,
Cihang Xie,
Yuyin Zhou
Abstract:
Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage…
▽ More
Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL.
Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConv
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Timing properties of the X-ray accreting pulsar RX J0440.9+4431 studied with Insight-HXMT and NICER
Authors:
P. P. Li,
L. Tao,
Y. L. Tuo,
M. Y. Ge,
L. D. Kong,
L. Zhang,
Q. C. Bu,
L. Ji,
J. L. Qu,
S. Zhang,
S. N. Zhang,
Y. Huang,
X. Ma,
W. T. Ye,
Q. C. Zhao,
R. C. Ma,
S. J. Zhao,
X. Hou,
Z. X. Yang,
P. J. Wang,
S. M. Jia,
Q. C. Shui,
J. Guan
Abstract:
RX J0440.9+4431, a Be/X-ray binary, had its brightest outburst in 2022 since its discovery, with a peak X-ray flux of 2.25 Crab (as recorded by Swift/BAT, 15-50 keV). We analyze the timing properties of this giant outburst using data from Insight-HXMT and NICER, focusing on the evolution of the pulse profile and pulse fraction. We observe that when the luminosity reached around ~ 3*10^{37} er s^{-…
▽ More
RX J0440.9+4431, a Be/X-ray binary, had its brightest outburst in 2022 since its discovery, with a peak X-ray flux of 2.25 Crab (as recorded by Swift/BAT, 15-50 keV). We analyze the timing properties of this giant outburst using data from Insight-HXMT and NICER, focusing on the evolution of the pulse profile and pulse fraction. We observe that when the luminosity reached around ~ 3*10^{37} er s^{-1}, a transition from double-peaked to single-peaked pulse profiles occurred across the energy range, with the peak of the low-energy profile aligning gradually with the peak of the high-energy profile. This change indicates a transition from subcritical to supercritical accretion. Additionally, we found a concave in the pulse fraction as a function of energy around 20-30 keV throughout the entire outburst period. Compared to the low luminosity, the concave becomes weaker in high luminosities, and overall, the pulse fraction is higher. We propose that this concave could be caused by the scattering of high-energy photons by the atmosphere of a neutron star, leading to a dilution of the pulse fraction. As the accretion reaches the supercritical state, the accretion column height increases, resulting in a larger direct component of strongly beamed X-ray flux, and an elevated pulse fraction.
△ Less
Submitted 27 September, 2023; v1 submitted 26 September, 2023;
originally announced September 2023.
-
NExT-GPT: Any-to-Any Multimodal LLM
Authors:
Shengqiong Wu,
Hao Fei,
Leigang Qu,
Wei Ji,
Tat-Seng Chua
Abstract:
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, develo** any-to-any MM-LLMs capable of accepting and delivering conte…
▽ More
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, develo** any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/
△ Less
Submitted 25 June, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Residual Denoising Diffusion Models
Authors:
Jiawei Liu,
Qiang Wang,
Huijie Fan,
Yinong Wang,
Yandong Tang,
Liangqiong Qu
Abstract:
We propose residual denoising diffusion models (RDDM), a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models, initially uninterpretable for image restoration, into a unified and interpretable model for both image generation and restorati…
▽ More
We propose residual denoising diffusion models (RDDM), a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models, initially uninterpretable for image restoration, into a unified and interpretable model for both image generation and restoration by introducing residuals. Specifically, our residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process for image restoration, while noise diffusion represents random perturbations in the diffusion process. The residual prioritizes certainty, while the noise emphasizes diversity, enabling RDDM to effectively unify tasks with varying certainty or diversity requirements, such as image generation and restoration. We demonstrate that our sampling process is consistent with that of DDPM and DDIM through coefficient transformation, and propose a partially path-independent generation process to better understand the reverse process. Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a batch size of 1, to compete with state-of-the-art image restoration methods. We provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/nachifur/RDDM).
△ Less
Submitted 22 March, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation
Authors:
Leigang Qu,
Shengqiong Wu,
Hao Fei,
Liqiang Nie,
Tat-Seng Chua
Abstract:
In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts…
▽ More
In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation. Our code and settings are available at https://layoutllm-t2i.github.io.
△ Less
Submitted 12 August, 2023; v1 submitted 9 August, 2023;
originally announced August 2023.
-
AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose
Authors:
Huichao Zhang,
Bowen Chen,
Hao Yang,
Liao Qu,
Xu Wang,
Li Chen,
Chao Long,
Feida Zhu,
Kang Du,
Min Zheng
Abstract:
Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guid…
▽ More
Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Bright Second Harmonic Emission from Photonic Crystal Vertical Cavity
Authors:
Lun Qu,
Zhidong Gu,
Chenyang Li,
Yuan Qin,
Yiting Zhang,
Di Zhang,
Jiaxian Zhao,
Qiang Liu,
Chunyan **,
Lishuan Wang,
Wei Wu,
Wei Cai,
Huasong Liu,
Mengxin Ren,
**gjun Xu
Abstract:
We present a study on photonic vertical cavities consisting of nonlinear materials embedded in photonic crystals (PhCs) for resonantly enhancing second harmonic generation (SHG). Previous attempts at SHG in such structures have been limited to efficiencies of 10$^{-7}$ to 10$^{-5}$, but we demonstrate here a high SHG efficiency of 0.28% by constructing a vertical cavity with a lithium niobate memb…
▽ More
We present a study on photonic vertical cavities consisting of nonlinear materials embedded in photonic crystals (PhCs) for resonantly enhancing second harmonic generation (SHG). Previous attempts at SHG in such structures have been limited to efficiencies of 10$^{-7}$ to 10$^{-5}$, but we demonstrate here a high SHG efficiency of 0.28% by constructing a vertical cavity with a lithium niobate membrane placed between two PhCs, which exhibits high quality resonances. Our results open up new possibilities for compact laser frequency converters that could have a revolutionary impact on the fields of nonlinear optics and photonics.
△ Less
Submitted 29 July, 2023;
originally announced July 2023.
-
Detection of a strong ~2.5 Hz modulation in the Newly Discovered Millisecond Pulsar MAXI J1816-195
Authors:
P. P. Li,
L. Tao,
L. Zhang,
Q. C. Bu,
J. L. Qu,
L. Ji,
P. J. Wang,
Y. P. Chen,
S. Zhang,
R. C. Ma,
Z. X. Yang,
W. T. Ye,
S. J. Zhao,
Q. C. Zhao,
Y. Huang,
X. Ma,
E. L. Qiao,
S. M. Jia,
S. N. Zhang
Abstract:
MAXI J181-195 is a newly discovered accreting millisecond X-ray pulsar that went outburst in June 2022. Through timing analysis with NICER and NuSTAR observations, we find a transient modulation at ~2.5 Hz during the decay period of MAXI J1816-195. The modulation is strongly correlated with a spectral hardening, and its fractional rms amplitude increases with energy. These results suggest that the…
▽ More
MAXI J181-195 is a newly discovered accreting millisecond X-ray pulsar that went outburst in June 2022. Through timing analysis with NICER and NuSTAR observations, we find a transient modulation at ~2.5 Hz during the decay period of MAXI J1816-195. The modulation is strongly correlated with a spectral hardening, and its fractional rms amplitude increases with energy. These results suggest that the modulation is likely to be produced in an unstable corona. In addition, the presence of the modulation during thermonuclear bursts indicates that it may originate from a disk-corona where the optical depth is likely the main factor affecting the modulation, rather than temperature. Moreover, we find significant reflection features in the spectra observed simultaneously by NICER and NuSTAR, including a relativistically broadened Fe-K line around 6-7 keV, and a Compton hump in the 10-30 keV energy band. The radius of the inner disc is constrained to be Rin = (1.04-1.23) RISCO based on reflection modeling of the broadband spectra. Assuming that the inner disc is truncated at the magnetosphere radius, we estimate that the magnetic field strength is < 4.67 * 10e8 G.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features
Authors:
Liao Qu,
Xianwei Zou,
Xiang Li,
Yandong Wen,
Rita Singh,
Bhiksha Raj
Abstract:
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiolo…
▽ More
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
HeteFedRec: Federated Recommender Systems with Model Heterogeneity
Authors:
Wei Yuan,
Liang Qu,
Lizhen Cui,
Yongxin Tong,
Xiaofang Zhou,
Hongzhi Yin
Abstract:
Owing to the nature of privacy protection, federated recommender systems (FedRecs) have garnered increasing interest in the realm of on-device recommender systems. However, most existing FedRecs only allow participating clients to collaboratively train a recommendation model of the same public parameter size. Training a model of the same size for all clients can lead to suboptimal performance sinc…
▽ More
Owing to the nature of privacy protection, federated recommender systems (FedRecs) have garnered increasing interest in the realm of on-device recommender systems. However, most existing FedRecs only allow participating clients to collaboratively train a recommendation model of the same public parameter size. Training a model of the same size for all clients can lead to suboptimal performance since clients possess varying resources. For example, clients with limited training data may prefer to train a smaller recommendation model to avoid excessive data consumption, while clients with sufficient data would benefit from a larger model to achieve higher recommendation accuracy. To address the above challenge, this paper introduces HeteFedRec, a novel FedRec framework that enables the assignment of personalized model sizes to participants. In HeteFedRec, we present a heterogeneous recommendation model aggregation strategy, including a unified dual-task learning mechanism and a dimensional decorrelation regularization, to allow knowledge aggregation among recommender models of different sizes. Additionally, a relation-based ensemble knowledge distillation method is proposed to effectively distil knowledge from heterogeneous item embeddings. Extensive experiments conducted on three real-world recommendation datasets demonstrate the effectiveness and efficiency of HeteFedRec in training federated recommender systems under heterogeneous settings.
△ Less
Submitted 5 December, 2023; v1 submitted 24 July, 2023;
originally announced July 2023.
-
Intermittent QPO properties of MAXI J1820+070 revealed by Insight-HXMT
Authors:
P. Zhang,
R. Soria,
S. Zhang,
L. Ji,
L. D. Kong,
Y. P. Chen,
S. N. Zhang,
Z. Chang,
M. Y. Ge,
J. Li,
G. C. Liu,
Q. Z. Liu,
X. Ma,
J. Q. Peng,
J. L. Qu,
Q. C. Shui,
L. Tao,
H. J. Tian,
P. J. Wang,
J. Z. Yan,
X. Y. Zeng
Abstract:
We investigate the dynamical properties of low frequency quasi-periodic oscillations (QPOs) observed from the black hole X-ray binary MAXI J1820+070 during the early part of its 2018 outburst, when the system was in a bright hard state. To this aim, we use a series of observations from the Hard X-ray Modulation Telescope Insight-HXMT, and apply a wavelet decomposition (weighted wavelet Z-transform…
▽ More
We investigate the dynamical properties of low frequency quasi-periodic oscillations (QPOs) observed from the black hole X-ray binary MAXI J1820+070 during the early part of its 2018 outburst, when the system was in a bright hard state. To this aim, we use a series of observations from the Hard X-ray Modulation Telescope Insight-HXMT, and apply a wavelet decomposition (weighted wavelet Z-transforms) to the X-ray light-curve. We find that the QPO phenomenon is intermittent within each individual observation, with some sub-intervals where the oscillation is strongly detected (high root-mean-square amplitude) and others where it is weak or absent. The average life time of individual QPO segments is ~ 5 oscillation cycles, with a 3 sigma tail up to ~ 20 cycles. There is no substantial difference between the energy spectra during intervals with strong and weak/absent QPOs. We discuss two possible reasons for the intermittent QPO strength, within the precessing jet model previously proposed for MAXI J1820+070. In the rigid precession model, intermittent QPOs are predicted to occur with a coherence Q ~ a few when the disk alignment time-scale is only a few times the precession time-scale. Alternatively, we suggest that changes in oscillation amplitude can be caused by changes in the jet speed. We discuss a possible reason for the intermittent QPO strength, within the precessing jet model previously proposed for MAXI J1820+070: we suggest that changes in oscillation amplitude are caused by changes in the jet speed. We argue that a misaligned, precessing jet scenario is also consistent with other recent observational findings that suggest an oscillation of the Compton reflection component in phase with the QPOs.
△ Less
Submitted 15 July, 2023;
originally announced July 2023.
-
OpenAL: An Efficient Deep Active Learning Framework for Open-Set Pathology Image Classification
Authors:
Linhao Qu,
Yingfan Ma,
Zhiwei Yang,
Manning Wang,
Zhijian Song
Abstract:
Active learning (AL) is an effective approach to select the most informative samples to label so as to reduce the annotation cost. Existing AL methods typically work under the closed-set assumption, i.e., all classes existing in the unlabeled sample pool need to be classified by the target model. However, in some practical clinical tasks, the unlabeled pool may contain not only the target classes…
▽ More
Active learning (AL) is an effective approach to select the most informative samples to label so as to reduce the annotation cost. Existing AL methods typically work under the closed-set assumption, i.e., all classes existing in the unlabeled sample pool need to be classified by the target model. However, in some practical clinical tasks, the unlabeled pool may contain not only the target classes that need to be fine-grainedly classified, but also non-target classes that are irrelevant to the clinical tasks. Existing AL methods cannot work well in this scenario because they tend to select a large number of non-target samples. In this paper, we formulate this scenario as an open-set AL problem and propose an efficient framework, OpenAL, to address the challenge of querying samples from an unlabeled pool with both target class and non-target class samples. Experiments on fine-grained classification of pathology images show that OpenAL can significantly improve the query quality of target class samples and achieve higher performance than current state-of-the-art AL methods. Code is available at https://github.com/miccaiif/OpenAL.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier is All You Need
Authors:
Linhao Qu,
Yingfan Ma,
Xiaoyuan Luo,
Manning Wang,
Zhijian Song
Abstract:
Weakly supervised whole slide image classification is usually formulated as a multiple instance learning (MIL) problem, where each slide is treated as a bag, and the patches cut out of it are treated as instances. Existing methods either train an instance classifier through pseudo-labeling or aggregate instance features into a bag feature through attention mechanisms and then train a bag classifie…
▽ More
Weakly supervised whole slide image classification is usually formulated as a multiple instance learning (MIL) problem, where each slide is treated as a bag, and the patches cut out of it are treated as instances. Existing methods either train an instance classifier through pseudo-labeling or aggregate instance features into a bag feature through attention mechanisms and then train a bag classifier, where the attention scores can be used for instance-level classification. However, the pseudo instance labels constructed by the former usually contain a lot of noise, and the attention scores constructed by the latter are not accurate enough, both of which affect their performance. In this paper, we propose an instance-level MIL framework based on contrastive learning and prototype learning to effectively accomplish both instance classification and bag classification tasks. To this end, we propose an instance-level weakly supervised contrastive learning algorithm for the first time under the MIL setting to effectively learn instance feature representation. We also propose an accurate pseudo label generation method through prototype learning. We then develop a joint training strategy for weakly supervised contrastive learning, prototype learning, and instance classifier training. Extensive experiments and visualizations on four datasets demonstrate the powerful performance of our method. Codes are available at https://github.com/miccaiif/INS.
△ Less
Submitted 11 May, 2024; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Multiple magnetoplasmon polaritons of magneto-optical graphene in near-field radiative heat transfer
Authors:
Ming-Jian He,
Lei Qu,
Ya-Tao Ren,
Hong Qi,
Mauro Antezza,
He-** Tan
Abstract:
Graphene, as a two-dimensional magneto-optical material, supports magnetoplasmon polaritons (MPP) when exposed to an applied magnetic field. Recently, MPP of a single-layer graphene has shown an excellent capability in the modulation of near-field radiative heat transfer (NFRHT). In this study, we present a comprehensive theoretical analysis of NFRHT between two multilayered graphene structures, w…
▽ More
Graphene, as a two-dimensional magneto-optical material, supports magnetoplasmon polaritons (MPP) when exposed to an applied magnetic field. Recently, MPP of a single-layer graphene has shown an excellent capability in the modulation of near-field radiative heat transfer (NFRHT). In this study, we present a comprehensive theoretical analysis of NFRHT between two multilayered graphene structures, with a particular focus on the multiple MPP effect. We reveal the physical mechanism and evolution law of the multiple MPP, and we demonstrate that the multiple MPP allow one to mediate, enhance, and tune the NFRHT by appropriately engineering the properties of graphene, the number of graphene sheets, the intensity of magnetic fields, as well as the geometric structure of systems. We show that the multiple MPP have a quite significant distinction relative to the single MPP or multiple surface plasmon polaritons (SPPs) in terms of modulating and manipulating NFRHT.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
Personalized Elastic Embedding Learning for On-Device Recommendation
Authors:
Ruiqi Zheng,
Liang Qu,
Tong Chen,
Kai Zheng,
Yuhui Shi,
Hongzhi Yin
Abstract:
To address privacy concerns and reduce network latency, there has been a recent trend of compressing cumbersome recommendation models trained on the cloud and deploying compact recommender models to resource-limited devices for the real-time recommendation. Existing solutions generally overlook device heterogeneity and user heterogeneity. They require devices with the same budget to share the same…
▽ More
To address privacy concerns and reduce network latency, there has been a recent trend of compressing cumbersome recommendation models trained on the cloud and deploying compact recommender models to resource-limited devices for the real-time recommendation. Existing solutions generally overlook device heterogeneity and user heterogeneity. They require devices with the same budget to share the same model and assume the available device resources (e.g., memory) are constant, which is not reflective of reality. Considering device and user heterogeneities as well as dynamic resource constraints, this paper proposes a Personalized Elastic Embedding Learning framework (PEEL) for the on-device recommendation, which generates Personalized Elastic Embeddings (PEEs) for devices with various memory budgets in a once-for-all manner, adapting to new or dynamic budgets, and addressing user preference diversity by assigning personalized embeddings for different groups of users. Specifically, it pretrains a global embedding table with collected user-item interaction instances and clusters users into groups. Then, it refines the embedding tables with local interaction instances within each group. PEEs are generated from the group-wise embedding blocks and their weights that indicate the contribution of each embedding block to the local recommendation performance. Given a memory budget, PEEL efficiently generates PEEs by selecting embedding blocks with the largest weights, making it adaptable to dynamic memory budgets on devices. Furthermore, a diversity-driven regularizer is implemented to encourage the expressiveness of embedding blocks, and a controller is utilized to optimize the weights. Extensive experiments are conducted on two public datasets, and the results show that PEEL yields superior performance on devices with heterogeneous and dynamic memory budgets.
△ Less
Submitted 16 November, 2023; v1 submitted 18 June, 2023;
originally announced June 2023.
-
The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification
Authors:
Linhao Qu,
Xiaoyuan Luo,
Kexue Fu,
Manning Wang,
Zhijian Song
Abstract:
This paper introduces the novel concept of few-shot weakly supervised learning for pathology Whole Slide Image (WSI) classification, denoted as FSWC. A solution is proposed based on prompt learning and the utilization of a large language model, GPT-4. Since a WSI is too large and needs to be divided into patches for processing, WSI classification is commonly approached as a Multiple Instance Learn…
▽ More
This paper introduces the novel concept of few-shot weakly supervised learning for pathology Whole Slide Image (WSI) classification, denoted as FSWC. A solution is proposed based on prompt learning and the utilization of a large language model, GPT-4. Since a WSI is too large and needs to be divided into patches for processing, WSI classification is commonly approached as a Multiple Instance Learning (MIL) problem. In this context, each WSI is considered a bag, and the obtained patches are treated as instances. The objective of FSWC is to classify both bags and instances with only a limited number of labeled bags. Unlike conventional few-shot learning problems, FSWC poses additional challenges due to its weak bag labels within the MIL framework. Drawing inspiration from the recent achievements of vision-language models (V-L models) in downstream few-shot classification tasks, we propose a two-level prompt learning MIL framework tailored for pathology, incorporating language prior knowledge. Specifically, we leverage CLIP to extract instance features for each patch, and introduce a prompt-guided pooling strategy to aggregate these instance features into a bag feature. Subsequently, we employ a small number of labeled bags to facilitate few-shot prompt learning based on the bag features. Our approach incorporates the utilization of GPT-4 in a question-and-answer mode to obtain language prior knowledge at both the instance and bag levels, which are then integrated into the instance and bag level language prompts. Additionally, a learnable component of the language prompts is trained using the available few-shot labeled data. We conduct extensive experiments on three real WSI datasets encompassing breast cancer, lung cancer, and cervical cancer, demonstrating the notable performance of the proposed method in bag and instance classification. All codes will be available.
△ Less
Submitted 28 January, 2024; v1 submitted 29 May, 2023;
originally announced May 2023.
-
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing
Authors:
Zhuang Li,
Yuyang Chai,
Terry Yue Zhuo,
Lizhen Qu,
Gholamreza Haffari,
Fei Li,
Donghong Ji,
Quan Hung Tran
Abstract:
Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resu…
▽ More
Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https://github.com/zhuang-li/FACTUAL .
△ Less
Submitted 1 June, 2023; v1 submitted 27 May, 2023;
originally announced May 2023.
-
The Best of Both Worlds: Combining Human and Machine Translations for Multilingual Semantic Parsing with Active Learning
Authors:
Zhuang Li,
Lizhen Qu,
Philip R. Cohen,
Raj V. Tumuluri,
Gholamreza Haffari
Abstract:
Multilingual semantic parsing aims to leverage the knowledge from the high-resource languages to improve low-resource semantic parsing, yet commonly suffers from the data imbalance problem. Prior works propose to utilize the translations by either humans or machines to alleviate such issues. However, human translations are expensive, while machine translations are cheap but prone to error and bias…
▽ More
Multilingual semantic parsing aims to leverage the knowledge from the high-resource languages to improve low-resource semantic parsing, yet commonly suffers from the data imbalance problem. Prior works propose to utilize the translations by either humans or machines to alleviate such issues. However, human translations are expensive, while machine translations are cheap but prone to error and bias. In this work, we propose an active learning approach that exploits the strengths of both human and machine translations by iteratively adding small batches of human translations into the machine-translated training set. Besides, we propose novel aggregated acquisition criteria that help our active learning method select utterances to be manually translated. Our experiments demonstrate that an ideal utterance selection can significantly reduce the error and bias in the translated data, resulting in higher parser accuracies than the parsers merely trained on the machine-translated data.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding
Authors:
Bhanu Prakash Voutharoja,
Lizhen Qu,
Fatemeh Shiri
Abstract:
Recent works on form understanding mostly employ multimodal transformers or large-scale pre-trained language models. These models need ample data for pre-training. In contrast, humans can usually identify key-value pairings from a form only by looking at layouts, even if they don't comprehend the language used. No prior research has been conducted to investigate how helpful layout information alon…
▽ More
Recent works on form understanding mostly employ multimodal transformers or large-scale pre-trained language models. These models need ample data for pre-training. In contrast, humans can usually identify key-value pairings from a form only by looking at layouts, even if they don't comprehend the language used. No prior research has been conducted to investigate how helpful layout information alone is for form understanding. Hence, we propose a unique entity-relation graph parsing method for scanned forms called LAGNN, a language-independent Graph Neural Network model. Our model parses a form into a word-relation graph in order to identify entities and relations jointly and reduce the time complexity of inference. This graph is then transformed by deterministic rules into a fully connected entity-relation graph. Our model simply takes into account relative spacing between bounding boxes from layout information to facilitate easy transfer across languages. To further improve the performance of LAGNN, and achieve isomorphism between entity-relation graphs and word-relation graphs, we use integer linear programming (ILP) based inference. Code is publicly available at https://github.com/Bhanu068/LAGNN
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
Turning Flowchart into Dialog: Augmenting Flowchart-grounded Troubleshooting Dialogs via Synthetic Data Generation
Authors:
Haolan Zhan,
Sameen Maruf,
Lizhen Qu,
Yufei Wang,
Ingrid Zukerman,
Gholamreza Haffari
Abstract:
Flowchart-grounded troubleshooting dialogue (FTD) systems, which follow the instructions of a flowchart to diagnose users' problems in specific domains (e.g., vehicle, laptop), have been gaining research interest in recent years. However, collecting sufficient dialogues that are naturally grounded on flowcharts is costly, thus FTD systems are impeded by scarce training data. To mitigate the data s…
▽ More
Flowchart-grounded troubleshooting dialogue (FTD) systems, which follow the instructions of a flowchart to diagnose users' problems in specific domains (e.g., vehicle, laptop), have been gaining research interest in recent years. However, collecting sufficient dialogues that are naturally grounded on flowcharts is costly, thus FTD systems are impeded by scarce training data. To mitigate the data sparsity issue, we propose a plan-based synthetic data generation (PlanSDG) approach that generates diverse synthetic dialog data at scale by transforming concise flowchart into dialogues. Specifically, its generative model employs a variational-base framework with a hierarchical planning strategy that includes global and local latent planning variables. Experiments on the FloDial dataset show that synthetic dialogue produced by PlanSDG improves the performance of downstream tasks, including flowchart path retrieval and response generation, in particular on the Out-of-Flowchart settings. In addition, further analysis demonstrate the quality of synthetic data generated by PlanSDG in paths that are covered by current sample dialogues and paths that are not covered.
△ Less
Submitted 29 October, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Learnable Pillar-based Re-ranking for Image-Text Retrieval
Authors:
Leigang Qu,
Meng Liu,
Wenjie Wang,
Zhedong Zheng,
Liqiang Nie,
Tat-Seng Chua
Abstract:
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority…
▽ More
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and inter-modal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote cross-modal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
SocialDial: A Benchmark for Socially-Aware Dialogue Systems
Authors:
Haolan Zhan,
Zhuang Li,
Yufei Wang,
Linhao Luo,
Tao Feng,
Xiaoxi Kang,
Yuncheng Hua,
Lizhen Qu,
Lay-Ki Soon,
Suraj Sharma,
Ingrid Zukerman,
Zhaleh Semnani-Azad,
Gholamreza Haffari
Abstract:
Dialogue systems have been widely applied in many scenarios and are now more powerful and ubiquitous than ever before. With large neural models and massive available data, current dialogue systems have access to more knowledge than any people in their life. However, current dialogue systems still do not perform at a human level. One major gap between conversational agents and humans lies in their…
▽ More
Dialogue systems have been widely applied in many scenarios and are now more powerful and ubiquitous than ever before. With large neural models and massive available data, current dialogue systems have access to more knowledge than any people in their life. However, current dialogue systems still do not perform at a human level. One major gap between conversational agents and humans lies in their abilities to be aware of social norms. The development of socially-aware dialogue systems is impeded due to the lack of resources. In this paper, we present the first socially-aware dialogue corpus - SocialDial, based on Chinese social culture. SocialDial consists of two parts: 1,563 multi-turn dialogues between two human speakers with fine-grained labels, and 4,870 synthetic conversations generated by ChatGPT. The human corpus covers five categories of social norms, which have 14 sub-categories in total. Specifically, it contains social factor annotations including social relation, context, social distance, and social norms. However, collecting sufficient socially-aware dialogues is costly. Thus, we harness the power of ChatGPT and devise an ontology-based synthetic data generation framework. This framework is able to generate synthetic data at scale. To ensure the quality of synthetic dialogues, we design several mechanisms for quality control during data collection. Finally, we evaluate our dataset using several pre-trained models, such as BERT and RoBERTa. Comprehensive empirical results based on state-of-the-art neural models demonstrate that modeling of social norms for dialogue systems is a promising research direction. To the best of our knowledge, SocialDial is the first socially-aware dialogue dataset that covers multiple social factors and has fine-grained labels.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
Towards Arbitrary-scale Histopathology Image Super-resolution: An Efficient Dual-branch Framework based on Implicit Self-texture Enhancement
Authors:
Linhao Qu,
Minghong Duan,
Zhiwei Yang,
Manning Wang,
Zhijian Song
Abstract:
Existing super-resolution models for pathology images can only work in fixed integer magnifications and have limited performance. Though implicit neural network-based methods have shown promising results in arbitrary-scale super-resolution of natural images, it is not effective to directly apply them in pathology images, because pathology images have special fine-grained image textures different f…
▽ More
Existing super-resolution models for pathology images can only work in fixed integer magnifications and have limited performance. Though implicit neural network-based methods have shown promising results in arbitrary-scale super-resolution of natural images, it is not effective to directly apply them in pathology images, because pathology images have special fine-grained image textures different from natural images. To address this challenge, we propose a dual-branch framework with an efficient self-texture enhancement mechanism for arbitrary-scale super-resolution of pathology images. Extensive experiments on two public datasets show that our method outperforms both existing fixed-scale and arbitrary-scale algorithms. To the best of our knowledge, this is the first work to achieve arbitrary-scale super-resolution in the field of pathology images. Codes will be available.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
Time-varying $β$-model for dynamic directed networks
Authors:
Yuqing Du,
Lianqiang Qu,
Ting Yan,
Yuan Zhang
Abstract:
We extend the well-known $β$-model for directed graphs to dynamic network setting, where we observe snapshots of adjacency matrices at different time points. We propose a kernel-smoothed likelihood approach for estimating $2n$ time-varying parameters in a network with $n$ nodes, from $N$ snapshots. We establish consistency and asymptotic normality properties of our kernel-smoothed estimators as ei…
▽ More
We extend the well-known $β$-model for directed graphs to dynamic network setting, where we observe snapshots of adjacency matrices at different time points. We propose a kernel-smoothed likelihood approach for estimating $2n$ time-varying parameters in a network with $n$ nodes, from $N$ snapshots. We establish consistency and asymptotic normality properties of our kernel-smoothed estimators as either $n$ or $N$ diverges. Our results contrast their counterparts in single-network analyses, where $n\to\infty$ is invariantly required in asymptotic studies. We conduct comprehensive simulation studies that confirm our theory's prediction and illustrate the performance of our method from various angles. We apply our method to an email data set and obtain meaningful results.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Less is More: Mitigate Spurious Correlations for Open-Domain Dialogue Response Generation Models by Causal Discovery
Authors:
Tao Feng,
Lizhen Qu,
Gholamreza Haffari
Abstract:
In this paper, we conduct the first study on spurious correlations for open-domain response generation models based on a corpus CGDIALOG curated in our work. The cur rent models indeed suffer from spurious correlations and have a tendency of generating irrelevant and generic responses. Inspired by causal discovery algorithms, we propose a novel model-agnostic method for training and inference of r…
▽ More
In this paper, we conduct the first study on spurious correlations for open-domain response generation models based on a corpus CGDIALOG curated in our work. The cur rent models indeed suffer from spurious correlations and have a tendency of generating irrelevant and generic responses. Inspired by causal discovery algorithms, we propose a novel model-agnostic method for training and inference of response generation model using a conditional independence classifier. The classifier is trained by a constrained self-training method, coined CONSTRAIN, to overcome data scarcity. The experimental results based on both human and automatic evaluation show that our method significantly outperforms the competitive baselines in terms of relevance, informativeness, and fluency.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
A detailed view of low-frequency quasi-periodic oscillation in the broadband 0.2-200 keV with Insight-HXMT and NICER
Authors:
X. Ma,
L. Zhang,
L. Tao,
Q. C. Bu,
J. L. Qu,
S. N. Zhang,
D. K. Zhou,
Y. Huang,
S. M. Jia,
L. M. Song,
S. Zhang,
M. Y. Ge,
H. X. Liu,
Z. X. Yang,
W. Yu,
E. S. Yorgancioglu
Abstract:
We report the X-ray timing results of the black hole candidate MAXI J1820+070 during its 2018 outburst using the Hard X-ray Modulation Telescope (Insight-HXMT) and Neutron Star Interior Composition Explorer Mission (NICER) observations. Low frequency quasi-periodic oscillations (LFQPOs) are detected in the low/hard state and the hard intermediate state, which lasted for about 90 days. Thanks to th…
▽ More
We report the X-ray timing results of the black hole candidate MAXI J1820+070 during its 2018 outburst using the Hard X-ray Modulation Telescope (Insight-HXMT) and Neutron Star Interior Composition Explorer Mission (NICER) observations. Low frequency quasi-periodic oscillations (LFQPOs) are detected in the low/hard state and the hard intermediate state, which lasted for about 90 days. Thanks to the large effective area of Insight-HXMT at high energies and NICER at low energies, we are able to present the energy dependence of the LFQPO characteristics and phase lags from 0.2 keV to 200 keV, which has never been explored by previous missions. We find that the centroid frequency of the LFQPOs do not change significantly with energy, while the full width at half maximum (FWHM) and fractional rms show a complex evolution with energy. The LFQPO phase lags at high energies and low energies show consistent energy-dependence relations taking the ~2 keV as reference. Our results suggest that the LFQPOs from high energy come from the LT precession of the relativistic jet, while the low-energy radiation is mainly from the perpendicular innermost regions of the accretion disk.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Towards more precise automatic analysis: a comprehensive survey of deep learning-based multi-organ segmentation
Authors:
Xiaoyu Liu,
Linhao Qu,
Ziyue Xie,
Jiayue Zhao,
Yonghong Shi,
Zhijian Song
Abstract:
Accurate segmentation of multiple organs of the head, neck, chest, and abdomen from medical images is an essential step in computer-aided diagnosis, surgical navigation, and radiation therapy. In the past few years, with a data-driven feature extraction approach and end-to-end training, automatic deep learning-based multi-organ segmentation method has far outperformed traditional methods and becom…
▽ More
Accurate segmentation of multiple organs of the head, neck, chest, and abdomen from medical images is an essential step in computer-aided diagnosis, surgical navigation, and radiation therapy. In the past few years, with a data-driven feature extraction approach and end-to-end training, automatic deep learning-based multi-organ segmentation method has far outperformed traditional methods and become a new research topic. This review systematically summarizes the latest research in this field. For the first time, from the perspective of full and imperfect annotation, we comprehensively compile 161 studies on deep learning-based multi-organ segmentation in multiple regions such as the head and neck, chest, and abdomen, containing a total of 214 related references. The method based on full annotation summarizes the existing methods from four aspects: network architecture, network dimension, network dedicated modules, and network loss function. The method based on imperfect annotation summarizes the existing methods from two aspects: weak annotation-based methods and semi annotation-based methods. We also summarize frequently used datasets for multi-organ segmentation and discuss new challenges and new research trends in this field.
△ Less
Submitted 2 March, 2023; v1 submitted 28 February, 2023;
originally announced March 2023.
-
Semi-decentralized Federated Ego Graph Learning for Recommendation
Authors:
Liang Qu,
Ningzhi Tang,
Ruiqi Zheng,
Quoc Viet Hung Nguyen,
Zi Huang,
Yuhui Shi,
Hongzhi Yin
Abstract:
Collaborative filtering (CF) based recommender systems are typically trained based on personal interaction data (e.g., clicks and purchases) that could be naturally represented as ego graphs. However, most existing recommendation methods collect these ego graphs from all users to compose a global graph to obtain high-order collaborative information between users and items, and these centralized CF…
▽ More
Collaborative filtering (CF) based recommender systems are typically trained based on personal interaction data (e.g., clicks and purchases) that could be naturally represented as ego graphs. However, most existing recommendation methods collect these ego graphs from all users to compose a global graph to obtain high-order collaborative information between users and items, and these centralized CF recommendation methods inevitably lead to a high risk of user privacy leakage. Although recently proposed federated recommendation systems can mitigate the privacy problem, they either restrict the on-device local training to an isolated ego graph or rely on an additional third-party server to access other ego graphs resulting in a cumbersome pipeline, which is hard to work in practice. In addition, existing federated recommendation systems require resource-limited devices to maintain the entire embedding tables resulting in high communication costs.
In light of this, we propose a semi-decentralized federated ego graph learning framework for on-device recommendations, named SemiDFEGL, which introduces new device-to-device collaborations to improve scalability and reduce communication costs and innovatively utilizes predicted interacted item nodes to connect isolated ego graphs to augment local subgraphs such that the high-order user-item collaborative information could be used in a privacy-preserving manner. Furthermore, the proposed framework is model-agnostic, meaning that it could be seamlessly integrated with existing graph neural network-based recommendation methods and privacy protection techniques. To validate the effectiveness of the proposed SemiDFEGL, extensive experiments are conducted on three public datasets, and the results demonstrate the superiority of the proposed SemiDFEGL compared to other federated recommendation methods.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition
Authors:
Leyuan Qu,
Cornelius Weber,
Stefan Wermter
Abstract:
Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-proces…
▽ More
Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-processing, we tackle this problem on an earlier processing level and eliminate the bias in acoustic modeling to recognize OOV words acoustically. We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words. Specifically, we enlarge the classification loss used for training neural networks' parameters of utterances containing OOV words (sentence-level), or rescale the gradient used for back-propagation for OOV words (word-level), when fine-tuning a previously trained model on synthetic audio. To overcome catastrophic forgetting, we also explore the combination of loss rescaling and model regularization, i.e. L2 regularization and elastic weight consolidation (EWC). Compared with previous methods that just fine-tune synthetic audio with EWC, the experimental results on the LibriSpeech benchmark reveal that our proposed loss rescaling approach can achieve significant improvement on the recall rate with only a slight decrease on word error rate. Moreover, word-level rescaling is more stable than utterance-level rescaling and leads to higher recall rates and precision on OOV word recognition. Furthermore, our proposed combined loss rescaling and weight consolidation methods can support continual learning of an ASR system.
△ Less
Submitted 21 February, 2023; v1 submitted 19 February, 2023;
originally announced February 2023.
-
Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation
Authors:
Minghao Wu,
George Foster,
Lizhen Qu,
Gholamreza Haffari
Abstract:
Existing work in document-level neural machine translation commonly concatenates several consecutive sentences as a pseudo-document, and then learns inter-sentential dependencies. This strategy limits the model's ability to leverage information from distant context. We overcome this limitation with a novel Document Flattening (DocFlat) technique that integrates Flat-Batch Attention (FBA) and Neura…
▽ More
Existing work in document-level neural machine translation commonly concatenates several consecutive sentences as a pseudo-document, and then learns inter-sentential dependencies. This strategy limits the model's ability to leverage information from distant context. We overcome this limitation with a novel Document Flattening (DocFlat) technique that integrates Flat-Batch Attention (FBA) and Neural Context Gate (NCG) into Transformer model to utilize information beyond the pseudo-document boundaries. FBA allows the model to attend to all the positions in the batch and learns the relationships between positions explicitly and NCG identifies the useful information from the distant context. We conduct comprehensive experiments and analyses on three benchmark datasets for English-German translation, and validate the effectiveness of two variants of DocFlat. Empirical results show that our approach outperforms strong baselines with statistical significance on BLEU, COMET and accuracy on the contrastive test set. The analyses highlight that DocFlat is highly effective in capturing the long-range information.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Timing analysis of EXO 2030+375 during its 2021 giant outburst observed with Insight-HXMT
Authors:
Yu-Cong Fu,
L. M. Song,
G. Q. Ding,
M. Y. Ge,
Y. L. Tuo,
S. Zhang,
S. N. Zhang,
X. Hou,
J. L. Qu,
J. Zhang,
L. Zhang,
Q. C. Bu,
Y. Huang,
X. Ma,
X. Zhou,
W. M. Yan,
Z. X. Yang,
X. F. Lu,
T. M. Li,
Y. C. Xu,
P. J. Wang,
S. H. Xiao,
H. X. Liu,
X. Q. Ren,
Y. F. Du
, et al. (2 additional authors not shown)
Abstract:
We report the evolution of the X-ray pulsations of EXO 2030+375 during its 2021 outburst using the observations from \textit{Insight}-HXMT. Based on the accretion torque model, we study the correlation between the spin frequency derivatives and the luminosity. Pulsations can be detected in the energy band of 1--160 keV. The pulse profile evolves significantly with luminosity during the outburst, l…
▽ More
We report the evolution of the X-ray pulsations of EXO 2030+375 during its 2021 outburst using the observations from \textit{Insight}-HXMT. Based on the accretion torque model, we study the correlation between the spin frequency derivatives and the luminosity. Pulsations can be detected in the energy band of 1--160 keV. The pulse profile evolves significantly with luminosity during the outburst, leading to that the whole outburst can be divided into several parts with different characteristics. The evolution of the pulse profile reveals the transition between the super-critical (fan-beam dominated) and the sub-critical accretion (pencil-beam dominated) mode. From the accretion torque model and the critical luminosity model, based on a distance of 7.1 kpc, the inferred magnetic fields are $(0.41-0.74) \times 10^{12}$ G and $(3.48-3.96) \times 10^{12}$ G, respectively, or based on a distance of 3.6 kpc, the estimated magnetic fields are $(2.4-4.3) \times 10^{13}$ G and $(0.98-1.11)\times 10^{12}$ G, respectively. Two different sets of magnetic fields both support the presence of multipole magnetic fields of the NS.
△ Less
Submitted 25 February, 2023; v1 submitted 4 February, 2023;
originally announced February 2023.
-
Reanalysis of the X-ray burst associated FRB 200428 with Insight-HXMT observations
Authors:
M. Y. Ge,
C. Z. Liu,
S. N. Zhang,
F. J. Lu,
Z. Zhang,
Z. Chang,
Y. L. Tuo,
X. B. Li,
C. K. Li,
S. L. Xiong,
C. Cai,
X. F. Li,
R. Zhang,
Z. G. Dai,
J. L. Qu,
L. M. Song,
S. Zhang,
L. J. Wang
Abstract:
A double-peak X-ray burst from the Galactic magnetar SGR J1935+2154 was discovered as associated with the two radio pulses of FRB 200428 separated by 28.97+-0.02 ms. Precise measurements of the timing and spectral properties of the X-ray bursts are helpful for understanding the physical origin of fast radio bursts (FRBs). In this paper, we have reconstructed some information about the hard X-ray e…
▽ More
A double-peak X-ray burst from the Galactic magnetar SGR J1935+2154 was discovered as associated with the two radio pulses of FRB 200428 separated by 28.97+-0.02 ms. Precise measurements of the timing and spectral properties of the X-ray bursts are helpful for understanding the physical origin of fast radio bursts (FRBs). In this paper, we have reconstructed some information about the hard X-ray events, which were lost because the High Energy X-ray Telescope (HE) onboard the Insight-HXMT mission was saturated by this extremely bright burst, and used the information to improve the temporal and spectral analyses of the X-ray burst. The arrival times of the two X-ray peaks by fitting the new Insight-HXMT/HE lightcurve with multi-Gaussian profiles are 2.77+-0.45 ms and 34.30+-0.56 ms after the first peak of FRB 200428, respectively, while these two parameters are 2.57+-0.52 ms and 32.5+-1.4 ms if the fitting profile is a fast rise and exponential decay function. The spectrum of the two X-ray peaks could be described by a cutoff power-law with cutoff energy ~60 keV and photon index ~1.4, the latter is softer than that of the underlying bright and broader X-ray burst when the two X-ray peaks appeared.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
A degree-corrected Cox model for dynamic networks
Authors:
Yuguo Chen,
Lianqiang Qu,
**feng Xu,
Ting Yan,
Yunpeng Zhou
Abstract:
Continuous time network data have been successfully modeled by multivariate counting processes, in which the intensity function is characterized by covariate information. However, degree heterogeneity has not been incorporated into the model which may lead to large biases for the estimation of homophily effects. In this paper, we propose a degree-corrected Cox network model to simultaneously analy…
▽ More
Continuous time network data have been successfully modeled by multivariate counting processes, in which the intensity function is characterized by covariate information. However, degree heterogeneity has not been incorporated into the model which may lead to large biases for the estimation of homophily effects. In this paper, we propose a degree-corrected Cox network model to simultaneously analyze the dynamic degree heterogeneity and homophily effects for continuous time directed network data. Since each node has individual-specific in- and out-degree effects in the model, the dimension of the time-varying parameter vector grows with the number of nodes, which makes the estimation problem non-standard. We develop a local estimating equations approach to estimate unknown time-varying parameters, and establish consistency and asymptotic normality of the proposed estimators by using the powerful martingale process theories. We further propose test statistics to test for trend and degree heterogeneity in dynamic networks. Simulation studies are provided to assess the finite sample performance of the proposed method and a real data analysis is used to illustrate its practical utility.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
When Federated Learning Meets Pre-trained Language Models' Parameter-Efficient Tuning Methods
Authors:
Zhuo Zhang,
Yuanhang Yang,
Yong Dai,
Lizhen Qu,
Zenglin Xu
Abstract:
With increasing privacy concerns on data, recent studies have made significant progress using federated learning (FL) on privacy-sensitive natural language processing (NLP) tasks. Much literature suggests fully fine-tuning pre-trained language models (PLMs) in the FL paradigm can mitigate the data heterogeneity problem and close the performance gap with centralized training. However, large PLMs br…
▽ More
With increasing privacy concerns on data, recent studies have made significant progress using federated learning (FL) on privacy-sensitive natural language processing (NLP) tasks. Much literature suggests fully fine-tuning pre-trained language models (PLMs) in the FL paradigm can mitigate the data heterogeneity problem and close the performance gap with centralized training. However, large PLMs bring the curse of prohibitive communication overhead and local model adaptation costs for the FL system. To this end, we introduce various parameter-efficient tuning (PETuning) methods into federated learning. Specifically, we provide a holistic empirical study of representative PLMs tuning methods in FL. The experimental results cover the analysis of data heterogeneity levels, data scales, and different FL scenarios. Overall communication overhead can be significantly reduced by locally tuning and globally aggregating lightweight model parameters while maintaining acceptable performance in various FL settings. To facilitate the research of PETuning in FL, we also develop a federated tuning framework FedPETuning, which allows practitioners to exploit different PETuning methods under the FL training paradigm conveniently. The source code is available at \url{https://github.com/iezhuozhuo/FedETuning/tree/deltaTuning}.
△ Less
Submitted 2 June, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Let's Negotiate! A Survey of Negotiation Dialogue Systems
Authors:
Haolan Zhan,
Yufei Wang,
Tao Feng,
Yuncheng Hua,
Suraj Sharma,
Zhuang Li,
Lizhen Qu,
Gholamreza Haffari
Abstract:
Negotiation is one of the crucial abilities in human communication, and there has been a resurgent research interest in negotiation dialogue systems recently, which goal is to empower intelligent agents with such ability that can efficiently help humans resolve conflicts or reach beneficial agreements. Although there have been many explorations in negotiation dialogue systems, a systematic review…
▽ More
Negotiation is one of the crucial abilities in human communication, and there has been a resurgent research interest in negotiation dialogue systems recently, which goal is to empower intelligent agents with such ability that can efficiently help humans resolve conflicts or reach beneficial agreements. Although there have been many explorations in negotiation dialogue systems, a systematic review of this task has to date remained notably absent. To this end, we aim to fill this gap by reviewing contemporary studies in the emerging field of negotiation dialogue systems, covering benchmarks, evaluations, and methodologies. Furthermore, we also discuss potential future directions, including multi-modal, multi-party, and cross-cultural negotiation scenarios. Our goal is to provide the community with a systematic overview of negotiation dialogue systems and to inspire future research.
△ Less
Submitted 18 December, 2022;
originally announced December 2022.
-
Disentangling Prosody Representations with Unsupervised Speech Reconstruction
Authors:
Leyuan Qu,
Taihao Li,
Cornelius Weber,
Theresa Pekarek-Rosin,
Fuji Ren,
Stefan Wermter
Abstract:
Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodi…
▽ More
Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.
△ Less
Submitted 25 September, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Trace the Accretion Geometry of H 1743--322 with Type C Quasi-periodic Oscillations in Multiple Outbursts
Authors:
Qing-Cang Shui,
Shu Zhang,
Yu-Peng P. Chen,
Shuang-Nan Zhang,
Ling-Da Kong,
Peng-Ju Wang,
Long Ji,
Hong-Xing Yin,
J. L. Qu,
L. Tao,
M. Y. Ge,
**g-Qiang Peng,
Zhi Chang,
Jian Li,
Peng Zhang
Abstract:
We present a systematic analysis of type C quasi-periodic oscillation (QPO) observations of H 1743--322 throughout the Rossi X-ray Timing Explorer (RXTE) era. We find that, while different outbursts have significant flux differences, they show consistent positive correlations between the QPO fractional root-mean-square (rms) amplitude and non-thermal fraction of the emission, which indicate an ind…
▽ More
We present a systematic analysis of type C quasi-periodic oscillation (QPO) observations of H 1743--322 throughout the Rossi X-ray Timing Explorer (RXTE) era. We find that, while different outbursts have significant flux differences, they show consistent positive correlations between the QPO fractional root-mean-square (rms) amplitude and non-thermal fraction of the emission, which indicate an independence of the intrinsic QPO rms on individual outburst brightness in H 1743--322. However, the dependence of the QPO rms on frequency is different between the outburst rise and decay phases, where QPO fractional rms of the decay phase is significantly lower than that of the rise phase at low frequencies. The spectral analysis also reveals different ranges of coronal temperature between the two outburst stages. A semi-quantitative analysis shows that the Lense-Thirring precession model could be responsible for the QPO rms differences, requiring a variable coronal geometric shape. However, the variable-Comptonization model could also account for the findings. The fact that the rms differences and the hysteresis traces in the hardness-intensity diagram (HID) accompany each other indicates a connection between the two phenomena. By correlating the findings with QPO phase lags and the quasi-simultaneous radio flux previously published, we propose there could be corona-jet transitions in H 1743--322 similar to those that have been recently reported in GRS 1915+105.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Reducing Domain Gap in Frequency and Spatial domain for Cross-modality Domain Adaptation on Medical Image Segmentation
Authors:
Shaolei Liu,
Siqi Yin,
Linhao Qu,
Manning Wang
Abstract:
Unsupervised domain adaptation (UDA) aims to learn a model trained on source domain and performs well on unlabeled target domain. In medical image segmentation field, most existing UDA methods depend on adversarial learning to address the domain gap between different image modalities, which is ineffective due to its complicated training process. In this paper, we propose a simple yet effective UDA…
▽ More
Unsupervised domain adaptation (UDA) aims to learn a model trained on source domain and performs well on unlabeled target domain. In medical image segmentation field, most existing UDA methods depend on adversarial learning to address the domain gap between different image modalities, which is ineffective due to its complicated training process. In this paper, we propose a simple yet effective UDA method based on frequency and spatial domain transfer uner multi-teacher distillation framework. In the frequency domain, we first introduce non-subsampled contourlet transform for identifying domain-invariant and domain-variant frequency components (DIFs and DVFs), and then keep the DIFs unchanged while replacing the DVFs of the source domain images with that of the target domain images to narrow the domain gap. In the spatial domain, we propose a batch momentum update-based histogram matching strategy to reduce the domain-variant image style bias. Experiments on two cross-modality medical image segmentation datasets (cardiac, abdominal) show that our proposed method achieves superior performance compared to state-of-the-art methods.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Learning Object-Language Alignments for Open-Vocabulary Object Detection
Authors:
Chuang Lin,
Peize Sun,
Yi Jiang,
** Luo,
Lizhen Qu,
Gholamreza Haffari,
Zehuan Yuan,
Jianfei Cai
Abstract:
Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging…
▽ More
Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
Modeling Multivariate Biosignals With Graph Neural Networks and Structured State Space Models
Authors:
Siyi Tang,
Jared A. Dunnmon,
Liangqiong Qu,
Khaled K. Saab,
Tina Baykaner,
Christopher Lee-Messer,
Daniel L. Rubin
Abstract:
Multivariate biosignals are prevalent in many medical domains, such as electroencephalography, polysomnography, and electrocardiography. Modeling spatiotemporal dependencies in multivariate biosignals is challenging due to (1) long-range temporal dependencies and (2) complex spatial correlations between the electrodes. To address these challenges, we propose representing multivariate biosignals as…
▽ More
Multivariate biosignals are prevalent in many medical domains, such as electroencephalography, polysomnography, and electrocardiography. Modeling spatiotemporal dependencies in multivariate biosignals is challenging due to (1) long-range temporal dependencies and (2) complex spatial correlations between the electrodes. To address these challenges, we propose representing multivariate biosignals as time-dependent graphs and introduce GraphS4mer, a general graph neural network (GNN) architecture that improves performance on biosignal classification tasks by modeling spatiotemporal dependencies in biosignals. Specifically, (1) we leverage the Structured State Space architecture, a state-of-the-art deep sequence model, to capture long-range temporal dependencies in biosignals and (2) we propose a graph structure learning layer in GraphS4mer to learn dynamically evolving graph structures in the data. We evaluate our proposed model on three distinct biosignal classification tasks and show that GraphS4mer consistently improves over existing models, including (1) seizure detection from electroencephalographic signals, outperforming a previous GNN with self-supervised pre-training by 3.1 points in AUROC; (2) sleep staging from polysomnographic signals, a 4.1 points improvement in macro-F1 score compared to existing sleep staging models; and (3) 12-lead electrocardiogram classification, outperforming previous state-of-the-art models by 2.7 points in macro-F1 score.
△ Less
Submitted 29 April, 2023; v1 submitted 20 November, 2022;
originally announced November 2022.
-
Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer
Authors:
Leyuan Qu,
Wei Wang,
Cornelius Weber,
Pengcheng Yue,
Taihao Li,
Stefan Wermter
Abstract:
Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug…
▽ More
Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.
△ Less
Submitted 28 December, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.