Search | arXiv e-print repository

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

Authors: Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen

Abstract: Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) and their potential to autonomously manage daily tasks. Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents, due to the inexh… ▽ More Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) and their potential to autonomously manage daily tasks. Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents, due to the inexhaustible states of apps and the vague definition of feasible action sequences. To address this challenge, we propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing. We initially define 100 tasks across 10 open-source apps, categorized by multiple levels of difficulty. Subsequently, we evaluate several existing mobile agents, including AppAgent and MobileAgent, to thoroughly and systematically compare their performance. All materials are accessible on our project webpage: https://MobileAgentBench.github.io, contributing to the advancement of both academic and industrial fields. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07385 [pdf, other]

Disrupting Bipartite Trading Networks: Matching for Revenue Maximization

Authors: Luca D'Amico-Wong, Yannai A. Gonczarowski, Gary Qiurui Ma, David C. Parkes

Abstract: We model the role of an online platform disrupting a market with unit-demand buyers and unit-supply sellers. Each seller can transact with a subset of the buyers whom she already knows, as well as with any additional buyers to whom she is introduced by the platform. Given these constraints on trade, prices and transactions are induced by a competitive equilibrium. The platform's revenue is proport… ▽ More We model the role of an online platform disrupting a market with unit-demand buyers and unit-supply sellers. Each seller can transact with a subset of the buyers whom she already knows, as well as with any additional buyers to whom she is introduced by the platform. Given these constraints on trade, prices and transactions are induced by a competitive equilibrium. The platform's revenue is proportional to the total price of all trades between platform-introduced buyers and sellers. In general, we show that the platform's revenue-maximization problem is computationally intractable. We provide structural results for revenue-optimal matchings and isolate special cases in which the platform can efficiently compute them. Furthermore, in a market where the maximum increase in social welfare that the platform can create is $ΔW$, we prove that the platform can attain revenue $Ω(ΔW/\log(\min\{n,m\}))$, where $n$ and $m$ are the numbers of buyers and sellers, respectively. When $ΔW$ is large compared to welfare without the platform, this gives a polynomial-time algorithm that guarantees a logarithmic approximation of the optimal welfare as revenue. We also show that even when the platform optimizes for revenue, the social welfare is at least an $O(\log(\min\{n,m\}))$-approximation to the optimal welfare. Finally, we prove significantly stronger bounds for revenue and social welfare in homogeneous-goods markets. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted at the Twenty-Fifth ACM Conference on Economics and Computation (EC'24), 2024

arXiv:2406.05426 [pdf, other]

Baking Symmetry into GFlowNets

Authors: George Ma, Emmanuel Bengio, Yoshua Bengio, Dinghuai Zhang

Abstract: GFlowNets have exhibited promising performance in generating diverse candidates with high rewards. These networks generate objects incrementally and aim to learn a policy that assigns probability of sampling objects in proportion to rewards. However, the current training pipelines of GFlowNets do not consider the presence of isomorphic actions, which are actions resulting in symmetric or isomorphi… ▽ More GFlowNets have exhibited promising performance in generating diverse candidates with high rewards. These networks generate objects incrementally and aim to learn a policy that assigns probability of sampling objects in proportion to rewards. However, the current training pipelines of GFlowNets do not consider the presence of isomorphic actions, which are actions resulting in symmetric or isomorphic states. This lack of symmetry increases the amount of samples required for training GFlowNets and can result in inefficient and potentially incorrect flow functions. As a consequence, the reward and diversity of the generated objects decrease. In this study, our objective is to integrate symmetries into GFlowNets by identifying equivalent actions during the generation process. Experimental results using synthetic data demonstrate the promising performance of our proposed approaches. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.02224 [pdf, other]

FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models

Authors: Tao Fan, Guoqiang Ma, Yan Kang, Hanlin Gu, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang

Abstract: Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server's LLM and clients' SLMs. To bri… ▽ More Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server's LLM and clients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server's LLM to clients' SLMs while concurrently enriching the LLM with clients' unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, we evaluate the effectiveness of FedMKT using various public LLMs and SLMs on a range of NLP text generation tasks. Empirical results demonstrate that FedMKT simultaneously boosts the performance of both LLMs and SLMs. △ Less

Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.18378 [pdf, other]

A Canonization Perspective on Invariant and Equivariant Learning

Authors: George Ma, Yifei Wang, Derek Lim, Stefanie Jegelka, Yisen Wang

Abstract: In many applications, we desire neural networks to exhibit invariance or equivariance to certain groups due to symmetries inherent in the data. Recently, frame-averaging methods emerged to be a unified framework for attaining symmetries efficiently by averaging over input-dependent subsets of the group, i.e., frames. What we currently lack is a principled understanding of the design of frames. In… ▽ More In many applications, we desire neural networks to exhibit invariance or equivariance to certain groups due to symmetries inherent in the data. Recently, frame-averaging methods emerged to be a unified framework for attaining symmetries efficiently by averaging over input-dependent subsets of the group, i.e., frames. What we currently lack is a principled understanding of the design of frames. In this work, we introduce a canonization perspective that provides an essential and complete view of the design of frames. Canonization is a classic approach for attaining invariance by map** inputs to their canonical forms. We show that there exists an inherent connection between frames and canonical forms. Leveraging this connection, we can efficiently compare the complexity of frames as well as determine the optimality of certain frames. Guided by this principle, we design novel frames for eigenvectors that are strictly superior to existing methods -- some are even optimal -- both theoretically and empirically. The reduction to the canonization perspective further uncovers equivalences between previous methods. These observations suggest that canonization provides a fundamental understanding of existing frame-averaging methods and unifies existing equivariant and invariant learning methods. △ Less

Submitted 29 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.14185 [pdf, other]

A structure-aware framework for learning device placements on computation graphs

Authors: Shukai Duan, Heng **, Nikos Kanakaris, Xiongye Xiao, Peiyu Zhang, Panagiotis Kyriakis, Nesreen K. Ahmed, Guixiang Ma, Mihai Capota, Shahin Nazarian, Theodore L. Willke, Paul Bogdan

Abstract: Existing approaches for device placement ignore the topological features of computation graphs and rely mostly on heuristic methods for graph partitioning. At the same time, they either follow a grouper-placer or an encoder-placer architecture, which requires understanding the interaction structure between code operations. To bridge the gap between encoder-placer and grouper-placer techniques, we… ▽ More Existing approaches for device placement ignore the topological features of computation graphs and rely mostly on heuristic methods for graph partitioning. At the same time, they either follow a grouper-placer or an encoder-placer architecture, which requires understanding the interaction structure between code operations. To bridge the gap between encoder-placer and grouper-placer techniques, we propose a novel framework for the task of device placement, relying on smaller computation graphs extracted from the OpenVINO toolkit using reinforcement learning. The framework consists of five steps, including graph coarsening, node representation learning and policy optimization. It facilitates end-to-end training and takes into consideration the directed and acyclic nature of the computation graphs. We also propose a model variant, inspired by graph parsing networks and complex network analysis, enabling graph representation learning and personalized graph partitioning jointly, using an unspecified number of groups. To train the entire framework, we utilize reinforcement learning techniques by employing the execution time of the suggested device placements to formulate the reward. We demonstrate the flexibility and effectiveness of our approach through multiple experiments with three benchmark models, namely Inception-V3, ResNet, and BERT. The robustness of the proposed framework is also highlighted through an ablation study. The suggested placements improve the inference speed for the benchmark models by up to $58.2\%$ over CPU execution and by up to $60.24\%$ compared to other commonly used baselines. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.05672 [pdf, other]

Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation

Authors: Mo Guan, Yan Wang, Guangkun Ma, Jiarui Liu, Mingzu Sun

Abstract: Sign language serves as a non-vocal means of communication, transmitting information and significance through gestures, facial expressions, and bodily movements. The majority of current approaches for sign language recognition (SLR) and translation rely on RGB video inputs, which are vulnerable to fluctuations in the background. Employing a keypoint-based strategy not only mitigates the effects of… ▽ More Sign language serves as a non-vocal means of communication, transmitting information and significance through gestures, facial expressions, and bodily movements. The majority of current approaches for sign language recognition (SLR) and translation rely on RGB video inputs, which are vulnerable to fluctuations in the background. Employing a keypoint-based strategy not only mitigates the effects of background alterations but also substantially diminishes the computational demands of the model. Nevertheless, contemporary keypoint-based methodologies fail to fully harness the implicit knowledge embedded in keypoint sequences. To tackle this challenge, our inspiration is derived from the human cognition mechanism, which discerns sign language by analyzing the interplay between gesture configurations and supplementary elements. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. In order to facilitate interaction across multiple streams, we investigate diverse methodologies such as keypoint fusion strategies, head fusion, and self-distillation. The resulting framework is denoted as MSKA-SLR, which is expanded into a sign language translation (SLT) model through the straightforward addition of an extra translation network. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology. Notably, we have attained a novel state-of-the-art performance in the sign language translation task of Phoenix-2014T. The code and models can be accessed at: https://github.com/sutwangyan/MSKA. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 15 pages

arXiv:2405.01918 [pdf, other]

An Onboard Framework for Staircases Modeling Based on Point Clouds

Authors: Chun Qing, Rongxiang Zeng, Xuan Wu, Yongliang Shi, Gan Ma

Abstract: The detection of traversable regions on staircases and the physical modeling constitutes pivotal aspects of the mobility of legged robots. This paper presents an onboard framework tailored to the detection of traversable regions and the modeling of physical attributes of staircases by point cloud data. To mitigate the influence of illumination variations and the overfitting due to the dataset dive… ▽ More The detection of traversable regions on staircases and the physical modeling constitutes pivotal aspects of the mobility of legged robots. This paper presents an onboard framework tailored to the detection of traversable regions and the modeling of physical attributes of staircases by point cloud data. To mitigate the influence of illumination variations and the overfitting due to the dataset diversity, a series of data augmentations are introduced to enhance the training of the fundamental network. A curvature suppression cross-entropy(CSCE) loss is proposed to reduce the ambiguity of prediction on the boundary between traversable and non-traversable regions. Moreover, a measurement correction based on the pose estimation of stairs is introduced to calibrate the output of raw modeling that is influenced by tilted perspectives. Lastly, we collect a dataset pertaining to staircases and introduce new evaluation criteria. Through a series of rigorous experiments conducted on this dataset, we substantiate the superior accuracy and generalization capabilities of our proposed method. Codes, models, and datasets will be available at https://github.com/szturobotics/Stair-detection-and-modeling-project. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.13842 [pdf, other]

On Support Relations Inference and Scene Hierarchy Graph Construction from Point Cloud in Clustered Environments

Authors: Gang Ma, Hui Wei

Abstract: Over the years, scene understanding has attracted a growing interest in computer vision, providing the semantic and physical scene information necessary for robots to complete some particular tasks autonomously. In 3D scenes, rich spatial geometric and topological information are often ignored by RGB-based approaches for scene understanding. In this study, we develop a bottom-up approach for scene… ▽ More Over the years, scene understanding has attracted a growing interest in computer vision, providing the semantic and physical scene information necessary for robots to complete some particular tasks autonomously. In 3D scenes, rich spatial geometric and topological information are often ignored by RGB-based approaches for scene understanding. In this study, we develop a bottom-up approach for scene understanding that infers support relations between objects from a point cloud. Our approach utilizes the spatial topology information of the plane pairs in the scene, consisting of three major steps. 1) Detection of pairwise spatial configuration: dividing primitive pairs into local support connection and local inner connection; 2) primitive classification: a combinatorial optimization method applied to classify primitives; and 3) support relations inference and hierarchy graph construction: bottom-up support relations inference and scene hierarchy graph construction containing primitive level and object level. Through experiments, we demonstrate that the algorithm achieves excellent performance in primitive classification and support relations inference. Additionally, we show that the scene hierarchy graph contains rich geometric and topological information of objects, and it possesses great scalability for scene understanding. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.07671 [pdf]

Deep learning-driven pulmonary arteries and veins segmentation reveals demography-associated pulmonary vasculature anatomy

Authors: Yuetan Chu, Gongning Luo, Longxi Zhou, Shaodong Cao, Guolin Ma, Xianglin Meng, Juexiao Zhou, Changchun Yang, Dexuan Xie, Ricardo Henao, Xigang Xiao, Lianming Wu, Zhaowen Qiu, Xin Gao

Abstract: Pulmonary artery-vein segmentation is crucial for diagnosing pulmonary diseases and surgical planning, and is traditionally achieved by Computed Tomography Pulmonary Angiography (CTPA). However, concerns regarding adverse health effects from contrast agents used in CTPA have constrained its clinical utility. In contrast, identifying arteries and veins using non-contrast CT, a conventional and low-… ▽ More Pulmonary artery-vein segmentation is crucial for diagnosing pulmonary diseases and surgical planning, and is traditionally achieved by Computed Tomography Pulmonary Angiography (CTPA). However, concerns regarding adverse health effects from contrast agents used in CTPA have constrained its clinical utility. In contrast, identifying arteries and veins using non-contrast CT, a conventional and low-cost clinical examination routine, has long been considered impossible. Here we propose a High-abundant Pulmonary Artery-vein Segmentation (HiPaS) framework achieving accurate artery-vein segmentation on both non-contrast CT and CTPA across various spatial resolutions. HiPaS first performs spatial normalization on raw CT scans via a super-resolution module, and then iteratively achieves segmentation results at different branch levels by utilizing the low-level vessel segmentation as a prior for high-level vessel segmentation. We trained and validated HiPaS on our established multi-centric dataset comprising 1,073 CT volumes with meticulous manual annotation. Both quantitative experiments and clinical evaluation demonstrated the superior performance of HiPaS, achieving a dice score of 91.8% and a sensitivity of 98.0%. Further experiments demonstrated the non-inferiority of HiPaS segmentation on non-contrast CT compared to segmentation on CTPA. Employing HiPaS, we have conducted an anatomical study of pulmonary vasculature on 10,613 participants in China (five sites), discovering a new association between pulmonary vessel abundance and sex and age: vessel abundance is significantly higher in females than in males, and slightly decreases with age, under the controlling of lung volumes (p < 0.0001). HiPaS realizing accurate artery-vein segmentation delineates a promising avenue for clinical diagnosis and understanding pulmonary physiology in a non-invasive manner. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.10538 [pdf, other]

MATADOR: Automated System-on-Chip Tsetlin Machine Design Generation for Edge Applications

Authors: Tousif Rahman, Gang Mao, Sidharth Maheshwari, Rishad Shafik, Alex Yakovlev

Abstract: System-on-Chip Field-Programmable Gate Arrays (SoC-FPGAs) offer significant throughput gains for machine learning (ML) edge inference applications via the design of co-processor accelerator systems. However, the design effort for training and translating ML models into SoC-FPGA solutions can be substantial and requires specialist knowledge aware trade-offs between model performance, power consumpt… ▽ More System-on-Chip Field-Programmable Gate Arrays (SoC-FPGAs) offer significant throughput gains for machine learning (ML) edge inference applications via the design of co-processor accelerator systems. However, the design effort for training and translating ML models into SoC-FPGA solutions can be substantial and requires specialist knowledge aware trade-offs between model performance, power consumption, latency and resource utilization. Contrary to other ML algorithms, Tsetlin Machine (TM) performs classification by forming logic proposition between boolean actions from the Tsetlin Automata (the learning elements) and boolean input features. A trained TM model, usually, exhibits high sparsity and considerable overlap** of these logic propositions both within and among the classes. The model, thus, can be translated to RTL-level design using a miniscule number of AND and NOT gates. This paper presents MATADOR, an automated boolean-to-silicon tool with GUI interface capable of implementing optimized accelerator design of the TM model onto SoC-FPGA for inference at the edge. It offers automation of the full development pipeline: model training, system level design generation, design verification and deployment. It makes use of the logic sharing that ensues from propositional overlap and creates a compact design by effectively utilizing the TM model's sparsity. MATADOR accelerator designs are shown to be up to 13.4x faster, up to 7x more resource frugal and up to 2x more power efficient when compared to the state-of-the-art Quantized and Binary Deep Neural Network implementations. △ Less

Submitted 3 March, 2024; originally announced March 2024.

arXiv:2403.04158 [pdf, other]

DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning

Authors: Ling Ge, Chunming Hu, Guanghui Ma, Jihong Liu, Hong Zhang

Abstract: Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift. Existing methods typically focus on weighting the predictions produced by language-specific classifiers of different sources that follow a shared encoder. However, all source languages share the same encoder, which… ▽ More Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift. Existing methods typically focus on weighting the predictions produced by language-specific classifiers of different sources that follow a shared encoder. However, all source languages share the same encoder, which is updated by all these languages. The extracted representations inevitably contain different source languages' information, which may disturb the learning of the language-specific classifiers. Additionally, due to the language gap, language-specific classifiers trained with source labels are unable to make accurate predictions for the target language. Both facts impair the model's performance. To address these challenges, we propose a Disentangled and Adaptive Network (DA-Net). Firstly, we devise a feedback-guided collaborative disentanglement method that seeks to purify input representations of classifiers, thereby mitigating mutual interference from multiple sources. Secondly, we propose a class-aware parallel adaptation method that aligns class-level distributions for each source-target language pair, thereby alleviating the language pairs' language gap. Experimental results on three different tasks involving 38 languages validate the effectiveness of our approach. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: AAAI 2024

arXiv:2403.02368 [pdf]

doi 10.1109/IECON51785.2023.10312491

A Novel Hybrid Feature Importance and Feature Interaction Detection Framework for Predictive Optimization in Industry 4.0 Applications

Authors: Zhipeng Ma, Bo Nørregaard Jørgensen, Zheng Grace Ma

Abstract: Advanced machine learning algorithms are increasingly utilized to provide data-based prediction and decision-making support in Industry 4.0. However, the prediction accuracy achieved by the existing models is insufficient to warrant practical implementation in real-world applications. This is because not all features present in real-world datasets possess a direct relevance to the predictive analy… ▽ More Advanced machine learning algorithms are increasingly utilized to provide data-based prediction and decision-making support in Industry 4.0. However, the prediction accuracy achieved by the existing models is insufficient to warrant practical implementation in real-world applications. This is because not all features present in real-world datasets possess a direct relevance to the predictive analysis being conducted. Consequently, the careful incorporation of select features has the potential to yield a substantial positive impact on the outcome. To address the research gap, this paper proposes a novel hybrid framework that combines the feature importance detector - local interpretable model-agnostic explanations (LIME) and the feature interaction detector - neural interaction detection (NID), to improve prediction accuracy. By applying the proposed framework, unnecessary features can be eliminated, and interactions are encoded to generate a more conducive dataset for predictive purposes. Subsequently, the proposed model is deployed to refine the prediction of electricity consumption in foundry processing. The experimental outcomes reveal an augmentation of up to 9.56% in the R2 score, and a diminution of up to 24.05% in the root mean square error. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Journal ref: IECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society

arXiv:2402.07610 [pdf, other]

Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrap**

Authors: Haoyu Wang, Guozheng Ma, Ziqiao Meng, Zeyu Qin, Li Shen, Zhong Zhang, Bingzhe Wu, Liu Liu, Yatao Bian, Tingyang Xu, Xueqian Wang, Peilin Zhao

Abstract: Self-alignment is an effective way to reduce the cost of human annotation while ensuring promising model capability. However, most current methods complete the data collection and training steps in a single round, which may overlook the continuously improving ability of self-aligned models. This gives rise to a key query: What if we do multi-time bootstrap** self-alignment? Does this strategy en… ▽ More Self-alignment is an effective way to reduce the cost of human annotation while ensuring promising model capability. However, most current methods complete the data collection and training steps in a single round, which may overlook the continuously improving ability of self-aligned models. This gives rise to a key query: What if we do multi-time bootstrap** self-alignment? Does this strategy enhance model performance or lead to rapid degradation? In this paper, our pioneering exploration delves into the impact of bootstrap** self-alignment on large language models. Our findings reveal that bootstrap** self-alignment markedly surpasses the single-round approach, by guaranteeing data diversity from in-context learning. To further exploit the capabilities of bootstrap**, we investigate and adjust the training order of data, which yields improved performance of the model. Drawing on these findings, we propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance. Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance. Our experiments demonstrate the efficiency of SOFT (SOFT+) across various classification and generation tasks, highlighting the potential of bootstrap** self-alignment on continually enhancing model alignment performance. △ Less

Submitted 27 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.02397 [pdf]

Multiplexed all-optical permutation operations using a reconfigurable diffractive optical network

Authors: Guangdong Ma, Xilin Yang, Bijie Bai, **gxi Li, Yuhang Li, Tianyi Gan, Che-Yung Shen, Yijie Zhang, Yuzhu Li, Mona Jarrahi, Aydogan Ozcan

Abstract: Large-scale and high-dimensional permutation operations are important for various applications in e.g., telecommunications and encryption. Here, we demonstrate the use of all-optical diffractive computing to execute a set of high-dimensional permutation operations between an input and output field-of-view through layer rotations in a diffractive optical network. In this reconfigurable multiplexed… ▽ More Large-scale and high-dimensional permutation operations are important for various applications in e.g., telecommunications and encryption. Here, we demonstrate the use of all-optical diffractive computing to execute a set of high-dimensional permutation operations between an input and output field-of-view through layer rotations in a diffractive optical network. In this reconfigurable multiplexed material designed by deep learning, every diffractive layer has four orientations: 0, 90, 180, and 270 degrees. Each unique combination of these rotatable layers represents a distinct rotation state of the diffractive design tailored for a specific permutation operation. Therefore, a K-layer rotatable diffractive material is capable of all-optically performing up to 4^K independent permutation operations. The original input information can be decrypted by applying the specific inverse permutation matrix to output patterns, while applying other inverse operations will lead to loss of information. We demonstrated the feasibility of this reconfigurable multiplexed diffractive design by approximating 256 randomly selected permutation matrices using K=4 rotatable diffractive layers. We also experimentally validated this reconfigurable diffractive network using terahertz radiation and 3D-printed diffractive layers, providing a decent match to our numerical results. The presented rotation-multiplexed diffractive processor design is particularly useful due to its mechanical reconfigurability, offering multifunctional representation through a single fabrication process. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 37 Pages, 10 Figures

arXiv:2402.01718 [pdf]

doi 10.1007/978-3-031-48649-4_15

Business Models for Digitalization Enabled Energy Efficiency and Flexibility in Industry: A Survey with Nine Case Studies

Authors: Zhipeng Ma, Bo Nørregaard Jørgensen, Michelle Levesque, Mouloud Amazouz, Zheng Grace Ma

Abstract: Digitalization is challenging in heavy industrial sectors, and many pi-lot projects facing difficulties to be replicated and scaled. Case studies are strong pedagogical vehicles for learning and sharing experience & knowledge, but rarely available in the literature. Therefore, this paper conducts a survey to gather a diverse set of nine industry cases, which are subsequently subjected to analysis… ▽ More Digitalization is challenging in heavy industrial sectors, and many pi-lot projects facing difficulties to be replicated and scaled. Case studies are strong pedagogical vehicles for learning and sharing experience & knowledge, but rarely available in the literature. Therefore, this paper conducts a survey to gather a diverse set of nine industry cases, which are subsequently subjected to analysis using the business model canvas (BMC). The cases are summarized and compared based on nine BMC components, and a Value of Business Model (VBM) evaluation index is proposed to assess the business potential of industrial digital solutions. The results show that the main partners are industry stakeholders, IT companies and academic institutes. Their key activities for digital solutions include big-data analysis, machine learning algorithms, digital twins, and internet of things developments. The value propositions of most cases are improving energy efficiency and enabling energy flexibility. Moreover, the technology readiness levels of six industrial digital solutions are under level 7, indicating that they need further validation in real-world environments. Building upon these insights, this paper proposes six recommendations for future industrial digital solution development: fostering cross-sector collaboration, prioritizing comprehensive testing and validation, extending value propositions, enhancing product adaptability, providing user-friendly platforms, and adopting transparent recommendations. △ Less

Submitted 26 January, 2024; originally announced February 2024.

Journal ref: Energy Informatics. EI.A 2023. Lecture Notes in Computer Science, vol 14467

arXiv:2401.17268 [pdf, other]

Weaver: Foundation Models for Creative Writing

Authors: Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, Yibin Liu, Jialong Wu, Shengwei Ding, Long Li, Zhiwei Huang, Xinle Deng, Teng Yu, Gangan Ma, Han Xiao, Zixin Chen, Danjun Xiang, Yunxia Wang, Yuanyuan Zhu, Yi Xiao, **g Wang , et al. (21 additional authors not shown)

Abstract: This work introduces Weaver, our first family of large language models (LLMs) dedicated to content creation. Weaver is pre-trained on a carefully selected corpus that focuses on improving the writing capabilities of large language models. We then fine-tune Weaver for creative and professional writing purposes and align it to the preference of professional writers using a suit of novel methods for… ▽ More This work introduces Weaver, our first family of large language models (LLMs) dedicated to content creation. Weaver is pre-trained on a carefully selected corpus that focuses on improving the writing capabilities of large language models. We then fine-tune Weaver for creative and professional writing purposes and align it to the preference of professional writers using a suit of novel methods for instruction data synthesis and LLM alignment, making it able to produce more human-like texts and follow more diverse instructions for content creation. The Weaver family consists of models of Weaver Mini (1.8B), Weaver Base (6B), Weaver Pro (14B), and Weaver Ultra (34B) sizes, suitable for different applications and can be dynamically dispatched by a routing agent according to query complexity to balance response quality and computation cost. Evaluation on a carefully curated benchmark for assessing the writing capabilities of LLMs shows Weaver models of all sizes outperform generalist LLMs several times larger than them. Notably, our most-capable Weaver Ultra model surpasses GPT-4, a state-of-the-art generalist LLM, on various writing scenarios, demonstrating the advantage of training specialized LLMs for writing purposes. Moreover, Weaver natively supports retrieval-augmented generation (RAG) and function calling (tool usage). We present various use cases of these abilities for improving AI-assisted writing systems, including integration of external knowledge bases, tools, or APIs, and providing personalized writing assistance. Furthermore, we discuss and summarize a guideline and best practices for pre-training and fine-tuning domain-specific LLMs. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.14903 [pdf]

doi 10.1109/APPEEC53445.2022.10072200

Energy Flexibility Potential in the Brewery Sector: A Multi-agent Based Simulation of 239 Danish Breweries

Authors: Daniel Anthony Howard, Zheng Grace Ma, Jacob Alstrup Engvang, Morten Hagenau, Kathrine Lau Jorgensen, Jonas Fausing Olesen, Bo Nørregaard Jørgensen

Abstract: The beverage industry is a typical food processing industry, accounts for significant energy consumption, and has flexible demands. However, the deployment of energy flexibility in the beverage industry is complex and challenging. Furthermore, activation of energy flexibility from the whole brewery industry is necessary to ensure grid stability. Therefore, this paper assesses the energy flexibilit… ▽ More The beverage industry is a typical food processing industry, accounts for significant energy consumption, and has flexible demands. However, the deployment of energy flexibility in the beverage industry is complex and challenging. Furthermore, activation of energy flexibility from the whole brewery industry is necessary to ensure grid stability. Therefore, this paper assesses the energy flexibility potential of Denmark's brewery sector based on a multi-agent-based simulation. 239 individual brewery facilities are simulated, and each facility, as an agent, can interact with the energy system market and make decisions based on its underlying parameters and operational restrictions. The results show that the Danish breweries could save 1.56 % of electricity costs annually while maintaining operational security and reducing approximately 1745 tonnes of CO2 emissions. Furthermore, medium-size breweries could obtain higher relative benefits by providing energy flexibility, especially those producing lager and ale. The result also shows that the breweries' relative saving potential is electricity market-dependent. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2401.11248 [pdf, other]

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

Authors: Guangyuan Ma, Xing Wu, Zijia Lin, Songlin Hu

Abstract: Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The… ▽ More Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The usage of additional Transformer-based decoders also incurs significant computational costs. In this study, we aim to shed light on this issue by revealing that masked auto-encoder (MAE) pre-training with enhanced decoding significantly improves the term coverage of input tokens in dense representations, compared to vanilla BERT checkpoints. Building upon this observation, we propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. This modification enables the efficient compression of lexical signals into dense representations through unsupervised pre-training. Remarkably, our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters, which provides a 67% training speed-up compared to standard masked auto-encoder pre-training with enhanced decoding. △ Less

Submitted 22 April, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: Accepted by SIGIR24. Our code is available at https://github.com/ma787639046/bowdpr

arXiv:2401.07329 [pdf, other]

Attention-based UNet enabled Lightweight Image Semantic Communication System over Internet of Things

Authors: Guoxin Ma, Haonan Tong, Nuocheng Yang, Changchuan Yin

Abstract: This paper studies the problem of the lightweight image semantic communication system that is deployed on Internet of Things (IoT) devices. In the considered system model, devices must use semantic communication techniques to support user behavior recognition in ultimate video service with high data transmission efficiency. However, it is computationally expensive for IoT devices to deploy semanti… ▽ More This paper studies the problem of the lightweight image semantic communication system that is deployed on Internet of Things (IoT) devices. In the considered system model, devices must use semantic communication techniques to support user behavior recognition in ultimate video service with high data transmission efficiency. However, it is computationally expensive for IoT devices to deploy semantic codecs due to the complex calculation processes of deep learning (DL) based codec training and inference. To make it affordable for IoT devices to deploy semantic communication systems, we propose an attention-based UNet enabled lightweight image semantic communication (LSSC) system, which achieves low computational complexity and small model size. In particular, we first let the LSSC system train the codec at the edge server to reduce the training computation load on IoT devices. Then, we introduce the convolutional block attention module (CBAM) to extract the image semantic features and decrease the number of downsampling layers thus reducing the floating-point operations (FLOPs). Finally, we experimentally adjust the structure of the codec and find out the optimal number of downsampling layers. Simulation results show that the proposed LSSC system can reduce the semantic codec FLOPs by 14%, and reduce the model size by 55%, with a sacrifice of 3% accuracy, compared to the baseline. Moreover, the proposed scheme can achieve a higher transmission accuracy than the traditional communication scheme in the low channel signal-to-noise (SNR) region. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: 6 pages, 6 figures, accepted by IEEE WCNC 2024

arXiv:2401.06748 [pdf, other]

Measure Theoretic Reeb Graphs and Reeb Spaces

Authors: Qingsong Wang, Guanquan Ma, Raghavendra Sridharamurthy, Bei Wang

Abstract: A Reeb graph is a graphical representation of a scalar function on a topological space that encodes the topology of the level sets. A Reeb space is a generalization of the Reeb graph to a multiparameter function. In this paper, we propose novel constructions of Reeb graphs and Reeb spaces that incorporate the use of a measure. Specifically, we introduce measure-theoretic Reeb graphs and Reeb space… ▽ More A Reeb graph is a graphical representation of a scalar function on a topological space that encodes the topology of the level sets. A Reeb space is a generalization of the Reeb graph to a multiparameter function. In this paper, we propose novel constructions of Reeb graphs and Reeb spaces that incorporate the use of a measure. Specifically, we introduce measure-theoretic Reeb graphs and Reeb spaces when the domain or the range is modeled as a metric measure space (i.e.,~a metric space equipped with a measure). Our main goal is to enhance the robustness of the Reeb graph and Reeb space in representing the topological features of a scalar field while accounting for the distribution of the measure. We first introduce a Reeb graph with local smoothing and prove its stability with respect to the interleaving distance. We then prove the stability of a Reeb graph of a metric measure space with respect to the measure, defined using the distance to a measure or the kernel distance to a measure, respectively. △ Less

Submitted 22 March, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

arXiv:2401.06192 [pdf]

doi 10.1007/978-3-031-48652-4_1

Multi-Agent Based Simulation for Investigating Electric Vehicle Adoption and Its Impacts on Electricity Distribution Grids and CO2 Emissions

Authors: Kristoffer Christensen, Zheng Grace Ma, Bo Nørregaard Jørgensen

Abstract: Electric vehicles are expected to significantly contribute to CO2-eq. emissions reduction, but the increasing number of EVs also introduces chal-lenges to the energy system, and to what extent it contributes to achieving cli-mate goals remains unknown. Static modeling and assumption-based simula-tions have been used for such investigation, but they cannot capture the realistic ecosystem dynamics.… ▽ More Electric vehicles are expected to significantly contribute to CO2-eq. emissions reduction, but the increasing number of EVs also introduces chal-lenges to the energy system, and to what extent it contributes to achieving cli-mate goals remains unknown. Static modeling and assumption-based simula-tions have been used for such investigation, but they cannot capture the realistic ecosystem dynamics. To fill the gap, this paper investigates the impacts of two adoption curves of private EVs on the electricity distribution grids and national climate goals. This paper develops a multi-agent based simulation with two adoption curves, the Traditional EV charging strategy, various EV models, driv-ing patterns, and CO2-eq. emission data to capture the full ecosystem dynamics during a long-term period from 2020 to 2032. The Danish 2030 climate goal and a Danish distribution network with 126 residential consumers are chosen as the case study. The results show that both EV adoption curves of 1 million and 775k EVs by 2030 will not satisfy the Danish climate goal of reducing transport sector emissions by 30% by 2030. The results also show that the current resi-dential electricity distribution grids cannot handle the load from increasing EVs. The first grid overload will occur in 2031 (around 16 and 24 months later for the 1 million and 775k EVs adopted by 2030) with a 67% share of EVs in the grid. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Journal ref: In: Energy Informatics. EI.A 2023. Lecture Notes in Computer Science, vol 14468

arXiv:2401.03888 [pdf]

doi 10.1007/978-3-031-48649-4_14

A Modifiable Architectural Design for Commercial Greenhouses Energy Economic Dispatch Testbed

Authors: Christian Skafte Beck Clausen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Abstract: Facing economic challenges due to the diverse objectives of businesses, and consumers, commercial greenhouses strive to minimize energy costs while addressing CO2 emissions. This scenario is intensified by rising energy costs and the global imperative to curtail CO2 emissions. To address these dynamic economic challenges, this paper proposes an architectural design for an energy economic dispatch… ▽ More Facing economic challenges due to the diverse objectives of businesses, and consumers, commercial greenhouses strive to minimize energy costs while addressing CO2 emissions. This scenario is intensified by rising energy costs and the global imperative to curtail CO2 emissions. To address these dynamic economic challenges, this paper proposes an architectural design for an energy economic dispatch testbed for commercial greenhouses. Utilizing the Attribute-Driven De-sign method, core architectural components of a software-in-the-loop testbed are proposed which emphasizes modularity and careful consideration of the multi-objective optimization problem. This approach extends prior research by implementing a modular multi-objective optimization framework in Java. The results demonstrate the successful integration of the CO2 reduction objective within the modular architecture with minimal effort. The multi-objective optimization output can also be employed to examine cost and CO2 objectives, ultimately serving as a valuable decision-support tool. The novel testbed architecture and a modular approach can tackle the multi-objective optimization problem and enable commercial greenhouses to navigate the intricate landscape of energy cost and CO2 emissions management. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: 19 pages

Journal ref: In: Energy Informatics. EI.A 2023. Lecture Notes in Computer Science, vol 14467

arXiv:2401.01155 [pdf, ps, other]

Deep Learning-Based Detection for Marker Codes over Insertion and Deletion Channels

Authors: Guochen Ma, Xiaopeng Jiao, Jianjun Mu, Hui Han, Yaming Yang

Abstract: Marker code is an effective coding scheme to protect data from insertions and deletions. It has potential applications in future storage systems, such as DNA storage and racetrack memory. When decoding marker codes, perfect channel state information (CSI), i.e., insertion and deletion probabilities, are required to detect insertion and deletion errors. Sometimes, the perfect CSI is not easy to obt… ▽ More Marker code is an effective coding scheme to protect data from insertions and deletions. It has potential applications in future storage systems, such as DNA storage and racetrack memory. When decoding marker codes, perfect channel state information (CSI), i.e., insertion and deletion probabilities, are required to detect insertion and deletion errors. Sometimes, the perfect CSI is not easy to obtain or the accurate channel model is unknown. Therefore, it is deserved to develop detecting algorithms for marker code without the knowledge of perfect CSI. In this paper, we propose two CSI-agnostic detecting algorithms for marker code based on deep learning. The first one is a model-driven deep learning method, which deep unfolds the original iterative detecting algorithm of marker code. In this method, CSI become weights in neural networks and these weights can be learned from training data. The second one is a data-driven method which is an end-to-end system based on the deep bidirectional gated recurrent unit network. Simulation results show that error performances of the proposed methods are significantly better than that of the original detection algorithm with CSI uncertainty. Furthermore, the proposed data-driven method exhibits better error performances than other methods for unknown channel models. △ Less

Submitted 2 January, 2024; originally announced January 2024.

arXiv:2312.08628 [pdf]

YOLO-OB: An improved anchor-free real-time multiscale colon polyp detector in colonoscopy

Authors: Xiao Yang, Enmin Song, Guangzhi Ma, Yunfeng Zhu, Dongming Yu, Bowen Ding, Xianyuan Wang

Abstract: Colon cancer is expected to become the second leading cause of cancer death in the United States in 2023. Although colonoscopy is one of the most effective methods for early prevention of colon cancer, up to 30% of polyps may be missed by endoscopists, thereby increasing patients' risk of develo** colon cancer. Though deep neural networks have been proven to be an effective means of enhancing th… ▽ More Colon cancer is expected to become the second leading cause of cancer death in the United States in 2023. Although colonoscopy is one of the most effective methods for early prevention of colon cancer, up to 30% of polyps may be missed by endoscopists, thereby increasing patients' risk of develo** colon cancer. Though deep neural networks have been proven to be an effective means of enhancing the detection rate of polyps. However, the variation of polyp size brings the following problems: (1) it is difficult to design an efficient and sufficient multi-scale feature fusion structure; (2) matching polyps of different sizes with fixed-size anchor boxes is a hard challenge. These problems reduce the performance of polyp detection and also lower the model's training and detection efficiency. To address these challenges, this paper proposes a new model called YOLO-OB. Specifically, we developed a bidirectional multiscale feature fusion structure, BiSPFPN, which could enhance the feature fusion capability across different depths of a CNN. We employed the ObjectBox detection head, which used a center-based anchor-free box regression strategy that could detect polyps of different sizes on feature maps of any scale. Experiments on the public dataset SUN and the self-collected colon polyp dataset Union demonstrated that the proposed model significantly improved various performance metrics of polyp detection, especially the recall rate. Compared to the state-of-the-art results on the public dataset SUN, the proposed method achieved a 6.73% increase on recall rate from 91.5% to 98.23%. Furthermore, our YOLO-OB was able to achieve real-time polyp detection at a speed of 39 frames per second using a RTX3090 graphics card. The implementation of this paper can be found here: https://github.com/seanyan62/YOLO-OB. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.06718 [pdf, other]

Large Scale Foundation Models for Intelligent Manufacturing Applications: A Survey

Authors: Haotian Zhang, Semujju Stuart Dereck, Zhicheng Wang, Xianwei Lv, Kang Xu, Liang Wu, Ye Jia, **g Wu, Zhuo Long, Wensheng Liang, X. G. Ma, Ruiyan Zhuang

Abstract: Although the applications of artificial intelligence especially deep learning had greatly improved various aspects of intelligent manufacturing, they still face challenges for wide employment due to the poor generalization ability, difficulties to establish high-quality training datasets, and unsatisfactory performance of deep learning methods. The emergence of large scale foundational models(LSFM… ▽ More Although the applications of artificial intelligence especially deep learning had greatly improved various aspects of intelligent manufacturing, they still face challenges for wide employment due to the poor generalization ability, difficulties to establish high-quality training datasets, and unsatisfactory performance of deep learning methods. The emergence of large scale foundational models(LSFMs) had triggered a wave in the field of artificial intelligence, shifting deep learning models from single-task, single-modal, limited data patterns to a paradigm encompassing diverse tasks, multimodal, and pre-training on massive datasets. Although LSFMs had demonstrated powerful generalization capabilities, automatic high-quality training dataset generation and superior performance across various domains, applications of LSFMs on intelligent manufacturing were still in their nascent stage. A systematic overview of this topic was lacking, especially regarding which challenges of deep learning can be addressed by LSFMs and how these challenges can be systematically tackled. To fill this gap, this paper systematically expounded current statue of LSFMs and their advantages in the context of intelligent manufacturing. and compared comprehensively with the challenges faced by current deep learning models in various intelligent manufacturing applications. We also outlined the roadmaps for utilizing LSFMs to address these challenges. Finally, case studies of applications of LSFMs in real-world intelligent manufacturing scenarios were presented to illustrate how LSFMs could help industries, improve their efficiency. △ Less

Submitted 22 December, 2023; v1 submitted 10 December, 2023; originally announced December 2023.

arXiv:2312.05657 [pdf, other]

Leveraging Reinforcement Learning and Large Language Models for Code Optimization

Authors: Shukai Duan, Nikos Kanakaris, Xiongye Xiao, Heng **, Chenyu Zhou, Nesreen K. Ahmed, Guixiang Ma, Mihai Capota, Theodore L. Willke, Shahin Nazarian, Paul Bogdan

Abstract: Code optimization is a daunting task that requires a significant level of expertise from experienced programmers. This level of expertise is not sufficient when compared to the rapid development of new hardware architectures. Towards advancing the whole code optimization process, recent approaches rely on machine learning and artificial intelligence techniques. This paper introduces a new framewor… ▽ More Code optimization is a daunting task that requires a significant level of expertise from experienced programmers. This level of expertise is not sufficient when compared to the rapid development of new hardware architectures. Towards advancing the whole code optimization process, recent approaches rely on machine learning and artificial intelligence techniques. This paper introduces a new framework to decrease the complexity of code optimization. The proposed framework builds on large language models (LLMs) and reinforcement learning (RL) and enables LLMs to receive feedback from their environment (i.e., unit tests) during the fine-tuning process. We compare our framework with existing state-of-the-art models and show that it is more efficient with respect to speed and computational usage, as a result of the decrement in training steps and its applicability to models with fewer parameters. Additionally, our framework reduces the possibility of logical and syntactical errors. Toward evaluating our approach, we run several experiments on the PIE dataset using a CodeT5 language model and RRHF, a new reinforcement learning algorithm. We adopt a variety of evaluation metrics with regards to optimization quality, and speedup. The evaluation results demonstrate that the proposed framework has similar results in comparison with existing models using shorter training times and smaller pre-trained models. In particular, we accomplish an increase of 5.6% and 2.2 over the baseline models concerning the %OP T and SP metrics. △ Less

Submitted 9 December, 2023; originally announced December 2023.

arXiv:2311.18675 [pdf, other]

Cascaded Interaction with Eroded Deep Supervision for Salient Object Detection

Authors: Hewen Xiao, Jie Mei, Guangfu Ma, Weiren Wu

Abstract: Deep convolutional neural networks have been widely applied in salient object detection and have achieved remarkable results in this field. However, existing models suffer from information distortion caused by interpolation during up-sampling and down-sampling. In response to this drawback, this article starts from two directions in the network: feature and label. On the one hand, a novel cascaded… ▽ More Deep convolutional neural networks have been widely applied in salient object detection and have achieved remarkable results in this field. However, existing models suffer from information distortion caused by interpolation during up-sampling and down-sampling. In response to this drawback, this article starts from two directions in the network: feature and label. On the one hand, a novel cascaded interaction network with a guidance module named global-local aligned attention (GAA) is designed to reduce the negative impact of interpolation on the feature side. On the other hand, a deep supervision strategy based on edge erosion is proposed to reduce the negative guidance of label interpolation on lateral output. Extensive experiments on five popular datasets demonstrate the superiority of our method. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2310.10095 [pdf, other]

A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images

Authors: Yangfan Ni, Duo Zhang, Gege Ma, Lijun Lu, Zhongke Huang, Wentao Zhu

Abstract: Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of i… ▽ More Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 17 pages, 7 figures

arXiv:2310.10049 [pdf, other]

FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models

Authors: Tao Fan, Yan Kang, Guoqiang Ma, Wei**g Chen, Wenbin Wei, Lixin Fan, Qiang Yang

Abstract: Large Language Models (LLMs), such as ChatGPT, LLaMA, GLM, and PaLM, have exhibited remarkable performances across various tasks in recent years. However, LLMs face two main challenges in real-world applications. One challenge is that training LLMs consumes vast computing resources, preventing LLMs from being adopted by small and medium-sized enterprises with limited computing resources. Another i… ▽ More Large Language Models (LLMs), such as ChatGPT, LLaMA, GLM, and PaLM, have exhibited remarkable performances across various tasks in recent years. However, LLMs face two main challenges in real-world applications. One challenge is that training LLMs consumes vast computing resources, preventing LLMs from being adopted by small and medium-sized enterprises with limited computing resources. Another is that training LLM requires a large amount of high-quality data, which are often scattered among enterprises. To address these challenges, we propose FATE-LLM, an industrial-grade federated learning framework for large language models. FATE-LLM (1) facilitates federated learning for large language models (coined FedLLM); (2) promotes efficient training of FedLLM using parameter-efficient fine-tuning methods; (3) protects the intellectual property of LLMs; (4) preserves data privacy during training and inference through privacy-preserving mechanisms. We release the code of FATE-LLM at https://github.com/FederatedAI/FATE-LLM to facilitate the research of FedLLM and enable a broad range of industrial applications. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.07418 [pdf, other]

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

Authors: Guozheng Ma, Lu Li, Sen Zhang, Zixuan Liu, Zhen Wang, Yixin Chen, Li Shen, Xueqian Wang, Dacheng Tao

Abstract: Plasticity, the ability of a neural network to evolve with new data, is crucial for high-performance and sample-efficient visual reinforcement learning (VRL). Although methods like resetting and regularization can potentially mitigate plasticity loss, the influences of various components within the VRL framework on the agent's plasticity are still poorly understood. In this work, we conduct a syst… ▽ More Plasticity, the ability of a neural network to evolve with new data, is crucial for high-performance and sample-efficient visual reinforcement learning (VRL). Although methods like resetting and regularization can potentially mitigate plasticity loss, the influences of various components within the VRL framework on the agent's plasticity are still poorly understood. In this work, we conduct a systematic empirical exploration focusing on three primary underexplored facets and derive the following insightful conclusions: (1) data augmentation is essential in maintaining plasticity; (2) the critic's plasticity loss serves as the principal bottleneck impeding efficient training; and (3) without timely intervention to recover critic's plasticity in the early stages, its loss becomes catastrophic. These insights suggest a novel strategy to address the high replay ratio (RR) dilemma, where exacerbated plasticity loss hinders the potential improvements of sample efficiency brought by increased reuse frequency. Rather than setting a static RR for the entire training process, we propose Adaptive RR, which dynamically adjusts the RR based on the critic's plasticity level. Extensive evaluations indicate that Adaptive RR not only avoids catastrophic plasticity loss in the early stages but also benefits from more frequent reuse in later phases, resulting in superior sample efficiency. △ Less

Submitted 19 May, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: ICLR 2024 poster

arXiv:2309.16178 [pdf, other]

LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR

Authors: Guodong Ma, Wenxuan Wang, Yuke Li, Yuting Yang, Binbin Du, Haoran Fu

Abstract: Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework.… ▽ More Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. It incorporates speech translation (ST) tasks into LAE and utilizes ST to learn the contextual information between different languages. It introduces a task-based mixture of expert modules, employing separate feed-forward networks for the ASR and ST tasks. Experimental results on the ASRU 2019 Mandarin-English CS challenge dataset demonstrate that, compared to the LAE-based CTC, the LAE-ST-MoE model achieves a 9.26% mix error reduction on the CS test with the same decoding parameter. Moreover, the well-trained LAE-ST-MoE model can perform ST tasks from CS speech to Mandarin or English text. △ Less

Submitted 7 October, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

Comments: Accepted to IEEE ASRU 2023

arXiv:2309.11166 [pdf, other]

Are Large Language Models Really Robust to Word-Level Perturbations?

Authors: Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, Dacheng Tao

Abstract: The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly re… ▽ More The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. Our extensive empirical experiments demonstrate that TREvaL provides an innovative method for evaluating the robustness of an LLM. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in https://github.com/Harry-mic/TREvaL. △ Less

Submitted 27 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

arXiv:2309.09528 [pdf]

Gesture Recognition in Millimeter-Wave Radar Based on Spatio-Temporal Feature Sequences

Authors: Qun Fang, YiHui Yan, GuoQing Ma

Abstract: Gesture recognition is a pivotal technology in the realm of intelligent education, and millimeter-wave (mmWave) signals possess advantages such as high resolution and strong penetration capability. This paper introduces a highly accurate and robust gesture recognition method using mmWave radar. The method involves capturing the raw signals of hand movements with the mmWave radar module and preproc… ▽ More Gesture recognition is a pivotal technology in the realm of intelligent education, and millimeter-wave (mmWave) signals possess advantages such as high resolution and strong penetration capability. This paper introduces a highly accurate and robust gesture recognition method using mmWave radar. The method involves capturing the raw signals of hand movements with the mmWave radar module and preprocessing the received radar signals, including Fourier transformation, distance compression, Doppler processing, and noise reduction through moving target indication (MTI). The preprocessed signals are then fed into the Convolutional Neural Network-Time Domain Convolutional Network (CNN-TCN) model to extract spatio-temporal features, with recognition performance evaluated through classification. Experimental results demonstrate that this method achieves an accuracy rate of 98.2% in domain-specific recognition and maintains a consistently high recognition rate across different neural networks, showcasing exceptional recognition performance and robustness. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.08781 [pdf, ps, other]

Platform Equilibrium: Analayzing Social Welfare in Online Market Places

Authors: Alon Eden, Gary Qiurui Ma, David C. Parkes

Abstract: We introduce the theoretical study of a Platform Equilibrium in a market with unit-demand buyers and unit-supply sellers. Each seller can join a platform and transact with any buyer or remain off-platform and transact with a subset of buyers whom she knows. Given the constraints on trade, prices form a competitive equilibrium and clears the market. The platform charges a transaction fee to all on-… ▽ More We introduce the theoretical study of a Platform Equilibrium in a market with unit-demand buyers and unit-supply sellers. Each seller can join a platform and transact with any buyer or remain off-platform and transact with a subset of buyers whom she knows. Given the constraints on trade, prices form a competitive equilibrium and clears the market. The platform charges a transaction fee to all on-platform sellers, in the form of a fraction of on-platform sellers' price. The platform chooses the fraction to maximize revenue. A Platform Equilibrium is a Nash equilibrium of the game where each seller decides whether or not to join the platform, balancing the effect of a larger pool of buyers to trade with, against the imposition of a transaction fee. Our main insights are: (i) In homogeneous-goods markets, pure equilibria always exist and can be found by a polynomial-time algorithm; (ii) When the platform is unregulated, the resulting Platform Equilibrium guarantees a tight $Θ(log(min(m, n)))$-approximation of the optimal welfare in homogeneous-goods markets, where $n$ and $m$ are the number of buyers and sellers respectively; (iii) Even light regulation helps: when the platform's fee is capped at $α\in[0,1)$, the price of anarchy is 2-$α$/1-$α$ for general markets. For example, if the platform takes 30 percent of the seller's revenue, a rather high fee, our analysis implies the welfare in a Platform Equilibrium is still a 0.412-fraction of the optimal welfare. Our main results extend to markets with multiple platforms, beyond unit-demand buyers, as well as to sellers with production costs. △ Less

Submitted 20 June, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted at the Twenty-Fifth ACM Conference on Economics and Computation (EC'24), 2024

arXiv:2309.02731 [pdf, other]

HC3 Plus: A Semantic-Invariant Human ChatGPT Comparison Corpus

Authors: Zhenpeng Su, Xing Wu, Wei Zhou, Guangyuan Ma, Songlin Hu

Abstract: ChatGPT has gained significant interest due to its impressive performance, but people are increasingly concerned about its potential risks, particularly around the detection of AI-generated content (AIGC), which is often difficult for untrained humans to identify. Current datasets utilized for detecting ChatGPT-generated text primarily center around question-answering, yet they tend to disregard t… ▽ More ChatGPT has gained significant interest due to its impressive performance, but people are increasingly concerned about its potential risks, particularly around the detection of AI-generated content (AIGC), which is often difficult for untrained humans to identify. Current datasets utilized for detecting ChatGPT-generated text primarily center around question-answering, yet they tend to disregard tasks that possess semantic-invariant properties, such as summarization, translation, and paraphrasing. Our primary studies demonstrate that detecting model-generated text on semantic-invariant tasks is more difficult. To fill this gap, we introduce a more extensive and comprehensive dataset that considers more types of tasks than previous work, including semantic-invariant tasks. In addition, the model after a large number of task instruction fine-tuning shows a strong powerful performance. Owing to its previous success, we further instruct fine-tuning T\textit{k}-instruct and build a more powerful detection system. △ Less

Submitted 25 January, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: This paper has been accepted by CIKM2023 workshop

arXiv:2308.08285 [pdf, other]

Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

Authors: Guangyuan Ma, Xing Wu, Peng Wang, Zijia Lin, Songlin Hu

Abstract: In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contr… ▽ More In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: 10 pages, 3 tables, 4 figures, under review

arXiv:2308.04663 [pdf, other]

doi 10.1016/j.media.2024.103199

Classification of lung cancer subtypes on CT images with synthetic pathological priors

Authors: Wentao Zhu, Yuan **, Gege Ma, Geng Chen, Jan Egger, Shaoting Zhang, Dimitris N. Metaxas

Abstract: The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns betwe… ▽ More The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns between the same case's CT images and its pathological images, we innovatively developed a pathological feature synthetic module (PFSM), which quantitatively maps cross-modality associations through deep neural networks, to derive the "gold standard" information contained in the corresponding pathological images from CT images. Additionally, we designed a radiological feature extraction module (RFEM) to directly acquire CT image information and integrated it with the pathological priors under an effective feature fusion framework, enabling the entire classification model to generate more indicative and specific pathologically related features and eventually output more accurate predictions. The superiority of the proposed model lies in its ability to self-generate hybrid features that contain multi-modality image information based on a single-modality input. To evaluate the effectiveness, adaptability, and generalization ability of our model, we performed extensive experiments on a large-scale multi-center dataset (i.e., 829 cases from three hospitals) to compare our model and a series of state-of-the-art (SOTA) classification models. The experimental results demonstrated the superiority of our model for lung cancer subtypes classification with significant accuracy improvements in terms of accuracy (ACC), area under the curve (AUC), and F1 score. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 16 pages, 7 figures

Journal ref: Medical Image Analysis 95, July 2024, 103199

arXiv:2308.01531 [pdf]

doi 10.1038/s41467-024-45435-4

Optimizing multi-user indoor sound communications with acoustic reconfigurable metasurfaces

Authors: Hongkuan Zhang, Qiyuan Wang, Mathias Fink, Guancong Ma

Abstract: Sound in indoor spaces forms a complex wavefield due to multiple scattering encountered by the sound. Indoor acoustic communication involving multiple sources and receivers thus inevitably suffers from cross-talks. Here, we demonstrate the isolation of acoustic communication channels in a room by wavefield sha** using acoustic reconfigurable metasurfaces (ARMs) controlled by optimization protoco… ▽ More Sound in indoor spaces forms a complex wavefield due to multiple scattering encountered by the sound. Indoor acoustic communication involving multiple sources and receivers thus inevitably suffers from cross-talks. Here, we demonstrate the isolation of acoustic communication channels in a room by wavefield sha** using acoustic reconfigurable metasurfaces (ARMs) controlled by optimization protocols based on communication theories. The ARMs have 200 electrically switchable units, each selectively offering 0 or π phase shifts in the reflected waves. The sound field is reshaped for maximal Shannon capacity and minimal cross-talk simultaneously. We demonstrate diverse acoustic functionalities over a spectrum much larger than the coherence bandwidth of the room, including multi-channel, multi-spectral channel isolations, and frequency-multiplexed acoustic communication. Our work shows that wavefield sha** in complex media can offer new strategies for future acoustic engineering. △ Less

Submitted 10 February, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

Journal ref: Nature Communications (2024)

arXiv:2308.00920 [pdf]

doi 10.1038/s41467-024-46077-2

Virtual histological staining of unlabeled autopsy tissue

Authors: Yuzhu Li, Nir Pillar, **gxi Li, Tairan Liu, Di Wu, Songyu Sun, Guangdong Ma, Kevin de Haan, Luzhe Huang, Sepehr Hamidi, Anatoly Urisman, Tal Keidar Haran, William Dean Wallace, Jonathan E. Zuckerman, Aydogan Ozcan

Abstract: Histological examination is a crucial step in an autopsy; however, the traditional histochemical staining of post-mortem samples faces multiple challenges, including the inferior staining quality due to autolysis caused by delayed fixation of cadaver tissue, as well as the resource-intensive nature of chemical staining procedures covering large tissue areas, which demand substantial labor, cost, a… ▽ More Histological examination is a crucial step in an autopsy; however, the traditional histochemical staining of post-mortem samples faces multiple challenges, including the inferior staining quality due to autolysis caused by delayed fixation of cadaver tissue, as well as the resource-intensive nature of chemical staining procedures covering large tissue areas, which demand substantial labor, cost, and time. These challenges can become more pronounced during global health crises when the availability of histopathology services is limited, resulting in further delays in tissue fixation and more severe staining artifacts. Here, we report the first demonstration of virtual staining of autopsy tissue and show that a trained neural network can rapidly transform autofluorescence images of label-free autopsy tissue sections into brightfield equivalent images that match hematoxylin and eosin (H&E) stained versions of the same samples, eliminating autolysis-induced severe staining artifacts inherent in traditional histochemical staining of autopsied tissue. Our virtual H&E model was trained using >0.7 TB of image data and a data-efficient collaboration scheme that integrates the virtual staining network with an image registration network. The trained model effectively accentuated nuclear, cytoplasmic and extracellular features in new autopsy tissue samples that experienced severe autolysis, such as COVID-19 samples never seen before, where the traditional histochemical staining failed to provide consistent staining quality. This virtual autopsy staining technique can also be extended to necrotic tissue, and can rapidly and cost-effectively generate artifact-free H&E stains despite severe autolysis and cell death, also reducing labor, cost and infrastructure requirements associated with the standard histochemical staining. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: 24 Pages, 7 Figures

Journal ref: Nature Communications (2024)

arXiv:2307.05956 [pdf, other]

Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition

Authors: Wenxuan Wang, Guodong Ma, Yuke Li, Binbin Du

Abstract: Multilingual speech recognition for both monolingual and code-switching speech is a challenging task. Recently, based on the Mixture of Experts (MoE), many works have made good progress in multilingual and code-switching ASR, but present huge computational complexity with the increase of supported languages. In this work, we propose a computation-efficient network named Language-Routing Mixture of… ▽ More Multilingual speech recognition for both monolingual and code-switching speech is a challenging task. Recently, based on the Mixture of Experts (MoE), many works have made good progress in multilingual and code-switching ASR, but present huge computational complexity with the increase of supported languages. In this work, we propose a computation-efficient network named Language-Routing Mixture of Experts (LR-MoE) for multilingual and code-switching ASR. LR-MoE extracts language-specific representations through the Mixture of Language Experts (MLE), which is guided to learn by a frame-wise language routing mechanism. The weight-shared frame-level language identification (LID) network is jointly trained as the shared pre-router of each MoE layer. Experiments show that the proposed method significantly improves multilingual and code-switching speech recognition performances over baseline with comparable computational efficiency. △ Less

Submitted 13 July, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin, Ireland

arXiv:2307.04024 [pdf, other]

Robust Ranking Explanations

Authors: Chao Chen, Chenghua Guo, Guixiang Ma, Ming Zeng, Xi Zhang, Sihong Xie

Abstract: Robust explanations of machine learning models are critical to establish human trust in the models. Due to limited cognition capability, most humans can only interpret the top few salient features. It is critical to make top salient features robust to adversarial attacks, especially those against the more vulnerable gradient-based explanations. Existing defense measures robustness using $\ell_p$-n… ▽ More Robust explanations of machine learning models are critical to establish human trust in the models. Due to limited cognition capability, most humans can only interpret the top few salient features. It is critical to make top salient features robust to adversarial attacks, especially those against the more vulnerable gradient-based explanations. Existing defense measures robustness using $\ell_p$-norms, which have weaker protection power. We define explanation thickness for measuring salient features ranking stability, and derive tractable surrogate bounds of the thickness to design the \textit{R2ET} algorithm to efficiently maximize the thickness and anchor top salient features. Theoretically, we prove a connection between R2ET and adversarial training. Experiments with a wide spectrum of network architectures and data modalities, including brain networks, demonstrate that R2ET attains higher explanation robustness under stealthy attacks while retaining accuracy. △ Less

Submitted 8 July, 2023; originally announced July 2023.

Comments: Accepted to IMLH (Interpretable ML in Healthcare) workshop at ICML 2023. arXiv admin note: substantial text overlap with arXiv:2212.14106

arXiv:2306.12045 [pdf, other]

Temporal Conditioning Spiking Latent Variable Models of the Neural Response to Natural Visual Scenes

Authors: Gehua Ma, Runhao Jiang, Rui Yan, Hua** Tang

Abstract: Develo** computational models of neural response is crucial for understanding sensory processing and neural computations. Current state-of-the-art neural network methods use temporal filters to handle temporal dependencies, resulting in an unrealistic and inflexible processing paradigm. Meanwhile, these methods target trial-averaged firing rates and fail to capture important features in spike tr… ▽ More Develo** computational models of neural response is crucial for understanding sensory processing and neural computations. Current state-of-the-art neural network methods use temporal filters to handle temporal dependencies, resulting in an unrealistic and inflexible processing paradigm. Meanwhile, these methods target trial-averaged firing rates and fail to capture important features in spike trains. This work presents the temporal conditioning spiking latent variable models (TeCoS-LVM) to simulate the neural response to natural visual stimuli. We use spiking neurons to produce spike outputs that directly match the recorded trains. This approach helps to avoid losing information embedded in the original spike trains. We exclude the temporal dimension from the model parameter space and introduce a temporal conditioning operation to allow the model to adaptively explore and exploit temporal dependencies in stimuli sequences in a {\it natural paradigm}. We show that TeCoS-LVM models can produce more realistic spike activities and accurately fit spike statistics than powerful alternatives. Additionally, learned TeCoS-LVM models can generalize well to longer time scales. Overall, while remaining computationally tractable, our model effectively captures key features of neural coding systems. It thus provides a useful tool for building accurate predictive computational accounts for various sensory perception circuits. △ Less

Submitted 19 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: Accepted at NeurIPS 2023 (https://openreview.net/forum?id=V4YeOvsQfu). 22 pages, 7 figures, 3 tables

arXiv:2306.04357 [pdf, other]

Dial-MAE: ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems

Authors: Zhenpeng Su, Xing Wu, Wei Zhou, Guangyuan Ma, Songlin Hu

Abstract: Dialogue response selection aims to select an appropriate response from several candidates based on a given user and system utterance history. Most existing works primarily focus on post-training and fine-tuning tailored for cross-encoders. However, there are no post-training methods tailored for dense encoders in dialogue response selection. We argue that when the current language model, based on… ▽ More Dialogue response selection aims to select an appropriate response from several candidates based on a given user and system utterance history. Most existing works primarily focus on post-training and fine-tuning tailored for cross-encoders. However, there are no post-training methods tailored for dense encoders in dialogue response selection. We argue that when the current language model, based on dense dialogue systems (such as BERT), is employed as a dense encoder, it separately encodes dialogue context and response, leading to a struggle to achieve the alignment of both representations. Thus, we propose Dial-MAE (Dialogue Contextual Masking Auto-Encoder), a straightforward yet effective post-training technique tailored for dense encoders in dialogue response selection. Dial-MAE uses an asymmetric encoder-decoder architecture to compress the dialogue semantics into dense vectors, which achieves better alignment between the features of the dialogue context and response. Our experiments have demonstrated that Dial-MAE is highly effective, achieving state-of-the-art performance on two commonly evaluated benchmarks. △ Less

Submitted 26 March, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: This paper has been accepted by NAACL 2024

arXiv:2306.00656 [pdf, other]

Normalization Enhances Generalization in Visual Reinforcement Learning

Authors: Lu Li, Jiafei Lyu, Guozheng Ma, Zilin Wang, Zhenjie Yang, Xiu Li, Zhiheng Li

Abstract: Recent advances in visual reinforcement learning (RL) have led to impressive success in handling complex tasks. However, these methods have demonstrated limited generalization capability to visual disturbances, which poses a significant challenge for their real-world application and adaptability. Though normalization techniques have demonstrated huge success in supervised and unsupervised learning… ▽ More Recent advances in visual reinforcement learning (RL) have led to impressive success in handling complex tasks. However, these methods have demonstrated limited generalization capability to visual disturbances, which poses a significant challenge for their real-world application and adaptability. Though normalization techniques have demonstrated huge success in supervised and unsupervised learning, their applications in visual RL are still scarce. In this paper, we explore the potential benefits of integrating normalization into visual RL methods with respect to generalization performance. We find that, perhaps surprisingly, incorporating suitable normalization techniques is sufficient to enhance the generalization capabilities, without any additional special design. We utilize the combination of two normalization techniques, CrossNorm and SelfNorm, for generalizable visual RL. Extensive experiments are conducted on DMControl Generalization Benchmark and CARLA to validate the effectiveness of our method. We show that our method significantly improves generalization capability while only marginally affecting sample efficiency. In particular, when integrated with DrQ-v2, our method enhances the test performance of DrQ-v2 on CARLA across various scenarios, from 14% of the training performance to 97%. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.16379 [pdf, other]

Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning

Authors: Guozheng Ma, Linrui Zhang, Haoyu Wang, Lu Li, Zilin Wang, Zhen Wang, Li Shen, Xueqian Wang, Dacheng Tao

Abstract: Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. Notably, employing simple observation transformations alone can yield outstanding performance without extra auxiliary representation tasks or pre-trained encoders. However, it remains unclear which attributes of DA account for its effectiveness in achieving sample-eff… ▽ More Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. Notably, employing simple observation transformations alone can yield outstanding performance without extra auxiliary representation tasks or pre-trained encoders. However, it remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL. To investigate this issue and further explore the potential of DA, this work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy and provides the following insights and improvements: (1) For individual DA operations, we reveal that both ample spatial diversity and slight hardness are indispensable. Building on this finding, we introduce Random PadResize (Rand PR), a new DA operation that offers abundant spatial diversity with minimal hardness. (2) For multi-type DA fusion schemes, the increased DA hardness and unstable data distribution result in the current fusion schemes being unable to achieve higher sample efficiency than their corresponding individual operations. Taking the non-stationary nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme called Cycling Augmentation (CycAug), which performs periodic cycles of different DA operations to increase type diversity while maintaining data distribution consistency. Extensive evaluations on the DeepMind Control suite and CARLA driving simulator demonstrate that our methods achieve superior sample efficiency compared with the prior state-of-the-art methods. △ Less

Submitted 27 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023 poster

arXiv:2305.16044 [pdf, other]

doi 10.1016/j.patter.2023.100831

Exploiting Noise as a Resource for Computation and Learning in Spiking Neural Networks

Authors: Gehua Ma, Rui Yan, Hua** Tang

Abstract: $\textbf{Formal version available at}$ https://cell.com/patterns/fulltext/S2666-3899(23)00200-3 Networks of spiking neurons underpin the extraordinary information-processing capabilities of the brain and have become pillar models in neuromorphic artificial intelligence. Despite extensive research on spiking neural networks (SNNs), most studies are established on deterministic models, overlooking… ▽ More $\textbf{Formal version available at}$ https://cell.com/patterns/fulltext/S2666-3899(23)00200-3 Networks of spiking neurons underpin the extraordinary information-processing capabilities of the brain and have become pillar models in neuromorphic artificial intelligence. Despite extensive research on spiking neural networks (SNNs), most studies are established on deterministic models, overlooking the inherent non-deterministic, noisy nature of neural computations. This study introduces the noisy spiking neural network (NSNN) and the noise-driven learning rule (NDL) by incorporating noisy neuronal dynamics to exploit the computational advantages of noisy neural processing. NSNN provides a theoretical framework that yields scalable, flexible, and reliable computation. We demonstrate that NSNN leads to spiking neural models with competitive performance, improved robustness against challenging perturbations than deterministic SNNs, and better reproducing probabilistic computations in neural coding. This study offers a powerful and easy-to-use tool for machine learning, neuromorphic intelligence practitioners, and computational neuroscience researchers. △ Less

Submitted 14 September, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 36 pages, 9 figures, 4 tables. Formal version available at https://cell.com/patterns/fulltext/S2666-3899(23)00200-3

arXiv:2305.01524 [pdf, other]

3D Laser-and-tissue Agnostic Data-driven Method for Robotic Laser Surgical Planning

Authors: Guangshen Ma, Ravi Prakash, Brian Mann, Weston Ross, Patrick Codd

Abstract: In robotic laser surgery, shape prediction of an one-shot ablation cavity is an important problem for minimizing errant overcutting of healthy tissue during the course of pathological tissue resection and precise tumor removal. Since it is difficult to physically model the laser-tissue interaction due to the variety of optical tissue properties, complicated process of heat transfer, and uncertaint… ▽ More In robotic laser surgery, shape prediction of an one-shot ablation cavity is an important problem for minimizing errant overcutting of healthy tissue during the course of pathological tissue resection and precise tumor removal. Since it is difficult to physically model the laser-tissue interaction due to the variety of optical tissue properties, complicated process of heat transfer, and uncertainty about the chemical reaction, we propose a 3D cavity prediction model based on an entirely data-driven method without any assumptions of laser settings and tissue properties. Based on the cavity prediction model, we formulate a novel robotic laser planning problem to determine the optimal laser incident configuration, which aims to create a cavity that aligns with the surface target (e.g. tumor, pathological tissue). To solve the one-shot ablation cavity prediction problem, we model the 3D geometric relation between the tissue surface and the laser energy profile as a non-linear regression problem that can be represented by a single-layer perceptron (SLP) network. The SLP network is encoded in a novel kinematic model to predict the shape of the post-ablation cavity with an arbitrary laser input. To estimate the SLP network parameters, we formulate a dataset of one-shot laser-phantom cavities reconstructed by the optical coherence tomography (OCT) B-scan images for the data-driven modelling. To verify the method. The learned cavity prediction model is applied to solve a simplified robotic laser planning problem modelled as a surface alignment error minimization problem. The initial results report (91.1 +- 3.0)% 3D-cavity-Intersection-over-Union (3D-cavity-IoU) for the 3D cavity prediction and an average of 97.9% success rate for the simulated surface alignment experiments. △ Less

Submitted 2 May, 2023; originally announced May 2023.

arXiv:2304.12633 [pdf, other]

PUNR: Pre-training with User Behavior Modeling for News Recommendation

Authors: Guangyuan Ma, Hongtao Liu, Xing Wu, Wanhui Qian, Zhepeng Lv, Qing Yang, Songlin Hu

Abstract: News recommendation aims to predict click behaviors based on user behaviors. How to effectively model the user representations is the key to recommending preferred news. Existing works are mostly focused on improvements in the supervised fine-tuning stage. However, there is still a lack of PLM-based unsupervised pre-training methods optimized for user representations. In this work, we propose an u… ▽ More News recommendation aims to predict click behaviors based on user behaviors. How to effectively model the user representations is the key to recommending preferred news. Existing works are mostly focused on improvements in the supervised fine-tuning stage. However, there is still a lack of PLM-based unsupervised pre-training methods optimized for user representations. In this work, we propose an unsupervised pre-training paradigm with two tasks, i.e. user behavior masking and user behavior generation, both towards effective user behavior modeling. Firstly, we introduce the user behavior masking pre-training task to recover the masked user behaviors based on their contextual behaviors. In this way, the model could capture a much stronger and more comprehensive user news reading pattern. Besides, we incorporate a novel auxiliary user behavior generation pre-training task to enhance the user representation vector derived from the user encoder. We use the above pre-trained user modeling encoder to obtain news and user representations in downstream fine-tuning. Evaluations on the real-world news benchmark show significant performance improvements over existing baselines. △ Less

Submitted 30 October, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by Findings of EMNLP23. Github Repo: https://github.com/ma787639046/punr

arXiv:2304.10195 [pdf, other]

CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with Mixture-of-Textual-Experts for Passage Retrieval

Authors: Guangyuan Ma, Xing Wu, Peng Wang, Songlin Hu

Abstract: Passage retrieval aims to retrieve relevant passages from large collections of the open-domain corpus. Contextual Masked Auto-Encoding has been proven effective in representation bottleneck pre-training of a monolithic dual-encoder for passage retrieval. Siamese or fully separated dual-encoders are often adopted as basic retrieval architecture in the pre-training and fine-tuning stages for encodin… ▽ More Passage retrieval aims to retrieve relevant passages from large collections of the open-domain corpus. Contextual Masked Auto-Encoding has been proven effective in representation bottleneck pre-training of a monolithic dual-encoder for passage retrieval. Siamese or fully separated dual-encoders are often adopted as basic retrieval architecture in the pre-training and fine-tuning stages for encoding queries and passages into their latent embedding spaces. However, simply sharing or separating the parameters of the dual-encoder results in an imbalanced discrimination of the embedding spaces. In this work, we propose to pre-train Contextual Masked Auto-Encoder with Mixture-of-Textual-Experts (CoT-MoTE). Specifically, we incorporate textual-specific experts for individually encoding the distinct properties of queries and passages. Meanwhile, a shared self-attention layer is still kept for unified attention modeling. Results on large-scale passage retrieval benchmarks show steady improvement in retrieval performances. The quantitive analysis also shows a more balanced discrimination of the latent embedding spaces. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Showing 1–50 of 164 results for author: Ma, G