Search | arXiv e-print repository

HoloHisto: End-to-end Gigapixel WSI Segmentation with 4K Resolution Sequential Tokenization

Authors: Yucheng Tang, Yufan He, Vishwesh Nath, Pengfeig Guo, Ruining Deng, Tianyuan Yao, Quan Liu, Can Cui, Mengmeng Yin, Ziyue Xu, Holger Roth, Daguang Xu, Haichun Yang, Yuankai Huo

Abstract: In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process: initially segmenting high-resolution whole slide images (WSI) into smaller patches (e.g., 256x256, 512x512, 1024x1024) and subsequently reconstructing them to their original scale. This method often struggles to capture the complex details and vast scope of WSIs. In this… ▽ More In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process: initially segmenting high-resolution whole slide images (WSI) into smaller patches (e.g., 256x256, 512x512, 1024x1024) and subsequently reconstructing them to their original scale. This method often struggles to capture the complex details and vast scope of WSIs. In this paper, we propose the holistic histopathology (HoloHisto) segmentation method to achieve end-to-end segmentation on gigapixel WSIs, whose maximum resolution is above 80,000$\times$70,000 pixels. HoloHisto fundamentally shifts the paradigm of WSI segmentation to an end-to-end learning fashion with 1) a large (4K) resolution base patch for elevated visual information inclusion and efficient processing, and 2) a novel sequential tokenization mechanism to properly model the contextual relationships and efficiently model the rich information from the 4K input. To our best knowledge, HoloHisto presents the first holistic approach for gigapixel resolution WSI segmentation, supporting direct I/O of complete WSI and their corresponding gigapixel masks. Under the HoloHisto platform, we unveil a random 4K sampler that transcends ultra-high resolution, delivering 31 and 10 times more pixels than standard 2D and 3D patches, respectively, for advancing computational capabilities. To facilitate efficient 4K resolution dense prediction, we leverage sequential tokenization, utilizing a pre-trained image tokenizer to group image features into a discrete token grid. To assess the performance, our team curated a new kidney pathology image segmentation (KPIs) dataset with WSI-level glomeruli segmentation from whole mouse kidneys. From the results, HoloHisto-4K delivers remarkable performance gains over previous state-of-the-art models. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02386 [pdf, other]

OpenSlot: Mixed Open-set Recognition with Object-centric Learning

Authors: Xu Yin, Fei Pan, Guoyuan An, Yuchi Huo, Zixuan Xie, Sung-Eui Yoon

Abstract: Existing open-set recognition (OSR) studies typically assume that each image contains only one class label, and the unknown test set (negative) has a disjoint label space from the known test set (positive), a scenario termed full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with known and unknown classes co-occurring in negatives, le… ▽ More Existing open-set recognition (OSR) studies typically assume that each image contains only one class label, and the unknown test set (negative) has a disjoint label space from the known test set (positive), a scenario termed full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with known and unknown classes co-occurring in negatives, leading to a more challenging super-label shift. Addressing the mixed OSR requires classification models to accurately distinguish different class semantics within images and measure their "knowness". In this study, we propose the OpenSlot framework, built upon object-centric learning. OpenSlot utilizes slot features to represent diverse class semantics and produce class predictions. Through our proposed anti-noise-slot (ANS) technique, we mitigate the impact of noise (invalid and background) slots during classification training, effectively addressing the semantic misalignment between class predictions and the ground truth. We conduct extensive experiments with OpenSlot on mixed & conventional OSR benchmarks. Without elaborate designs, OpenSlot not only exceeds existing OSR studies in detecting super-label shifts across single & multi-label mixed OSR tasks but also achieves state-of-the-art performance on conventional benchmarks. Remarkably, our method can localize class objects without using bounding boxes during training. The competitive performance in open-set object detection demonstrates OpenSlot's ability to explicitly explain label shifts and benefits in computational efficiency and generalization. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: This study is under IEEE TMM review

arXiv:2407.00596 [pdf, other]

HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis

Authors: Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Juming Xiong, Shunxing Bao, Hao Li, Mengmeng Yin, Yu Wang, Shilin Zhao, Yucheng Tang, Haichun Yang, Yuankai Huo

Abstract: Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel… ▽ More Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel Hierarchical Adaptive Taxonomy Segmentation (HATs) method, which is designed to thoroughly segment panoramic views of kidney structures by leveraging detailed anatomical insights. Our approach entails (1) the innovative HATs technique which translates spatial relationships among 15 distinct object classes into a versatile "plug-and-play" loss function that spans across regions, functional units, and cells, (2) the incorporation of anatomical hierarchies and scale considerations into a unified simple matrix representation for all panoramic entities, (3) the adoption of the latest AI foundation model (EfficientSAM) as a feature extraction tool to boost the model's adaptability, yet eliminating the need for manual prompt generation in conventional segment anything model (SAM). Experimental findings demonstrate that the HATs method offers an efficient and effective strategy for integrating clinical insights and imaging precedents into a unified segmentation model across more than 15 categories. The official implementation is publicly available at https://github.com/hrlblab/HATs. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2402.19286

arXiv:2406.19540 [pdf, other]

Weighted Circle Fusion: Ensembling Circle Representation from Different Object Detection Results

Authors: Jialin Yue, Tianyuan Yao, Ruining Deng, Quan Liu, Juming Xiong, Haichun Yang, Yuankai Huo

Abstract: Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn't crucial. Unfortunately, this widely adopted strategy is not re… ▽ More Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn't crucial. Unfortunately, this widely adopted strategy is not readily available for combining circle representations. In this paper, we propose Weighted Circle Fusion (WCF), a simple approach for merging predictions from various circle detection models. Our method leverages confidence scores associated with each proposed bounding circle to generate averaged circles. Our method undergoes thorough evaluation on a proprietary dataset for glomerular detection in object detection within whole slide imaging (WSI). The findings reveal a performance gain of 5 %, respectively, compared to existing ensemble methods. Furthermore, the Weighted Circle Fusion technique not only improves the precision of object detection in medical images but also notably decreases false detections, presenting a promising direction for future research and application in pathological image analysis. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.16386 [pdf, other]

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Authors: Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, Michael R. Lyu

Abstract: Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore… ▽ More Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods. To the best of our knowledge, DCGen is the first segment-aware prompt-based approach for generating UI code directly from screenshots. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16360 [pdf, other]

MIRReS: Multi-bounce Inverse Rendering using Reservoir Sampling

Authors: Yuxin Dai, Qi Wang, **gsen Zhu, Dianbing Xi, Yuchi Huo, Chen Qian, Ying He

Abstract: We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based… ▽ More We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based inverse rendering model that utilizes multi-bounce path tracing and Monte Carlo integration. By leveraging multi-bounce path tracing, our method effectively estimates indirect illumination, including self-shadowing and internal reflections, which improves the intrinsic decomposition of shape, material, and lighting. Moreover, we incorporate reservoir sampling into our framework to address the noise in Monte Carlo integration, enhancing convergence and facilitating gradient-based optimization with low sample counts. Through qualitative and quantitative evaluation of several scenarios, especially in challenging scenarios with complex shadows, we demonstrate that our method achieves state-of-the-art performance on decomposition results. Additionally, our optimized explicit geometry enables applications such as scene editing, relighting, and material editing with modern graphics engines or CAD software. The source code is available at https://brabbitdousha.github.io/MIRReS/ △ Less

Submitted 24 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

Comments: 16 pages, 14 figures

arXiv:2406.15755 [pdf, other]

Fine-grained Background Representation for Weakly Supervised Semantic Segmentation

Authors: Xu Yin, Woobin Im, Dongbo Min, Yuchi Huo, Fei Pan, Sung-Eui Yoon

Abstract: Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper pr… ▽ More Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon using the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our proposed method can be seamlessly plugged into various models, yielding new state-of-the-art results under various WSSS settings across benchmarks. Leveraging solely image-level (I) labels as supervision, our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively. Furthermore, by incorporating saliency maps as an additional supervision signal (I+S), we attain 74.9 mIoU on Pascal Voc test set. Concurrently, our FBR approach demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks, showcasing its robustness and strong generalization capabilities across diverse domains. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.14129 [pdf, other]

Towards Event-oriented Long Video Understanding

Authors: Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

Abstract: With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce… ▽ More With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Work on progress

arXiv:2406.11317 [pdf, other]

GUICourse: From General Vision Language Models to Versatile GUI Agents

Authors: Wentong Chen, Junbo Cui, **yi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

Abstract: Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (th… ▽ More Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11242 [pdf, other]

Accurate and Fast Pixel Retrieval with Spatial and Uncertainty Aware Hypergraph Diffusion

Authors: Guoyuan An, Yuchi Huo, Sung-Eui Yoon

Abstract: This paper presents a novel method designed to enhance the efficiency and accuracy of both image retrieval and pixel retrieval. Traditional diffusion methods struggle to propagate spatial information effectively in conventional graphs due to their reliance on scalar edge weights. To overcome this limitation, we introduce a hypergraph-based framework, uniquely capable of efficiently propagating spa… ▽ More This paper presents a novel method designed to enhance the efficiency and accuracy of both image retrieval and pixel retrieval. Traditional diffusion methods struggle to propagate spatial information effectively in conventional graphs due to their reliance on scalar edge weights. To overcome this limitation, we introduce a hypergraph-based framework, uniquely capable of efficiently propagating spatial information using local features during query time, thereby accurately retrieving and localizing objects within a database. Additionally, we innovatively utilize the structural information of the image graph through a technique we term "community selection". This approach allows for the assessment of the initial search result's uncertainty and facilitates an optimal balance between accuracy and speed. This is particularly crucial in real-world applications where such trade-offs are often necessary. Our experimental results, conducted on the (P)ROxford and (P)RParis datasets, demonstrate the significant superiority of our method over existing diffusion techniques. We achieve state-of-the-art (SOTA) accuracy in both image-level and pixel-level retrieval, while also maintaining impressive processing speed. This dual achievement underscores the effectiveness of our hypergraph-based framework and community selection technique, marking a notable advancement in the field of content-based image retrieval. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.09367 [pdf, other]

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

Authors: Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, **g Liu

Abstract: Video understanding is a crucial next step for multimodal large language models (MLLMs). To probe specific aspects of video understanding ability, existing video benchmarks typically require careful video selection based on the target capability, along with laborious annotation of query-response pairs to match the specific video content. This process is both challenging and resource-intensive. In… ▽ More Video understanding is a crucial next step for multimodal large language models (MLLMs). To probe specific aspects of video understanding ability, existing video benchmarks typically require careful video selection based on the target capability, along with laborious annotation of query-response pairs to match the specific video content. This process is both challenging and resource-intensive. In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples test video content from their query-responses by inserting unrelated image/text 'needles' into original videos. It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses. Additionally, by inserting multiple needles, VideoNIAH rigorously evaluates the temporal understanding capabilities of models. We utilized VideoNIAH to compile a video benchmark VNBench, including tasks such as retrieval, ordering, and counting. VNBench can efficiently evaluate the fine-grained understanding ability and spatio-temporal modeling ability of a video model, while also supporting the long-context evaluation. Additionally, we evaluated recent video-centric multimodal large language models (MLLMs), both open-source and proprietary, providing a comprehensive analysis. We found that although proprietary models have significant advantages over open-source models, all existing video models still perform poorly on long-distance dependency tasks. VideoNIAH is a simple yet highly scalable benchmark construction framework, and we believe it will inspire future video benchmark works. The code and data are available at https://github.com/joez17/VideoNIAH. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.02430 [pdf, other]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.02045 [pdf, other]

Experimental single-photon quantum key distribution surpassing the fundamental coherent-state rate limit

Authors: Yang Zhang, Xing Ding, Yang Li, Likang Zhang, Yong-Peng Guo, Gao-Qiang Wang, Zhen Ning, Mo-Chi Xu, Run-Ze Liu, Jun-Yi Zhao, Geng-Yan Zou, Hui Wang, Yuan Cao, Yu-Ming He, Cheng-Zhi Peng, Yong-Heng Huo, Sheng-Kai Liao, Chao-Yang Lu, Feihu Xu, Jian-Wei Pan

Abstract: Single-photon sources are essential for quantum networks, enabling applications ranging from quantum key distribution (QKD) to the burgeoning quantum internet. Despite the remarkable advancements, the current reliance of QKD on attenuated coherent (laser) light sources has imposed a fundamental limit on the secret key rate (SKR). This constraint is primarily attributable to the scarcity of single-… ▽ More Single-photon sources are essential for quantum networks, enabling applications ranging from quantum key distribution (QKD) to the burgeoning quantum internet. Despite the remarkable advancements, the current reliance of QKD on attenuated coherent (laser) light sources has imposed a fundamental limit on the secret key rate (SKR). This constraint is primarily attributable to the scarcity of single-photon components within coherent light, confined by an inherent upper bound of 1/e. Here, we report high-rate QKD using a high-efficiency single-photon source, enabling an SKR transcending the fundamental rate limit of coherent light. We developed an on-demand, bright semiconductor quantum-dot single-photon source with an efficiency of 0.71(2), exceeding the inherent bound of coherent light by approximately 2.87 dB. Implementing narrow-bandwidth filtering and random polarization modulation, we conducted a field QKD trial over a 14.6(1.1)-dB-loss free-space urban channel, achieving an SKR of 0.00108 bits per pulse. This surpasses the practical limit of coherent-light-based QKD by 2.53 dB. Our findings conclusively demonstrate the superior performance of nanotechnology-based single-photon sources over coherent light for QKD applications, marking a pivotal stride towards the realization of a global quantum internet. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 22 pages, 5 figures, 1 Table

arXiv:2405.17824 [pdf, other]

mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis

Authors: Quan Liu, Ruining Deng, Can Cui, Tianyuan Yao, Vishwesh Nath, Yucheng Tang, Yuankai Huo

Abstract: Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging, particularly with large, high-resolution images like gigapixel Whole Slide Images (WSIs). Current methods typically rely on manual region labeling or multi-stage learning to assemble local representations (e.g., patch-level) into global features (e.g.,… ▽ More Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging, particularly with large, high-resolution images like gigapixel Whole Slide Images (WSIs). Current methods typically rely on manual region labeling or multi-stage learning to assemble local representations (e.g., patch-level) into global features (e.g., slide-level). However, there is no effective way to integrate multi-scale image representations with text data in a seamless end-to-end process. In this study, we introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE). This novel text-guided approach effectively captures multi-scale WSI representations by utilizing information from accompanying textual pathology information. mTREE innovatively combines - the localization of key areas (global-to-local) and the development of a WSI-level image-text representation (local-to-global) - into a unified, end-to-end learning framework. In this model, textual information serves a dual purpose: firstly, functioning as an attention map to accurately identify key areas, and secondly, acting as a conduit for integrating textual features into the comprehensive representation of the image. Our study demonstrates the effectiveness of mTREE through quantitative analyses in two image-related tasks: classification and survival prediction, showcasing its remarkable superiority over baselines. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.17568 [pdf, other]

ExtremeMETA: High-speed Lightweight Image Segmentation Model by Remodeling Multi-channel Metamaterial Imagers

Authors: Quan Liu, Brandon T. Swartz, Ivan Kravchenko, Jason G. Valentine, Yuankai Huo

Abstract: Deep neural networks (DNNs) have heavily relied on traditional computational units like CPUs and GPUs. However, this conventional approach brings significant computational burdens, latency issues, and high power consumption, limiting their effectiveness. This has sparked the need for lightweight networks like ExtremeC3Net. On the other hand, there have been notable advancements in optical computat… ▽ More Deep neural networks (DNNs) have heavily relied on traditional computational units like CPUs and GPUs. However, this conventional approach brings significant computational burdens, latency issues, and high power consumption, limiting their effectiveness. This has sparked the need for lightweight networks like ExtremeC3Net. On the other hand, there have been notable advancements in optical computational units, particularly with metamaterials, offering the exciting prospect of energy-efficient neural networks operating at the speed of light. Yet, the digital design of metamaterial neural networks (MNNs) faces challenges such as precision, noise, and bandwidth, limiting their application to intuitive tasks and low-resolution images. In this paper, we propose a large kernel lightweight segmentation model, ExtremeMETA. Based on the ExtremeC3Net, the ExtremeMETA maximizes the ability of the first convolution layer by exploring a larger convolution kernel and multiple processing paths. With the proposed large kernel convolution model, we extend the optic neural network application boundary to the segmentation task. To further lighten the computation burden of the digital processing part, a set of model compression methods is applied to improve model efficiency in the inference stage. The experimental results on three publicly available datasets demonstrate that the optimized efficient design improved segmentation performance from 92.45 to 95.97 on mIoU while reducing computational FLOPs from 461.07 MMacs to 166.03 MMacs. The proposed the large kernel lightweight model ExtremeMETA showcases the hybrid design's ability on complex tasks. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16141 [pdf, other]

AIGB: Generative Auto-bidding via Diffusion Modeling

Authors: Jiayan Guo, Yusen Huo, Zhilin Zhang, Tianyu Wang, Chuan Yu, Jian Xu, Yan Zhang, Bo Zheng

Abstract: Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon… ▽ More Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon scenarios and makes the model unstable when dealing with highly random online advertising environments. To tackle this issue, this paper introduces AI-Generated Bidding (AIGB), a novel paradigm for auto-bidding through generative modeling. In this paradigm, we propose DiffBid, a conditional diffusion modeling approach for bid generation. DiffBid directly models the correlation between the return and the entire trajectory, effectively avoiding error propagation across time steps in long horizons. Additionally, DiffBid offers a versatile approach for generating trajectories that maximize given targets while adhering to specific constraints. Extensive experiments conducted on the real-world dataset and online A/B test on Alibaba advertising platform demonstrate the effectiveness of DiffBid, achieving 2.81% increase in GMV and 3.36% increase in ROI. △ Less

Submitted 27 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

Comments: Accepted by KDD 2024

arXiv:2405.14580 [pdf, other]

LDM: Large Tensorial SDF Model for Textured Mesh Generation

Authors: Rengan Xie, Wenting Zheng, Kai Huang, Yizheng Chen, Qi Wang, Qi Ye, Wei Chen, Yuchi Huo

Abstract: Previous efforts have managed to generate production-ready 3D assets from text or images. However, these methods primarily employ NeRF or 3D Gaussian representations, which are not adept at producing smooth, high-quality geometries required by modern rendering pipelines. In this paper, we propose LDM, a novel feed-forward framework capable of generating high-fidelity, illumination-decoupled textur… ▽ More Previous efforts have managed to generate production-ready 3D assets from text or images. However, these methods primarily employ NeRF or 3D Gaussian representations, which are not adept at producing smooth, high-quality geometries required by modern rendering pipelines. In this paper, we propose LDM, a novel feed-forward framework capable of generating high-fidelity, illumination-decoupled textured mesh from a single image or text prompts. We firstly utilize a multi-view diffusion model to generate sparse multi-view inputs from single images or text prompts, and then a transformer-based model is trained to predict a tensorial SDF field from these sparse multi-view image inputs. Finally, we employ a gradient-based mesh optimization layer to refine this model, enabling it to produce an SDF field from which high-quality textured meshes can be extracted. Extensive experiments demonstrate that our method can generate diverse, high-quality 3D mesh assets with corresponding decomposed RGB textures within seconds. △ Less

Submitted 20 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14097 [pdf, other]

Impact of gauge fixing precision on the continuum limit of non-local quark-bilinear lattice operators

Authors: Kuan Zhang, Yi-Kai Huo, Xiangdong Ji, Andreas Schaefer, Chun-Jiang Shi, Peng Sun, Wei Wang, Yi-Bo Yang, Jian-Hui Zhang

Abstract: We analyze the gauge fixing precision dependence of some non-local quark-blinear lattice operators interesting in computing parton physics for several measurements, using 5 lattice spacings ranging from 0.032 fm to 0.121 fm. Our results show that gauge dependent non-local measurements are significantly more sensitive to the precision of gauge fixing than anticipated. The impact of imprecise gauge… ▽ More We analyze the gauge fixing precision dependence of some non-local quark-blinear lattice operators interesting in computing parton physics for several measurements, using 5 lattice spacings ranging from 0.032 fm to 0.121 fm. Our results show that gauge dependent non-local measurements are significantly more sensitive to the precision of gauge fixing than anticipated. The impact of imprecise gauge fixing is significant for fine lattices and long distances. For instance, even with the typically defined precision of Landau gauge fixing of $10^{-8}$, the deviation caused by imprecise gauge fixing can reach 12 percent, when calculating the trace of Wilson lines at 1.2 fm with a lattice spacing of approximately 0.03 fm. Similar behavior has been observed in $ξ$ gauge and Coulomb gauge as well. For both quasi PDFs and quasi TMD-PDFs operators renormalized using the RI/MOM scheme, convergence for different lattice spacings at long distance is only observed when the precision of Landau gauge fixing is sufficiently high. To describe these findings quantitatively, we propose an empirical formula to estimate the required precision. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 16 pages, 15 figures

arXiv:2405.11270 [pdf, other]

HR Human: Modeling Human Avatars with Triangular Mesh and High-Resolution Textures from Videos

Authors: Qifeng Chen, Rengan Xie, Kai Huang, Qi Wang, Wenting Zheng, Rong Li, Yuchi Huo

Abstract: Recently, implicit neural representation has been widely used to generate animatable human avatars. However, the materials and geometry of those representations are coupled in the neural network and hard to edit, which hinders their application in traditional graphics engines. We present a framework for acquiring human avatars that are attached with high-resolution physically-based material textur… ▽ More Recently, implicit neural representation has been widely used to generate animatable human avatars. However, the materials and geometry of those representations are coupled in the neural network and hard to edit, which hinders their application in traditional graphics engines. We present a framework for acquiring human avatars that are attached with high-resolution physically-based material textures and triangular mesh from monocular video. Our method introduces a novel information fusion strategy to combine the information from the monocular video and synthesize virtual multi-view images to tackle the sparsity of the input view. We reconstruct humans as deformable neural implicit surfaces and extract triangle mesh in a well-behaved pose as the initial mesh of the next stage. In addition, we introduce an approach to correct the bias for the boundary and size of the coarse mesh extracted. Finally, we adapt prior knowledge of the latent diffusion model at super-resolution in multi-view to distill the decomposed texture. Experiments show that our approach outperforms previous representations in terms of high fidelity, and this explicit result supports deployment on common renderers. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.09045 [pdf, other]

AMSNet: Netlist Dataset for AMS Circuits

Authors: Zhuofu Tao, Yichen Shi, Yiru Huo, Rui Ye, Zonghang Li, Li Huang, Chen Wu, Na Bai, Zhi** Yu, Ting-Jung Lin, Lei He

Abstract: Today's analog/mixed-signal (AMS) integrated circuit (IC) designs demand substantial manual intervention. The advent of multimodal large language models (MLLMs) has unveiled significant potential across various fields, suggesting their applicability in streamlining large-scale AMS IC design as well. A bottleneck in employing MLLMs for automatic AMS circuit generation is the absence of a comprehens… ▽ More Today's analog/mixed-signal (AMS) integrated circuit (IC) designs demand substantial manual intervention. The advent of multimodal large language models (MLLMs) has unveiled significant potential across various fields, suggesting their applicability in streamlining large-scale AMS IC design as well. A bottleneck in employing MLLMs for automatic AMS circuit generation is the absence of a comprehensive dataset delineating the schematic-netlist relationship. We therefore design an automatic technique for converting schematics into netlists, and create dataset AMSNet, encompassing transistor-level schematics and corresponding SPICE format netlists. With a growing size, AMSNet can significantly facilitate exploration of MLLM applications in AMS circuit design. We have made an initial set of netlists public, and will make both our netlist generation tool and the full dataset available upon publishing of this paper. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.03652 [pdf]

Field-of-View Extension for Diffusion MRI via Deep Generative Models

Authors: Chenyu Gao, Shunxing Bao, Michael Kim, Nancy Newlin, Praitayini Kanakaraj, Tianyuan Yao, Gaurav Rudravaram, Yuankai Huo, Daniel Moyer, Kurt Schilling, Walter Kukull, Arthur Toga, Derek Archer, Timothy Hohman, Bennett Landman, Zhiyuan Li

Abstract: Purpose: In diffusion MRI (dMRI), the volumetric and bundle analyses of whole-brain tissue microstructure and connectivity can be severely impeded by an incomplete field-of-view (FOV). This work aims to develop a method for imputing the missing slices directly from existing dMRI scans with an incomplete FOV. We hypothesize that the imputed image with complete FOV can improve the whole-brain tracto… ▽ More Purpose: In diffusion MRI (dMRI), the volumetric and bundle analyses of whole-brain tissue microstructure and connectivity can be severely impeded by an incomplete field-of-view (FOV). This work aims to develop a method for imputing the missing slices directly from existing dMRI scans with an incomplete FOV. We hypothesize that the imputed image with complete FOV can improve the whole-brain tractography for corrupted data with incomplete FOV. Therefore, our approach provides a desirable alternative to discarding the valuable dMRI data, enabling subsequent tractography analyses that would otherwise be challenging or unattainable with corrupted data. Approach: We propose a framework based on a deep generative model that estimates the absent brain regions in dMRI scans with incomplete FOV. The model is capable of learning both the diffusion characteristics in diffusion-weighted images (DWI) and the anatomical features evident in the corresponding structural images for efficiently imputing missing slices of DWI outside of incomplete FOV. Results: For evaluating the imputed slices, on the WRAP dataset the proposed framework achieved PSNRb0=22.397, SSIMb0=0.905, PSNRb1300=22.479, SSIMb1300=0.893; on the NACC dataset it achieved PSNRb0=21.304, SSIMb0=0.892, PSNRb1300=21.599, SSIMb1300= 0.877. The proposed framework improved the tractography accuracy, as demonstrated by an increased average Dice score for 72 tracts (p < 0.001) on both the WRAP and NACC datasets. Conclusions: Results suggest that the proposed framework achieved sufficient imputation performance in dMRI data with incomplete FOV for improving whole-brain tractography, thereby repairing the corrupted data. Our approach achieved more accurate whole-brain tractography results with extended and complete FOV and reduced the uncertainty when analyzing bundles associated with Alzheimer's Disease. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 20 pages, 11 figures

arXiv:2404.13896 [pdf, other]

CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory

Authors: Yunlong Ran, Yanxu Li, Qi Ye, Yuchi Huo, Zechun Bai, Jiahao Sun, Jiming Chen

Abstract: Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address t… ▽ More Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose CT-NeRF, an incremental reconstruction optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojected geometric image distance constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction, CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy. △ Less

Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.09707 [pdf, other]

Adaptive Patching for High-resolution Image Segmentation with Transformers

Authors: Enzhi Zhang, Isaac Lyngaas, Peng Chen, Xiao Wang, Jun Igarashi, Yuankai Huo, Mohamed Wahib, Masaharu Munetomo

Abstract: Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attenti… ▽ More Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model, if we are to use smaller patch sizes that are favorable in segmentation. The solution is to either use custom complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods in HPC by adaptively patching the images, as a pre-processing step, based on the image details to reduce the number of patches being fed to the model, by orders of magnitude. This method has a negligible overhead, and works seamlessly with any attention-based model, i.e. it is a pre-processing step that can be adopted by any attention-based model without friction. We demonstrate superior segmentation quality over SoTA segmentation models for real-world pathology datasets while gaining a geomean speedup of $6.9\times$ for resolutions up to $64K^2$, on up to $2,048$ GPUs. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.00714 [pdf, other]

Neural Radiance Field-based Visual Rendering: A Comprehensive Review

Authors: Mingyuan Yao, Yukang Huo, Yang Ran, Qingbin Tian, Ruifeng Wang, Haihua Wang

Abstract: In recent years, Neural Radiance Fields (NeRF) has made remarkable progress in the field of computer vision and graphics, providing strong technical support for solving key tasks including 3D scene understanding, new perspective synthesis, human body reconstruction, robotics, and so on, the attention of academics to this research result is growing. As a revolutionary neural implicit field represen… ▽ More In recent years, Neural Radiance Fields (NeRF) has made remarkable progress in the field of computer vision and graphics, providing strong technical support for solving key tasks including 3D scene understanding, new perspective synthesis, human body reconstruction, robotics, and so on, the attention of academics to this research result is growing. As a revolutionary neural implicit field representation, NeRF has caused a continuous research boom in the academic community. Therefore, the purpose of this review is to provide an in-depth analysis of the research literature on NeRF within the past two years, to provide a comprehensive academic perspective for budding researchers. In this paper, the core architecture of NeRF is first elaborated in detail, followed by a discussion of various improvement strategies for NeRF, and case studies of NeRF in diverse application scenarios, demonstrating its practical utility in different domains. In terms of datasets and evaluation metrics, This paper details the key resources needed for NeRF model training. Finally, this paper provides a prospective discussion on the future development trends and potential challenges of NeRF, aiming to provide research inspiration for researchers in the field and to promote the further development of related technologies. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: 35 pages, 22 figures, 14 tables, 18 formulas

arXiv:2404.00640 [pdf, other]

doi 10.1145/3650212.3652106

Face It Yourselves: An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs

Authors: Shiwen Shan, Yintong Huo, Yuxin Su, Yichen Li, Dan Li, Zibin Zheng

Abstract: Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are e… ▽ More Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are easily accessible to most end-users, we conduct a preliminary study to outline the challenges and opportunities of utilizing logs in localizing configuration errors. Based on the insights gained from the preliminary study, we propose an LLM-based two-stage strategy for end-users to localize the root-cause configuration properties based on logs. We further implement a tool, LogConfigLocalizer, aligned with the design of the aforementioned strategy, ho** to assist end-users in co** with configuration errors through log analysis. To the best of our knowledge, this is the first work to localize the root-cause configuration properties for end-users based on Large Language Models~(LLMs) and logs. We evaluate the proposed strategy on Hadoop by LogConfigLocalizer and prove its efficiency with an average accuracy as high as 99.91%. Additionally, we also demonstrate the effectiveness and necessity of different phases of the methodology by comparing it with two other variants and a baseline tool. Moreover, we validate the proposed methodology through a practical case study to demonstrate its effectiveness and feasibility. △ Less

Submitted 2 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

Comments: 13 pages, accepted by ISSTA 2024 (The 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis)

arXiv:2403.17574 [pdf, other]

SPES: Towards Optimizing Performance-Resource Trade-Off for Serverless Functions

Authors: Cheryl Lee, Zhouruixin Zhu, Tianyi Yang, Yintong Huo, Yuxin Su, Pinjia He, Michael R. Lyu

Abstract: As an emerging cloud computing deployment paradigm, serverless computing is gaining traction due to its efficiency and ability to harness on-demand cloud resources. However, a significant hurdle remains in the form of the cold start problem, causing latency when launching new function instances from scratch. Existing solutions tend to use over-simplistic strategies for function pre-loading/unloadi… ▽ More As an emerging cloud computing deployment paradigm, serverless computing is gaining traction due to its efficiency and ability to harness on-demand cloud resources. However, a significant hurdle remains in the form of the cold start problem, causing latency when launching new function instances from scratch. Existing solutions tend to use over-simplistic strategies for function pre-loading/unloading without full invocation pattern exploitation, rendering unsatisfactory optimization of the trade-off between cold start latency and resource waste. To bridge this gap, we propose SPES, the first differentiated scheduler for runtime cold start mitigation by optimizing serverless function provision. Our insight is that the common architecture of serverless systems prompts the concentration of certain invocation patterns, leading to predictable invocation behaviors. This allows us to categorize functions and pre-load/unload proper function instances with finer-grained strategies based on accurate invocation prediction. Experiments demonstrate the success of SPES in optimizing serverless function provision on both sides: reducing the 75th-percentile cold start rates by 49.77% and the wasted memory time by 56.43%, compared to the state-of-the-art. By mitigating the cold start issue, SPES is a promising advancement in facilitating cloud services deployed on serverless architectures. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 12 pages, accepted by ICDE 2024 (40th IEEE International Conference on Data Engineering)

arXiv:2403.11945 [pdf, other]

Kernel Modelling of Fading Memory Systems

Authors: Yongkang Huo, Thomas Chaffey, Rodolphe Sepulchre

Abstract: The paper introduces a kernel-based framework to model and identify time-invariant systems with the fading memory property. The key departure from the previous literature is to bypass the state-space representation of the model. Instead, a kernel representation is used to directly model the memory functional that maps past inputs to the present output. We explore the versatility of this approach t… ▽ More The paper introduces a kernel-based framework to model and identify time-invariant systems with the fading memory property. The key departure from the previous literature is to bypass the state-space representation of the model. Instead, a kernel representation is used to directly model the memory functional that maps past inputs to the present output. We explore the versatility of this approach to encode important system properties in the hyperparameters of the kernel. The approach is illustrated on the Hodgkin and Huxley model of neuronal excitability. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11626 [pdf, other]

QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

Authors: Zhizhen Zhou, Ye**g Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li

Abstract: The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This… ▽ More The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted by The Visual Computer Journal

arXiv:2403.11507 [pdf, other]

Circle Representation for Medical Instance Object Segmentation

Authors: Juming Xiong, Ethan H. Nguyen, Yilin Liu, Ruining Deng, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Haichun Yang, Agnes B. Fogo, Yuankai Huo

Abstract: Recently, circle representation has been introduced for medical imaging, designed specifically to enhance the detection of instance objects that are spherically shaped (e.g., cells, glomeruli, and nuclei). Given its outstanding effectiveness in instance detection, it is compelling to consider the application of circle representation for segmenting instance medical objects. In this study, we introd… ▽ More Recently, circle representation has been introduced for medical imaging, designed specifically to enhance the detection of instance objects that are spherically shaped (e.g., cells, glomeruli, and nuclei). Given its outstanding effectiveness in instance detection, it is compelling to consider the application of circle representation for segmenting instance medical objects. In this study, we introduce CircleSnake, a simple end-to-end segmentation approach that utilizes circle contour deformation for segmenting ball-shaped medical objects at the instance level. The innovation of CircleSnake lies in these three areas: (1) It substitutes the complex bounding box-to-octagon contour transformation with a more consistent and rotation-invariant bounding circle-to-circle contour adaptation. This adaptation specifically targets ball-shaped medical objects. (2) The circle representation employed in CircleSnake significantly reduces the degrees of freedom to two, compared to eight in the octagon representation. This reduction enhances both the robustness of the segmentation performance and the rotational consistency of the method. (3) CircleSnake is the first end-to-end deep instance segmentation pipeline to incorporate circle representation, encompassing consistent circle detection, circle contour proposal, and circular convolution in a unified framework. This integration is achieved through the novel application of circular graph convolution within the context of circle detection and instance segmentation. In practical applications, such as the detection of glomeruli, nuclei, and eosinophils in pathological images, CircleSnake has demonstrated superior performance and greater rotation invariance when compared to benchmarks. The code has been made publicly available: https://github.com/hrlblab/CircleSnake. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.07728 [pdf, other]

CAP: A General Algorithm for Online Selective Conformal Prediction with FCR Control

Authors: Yajie Bao, Yuyang Huo, Haojie Ren, Changliang Zou

Abstract: We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control t… ▽ More We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) which measures the overall miscoverage level. We develop a general framework named CAP (Calibration after Adaptive Pick) that performs an adaptive pick rule on historical data to construct a calibration set if the current individual is selected and then outputs a conformal prediction interval for the unobserved label. We provide tractable procedures for constructing the calibration set for popular online selection rules. We proved that CAP can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. To account for the distribution shift in online data, we also embed CAP into some recent dynamic conformal prediction algorithms and show that the proposed method can deliver long-run FCR control. Numerical results on both synthetic and real data corroborate that CAP can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings. △ Less

Submitted 28 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.06640 [pdf, other]

doi 10.1109/LCSYS.2024.3408065

Passive iFIR Filters for Data-Driven Control

Authors: Zixing Wang, Yongkang Huo, Fulvio Forni

Abstract: We consider the design of a new class of passive iFIR controllers given by the parallel action of an integrator and a finite impulse response filter. iFIRs are more expressive than PID controllers but retain their features and simplicity. The paper provides a model-free data-driven design for passive iFIR controllers based on virtual reference feedback tuning. Passivity is enforced through constra… ▽ More We consider the design of a new class of passive iFIR controllers given by the parallel action of an integrator and a finite impulse response filter. iFIRs are more expressive than PID controllers but retain their features and simplicity. The paper provides a model-free data-driven design for passive iFIR controllers based on virtual reference feedback tuning. Passivity is enforced through constrained optimization (three different formulations are discussed). The proposed design does not rely on large datasets or accurate plant models. △ Less

Submitted 29 June, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 6 pages, 7 figures, Accepted by IEEE Control Systems Letters (L-CSS) with the option to present it to 2024 Conference on Decision and Control (CDC 2024)

Journal ref: IEEE Control Systems Letters, vol. 8, pp. 1289-1294, 2024

arXiv:2402.19286 [pdf, other]

PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation

Authors: Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Jialin Yue, Juming Xiong, Lining Yu, Yifei Wu, Mengmeng Yin, Yu Wang, Shilin Zhao, Yucheng Tang, Haichun Yang, Yuankai Huo

Abstract: Understanding the anatomy of renal pathology is crucial for advancing disease diagnostics, treatment evaluation, and clinical research. The complex kidney system comprises various components across multiple levels, including regions (cortex, medulla), functional units (glomeruli, tubules), and cells (podocytes, mesangial cells in glomerulus). Prior studies have predominantly overlooked the intrica… ▽ More Understanding the anatomy of renal pathology is crucial for advancing disease diagnostics, treatment evaluation, and clinical research. The complex kidney system comprises various components across multiple levels, including regions (cortex, medulla), functional units (glomeruli, tubules), and cells (podocytes, mesangial cells in glomerulus). Prior studies have predominantly overlooked the intricate spatial interrelations among objects from clinical knowledge. In this research, we introduce a novel universal proposition learning approach, called panoramic renal pathology segmentation (PrPSeg), designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. In this paper, we propose (1) the design of a comprehensive universal proposition matrix for renal pathology, facilitating the incorporation of classification and spatial relationships into the segmentation process; (2) a token-based dynamic head single network architecture, with the improvement of the partial label image segmentation and capability for future data enlargement; and (3) an anatomy loss function, quantifying the inter-object relationships across the kidney. △ Less

Submitted 20 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: IEEE / CVF Computer Vision and Pattern Recognition Conference 2024

arXiv:2402.15102 [pdf, other]

Trajectory-wise Iterative Reinforcement Learning Framework for Auto-bidding

Authors: Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, Fan Wu

Abstract: In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed… ▽ More In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method. △ Less

Submitted 8 April, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: Accepted by The Web Conference 2024 (WWW'24) as an oral paper

arXiv:2402.12958 [pdf, other]

Go Static: Contextualized Logging Statement Generation

Authors: Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, **yang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, Michael R. Lyu

Abstract: Logging practices have been extensively investigated to assist developers in writing appropriate logging statements for documenting software behaviors. Although numerous automatic logging approaches have been proposed, their performance remains unsatisfactory due to the constraint of the single-method input, without informative programming context outside the method. Specifically, we identify thre… ▽ More Logging practices have been extensively investigated to assist developers in writing appropriate logging statements for documenting software behaviors. Although numerous automatic logging approaches have been proposed, their performance remains unsatisfactory due to the constraint of the single-method input, without informative programming context outside the method. Specifically, we identify three inherent limitations with single-method context: limited static scope of logging statements, inconsistent logging styles, and missing type information of logging variables. To tackle these limitations, we propose SCLogger, the first contextualized logging statement generation approach with inter-method static contexts. First, SCLogger extracts inter-method contexts with static analysis to construct the contextualized prompt for language models to generate a tentative logging statement. The contextualized prompt consists of an extended static scope and sampled similar methods, ordered by the chain-of-thought (COT) strategy. Second, SCLogger refines the access of logging variables by formulating a new refinement prompt for language models, which incorporates detailed type information of variables in the tentative logging statement. The evaluation results show that SCLogger surpasses the state-of-the-art approach by 8.7% in logging position accuracy, 32.1% in level accuracy, 19.6% in variable precision, and 138.4% in text BLEU-4 score. Furthermore, SCLogger consistently boosts the performance of logging statement generation across a range of large language models, thereby showcasing the generalizability of this approach. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: This paper was accepted by The ACM International Conference on the Foundations of Software Engineering (FSE 2024)

arXiv:2402.10937 [pdf]

A Lightweight Inception Boosted U-Net Neural Network for Routability Prediction

Authors: Hailiang Li, Yan Huo, Yan Wang, Xu Yang, Miaohui Hao, Xiao Wang

Abstract: As the modern CPU, GPU, and NPU chip design complexity and transistor counts keep increasing, and with the relentless shrinking of semiconductor technology nodes to nearly 1 nanometer, the placement and routing have gradually become the two most pivotal processes in modern very-large-scale-integrated (VLSI) circuit back-end design. How to evaluate routability efficiently and accurately in advance… ▽ More As the modern CPU, GPU, and NPU chip design complexity and transistor counts keep increasing, and with the relentless shrinking of semiconductor technology nodes to nearly 1 nanometer, the placement and routing have gradually become the two most pivotal processes in modern very-large-scale-integrated (VLSI) circuit back-end design. How to evaluate routability efficiently and accurately in advance (at the placement and global routing stages) has grown into a crucial research area in the field of artificial intelligence (AI) assisted electronic design automation (EDA). In this paper, we propose a novel U-Net variant model boosted by an Inception embedded module to predict Routing Congestion (RC) and Design Rule Checking (DRC) hotspots. Experimental results on the recently published CircuitNet dataset benchmark show that our proposed method achieves up to 5% (RC) and 20% (DRC) rate reduction in terms of Avg-NRMSE (Average Normalized Root Mean Square Error) compared to the classic architecture. Furthermore, our approach consistently outperforms the prior model on the SSIM (Structural Similarity Index Measure) metric. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: The paper is submitted to the International Symposium of EDA (2024, XiAn, China)

arXiv:2402.03630 [pdf, other]

Enhancing LLM-Based Coding Tools through Native Integration of IDE-Derived Static Context

Authors: Yichen Li, Yun Peng, Yintong Huo, Michael R. Lyu

Abstract: Large Language Models (LLMs) have achieved remarkable success in code completion, as evidenced by their essential roles in develo** code assistant services such as Copilot. Being trained on in-file contexts, current LLMs are quite effective in completing code for single source files. However, it is challenging for them to conduct repository-level code completion for large software projects that… ▽ More Large Language Models (LLMs) have achieved remarkable success in code completion, as evidenced by their essential roles in develo** code assistant services such as Copilot. Being trained on in-file contexts, current LLMs are quite effective in completing code for single source files. However, it is challenging for them to conduct repository-level code completion for large software projects that require cross-file information. Existing research on LLM-based repository-level code completion identifies and integrates cross-file contexts, but it suffers from low accuracy and limited context length of LLMs. In this paper, we argue that Integrated Development Environments (IDEs) can provide direct, accurate and real-time cross-file information for repository-level code completion. We propose IDECoder, a practical framework that leverages IDE native static contexts for cross-context construction and diagnosis results for self-refinement. IDECoder utilizes the rich cross-context information available in IDEs to enhance the capabilities of LLMs of repository-level code completion. We conducted preliminary experiments to validate the performance of IDECoder and observed that this synergy represents a promising trend for future exploration. △ Less

Submitted 19 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.00028 [pdf, other]

Neural Rendering and Its Hardware Acceleration: A Review

Authors: Xinkai Yan, Jieting Xu, Yuchi Huo, Hujun Bao

Abstract: Neural rendering is a new image and video generation method based on deep learning. It combines the deep learning model with the physical knowledge of computer graphics, to obtain a controllable and realistic scene model, and realize the control of scene attributes such as lighting, camera parameters, posture and so on. On the one hand, neural rendering can not only make full use of the advantages… ▽ More Neural rendering is a new image and video generation method based on deep learning. It combines the deep learning model with the physical knowledge of computer graphics, to obtain a controllable and realistic scene model, and realize the control of scene attributes such as lighting, camera parameters, posture and so on. On the one hand, neural rendering can not only make full use of the advantages of deep learning to accelerate the traditional forward rendering process, but also provide new solutions for specific tasks such as inverse rendering and 3D reconstruction. On the other hand, the design of innovative hardware structures that adapt to the neural rendering pipeline breaks through the parallel computing and power consumption bottleneck of existing graphics processors, which is expected to provide important support for future key areas such as virtual and augmented reality, film and television creation and digital entertainment, artificial intelligence and the metaverse. In this paper, we review the technical connotation, main challenges, and research progress of neural rendering. On this basis, we analyze the common requirements of neural rendering pipeline for hardware acceleration and the characteristics of the current hardware acceleration architecture, and then discuss the design challenges of neural rendering processor architecture. Finally, the future development trend of neural rendering processor architecture is prospected. △ Less

Submitted 6 January, 2024; originally announced February 2024.

arXiv:2401.17410 [pdf, other]

Multi-Species Cohesion: Humans, machinery, AI and beyond

Authors: Frank Yingjie Huo, Pedro D. Manrique, Neil F. Johnson

Abstract: What large-scale cohesive behaviors -- desirable or dangerous -- can suddenly emerge from systems with interacting humans, machinery and software including AI? When will they emerge? How will they evolve and be controlled? Here we offer some answers to these urgent questions by introducing an aggregation model that accounts for entities' inter- and intra-species diversities. It yields a novel mult… ▽ More What large-scale cohesive behaviors -- desirable or dangerous -- can suddenly emerge from systems with interacting humans, machinery and software including AI? When will they emerge? How will they evolve and be controlled? Here we offer some answers to these urgent questions by introducing an aggregation model that accounts for entities' inter- and intra-species diversities. It yields a novel multi-dimensional generalization of existing aggregation physics. We derive exact analytic solutions for the time-to-cohesion and growth-of-cohesion for two species, and some generalizations for an arbitrary number of species. These solutions reproduce -- and offer a microscopic explanation for -- an anomalous nonlinear growth feature observed in related real-world systems, e.g. Hamas-Hezbollah online support, human-machine team interactions, AI-determined topic coherence. A key takeaway is that good and bad 'surprises' will appear increasingly quickly as humans-machinery-AI etc. mix more -- but the theory offers a rigorous approach for understanding and controlling this. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.17022 [pdf, other]

doi 10.1126/science.ado3912

Realization of fractional quantum Hall state with interacting photons

Authors: Can Wang, Feng-Ming Liu, Ming-Cheng Chen, He Chen, Xian-He Zhao, Chong Ying, Zhong-Xia Shang, Jian-Wen Wang, Yong-Heng Huo, Cheng-Zhi Peng, Xiaobo Zhu, Chao-Yang Lu, Jian-Wei Pan

Abstract: Fractional quantum Hall (FQH) states, known for their robust topological order and the emergence of non-Abelian anyons, have captured significant interest due to the appealing applications in fault-tolerant quantum computing. Bottom-up approach on an engineered quantum platform will provide opportunities to operate FQH states without external magnetic field and enhance local and coherent manipulat… ▽ More Fractional quantum Hall (FQH) states, known for their robust topological order and the emergence of non-Abelian anyons, have captured significant interest due to the appealing applications in fault-tolerant quantum computing. Bottom-up approach on an engineered quantum platform will provide opportunities to operate FQH states without external magnetic field and enhance local and coherent manipulation of these exotic states. Here we demonstrate a lattice version of photon FQH state using a programmable on-chip platform based on photon blockade and engineering gauge fields on a novel two-dimensional circuit quantum electrodynamics (QED) system. We first observe the effective photon Lorentz force and butterfly spectrum in the artificial gauge field, a prerequisite for FQH states. After adiabatic assembly of Laughlin FQH wavefunction of 1/2 filling factor from localized photons, we observe strong density correlation and chiral topological flow among the FQH photons. We then verify the unique features of FQH states in response to external fields, including the incompressibility of generating quasiparticles and the smoking-gun signature of fractional quantum Hall conductivity. Our work represents a significant advance in the bottom-up creation and manipulation of novel strongly correlated topological quantum matter composed of photons and opens up possibilities for fault-tolerant quantum information devices. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 8 pages, 6 figures

Journal ref: Science 384, 579-584 (2024)

arXiv:2401.16065 [pdf, other]

Atomistic-Level Analysis of Nanoindentation-Induced Plasticity in Arc--Melted NiFeCrCo Alloys: The role of stacking faults

Authors: F. J. Dominguez-Gutierrez, A. Olejarz, M. Landeiro Dos Reis, E. Wyszkowska, D. Kalita, W. Y. Huo, I. Jozwik, L. Kurpaska, S. Papanikolaou, M. J. Alava, K. Muszka

Abstract: Concentrated solid solution alloys (CSAs) have attracted attention for their promising properties; however, current manufacturing methods face challenges in complexity, high costs, and limited scalability, raising concerns about industrial viability. The prevalent technique, arc melting, yields high-purity samples with complex shapes. In this study, we explore nanoindentation tests at room tempera… ▽ More Concentrated solid solution alloys (CSAs) have attracted attention for their promising properties; however, current manufacturing methods face challenges in complexity, high costs, and limited scalability, raising concerns about industrial viability. The prevalent technique, arc melting, yields high-purity samples with complex shapes. In this study, we explore nanoindentation tests at room temperature where arc-melted samples exhibit larger grain sizes, diminishing the effects of grain boundaries on the results. Motivated by these findings, our investigation focuses on the atomistic-level exploration of plasticity mechanisms, specifically dislocation nucleation and propagation during nanoindentation tests. The intricate chemistry of NiFeCrCo CSA influences pile-ups and slip traces, aiming to elucidate plastic deformation by considering both pristine and pre-existing stacking fault tetrahedra. Our analysis scrutinizes dynamic deformation processes, defect nucleation, and evolution, complemented by stress-strain and dislocation densities-strain curves illustrating the hardening mechanism of defective materials. Additionally, we examine surface morphology and plastic deformation through atomic shear strain and displacement map**s. This integrated approach provides insights into the complex interplay between material structure and mechanical behavior, paving the way for an enhanced understanding and potential advancements in CSA applications. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.15841 [pdf, other]

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Authors: Yizheng Chen, Rengan Xie, Qi Ye, Sen Yang, Zixuan Xie, Tianxiao Chen, Rong Li, Yuchi Huo

Abstract: Reconstructing 3D objects from a single image is an intriguing but challenging problem. One promising solution is to utilize multi-view (MV) 3D reconstruction to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. To cope with these problems, we p… ▽ More Reconstructing 3D objects from a single image is an intriguing but challenging problem. One promising solution is to utilize multi-view (MV) 3D reconstruction to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. To cope with these problems, we present a novel 3D reconstruction framework that leverages intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation to cope with the three issues, respectively. Specifically, we first leverage to decouple the shading information from the generated images to reduce the impact of inconsistent lighting; then, we introduce mono prior with view-dependent transient encoding to enhance the reconstructed normal; and finally, we design a view augmentation fusion strategy that minimizes pixel-level loss in generated sparse views and semantic loss in augmented random views, resulting in view-consistent geometry and detailed textures. Our approach, therefore, enables the integration of a pre-trained MV image generator and a neural network-based volumetric signed distance function (SDF) representation for a single image to 3D object reconstruction. We evaluate our framework on various datasets and demonstrate its superior performance in both quantitative and qualitative assessments, signifying a significant advancement in 3D object reconstruction. Compared with the latest state-of-the-art method Syncdreamer~\cite{liu2023syncdreamer}, we reduce the Chamfer Distance error by about 36\% and improve PSNR by about 30\% . △ Less

Submitted 28 January, 2024; originally announced January 2024.

arXiv:2401.14150 [pdf]

Polarized and bright telecom C-band single-photon source from InP-based quantum dots coupled to elliptical Bragg gratings

Authors: Zhenxuan Ge, Tunghsun Chung, Yu-Ming He, Mohamed Benyoucef, Yongheng Huo

Abstract: Bright, polarized, and high-purity single-photon sources in telecom wavelengths are crucial components in long-distance quantum communication, optical quantum computation and quantum networks. Semiconductor InAs/InP quantum dots (QDs) combined with photonic cavities provide a competitive path leading to optimal single-photon sources in this range. Here, we demonstrate a bright and polarized single… ▽ More Bright, polarized, and high-purity single-photon sources in telecom wavelengths are crucial components in long-distance quantum communication, optical quantum computation and quantum networks. Semiconductor InAs/InP quantum dots (QDs) combined with photonic cavities provide a competitive path leading to optimal single-photon sources in this range. Here, we demonstrate a bright and polarized single-photon source operating in the telecom C-band based on an elliptical Bragg grating (EBG) cavity. With a significant Purcell enhancement of 5.25$\pm$0.05, the device achieves a polarization ratio of 0.986, single-photon purity of g^2 (0)=0.078$\pm$0.016 and single-polarized photon collection efficiency of ~ 24% at the first lens (NA=0.65) without blinking. These findings suggest that C-band QD-based single-photon sources are potential candidates for advancing quantum communication. △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.07854 [pdf, other]

$M^{2}$Fusion: Bayesian-based Multimodal Multi-level Fusion on Colorectal Cancer Microsatellite Instability Prediction

Authors: Quan Liu, Jiawen Yao, Lisha Yao, Xin Chen, **gren Zhou, Le Lu, Ling Zhang, Zaiyi Liu, Yuankai Huo

Abstract: Colorectal cancer (CRC) micro-satellite instability (MSI) prediction on histopathology images is a challenging weakly supervised learning task that involves multi-instance learning on gigapixel images. To date, radiology images have proven to have CRC MSI information and efficient patient imaging techniques. Different data modalities integration offers the opportunity to increase the accuracy and… ▽ More Colorectal cancer (CRC) micro-satellite instability (MSI) prediction on histopathology images is a challenging weakly supervised learning task that involves multi-instance learning on gigapixel images. To date, radiology images have proven to have CRC MSI information and efficient patient imaging techniques. Different data modalities integration offers the opportunity to increase the accuracy and robustness of MSI prediction. Despite the progress in representation learning from the whole slide images (WSI) and exploring the potential of making use of radiology data, CRC MSI prediction remains a challenge to fuse the information from multiple data modalities (e.g., pathology WSI and radiology CT image). In this paper, we propose $M^{2}$Fusion: a Bayesian-based multimodal multi-level fusion pipeline for CRC MSI. The proposed fusion model $M^{2}$Fusion is capable of discovering more novel patterns within and across modalities that are beneficial for predicting MSI than using a single modality alone, as well as other fusion methods. The contribution of the paper is three-fold: (1) $M^{2}$Fusion is the first pipeline of multi-level fusion on pathology WSI and 3D radiology CT image for MSI prediction; (2) CT images are the first time integrated into multimodal fusion for CRC MSI prediction; (3) feature-level fusion strategy is evaluated on both Transformer-based and CNN-based method. Our approach is validated on cross-validation of 352 cases and outperforms either feature-level (0.8177 vs. 0.7908) or decision-level fusion strategy (0.8177 vs. 0.7289) on AUC score. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.07654 [pdf, other]

Foundation Models for Biomedical Image Segmentation: A Survey

Authors: Ho Hin Lee, Yu Gu, Theodore Zhao, Yanbo Xu, Jianwei Yang, Naoto Usuyama, Cliff Wong, Mu Wei, Bennett A. Landman, Yuankai Huo, Alberto Santamaria-Pang, Hoifung Poon

Abstract: Recent advancements in biomedical image analysis have been significantly driven by the Segment Anything Model (SAM). This transformative technology, originally developed for general-purpose computer vision, has found rapid application in medical image processing. Within the last year, marked by over 100 publications, SAM has demonstrated its prowess in zero-shot learning adaptations for medical im… ▽ More Recent advancements in biomedical image analysis have been significantly driven by the Segment Anything Model (SAM). This transformative technology, originally developed for general-purpose computer vision, has found rapid application in medical image processing. Within the last year, marked by over 100 publications, SAM has demonstrated its prowess in zero-shot learning adaptations for medical imaging. The fundamental premise of SAM lies in its capability to segment or identify objects in images without prior knowledge of the object type or imaging modality. This approach aligns well with tasks achievable by the human visual system, though its application in non-biological vision contexts remains more theoretically challenging. A notable feature of SAM is its ability to adjust segmentation according to a specified resolution scale or area of interest, akin to semantic priming. This adaptability has spurred a wave of creativity and innovation in applying SAM to medical imaging. Our review focuses on the period from April 1, 2023, to September 30, 2023, a critical first six months post-initial publication. We examine the adaptations and integrations of SAM necessary to address longstanding clinical challenges, particularly in the context of 33 open datasets covered in our analysis. While SAM approaches or achieves state-of-the-art performance in numerous applications, it falls short in certain areas, such as segmentation of the carotid artery, adrenal glands, optic nerve, and mandible bone. Our survey delves into the innovative techniques where SAM's foundational approach excels and explores the core concepts in translating and applying these models effectively in diverse medical imaging scenarios. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: 22 pages, 4 figures, 7 tables

arXiv:2401.06798 [pdf]

Evaluation of Mean Shift, ComBat, and CycleGAN for Harmonizing Brain Connectivity Matrices Across Sites

Authors: Hanliang Xu, Nancy R. Newlin, Michael E. Kim, Chenyu Gao, Praitayini Kanakaraj, Aravind R. Krishnan, Lucas W. Remedios, Nazirah Mohd Khairi, Kimberly Pechman, Derek Archer, Timothy J. Hohman, Angela L. Jefferson, The BIOCARD Study Team, Ivana Isgum, Yuankai Huo, Daniel Moyer, Kurt G. Schilling, Bennett A. Landman

Abstract: Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph… ▽ More Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph measures derived from these matrices before and after applying three harmonization techniques: mean shift, ComBat, and CycleGAN. The sample comprises 168 age-matched, sex-matched normal subjects from two studies: the Vanderbilt Memory and Aging Project (VMAP) and the Biomarkers of Cognitive Decline Among Normal Individuals (BIOCARD). First, we plotted the graph measures and used coefficient of variation (CoV) and the Mann-Whitney U test to evaluate different methods' effectiveness in removing site effects on the matrices and the derived graph measures. ComBat effectively eliminated site effects for global efficiency and modularity and outperformed the other two methods. However, all methods exhibited poor performance when harmonizing average betweenness centrality. Second, we tested whether our harmonization methods preserved correlations between age and graph measures. All methods except for CycleGAN in one direction improved correlations between age and global efficiency and between age and modularity from insignificant to significant with p-values less than 0.05. △ Less

Submitted 24 January, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

Comments: 11 pages, 5 figures, to be published in SPIE Medical Imaging 2024: Image Processing

arXiv:2401.05602 [pdf]

Nucleus subtype classification using inter-modality learning

Authors: Lucas W. Remedios, Shunxing Bao, Samuel W. Remedios, Ho Hin Lee, Leon Y. Cai, Thomas Li, Ruining Deng, Can Cui, Jia Li, Qi Liu, Ken S. Lau, Joseph T. Roland, Mary K. Washington, Lori A. Coburn, Keith T. Wilson, Yuankai Huo, Bennett A. Landman

Abstract: Understanding the way cells communicate, co-locate, and interrelate is essential to understanding human physiology. Hematoxylin and eosin (H&E) staining is ubiquitously available both for clinical studies and research. The Colon Nucleus Identification and Classification (CoNIC) Challenge has recently innovated on robust artificial intelligence labeling of six cell types on H&E stains of the colon.… ▽ More Understanding the way cells communicate, co-locate, and interrelate is essential to understanding human physiology. Hematoxylin and eosin (H&E) staining is ubiquitously available both for clinical studies and research. The Colon Nucleus Identification and Classification (CoNIC) Challenge has recently innovated on robust artificial intelligence labeling of six cell types on H&E stains of the colon. However, this is a very small fraction of the number of potential cell classification types. Specifically, the CoNIC Challenge is unable to classify epithelial subtypes (progenitor, endocrine, goblet), lymphocyte subtypes (B, helper T, cytotoxic T), or connective subtypes (fibroblasts, stromal). In this paper, we propose to use inter-modality learning to label previously un-labelable cell types on virtual H&E. We leveraged multiplexed immunofluorescence (MxIF) histology imaging to identify 14 subclasses of cell types. We performed style transfer to synthesize virtual H&E from MxIF and transferred the higher density labels from MxIF to these virtual H&E images. We then evaluated the efficacy of learning in this approach. We identified helper T and progenitor nuclei with positive predictive values of $0.34 \pm 0.15$ (prevalence $0.03 \pm 0.01$) and $0.47 \pm 0.1$ (prevalence $0.07 \pm 0.02$) respectively on virtual H&E. This approach represents a promising step towards automating annotation in digital pathology. △ Less

Submitted 28 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

arXiv:2401.03060 [pdf]

Super-resolution multi-contrast unbiased eye atlases with deep probabilistic refinement

Authors: Ho Hin Lee, Adam M. Saunders, Michael E. Kim, Samuel W. Remedios, Lucas W. Remedios, Yucheng Tang, Qi Yang, Xin Yu, Shunxing Bao, Chloe Cho, Louise A. Mawn, Tonia S. Rex, Kevin L. Schey, Blake E. Dewey, Jeffrey M. Spraggins, Jerry L. Prince, Yuankai Huo, Bennett A. Landman

Abstract: Purpose: Eye morphology varies significantly across the population, especially for the orbit and optic nerve. These variations limit the feasibility and robustness of generalizing population-wise features of eye organs to an unbiased spatial reference. Approach: To tackle these limitations, we propose a process for creating high-resolution unbiased eye atlases. First, to restore spatial details… ▽ More Purpose: Eye morphology varies significantly across the population, especially for the orbit and optic nerve. These variations limit the feasibility and robustness of generalizing population-wise features of eye organs to an unbiased spatial reference. Approach: To tackle these limitations, we propose a process for creating high-resolution unbiased eye atlases. First, to restore spatial details from scans with a low through-plane resolution compared to a high in-plane resolution, we apply a deep learning-based super-resolution algorithm. Then, we generate an initial unbiased reference with an iterative metric-based registration using a small portion of subject scans. We register the remaining scans to this template and refine the template using an unsupervised deep probabilistic approach that generates a more expansive deformation field to enhance the organ boundary alignment. We demonstrate this framework using magnetic resonance images across four different tissue contrasts, generating four atlases in separate spatial alignments. Results: For each tissue contrast, we find a significant improvement using the Wilcoxon signed-rank test in the average Dice score across four labeled regions compared to a standard registration framework consisting of rigid, affine, and deformable transformations. These results highlight the effective alignment of eye organs and boundaries using our proposed process. Conclusions: By combining super-resolution preprocessing and deep probabilistic models, we address the challenge of generating an eye atlas to serve as a standardized reference across a largely variable population. △ Less

Submitted 14 June, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: Revised for submission to SPIE Journal of Medical Imaging. 26 pages, 6 figures

arXiv:2312.16425 [pdf, other]

In-Hand 3D Object Reconstruction from a Monocular RGB Video

Authors: Shijian Jiang, Qi Ye, Rengan Xie, Yuchi Huo, Xiang Li, Yang Zhou, Jiming Chen

Abstract: Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera. Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object. However, these methods falter in accurately capturing the shape within the hand-object contac… ▽ More Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera. Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object. However, these methods falter in accurately capturing the shape within the hand-object contact region due to occlusion. In this paper, we propose a novel method that deals with surface reconstruction under occlusion by incorporating priors of 2D occlusion elucidation and physical contact constraints. For the former, we introduce an object amodal completion network to infer the 2D complete mask of objects under occlusion. To ensure the accuracy and view consistency of the predicted 2D amodal masks, we devise a joint optimization method for both amodal mask refinement and 3D reconstruction. For the latter, we impose penetration and attraction constraints on the local geometry in contact regions. We evaluate our approach on HO3D and HOD datasets and demonstrate that it outperforms the state-of-the-art methods in terms of reconstruction surface quality, with an improvement of $52\%$ on HO3D and $20\%$ on HOD. Project webpage: https://east-j.github.io/ihor. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI2024

arXiv:2312.08609 [pdf, other]

Non-equilibrium physics of multi-species assembly: From inhibition of fibrils in biomolecular condensates to growth of online distrust

Authors: Pedro D. Manrique, Frank Yingjie Huo, Sara El Oud, Neil F. Johnson

Abstract: Self-assembly is a key process in living systems - from the microscopic biological level (e.g. assembly of proteins into fibrils within biomolecular condensates in a human cell) through to the macroscopic societal level (e.g. assembly of humans into common-interest communities across online social media platforms). The components in such systems (e.g. macromolecules, humans) are highly diverse, an… ▽ More Self-assembly is a key process in living systems - from the microscopic biological level (e.g. assembly of proteins into fibrils within biomolecular condensates in a human cell) through to the macroscopic societal level (e.g. assembly of humans into common-interest communities across online social media platforms). The components in such systems (e.g. macromolecules, humans) are highly diverse, and so are the self-assembled structures that they form. However, there is no simple theory of how such structures assemble from a multi-species pool of components. Here we provide a very simple model which trades myriad chemical and human details for a transparent analysis, and yields results in good agreement with recent empirical data. It reveals a new inhibitory role for biomolecular condensates in the formation of dangerous amyloid fibrils, as well as a kinetic explanation of why so many diverse distrust movements are now emerging across social media. The nonlinear dependencies that we uncover suggest new real-world control strategies for such multi-species assembly. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: 20 pages, 4 Figures

arXiv:2311.17455 [pdf, other]

Experimental Generation of Spin-Photon Entanglement in Silicon Carbide

Authors: Ren-Zhou Fang, Xiao-Yi Lai, Tao Li, Ren-Zhu Su, Bo-Wei Lu, Chao-Wei Yang, Run-Ze Liu, Yu-Kun Qiao, Cheng Li, Zhi-Gang He, Jia Huang, Hao Li, Li-Xing You, Yong-Heng Huo, Xiao-Hui Bao, Jian-Wei Pan

Abstract: A solid-state approach for quantum networks is advantages, as it allows the integration of nanophotonics to enhance the photon emission and the utilization of weakly coupled nuclear spins for long-lived storage. Silicon carbide, specifically point defects within it, shows great promise in this regard due to the easy of availability and well-established nanofabrication techniques. Despite of remark… ▽ More A solid-state approach for quantum networks is advantages, as it allows the integration of nanophotonics to enhance the photon emission and the utilization of weakly coupled nuclear spins for long-lived storage. Silicon carbide, specifically point defects within it, shows great promise in this regard due to the easy of availability and well-established nanofabrication techniques. Despite of remarkable progresses made, achieving spin-photon entanglement remains a crucial aspect to be realized. In this paper, we experimentally generate entanglement between a silicon vacancy defect in silicon carbide and a scattered single photon in the zero-phonon line. The spin state is measured by detecting photons scattered in the phonon sideband. The photonic qubit is encoded in the time-bin degree-of-freedom and measured using an unbalanced Mach-Zehnder interferometer. Photonic correlations not only reveal the quality of the entanglement but also verify the deterministic nature of the entanglement creation process. By harnessing two pairs of such spin-photon entanglement, it becomes straightforward to entangle remote quantum nodes at long distance. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 8 pages in total, 4 figures in the main text, 1 figure in the supplemental material

Showing 1–50 of 321 results for author: Huo, Y