Search | arXiv e-print repository

High Spectral-Efficiency, Ultra-low MIMO SDM Transmission over a Field-Deployed Multi-Core OAM Fiber

Authors: Junyi Liu, Zengquan Xu, Shuqi Mo, Yuming Huang, Yining Huang, Zhenhua Li, Yuying Guo, Lei Shen, Shuo Xu, Ran Gao, Cheng Du, Qian Feng, Jie Luo, Jie Liu, Siyuan Yu

Abstract: Few-mode multi-core fiber (FM-MCF) based Space-Division Multiplexing (SDM) systems possess the potential to maximize the number of multiplexed spatial channels per fiber by harnessing both the space (fiber cores) and mode (optical mode per core) dimensions. However, to date, no SDM transmissions over field-deployed FM-MCFs in realistic outdoor settings have been reported, which contrasts with SDM… ▽ More Few-mode multi-core fiber (FM-MCF) based Space-Division Multiplexing (SDM) systems possess the potential to maximize the number of multiplexed spatial channels per fiber by harnessing both the space (fiber cores) and mode (optical mode per core) dimensions. However, to date, no SDM transmissions over field-deployed FM-MCFs in realistic outdoor settings have been reported, which contrasts with SDM schemes demonstrated using single-mode multi-core fibers (SM-MCFs) installed in practical fiber cable ducts. In this paper, we present the successful demonstration of bidirectional SDM transmission over a 5-km field-deployed seven ring-core fiber (7-RCF) with a cladding diameter of 178 $μ$m, achieving a Spectral Efficiency (SE) of 2$\times$201.6 bit/s/Hz. This work establishes a new record for the highest SE attained in SDM demonstrations utilizing field-deployed fiber cables, achieving an approximate 10x increase compared to the SE of reported field-deployed optical fiber cable transmission systems. Notably, these results are realized through the utilization of small-scale modular 4$\times$4 multiple-input multiple-output (MIMO) processing with a time-domain equalization (TDE) tap number not exceeding 15, maintaining a complexity per unit capacity comparable to that of MIMO equalization in SDM demonstrations employing weakly coupled SM-MCF cables. These results underscore the significant potential for achieving heightened SE and expanding capacity per individual fiber using SDM techniques in practical applications. △ Less

Submitted 29 April, 2024; originally announced July 2024.

Comments: 17 pages, 8 figures

arXiv:2406.09386 [pdf, other]

SimGen: Simulator-conditioned Driving Scene Generation

Authors: Yunsong Zhou, Michael Simon, Zhenghao Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou

Abstract: Controllable synthetic data generation can substantially lower the annotation cost of training data in autonomous driving research and development. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, the trained models can only… ▽ More Controllable synthetic data generation can substantially lower the annotation cost of training data in autonomous driving research and development. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, the trained models can only generate images based on the real-world layout data from the validation set of the same dataset, where overfitting might happen. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation. Code, data, and models will be made available. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08114 [pdf]

Massive 1D Dirac Line, Solitons and Reversible Manipulation on the Surface of a Prototype Obstructed Atomic Insulator, Silicon

Authors: Zhongkai Liu, Peng Deng, Yuanfeng Xu, Haifeng Yang, Ding Pei, Cheng Chen, Shanmei He, Defa Liu, Sung-Kwan Mo, Timur Kim, Cephise Cacho, Hong Yao, Zhi-Da Song, Xi Chen, Zhong Wang, Binghai Yan, Lexian Yang, Bogdan A. Bernevig, Yulin Chen

Abstract: Topologically trivial insulators can be classified into atomic insulators (AIs) and obstructed atomic insulators (OAIs) depending on whether the Wannier charge centers are localized or not at spatial positions occupied by atoms. An OAI can possess unusual properties such as surface states along certain crystalline surfaces, which advantageously appear in materials with much larger bulk energy gap… ▽ More Topologically trivial insulators can be classified into atomic insulators (AIs) and obstructed atomic insulators (OAIs) depending on whether the Wannier charge centers are localized or not at spatial positions occupied by atoms. An OAI can possess unusual properties such as surface states along certain crystalline surfaces, which advantageously appear in materials with much larger bulk energy gap than topological insulators, making them more attractive for potential applications. In this work, we show that a well-known crystal, silicon (Si) is a model OAI, which naturally explains some of Si's unusual properties such as its famous (111) surface states. On this surface, using angle resolved photoemission spectroscopy (ARPES), we reveal sharp quasi-1D massive Dirac line dispersions; we also observe, using scanning tunneling microscopy/spectroscopy (STM/STS), topological solitons at the interface of the two atomic chains. Remarkably, we show that the different chain domains can be reversibly switched at the nanometer scale, suggesting the application potential in ultra-high density storage devices. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07540 [pdf, other]

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Authors: Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

Abstract: Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexib… ▽ More Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 18 pages, 11 figures, see project page at https://genforce.github.io/ctrl-x

arXiv:2406.05038 [pdf, other]

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

Authors: Shentong Mo

Abstract: Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their… ▽ More Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model's scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04930 [pdf, other]

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Authors: Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

Abstract: Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignmen… ▽ More Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted in Efficient Deep Learning for Computer Vision CVPR Workshop 2024

arXiv:2405.17995 [pdf, other]

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Authors: Shentong Mo, Sukmin Yun

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can… ▽ More The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.15881 [pdf, other]

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

Authors: Shentong Mo, Yapeng Tian

Abstract: In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, l… ▽ More In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images. To address this challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative. By harnessing the inherent efficiency of the Mamba architecture, DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length. Our architecture not only scales effectively but also outperforms existing diffusion transformers in both image and video generation tasks. The results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques. This work advances the field of generative models and paves the way for further applications of scalable architectures. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.07202 [pdf, other]

Unified Video-Language Pre-training with Synchronized Audio

Authors: Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

Abstract: Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two m… ▽ More Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2404.17808 [pdf, other]

Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal

Authors: Haoran Lian, Yizhe Xiong, Jianwei Niu, Shasha Mo, Zhenpeng Su, Zijia Lin, Peng Liu, Hui Chen, Guiguang Ding

Abstract: Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while kee** all tokens that have be… ▽ More Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while kee** all tokens that have been merged in the vocabulary, it unavoidably holds tokens that primarily represent subwords of complete words and appear infrequently on their own in the text corpus. We term such tokens as Scaffold Tokens. Due to their infrequent appearance in the text corpus, Scaffold Tokens pose a learning imbalance issue for language models. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for the given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness and superiority. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.13081 [pdf, other]

SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs

Authors: Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jong** Park, Sang-Woo Lee, Minjoon Seo, Jung-Woo Ha, **woo Shin

Abstract: Large language models (LLMs) have made significant advancements in various natural language processing tasks, including question answering (QA) tasks. While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Augmenting retrieved passage… ▽ More Large language models (LLMs) have made significant advancements in various natural language processing tasks, including question answering (QA) tasks. While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Augmenting retrieved passages via prompting has the potential to address this limitation, but this direction has been limitedly explored. To this end, we design a simple yet effective framework to enhance open-domain QA (ODQA) with LLMs, based on the summarized retrieval (SuRe). SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval that could be viewed as an explicit rationale extracted from the retrieved passages. Specifically, SuRe first constructs summaries of the retrieved passages for each of the multiple answer candidates. Then, SuRe confirms the most plausible answer from the candidate set by evaluating the validity and ranking of the generated summaries. Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches. SuRe also can be integrated with a broad range of retrieval methods and LLMs. Finally, the generated summaries from SuRe show additional advantages to measure the importance of retrieved passages and serve as more preferred rationales by models and humans. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Accepted at ICLR 2024

arXiv:2404.12876 [pdf, other]

A Large-scale Medical Visual Task Adaptation Benchmark

Authors: Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li

Abstract: Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such… ▽ More Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such as color images, X-ray, and CT. To close this gap, we present Med-VTAB, a large-scale Medical Visual Task Adaptation Benchmark consisting of 1.68 million medical images for diverse organs, modalities, and adaptation approaches. Based on Med-VTAB, we explore the scaling law of medical prompt tuning concerning tunable parameters and the generalizability of medical visual adaptation using non-medical/medical pre-train weights. Besides, we study the impact of patient ID out-of-distribution on medical visual adaptation, which is a real and challenging scenario. Furthermore, results from Med-VTAB indicate that a single pre-trained model falls short in medical task adaptation. Therefore, we introduce GMoE-Adapter, a novel method that combines medical and general pre-training weights through a gated mixture-of-experts adapter, achieving state-of-the-art results in medical visual task adaptation. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.11934 [pdf]

doi 10.1103/PhysRevB.109.L161102

Quantum simulation of honeycomb lattice model by high-order moiré pattern

Authors: Qiang Wan, Chunlong Wu, Xun-Jiang Luo, Shenghao Dai, Cao Peng, Renzhe Li, Shangkun Mo, Keming Zhao, Wen-Xuan Qiu, Hao Zhong, Yiwei Li, Chendong Zhang, Fengcheng Wu, Nan Xu

Abstract: Moiré superlattices have become an emergent solid-state platform for simulating quantum lattice models. However, in single moiré device, Hamiltonians parameters like lattice constant, hop** and interaction terms can hardly be manipulated, limiting the controllability and accessibility of moire quantum simulator. Here, by combining angle-resolved photoemission spectroscopy and theoretical analysi… ▽ More Moiré superlattices have become an emergent solid-state platform for simulating quantum lattice models. However, in single moiré device, Hamiltonians parameters like lattice constant, hop** and interaction terms can hardly be manipulated, limiting the controllability and accessibility of moire quantum simulator. Here, by combining angle-resolved photoemission spectroscopy and theoretical analysis, we demonstrate that high-order moiré patterns in graphene-monolayered xenon/krypton heterostructures can simulate honeycomb model in mesoscale, with in-situ tunable Hamiltonians parameters. The length scale of simulated lattice constant can be tuned by annealing processes, which in-situ adjusts intervalley interaction and hop** parameters in the simulated honeycomb lattice. The sign of the lattice constant can be switched by choosing xenon or krypton monolayer deposited on graphene, which controls sublattice degree of freedom and valley arrangment of Dirac fermions. Our work establishes a novel path for experimentally simulating the honeycomb model with tunable parameters by high-order moiré patterns. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 19 pages, 5 figure

Journal ref: Phy. Rev. B 109, L161102 (2024)

arXiv:2404.10308 [pdf, other]

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Authors: Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, **woo Shin

Abstract: Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address… ▽ More Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at https://github.com/alinlab/HOMER. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Accepted to ICLR 2024. The first two authors contributed equally

arXiv:2404.02257 [pdf, other]

SnAG: Scalable and Accurate Video Grounding

Authors: Fangzhou Mu, Sicheng Mo, Yin Li

Abstract: Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the sca… ▽ More Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos. △ Less

Submitted 5 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024. Code available at https://github.com/fmu2/snag_release

arXiv:2404.00509 [pdf, other]

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

Authors: Jiantao Wu, Shentong Mo, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais

Abstract: Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining po… ▽ More Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining pose significant challenges, particularly within academic environments, thereby impeding the SSL research progress. In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks and employing progressive training techniques and other tricks to closely maintain pretraining performance. Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours, using a single machine equipped with 8 A100 GPUs. By achieving speed gains of up to 5.8 times, this work not only demonstrates the feasibility of conducting high-efficiency SSL training but also paves the way for broader accessibility and promotes advancement in SSL research particularly for prototy** and initial testing of SSL ideas. The code is available in https://github.com/erow/FastSSL. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.13596 [pdf]

Tailoring Physical Properties of Crystals through Synthetic Temperature Control: A Case Study for new Polymorphic NbFeTe2 phases

Authors: Hanlin Wu, Sheng Li, Yan Lyu, Yucheng Guo, Wenhao Liu, Ji Seop Oh, Yichen Zhang, Sung-Kwan Mo, Clarina dela Cruz, Robert J. Birgeneau, Keith M. Taddei, Ming Yi, Li Yang, Bing Lv

Abstract: Growth parameters play a significant role in the crystal quality and physical properties of layered materials. Here we present a case study on a van der Waals magnetic NbFeTe2 material. Two different types of polymorphic NbFeTe2 phases, synthesized at different temperatures, display significantly different behaviors in crystal symmetry, electronic structure, electrical transport, and magnetism. Wh… ▽ More Growth parameters play a significant role in the crystal quality and physical properties of layered materials. Here we present a case study on a van der Waals magnetic NbFeTe2 material. Two different types of polymorphic NbFeTe2 phases, synthesized at different temperatures, display significantly different behaviors in crystal symmetry, electronic structure, electrical transport, and magnetism. While the phase synthesized at low temperature showing behavior consistent with previous reports, the new phase synthesized at high temperature, has completely different physical properties, such as metallic resistivity, long-range ferromagnetic order, anomalous Hall effect, negative magnetoresistance, and distinct electronic structures. Neutron diffraction reveals out-of-plane ferromagnetism below 70K, consistent with the electrical transport and magnetic susceptibility studies. Our work suggests that simply tuning synthetic parameters in a controlled manner could be an effective route to alter the physical properties of existing materials potentially unlocking new states of matter, or even discovering new materials. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: 22 Pages, 6 figures

arXiv:2403.11416 [pdf]

doi 10.1103/PhysRevB.109.115415

Surface region band enhancement in noble gas adsorption assisted ARPES on kagome superconductor RbV3Sb5

Authors: Cao Peng, Yiwei Li, Xu Chen, Shenghao Dai, Zewen Wu, Chunlong Wu, Qiang Wan, Keming Zhao, Renzhe Li, Shangkun Mo, Dingkun Qin, Shuming Yu, Hao Zhong, Shengjun Yuan, Jiangang Guo, Nan Xu

Abstract: Electronic states near surface regions can be distinct from bulk states, which are paramount in understanding various physical phenomena occurring at surfaces and in applications in semiconductors, energy, and catalysis. Here, we report an abnormal surface region band enhancement effect in angle-resolved photoemission spectroscopy on kagome superconductor RbV3Sb5, by depositing noble gases with fi… ▽ More Electronic states near surface regions can be distinct from bulk states, which are paramount in understanding various physical phenomena occurring at surfaces and in applications in semiconductors, energy, and catalysis. Here, we report an abnormal surface region band enhancement effect in angle-resolved photoemission spectroscopy on kagome superconductor RbV3Sb5, by depositing noble gases with fine control. In contrast to conventional surface contamination, the intensity of surface region Sb band can be enhanced more than three times with noble gas adsorption. In the meantime, a hole-dope effect is observed for the enhanced surface region band, with other bands hardly changing. The do** effect is more pronounced with heavier noble gases. We propose that noble gas atoms selectively fill into alkali metal vacancy sites on the surface, which improves the surface condition, boosts surface region bands, and effectively dopes it with the Pauli repulsion mechanism. Our results provide a novel and reversible way to improve surface conditions and tune surface region bands by controlled surface noble gas deposition. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 17 pages,4 figures

Journal ref: Phys. Rev. B 109, 115415 (2024)

arXiv:2403.09846 [pdf]

doi 10.1021/acs.nanolett.3c03203

Electronic structure of above-room-temperature van der Waals ferromagnet Fe$_3$GaTe$_2$

Authors: Ji-Eun Lee, Shaohua Yan, Sehoon Oh, **woong Hwang, Jonathan D. Denlinger, Choongyu Hwang, Hechang Lei, Sung-Kwan Mo, Se Young Park, Hye** Ryu

Abstract: Fe$_3$GaTe$_2$, a recently discovered van der Waals ferromagnet, demonstrates intrinsic ferromagnetism above room temperature, necessitating a comprehensive investigation of the microscopic origins of its high Curie temperature ($\textit{T}$$_C$). In this study, we reveal the electronic structure of Fe$_3$GaTe$_2$ in its ferromagnetic ground state using angle-resolved photoemission spectroscopy an… ▽ More Fe$_3$GaTe$_2$, a recently discovered van der Waals ferromagnet, demonstrates intrinsic ferromagnetism above room temperature, necessitating a comprehensive investigation of the microscopic origins of its high Curie temperature ($\textit{T}$$_C$). In this study, we reveal the electronic structure of Fe$_3$GaTe$_2$ in its ferromagnetic ground state using angle-resolved photoemission spectroscopy and density functional theory calculations. Our results establish a consistent correspondence between the measured band structure and theoretical calculations, underscoring the significant contributions of the Heisenberg exchange interaction ($\textit{J}$$_{ex}$) and magnetic anisotropy energy to the development of the high-$\textit{T}$$_C$ ferromagnetic ordering in Fe$_3$GaTe$_2$. Intriguingly, we observe substantial modifications to these crucial driving factors through do**, which we attribute to alterations in multiple spin-splitting bands near the Fermi level. These findings provide valuable insights into the underlying electronic structure and its correlation with the emergence of high-$\textit{T}$$_C$ ferromagnetic ordering in Fe$_3$GaTe$_2$. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 25 pages, 4 figures

Journal ref: Nano Lett. 23 (2023) 11526-11532

arXiv:2403.07938 [pdf, other]

Text-to-Audio Generation Synchronized with Videos

Authors: Shentong Mo, **g Shi, Yapeng Tian

Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often… ▽ More In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency. To complement this, we also present a simple yet effective video-aligned TTA generation model, namely T2AV. Moving beyond traditional methods, T2AV refines the latent diffusion approach by integrating visual-aligned text embeddings as its conditional foundation. It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet that adeptly merges temporal visual representations with text embeddings. Further enhancing this integration, we weave in a contrastive learning objective, designed to ensure that the visual-aligned text embeddings resonate closely with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.12903

arXiv:2403.05659 [pdf, other]

Audio-Synchronized Visual Animation

Authors: Lin Zhang, Shentong Mo, Yi**g Zhang, Pedro Morgado

Abstract: Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips acros… ▽ More Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our models superior performance. We further explore AVSyncDs potential in a variety of audio synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation. More videos on project webpage https://lzhangbj.github.io/projects/asva/asva.html. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 15 pages

arXiv:2402.17406 [pdf, other]

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

Authors: Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li

Abstract: Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominan… ▽ More Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominantly sourced from the model's previous block. A pivotal oversight in such approaches is their failure to harness the potential of long-range previous blocks as sources of prompts within each self-supervised ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning. Drawing inspiration from the intricacies of the human brain, LSPT ingeniously incorporates long-term gated prompts. This feature serves as temporal coding, curbing the risk of forgetting parameters acquired from earlier blocks. Further enhancing its prowess, LSPT brings into play patch tokens, serving as spatial coding. This is strategically designed to perpetually amass class-conscious features, thereby fortifying the model's prowess in distinguishing and identifying visual categories. To validate the efficacy of our proposed method, we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks. Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.14299 [pdf, other]

We Choose to Go to Space: Agent-driven Human and Multi-Robot Collaboration in Microgravity

Authors: Miao Xin, Zhongrui You, Zihan Zhang, Taoran Jiang, Tingjia Xu, Haotian Liang, Guo**g Ge, Yuchen Ji, Shentong Mo, Jian Cheng

Abstract: We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions. Future space exploration requires humans to work together with robots. However, acquiring proficient robot skills and adept collaboration under microgravity conditions poses significant challenges within ground laboratories. To address this issue, we develop a microg… ▽ More We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions. Future space exploration requires humans to work together with robots. However, acquiring proficient robot skills and adept collaboration under microgravity conditions poses significant challenges within ground laboratories. To address this issue, we develop a microgravity simulation environment and present three typical configurations of intra-cabin robots. We propose a hierarchical heterogeneous multi-agent collaboration architecture: guided by foundation models, a Decision-Making Agent serves as a task planner for human-robot collaboration, while individual Skill-Expert Agents manage the embodied control of robots. This mechanism empowers the SpaceAgents-1 system to execute a range of intricate long-horizon HMRC tasks. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.07143 [pdf, ps, other]

Electronic structure of the alternating monolayer-trilayer phase of La3Ni2O7

Authors: Sebastien N. Abadi, Ke-Jun Xu, Eder G. Lomeli, Pascal Puphal, Masahiko Isobe, Yong Zhong, Alexei V. Fedorov, Sung-Kwan Mo, Makoto Hashimoto, Dong-Hui Lu, Brian Moritz, Bernhard Keimer, Thomas P. Devereaux, Matthias Hepting, Zhi-Xun Shen

Abstract: Recent studies of La$_3$Ni$_2$O$_7$ have identified a bilayer (2222) structure and an unexpected alternating monolayer-trilayer (1313) structure, both of which feature signatures of superconductivity near 80 K under high pressures. Using angle-resolved photoemission spectroscopy, we measure the electronic structure of 1313 samples. In contrast to the previously studied 2222 structure, we find that… ▽ More Recent studies of La$_3$Ni$_2$O$_7$ have identified a bilayer (2222) structure and an unexpected alternating monolayer-trilayer (1313) structure, both of which feature signatures of superconductivity near 80 K under high pressures. Using angle-resolved photoemission spectroscopy, we measure the electronic structure of 1313 samples. In contrast to the previously studied 2222 structure, we find that the 1313 structure hosts a flat band with a markedly different binding energy, as well as an additional electron pocket and band splittings. By comparison to local-density approximation calculations, we find renormalizations of the Ni-$d_{z^2}$ and Ni-$d_{x^2-y^2}$ derived bands to be about 5 to 7 and about 4 respectively, suggesting strong correlation effects. These results reveal important differences in the electronic structure brought about by the distinct structural motifs with the same stoichiometry. Such differences may be relevant to the putative high temperature superconductivity. △ Less

Submitted 25 June, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

Comments: Version 2: Improved data quality of the small electron pocket at the zone center ($ε$ band). Also, observations of multilayer splitting effects in the flat band ($γ$ and $δ$ bands) and in the large cuprate-like pockets ($β$ bands). Band structure calculations now use LDA instead of LDA+U. Main text: 7 pages, 3 figures. SM: 9 pages, 6 figures

arXiv:2401.17607 [pdf, ps, other]

Engineering two-dimensional nodal semimetals in functionalized biphenylene by fluorine adatoms

Authors: Seongjun Mo, Jaeuk Seo, Seok-Kyun Son, Sejoong Kim, Jun-Won Rhim, Hoonkyung Lee

Abstract: We propose a new band engineering scheme on the biphenylene network, a newly synthesized carbon allotrope. First, we investigate the mechanism for the appearance of type II Dirac fermion in a pristine biphenylene network. We show that the essential ingredients are mirror symmetries and the stabilization of the compact localized eigenstates via destructive interference. While the former is used for… ▽ More We propose a new band engineering scheme on the biphenylene network, a newly synthesized carbon allotrope. First, we investigate the mechanism for the appearance of type II Dirac fermion in a pristine biphenylene network. We show that the essential ingredients are mirror symmetries and the stabilization of the compact localized eigenstates via destructive interference. While the former is used for the band-crossing point along high symmetry lines, the latter makes the obtained Dirac dispersion highly inclined. Then, we demonstrate that many other different kinds of Dirac fermions, such as type-I Dirac, gapped type-II Dirac, and nodal line semimetals, can be developed by fluorinating the biphenylene network periodically in various ways. In this program, the key role of the fluorine atoms is manipulating the condition of the destructive interference and mirror symmetries. △ Less

Submitted 31 January, 2024; originally announced January 2024.

arXiv:2401.12969 [pdf, ps, other]

Monadic transductions and definable classes of matroids

Authors: Susan Jowett, Dillon Mayhew, Songbao Mo, Christopher Tuffley

Abstract: A transduction provides us with a way of using the monadic second-order language of a structure to make statements about a derived structure. Any transduction induces a relation on the set of these structures. This article presents a self-contained presentation of the theory of transductions for the monadic second-order language of matroids. This includes a proof of the matroid version of the Back… ▽ More A transduction provides us with a way of using the monadic second-order language of a structure to make statements about a derived structure. Any transduction induces a relation on the set of these structures. This article presents a self-contained presentation of the theory of transductions for the monadic second-order language of matroids. This includes a proof of the matroid version of the Backwards Translation Theorem, which lifts any formula applied to the images of the transduction into a formula which we can apply to the pre-images. Applications include proofs that the class of lattice-path matroids and the class of spike-minors can be defined by sentences in monadic second-order logic. △ Less

Submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.03494 [pdf]

Pre-insertion resistors temperature prediction based on improved WOA-SVR

Authors: Honghe Dai, Site Mo, Haoxin Wang, Nan Yin, Songhai Fan, Bixiong Li

Abstract: The pre-insertion resistors (PIR) within high-voltage circuit breakers are critical components and warm up by generating Joule heat when an electric current flows through them. Elevated temperature can lead to temporary closure failure and, in severe cases, the rupture of PIR. To accurately predict the temperature of PIR, this study combines finite element simulation techniques with Support Vector… ▽ More The pre-insertion resistors (PIR) within high-voltage circuit breakers are critical components and warm up by generating Joule heat when an electric current flows through them. Elevated temperature can lead to temporary closure failure and, in severe cases, the rupture of PIR. To accurately predict the temperature of PIR, this study combines finite element simulation techniques with Support Vector Regression (SVR) optimized by an Improved Whale Optimization Algorithm (IWOA) approach. The IWOA includes Tent map**, a convergence factor based on the sigmoid function, and the Ornstein-Uhlenbeck variation strategy. The IWOA-SVR model is compared with the SSA-SVR and WOA-SVR. The results reveal that the prediction accuracies of the IWOA-SVR model were 90.2% and 81.5% (above 100$^\circ$C) in the 3$^\circ$C temperature deviation range and 96.3% and 93.4% (above 100$^\circ$C) in the 4$^\circ$C temperature deviation range, surpassing the performance of the comparative models. This research demonstrates the method proposed can realize the online monitoring of the temperature of the PIR, which can effectively prevent thermal faults PIR and provide a basis for the opening and closing of the circuit breaker within a short period. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2312.11732 [pdf, other]

doi 10.1103/PhysRevB.109.045416

Two-Step Electronic Response to Magnetic Ordering in a van der Waals Ferromagnet

Authors: Han Wu, Jian-Xin Zhu, Lebing Chen, Matthew W Butcher, Ziqin Yue, Dongsheng Yuan, Yu He, Ji Seop Oh, Jianwei Huang, Shan Wu, Cheng Gong, Yucheng Guo, Sung-Kwan Mo, Jonathan D. Denlinger, Donghui Lu, Makoto Hashimoto, Matthew B. Stone, Alexander I. Kolesnikov, Songxue Chi, Junichiro Kono, Andriy H. Nevidomskyy, Robert J. Birgeneau, Pengcheng Dai, Ming Yi

Abstract: The two-dimensional (2D) material Cr$_2$Ge$_2$Te$_6$ is a member of the class of insulating van der Waals magnets. Here, using high resolution angle-resolved photoemission spectroscopy in a detailed temperature dependence study, we identify a clear response of the electronic structure to a dimensional crossover in the form of two distinct temperature scales marking onsets of modifications in the e… ▽ More The two-dimensional (2D) material Cr$_2$Ge$_2$Te$_6$ is a member of the class of insulating van der Waals magnets. Here, using high resolution angle-resolved photoemission spectroscopy in a detailed temperature dependence study, we identify a clear response of the electronic structure to a dimensional crossover in the form of two distinct temperature scales marking onsets of modifications in the electronic structure. Specifically, we observe Te $p$-orbital-dominated bands to undergo changes at the Curie transition temperature T$_C$ while the Cr $d$-orbital-dominated bands begin evolving at a higher temperature scale. Combined with neutron scattering, density functional theory calculations, and Monte Carlo simulations, we find that the electronic system can be consistently understood to respond sequentially to the distinct temperatures at which in-plane and out-of-plane spin correlations exceed a characteristic length scale. Our findings reveal the sensitivity of the orbital-selective electronic structure for probing the dynamical evolution of local moment correlations in vdW insulating magnets. △ Less

Submitted 20 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: PRB, in press

Journal ref: Physical Review B 109, 045416 (2024)

arXiv:2312.07536 [pdf, other]

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

Authors: Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou

Abstract: Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this… ▽ More Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work, we present FreeControl, a training-free approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. FreeControl designs structure guidance to facilitate the structure alignment with a guidance image, and appearance guidance to enable the appearance sharing between images generated using the same seed. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular, FreeControl facilitates convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality with training-based approaches. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: Project Page: https://genforce.github.io/freecontrol/

arXiv:2312.07231 [pdf, other]

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

Authors: Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li

Abstract: Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose Fas… ▽ More Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: Project Page: https://dit-3d.github.io/FastDiT-3D/

arXiv:2312.06220 [pdf, other]

Dance of Channel and Sequence: An Efficient Attention-Based Approach for Multivariate Time Series Forecasting

Authors: Haoxin Wang, Yipeng Mo, Nan Yin, Honghe Dai, Bixiong Li, Songhai Fan, Site Mo

Abstract: In recent developments, predictive models for multivariate time series analysis have exhibited commendable performance through the adoption of the prevalent principle of channel independence. Nevertheless, it is imperative to acknowledge the intricate interplay among channels, which fundamentally influences the outcomes of multivariate predictions. Consequently, the notion of channel independence,… ▽ More In recent developments, predictive models for multivariate time series analysis have exhibited commendable performance through the adoption of the prevalent principle of channel independence. Nevertheless, it is imperative to acknowledge the intricate interplay among channels, which fundamentally influences the outcomes of multivariate predictions. Consequently, the notion of channel independence, while offering utility to a certain extent, becomes increasingly impractical, leading to information degradation. In response to this pressing concern, we present CSformer, an innovative framework characterized by a meticulously engineered two-stage self-attention mechanism. This mechanism is purposefully designed to enable the segregated extraction of sequence-specific and channel-specific information, while sharing parameters to promote synergy and mutual reinforcement between sequences and channels. Simultaneously, we introduce sequence adapters and channel adapters, ensuring the model's ability to discern salient features across various dimensions. Rigorous experimentation, spanning multiple real-world datasets, underscores the robustness of our approach, consistently establishing its position at the forefront of predictive performance across all datasets. This augmentation substantially enhances the capacity for feature extraction inherent to multivariate time series data, facilitating a more comprehensive exploitation of the available information. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.01186 [pdf, other]

Linker-Tuning: Optimizing Continuous Prompts for Heterodimeric Protein Prediction

Authors: Shuxian Zou, Hui Li, Shentong Mo, Xingyi Cheng, Eric Xing, Le Song

Abstract: Predicting the structure of interacting chains is crucial for understanding biological systems and develo** new drugs. Large-scale pre-trained Protein Language Models (PLMs), such as ESM2, have shown impressive abilities in extracting biologically meaningful representations for protein structure prediction. In this paper, we show that ESMFold, which has been successful in computing accurate atom… ▽ More Predicting the structure of interacting chains is crucial for understanding biological systems and develo** new drugs. Large-scale pre-trained Protein Language Models (PLMs), such as ESM2, have shown impressive abilities in extracting biologically meaningful representations for protein structure prediction. In this paper, we show that ESMFold, which has been successful in computing accurate atomic structures for single-chain proteins, can be adapted to predict the heterodimer structures in a lightweight manner. We propose Linker-tuning, which learns a continuous prompt to connect the two chains in a dimer before running it as a single sequence in ESMFold. Experiment results show that our method successfully predicts 56.98% of interfaces on the i.i.d. heterodimer test set, with an absolute improvement of +12.79% over the ESMFold-Linker baseline. Furthermore, our model can generalize well to the out-of-distribution (OOD) test set HeteroTest2 and two antibody test sets Fab and Fv while being $9\times$ faster than AF-Multimer. △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2312.01118 [pdf, other]

Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning

Authors: Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, Zhenhua Feng, Muhammad Awais

Abstract: Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function. It overcomes the limitations of conventional supervised one, e.g., scalability and label biases. Despite progress in this domain, current benchmarks, incorporating a narrow scope of classes, stop the nuanced evaluation of semantic representations. To bridge this gap, we introduce… ▽ More Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function. It overcomes the limitations of conventional supervised one, e.g., scalability and label biases. Despite progress in this domain, current benchmarks, incorporating a narrow scope of classes, stop the nuanced evaluation of semantic representations. To bridge this gap, we introduce a large-scale benchmark with diversity and granularity of classes, Statistical Metric Learning Benchmark (SMLB) built upon ImageNet-21K and WordNet. SMLB is designed to rigorously evaluate the discriminative discernment and generalizability across more than 14M images, 20K classes, and 16K taxonomic nodes. Alongside, we propose novel evaluation metrics -- `overlap' for separability and `aSTD' for consistency -- to measure distance statistical information, which are efficient and robust to the change of class number. Our benchmark offers a novel perspective of evaluating the quality of representations beyond accuracy. Our findings reveal the limitations of supervised learning and the class bias inherent in SSL models, offering insights into potential areas for future model enhancement. △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2312.01017 [pdf, other]

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Authors: Shentong Mo, Pedro Morgado

Abstract: Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for develo** multimodal perception models. However, training early fusion architectures poses significant challe… ▽ More Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for develo** multimodal perception models. However, training early fusion architectures poses significant challenges, as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. Additionally, we propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions. While effective, this procedure can become computationally intractable, as the number of local representations increases. Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification, visual sound localization, sound separation, and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.15080 [pdf, other]

Weakly-Supervised Audio-Visual Segmentation

Authors: Shentong Mo, Bhiksha Raj

Abstract: Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotat… ▽ More Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios. △ Less

Submitted 25 November, 2023; originally announced November 2023.

arXiv:2311.11285 [pdf, other]

TimeSQL: Improving Multivariate Time Series Forecasting with Multi-Scale Patching and Smooth Quadratic Loss

Authors: Site Mo, Haoxin Wang, Bixiong Li, Songhai Fan, Yuankai Wu, Xianggen Liu

Abstract: Time series is a special type of sequence data, a sequence of real-valued random variables collected at even intervals of time. The real-world multivariate time series comes with noises and contains complicated local and global temporal dynamics, making it difficult to forecast the future time series given the historical observations. This work proposes a simple and effective framework, coined as… ▽ More Time series is a special type of sequence data, a sequence of real-valued random variables collected at even intervals of time. The real-world multivariate time series comes with noises and contains complicated local and global temporal dynamics, making it difficult to forecast the future time series given the historical observations. This work proposes a simple and effective framework, coined as TimeSQL, which leverages multi-scale patching and smooth quadratic loss (SQL) to tackle the above challenges. The multi-scale patching transforms the time series into two-dimensional patches with different length scales, facilitating the perception of both locality and long-term correlations in time series. SQL is derived from the rational quadratic kernel and can dynamically adjust the gradients to avoid overfitting to the noises and outliers. Theoretical analysis demonstrates that, under mild conditions, the effect of the noises on the model with SQL is always smaller than that with MSE. Based on the two modules, TimeSQL achieves new state-of-the-art performance on the eight real-world benchmark datasets. Further ablation studies indicate that the key modules in TimeSQL could also enhance the results of other models for multivariate time series forecasting, standing as plug-and-play techniques. △ Less

Submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.06217 [pdf, other]

MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things

Authors: Shentong Mo, Paul Pu Liang, Russ Salakhutdinov, Louis-Philippe Morency

Abstract: The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geoloca… ▽ More The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, video, and audio for prediction tasks involving the pose, gaze, activities, and gestures of humans as well as the touch, contact, pose, 3D of physical objects. Machine learning presents a rich opportunity to automatically process IoT data at scale, enabling efficient inference for impact in understanding human wellbeing, controlling physical devices, and interconnecting smart cities. To develop machine learning technologies for IoT, this paper proposes MultiIoT, the most expansive IoT benchmark to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks. MultiIoT introduces unique challenges involving (1) learning from many sensory modalities, (2) fine-grained interactions across long temporal ranges, and (3) extreme heterogeneity due to unique structure and noise topologies in real-world sensors. We also release a set of strong modeling baselines, spanning modality and task-specific methods to multisensory and multitask models to encourage future research in multisensory representation learning for IoT. △ Less

Submitted 10 November, 2023; originally announced November 2023.

arXiv:2311.01144 [pdf, ps, other]

On Stable Rationality of Polytopes

Authors: Simen Westbye Moe

Abstract: Nicaise--Ottem introduced the notion of (stably) rational polytopes and studied this using a combinatorial description of the motivic volume. In this framework, we ask whether being non-stably rational is preserved under inclusions. We prove this holds for a large class of polytopes, leading to a combinatorial strategy for studying stable rationality of hypersurfaces in toric varieties. As a resul… ▽ More Nicaise--Ottem introduced the notion of (stably) rational polytopes and studied this using a combinatorial description of the motivic volume. In this framework, we ask whether being non-stably rational is preserved under inclusions. We prove this holds for a large class of polytopes, leading to a combinatorial strategy for studying stable rationality of hypersurfaces in toric varieties. As a result, we obtain new bounds for non-stably rational hypersurface in projective space, improving the ones given by Schreieder when the field has characteristic 0. We also obtain similar bounds for double covers of projective space and some new classes of non-stably rational varieties in products of projective space. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: 34 pages

arXiv:2310.18850 [pdf, other]

Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models

Authors: Shentong Mo, Zhun Sun, Chao Li

Abstract: Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views. In practice, augmentation techniques that mask regions of a sample with zero/mean values or patches from other samples are commonly employed in pre-trained models with self-/semi-/fully-supervised contrastive losses. However, the underlying mechanism behind the effectiv… ▽ More Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views. In practice, augmentation techniques that mask regions of a sample with zero/mean values or patches from other samples are commonly employed in pre-trained models with self-/semi-/fully-supervised contrastive losses. However, the underlying mechanism behind the effectiveness of these augmentation techniques remains poorly explored. To investigate the problems, we conduct an empirical study to quantify how data augmentation affects performance. Concretely, we apply 4 types of data augmentations termed with Random Erasing, CutOut, CutMix and MixUp to a series of self-/semi-/fully- supervised pre-trained models. We report their performance on vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. We then explicitly evaluate the invariance and diversity of the feature embedding. We observe that: 1) Masking regions of the images decreases the invariance of the learned feature embedding while providing a more considerable diversity. 2) Manual annotations do not change the invariance or diversity of the learned feature embedding. 3) The MixUp approach improves the diversity significantly, with only a marginal decrease in terms of the invariance. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2309.15371 [pdf]

doi 10.1038/s41467-023-40997-1

From Stoner to Local Moment Magnetism in Atomically Thin Cr2Te3

Authors: Yong Zhong, Cheng Peng, Haili Huang, Dandan Guan, **woong Hwang, Kuan H. Hsu, Yi Hu, Chun**g Jia, Brian Moritz, Donghui Lu, Jun-Sik Lee, **-Feng Jia, Thomas P. Devereaux, Sung-Kwan Mo, Zhi-Xun Shen

Abstract: The field of two-dimensional (2D) ferromagnetism has been proliferating over the past few years, with ongoing interests in basic science and potential applications in spintronic technology. However, a high-resolution spectroscopic study of the 2D ferromagnet is still lacking due to the small size and air sensitivity of the exfoliated nanoflakes. Here, we report a thickness-dependent ferromagnetism… ▽ More The field of two-dimensional (2D) ferromagnetism has been proliferating over the past few years, with ongoing interests in basic science and potential applications in spintronic technology. However, a high-resolution spectroscopic study of the 2D ferromagnet is still lacking due to the small size and air sensitivity of the exfoliated nanoflakes. Here, we report a thickness-dependent ferromagnetism in epitaxially grown Cr2Te3 thin films and investigate the evolution of the underlying electronic structure by synergistic angle-resolved photoemission spectroscopy, scanning tunneling microscopy, x-ray absorption spectroscopy, and first-principle calculations. A conspicuous ferromagnetic transition from Stoner to Heisenberg-type is directly observed in the atomically thin limit, indicating that dimensionality is a powerful tuning knob to manipulate the novel properties of 2D magnetism. Monolayer Cr2Te3 retains robust ferromagnetism, but with a suppressed Curie temperature, due to the drastic drop in the density of states near the Fermi level. Our results establish atomically thin Cr2Te3 as an excellent platform to explore the dual nature of localized and itinerant ferromagnetism in 2D magnets. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: 32 pages, 4 + 10 figures

Journal ref: Nature Communications 14, 5340 (2023)

arXiv:2309.07694 [pdf, ps, other]

Tree of Uncertain Thoughts Reasoning for Large Language Models

Authors: Shentong Mo, Miao Xin

Abstract: While the recently introduced Tree of Thoughts (ToT) has heralded advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significan… ▽ More While the recently introduced Tree of Thoughts (ToT) has heralded advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process. Addressing this pivotal gap, we introduce the Tree of Uncertain Thoughts (TouT) - a reasoning framework tailored for LLMs. Our TouT effectively leverages Monte Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse local responses at these intermediate steps. By marrying this local uncertainty quantification with global search algorithms, TouT enhances the model's precision in response generation. We substantiate our approach with rigorous experiments on two demanding planning tasks: Game of 24 and Mini Crosswords. The empirical evidence underscores TouT's superiority over both ToT and chain-of-thought prompting methods. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2309.05281 [pdf, other]

Class-Incremental Grou** Network for Continual Audio-Visual Learning

Authors: Shentong Mo, Weiguo Pian, Yapeng Tian

Abstract: Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-mo… ▽ More Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-modal representations for continual audio-visual learning. To address this gap, we propose a novel class-incremental grou** network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning. Our CIGN leverages learnable audio-visual class tokens and audio-visual grou** to continually aggregate class-aware features. Additionally, it utilizes class tokens distillation and continual grou** to prevent forgetting parameters learned from previous tasks, thereby improving the model's ability to capture discriminative audio-visual categories. We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks. Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance. Code is available at https://github.com/stoneMo/CIGN. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: ICCV 2023. arXiv admin note: text overlap with arXiv:2303.17056

arXiv:2308.11448 [pdf, other]

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

Authors: Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler

Abstract: Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to… ▽ More Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2308.11073 [pdf, other]

Audio-Visual Class-Incremental Learning

Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian

Abstract: In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlati… ▽ More In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023. △ Less

Submitted 14 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2307.12679 [pdf, other]

An Estimator for the Sensitivity to Perturbations of Deep Neural Networks

Authors: Naman Maheshwari, Nicholas Malaya, Scott Moe, Jaydeep P. Kulkarni, Sudhanva Gurumurthi

Abstract: For Deep Neural Networks (DNNs) to become useful in safety-critical applications, such as self-driving cars and disease diagnosis, they must be stable to perturbations in input and model parameters. Characterizing the sensitivity of a DNN to perturbations is necessary to determine minimal bit-width precision that may be used to safely represent the network. However, no general result exists that i… ▽ More For Deep Neural Networks (DNNs) to become useful in safety-critical applications, such as self-driving cars and disease diagnosis, they must be stable to perturbations in input and model parameters. Characterizing the sensitivity of a DNN to perturbations is necessary to determine minimal bit-width precision that may be used to safely represent the network. However, no general result exists that is capable of predicting the sensitivity of a given DNN to round-off error, noise, or other perturbations in input. This paper derives an estimator that can predict such quantities. The estimator is derived via inequalities and matrix norms, and the resulting quantity is roughly analogous to a condition number for the entire neural network. An approximation of the estimator is tested on two Convolutional Neural Networks, AlexNet and VGG-19, using the ImageNet dataset. For each of these networks, the tightness of the estimator is explored via random perturbations and adversarial attacks. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: Actual work and paper concluded in January 2019

arXiv:2307.09062 [pdf]

doi 10.1103/PhysRevB.107.125125

Antiferromagnetic topological insulating state in Tb$_{0.02}$Bi$_{1.08}$Sb$_{0.9}$Te$_2$S single crystals

Authors: Lei Guo, Weiyao Zhao, Qile Li, Meng Xu, Lei Chen, Abdulhakim Bake, Thi-Hai-Yen Vu, Yahua He, Yong Fang, David Cortie, Sung-Kwan Mo, Mark Edmonds, Xiaolin Wang, Shuai Dong, Julie Karel, Ren-Kui Zheng

Abstract: Topological insulators are emerging materials with insulating bulk and symmetry protected nontrivial surface states. One of the most fascinating transport behaviors in a topological insulator is the quantized anomalous Hall insulator, which has been observed inmagnetic-topological-insulator-based devices. In this work, we report a successful do** of rare earth element Tb into Bi$_{1.08}$Sb… ▽ More Topological insulators are emerging materials with insulating bulk and symmetry protected nontrivial surface states. One of the most fascinating transport behaviors in a topological insulator is the quantized anomalous Hall insulator, which has been observed inmagnetic-topological-insulator-based devices. In this work, we report a successful do** of rare earth element Tb into Bi$_{1.08}$Sb$_{0.9}$Te$_2$S topological insulator single crystals, in which the Tb moments are antiferromagnetically ordered below ~10 K. Benefiting from the in-bulk-gap Fermi level, transport behavior dominant by the topological surface states is observed below ~ 150 K. At low temperatures, strong Shubnikov-de Haas oscillations are observed, which exhibit 2D-like behavior. The topological insulator with long range magnetic ordering in rare earth doped Bi$_{1.08}$Sb$_{0.9}$Te$_2$S single crystal provides an ideal platform for quantum transport studies and potential applications. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 15 pages, 3 figures

Journal ref: Physical Review B 107.12 (2023): 125125

arXiv:2307.03154 [pdf, other]

doi 10.1038/s41467-024-46862-z

Reversible Non-Volatile Electronic Switching in a Near Room Temperature van der Waals Ferromagnet

Authors: Han Wu, Lei Chen, Paul Malinowski, Jianwei Huang, Qinwen Deng, Kirsty Scott, Bo Gyu Jang, Jacob P. C. Ruff, Yu He, Xiang Chen, Chaowei Hu, Ziqin Yue, Ji Seop Oh, Xiaokun Teng, Yucheng Guo, Mason Klemm, Chuqiao Shi, Yue Shi, Chandan Setty, Tyler Werner, Makoto Hashimoto, Donghui Lu, T. Yilmaz, Elio Vescovo, Sung-Kwan Mo , et al. (15 additional authors not shown)

Abstract: The ability to reversibly toggle between two distinct states in a non-volatile method is important for information storage applications. Such devices have been realized for phase-change materials, which utilizes local heating methods to toggle between a crystalline and an amorphous state with distinct electrical properties. To expand such kind of switching between two topologically distinct phases… ▽ More The ability to reversibly toggle between two distinct states in a non-volatile method is important for information storage applications. Such devices have been realized for phase-change materials, which utilizes local heating methods to toggle between a crystalline and an amorphous state with distinct electrical properties. To expand such kind of switching between two topologically distinct phases requires non-volatile switching between two crystalline phases with distinct symmetries. Here we report the observation of reversible and non-volatile switching between two stable and closely-related crystal structures with remarkably distinct electronic structures in the near room temperature van der Waals ferromagnet Fe$_{5-δ}$GeTe$_2$. From a combination of characterization techniques we show that the switching is enabled by the ordering and disordering of an Fe site vacancy that results in distinct crystalline symmetries of the two phases that can be controlled by a thermal annealing and quenching method. Furthermore, from symmetry analysis as well as first principle calculations, we provide understanding of the key distinction in the observed electronic structures of the two phases: topological nodal lines compatible with the preserved global inversion symmetry in the site-disordered phase, and flat bands resulting from quantum destructive interference on a bipartite crystaline lattice formed by the presence of the site order as well as the lifting of the topological degeneracy due to the broken inversion symmetry in the site-ordered phase. Our work not only reveals a rich variety of quantum phases emergent in the metallic van der Waals ferromagnets due to the presence of site ordering, but also demonstrates the potential of these highly tunable two-dimensional magnets for memory and spintronics applications. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Journal ref: Nat Commun 15, 2739 (2024)

arXiv:2307.01831 [pdf, other]

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

Authors: Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li

Abstract: Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape genera… ▽ More Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: Project Page: https://dit-3d.github.io/

arXiv:2306.16329 [pdf, other]

DiffComplete: Diffusion-based Generative 3D Shape Completion

Authors: Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, Jiaya Jia

Abstract: We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature agg… ▽ More We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature aggregation mechanism to inject conditional features in a spatially-consistent manner. So, we can capture both local details and broader contexts of the conditional inputs to control the shape completion. Second, we propose an occupancy-aware fusion strategy in our model to enable the completion of multiple partial shapes and introduce higher flexibility on the input conditions. DiffComplete sets a new SOTA performance (e.g., 40% decrease on l_1 error) on two large-scale 3D shape completion benchmarks. Our completed shapes not only have a realistic outlook compared with the deterministic methods but also exhibit high similarity to the ground truths compared with the probabilistic alternatives. Further, DiffComplete has strong generalizability on objects of entirely unseen classes for both synthetic and real data, eliminating the need for model re-training in various applications. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: Project Page: https://ruihangchu.com/diffcomplete.html

arXiv:2306.14490 [pdf, other]

TaiChi Action Capture and Performance Analysis with Multi-view RGB Cameras

Authors: Jianwei Li, Siyu Mo, Yanfei Shen

Abstract: Recent advances in computer vision and deep learning have influenced the field of sports performance analysis for researchers to track and reconstruct freely moving humans without any marker attachment. However, there are few works for vision-based motion capture and intelligent analysis for professional TaiChi movement. In this paper, we propose a framework for TaiChi performance capture and anal… ▽ More Recent advances in computer vision and deep learning have influenced the field of sports performance analysis for researchers to track and reconstruct freely moving humans without any marker attachment. However, there are few works for vision-based motion capture and intelligent analysis for professional TaiChi movement. In this paper, we propose a framework for TaiChi performance capture and analysis with multi-view geometry and artificial intelligence technology. The main innovative work is as follows: 1) A multi-camera system suitable for TaiChi motion capture is built and the multi-view TaiChi data is collected and processed; 2) A combination of traditional visual method and implicit neural radiance field is proposed to achieve sparse 3D skeleton fusion and dense 3D surface reconstruction. 3) The normalization modeling of movement sequences is carried out based on motion transfer, so as to realize TaiChi performance analysis for different groups. We have carried out evaluation experiments, and the experimental results have shown the efficiency of our method. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Showing 1–50 of 247 results for author: Moe, S