Search | arXiv e-print repository

SAM-LAD: Segment Anything Model Meets Zero-Shot Logic Anomaly Detection

Authors: Yun Peng, Xiao Lin, Nachuan Ma, Jiayuan Du, Chuangwei Liu, Chengju Liu, Qijun Chen

Abstract: Visual anomaly detection is vital in real-world applications, such as industrial defect detection and medical diagnosis. However, most existing methods focus on local structural anomalies and fail to detect higher-level functional anomalies under logical conditions. Although recent studies have explored logical anomaly detection, they can only address simple anomalies like missing or addition and… ▽ More Visual anomaly detection is vital in real-world applications, such as industrial defect detection and medical diagnosis. However, most existing methods focus on local structural anomalies and fail to detect higher-level functional anomalies under logical conditions. Although recent studies have explored logical anomaly detection, they can only address simple anomalies like missing or addition and show poor generalizability due to being heavily data-driven. To fill this gap, we propose SAM-LAD, a zero-shot, plug-and-play framework for logical anomaly detection in any scene. First, we obtain a query image's feature map using a pre-trained backbone. Simultaneously, we retrieve the reference images and their corresponding feature maps via the nearest neighbor search of the query image. Then, we introduce the Segment Anything Model (SAM) to obtain object masks of the query and reference images. Each object mask is multiplied with the entire image's feature map to obtain object feature maps. Next, an Object Matching Model (OMM) is proposed to match objects in the query and reference images. To facilitate object matching, we further propose a Dynamic Channel Graph Attention (DCGA) module, treating each object as a keypoint and converting its feature maps into feature vectors. Finally, based on the object matching relations, an Anomaly Measurement Model (AMM) is proposed to detect objects with logical anomalies. Structural anomalies in the objects can also be detected. We validate our proposed SAM-LAD using various benchmarks, including industrial datasets (MVTec Loco AD, MVTec AD), and the logical dataset (DigitAnatomy). Extensive experimental results demonstrate that SAM-LAD outperforms existing SoTA methods, particularly in detecting logical anomalies. △ Less

Submitted 5 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.18003 [pdf, other]

MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

Authors: Bowen Zhang, Xiaofei Xie, Haotian Lu, Na Ma, Tianlin Li, Qing Guo

Abstract: Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To ta… ▽ More Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To tackle this, we propose an intuitive and straightforward solution: splicing multiple single-action video segments sequentially. The core challenge lies in generating smooth and natural transitions between these segments given the inherent complexity and variability of action transitions. We introduce MAVIN (Multi-Action Video INfilling model), designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. MAVIN incorporates several innovative techniques to address challenges in the transition video infilling task. Firstly, a consecutive noising strategy coupled with variable-length sampling is employed to handle large infilling gaps and varied generation lengths. Secondly, boundary frame guidance (BFG) is proposed to address the lack of semantic guidance during transition generation. Lastly, a Gaussian filter mixer (GFM) dynamically manages noise initialization during inference, mitigating train-test discrepancy while preserving generation flexibility. Additionally, we introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics. Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions compared to existing methods. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.16383 [pdf, other]

Rewarded Region Replay (R3) for Policy Learning with Discrete Action Space

Authors: Bangzheng Li, Ningshan Ma, Zifan Wang

Abstract: We introduce a new on-policy algorithm called Rewarded Region Replay (R3), which significantly improves on PPO in solving environments with discrete action spaces. R3 improves sample efficiency by using a replay buffer which contains past successful trajectories with reward above a certain threshold, which are used to update a PPO agent with importance sampling. Crucially, we discard the importanc… ▽ More We introduce a new on-policy algorithm called Rewarded Region Replay (R3), which significantly improves on PPO in solving environments with discrete action spaces. R3 improves sample efficiency by using a replay buffer which contains past successful trajectories with reward above a certain threshold, which are used to update a PPO agent with importance sampling. Crucially, we discard the importance sampling factors which are above a certain ratio to reduce variance and stabilize training. We found that R3 significantly outperforms PPO in Minigrid environments with sparse rewards and discrete action space, such as DoorKeyEnv and CrossingEnv, and moreover we found that the improvement margin of our method versus baseline PPO increases with the complexity of the environment. We also benchmarked the performance of R3 against DDQN (Double Deep Q-Network), which is a standard baseline in off-policy methods for discrete actions, and found that R3 also outperforms DDQN agent in DoorKeyEnv. Lastly, we adapt the idea of R3 to dense reward setting to obtain the Dense R3 algorithm (or DR3) and benchmarked it against PPO on Cartpole-V1 environment. We found that DR3 outperforms PPO significantly on this dense reward environment. Our code can be found at https://github.com/chry-santhemum/R3. △ Less

Submitted 25 May, 2024; originally announced May 2024.

ACM Class: I.2.6

arXiv:2405.06959 [pdf, other]

AHPPEBot: Autonomous Robot for Tomato Harvesting based on Phenoty** and Pose Estimation

Authors: Xingxu Li, Nan Ma, Yiheng Han, Shun Yang, Siyi Zheng

Abstract: To address the limitations inherent to conventional automated harvesting robots specifically their suboptimal success rates and risk of crop damage, we design a novel bot named AHPPEBot which is capable of autonomous harvesting based on crop phenoty** and pose estimation. Specifically, In phenoty**, the detection, association, and maturity estimation of tomato trusses and individual fruits are… ▽ More To address the limitations inherent to conventional automated harvesting robots specifically their suboptimal success rates and risk of crop damage, we design a novel bot named AHPPEBot which is capable of autonomous harvesting based on crop phenoty** and pose estimation. Specifically, In phenoty**, the detection, association, and maturity estimation of tomato trusses and individual fruits are accomplished through a multi-task YOLOv5 model coupled with a detection-based adaptive DBScan clustering algorithm. In pose estimation, we employ a deep learning model to predict seven semantic keypoints on the pedicel. These keypoints assist in the robot's path planning, minimize target contact, and facilitate the use of our specialized end effector for harvesting. In autonomous tomato harvesting experiments conducted in commercial greenhouses, our proposed robot achieved a harvesting success rate of 86.67%, with an average successful harvest time of 32.46 s, showcasing its continuous and robust harvesting capabilities. The result underscores the potential of harvesting robots to bridge the labor gap in agriculture. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: Accepted by 2024 IEEE International Conference on Robotics and Automation (ICRA),7 pages, 3 figures

arXiv:2403.18058 [pdf, other]

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Authors: Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang **, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang

Abstract: Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic… ▽ More Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and develo** Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2402.17664 [pdf, other]

Bayesian Differentiable Physics for Cloth Digitalization

Authors: Deshan Gong, Ningtao Mao, He Wang

Abstract: We propose a new method for cloth digitalization. Deviating from existing methods which learn from data captured under relatively casual settings, we propose to learn from data captured in strictly tested measuring protocols, and find plausible physical parameters of the cloths. However, such data is currently absent, so we first propose a new dataset with accurate cloth measurements. Further, the… ▽ More We propose a new method for cloth digitalization. Deviating from existing methods which learn from data captured under relatively casual settings, we propose to learn from data captured in strictly tested measuring protocols, and find plausible physical parameters of the cloths. However, such data is currently absent, so we first propose a new dataset with accurate cloth measurements. Further, the data size is considerably smaller than the ones in current deep learning, due to the nature of the data capture process. To learn from small data, we propose a new Bayesian differentiable cloth model to estimate the complex material heterogeneity of real cloths. It can provide highly accurate digitalization from very limited data samples. Through exhaustive evaluation and comparison, we show our method is accurate in cloth digitalization, efficient in learning from limited data samples, and general in capturing material variations. Code and data are available https://github.com/realcrane/Bayesian-Differentiable-Physics-for-Cloth-Digitalization △ Less

Submitted 11 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 9 pages, 8 figures, to be published in CVPR

ACM Class: F.4.8; I.6.8

arXiv:2402.13607 [pdf, other]

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Authors: Fuwen Luo, Chi Chen, Zihao Wan, Zhaolu Kang, Qidong Yan, Yingjie Li, Xiaolong Wang, Siyu Wang, Ziyue Wang, Xiaoyue Mi, Peng Li, Ning Ma, Maosong Sun, Yang Liu

Abstract: Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpret… ▽ More Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at https://thunlp-mt.github.io/CODIS. △ Less

Submitted 4 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.04883 [pdf, other]

Toward Accurate Camera-based 3D Object Detection via Cascade Depth Estimation and Calibration

Authors: Chaoqun Wang, Yiran Qin, Zijian Kang, Ningning Ma, Ruimao Zhang

Abstract: Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces, as well as the accuracy of object localization within the 3D space. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization. Different from previous met… ▽ More Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces, as well as the accuracy of object localization within the 3D space. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization. Different from previous methods which directly predict depth distributions by using a supervised estimation model, we propose a cascade framework consisting of two depth-aware learning paradigms. First, a depth estimation (DE) scheme leverages relative depth information to realize the effective feature lifting from 2D to 3D spaces. Furthermore, a depth calibration (DC) scheme introduces depth reconstruction to further adjust the 3D object localization perturbation along the depth axis. In practice, the DE is explicitly realized by using both the absolute and relative depth optimization loss to promote the precision of depth prediction, while the capability of DC is implicitly embedded into the detection Transformer through a depth denoising mechanism in the training phase. The entire model training is accomplished through an end-to-end manner. We propose a baseline detector and evaluate the effectiveness of our proposal with +2.2%/+2.7% NDS/mAP improvements on NuScenes benchmark, and gain a comparable performance with 55.9%/45.7% NDS/mAP. Furthermore, we conduct extensive experiments to demonstrate its generality based on various detectors with about +2% NDS improvements. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted to ICRA2024

arXiv:2402.02950 [pdf, other]

Semantic Entropy Can Simultaneously Benefit Transmission Efficiency and Channel Security of Wireless Semantic Communications

Authors: Yankai Rong, Guoshun Nan, Minwei Zhang, Sihan Chen, Songtao Wang, Xuefei Zhang, Nan Ma, Shixun Gong, Zhaohui Yang, Qimei Cui, Xiaofeng Tao, Tony Q. S. Quek

Abstract: Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of tra… ▽ More Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of transmission efficiency in wireless semantic communications while also alleviating its security disadvantages?". Kee** this in mind, we propose SemEntropy, a novel method that answers the above question by exploring the semantics of data for both adaptive transmission and physical layer encryption. Specifically, we first introduce semantic entropy, which indicates the expectation of various semantic scores regarding the transmission goal of the DLSC. Equipped with such semantic entropy, we can dynamically assign informative semantics to Orthogonal Frequency Division Multiplexing (OFDM) subcarriers with better channel conditions in a fine-grained manner. We also use the entropy to guide semantic key generation to safeguard communications over open wireless channels. By doing so, both transmission efficiency and channel security can be simultaneously improved. Extensive experiments over various benchmarks show the effectiveness of the proposed SemEntropy. We discuss the reason why our proposed method benefits secure transmission of DLSC, and also give some interesting findings, e.g., SemEntropy can keep the semantic accuracy remain 95% with 60% less transmission. △ Less

Submitted 6 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 13 pages, 12 figures

arXiv:2402.01269 [pdf, other]

Spectrum-guided Feature Enhancement Network for Event Person Re-Identification

Authors: Hongchen Tan, Yi Zhang, ** Liu, Baocai Yin, Nan Ma, Xin Li, Huchuan Lu

Abstract: As a cutting-edge biosensor, the event camera holds significant potential in the field of computer vision, particularly regarding privacy preservation. However, compared to traditional cameras, event streams often contain noise and possess extremely sparse semantics, posing a formidable challenge for event-based person re-identification (event Re-ID). To address this, we introduce a novel event pe… ▽ More As a cutting-edge biosensor, the event camera holds significant potential in the field of computer vision, particularly regarding privacy preservation. However, compared to traditional cameras, event streams often contain noise and possess extremely sparse semantics, posing a formidable challenge for event-based person re-identification (event Re-ID). To address this, we introduce a novel event person re-identification network: the Spectrum-guided Feature Enhancement Network (SFE-Net). This network consists of two innovative components: the Multi-grain Spectrum Attention Mechanism (MSAM) and the Consecutive Patch Dropout Module (CPDM). MSAM employs a fourier spectrum transform strategy to filter event noise, while also utilizing an event-guided multi-granularity attention strategy to enhance and capture discriminative person semantics. CPDM employs a consecutive patch dropout strategy to generate multiple incomplete feature maps, encouraging the deep Re-ID model to equally perceive each effective region of the person's body and capture robust person descriptors. Extensive experiments on Event Re-ID datasets demonstrate that our SFE-Net achieves the best performance in this task. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.15647 [pdf, other]

UP-CrackNet: Unsupervised Pixel-Wise Road Crack Detection via Adversarial Image Restoration

Authors: Nachuan Ma, Rui Fan, Lihua Xie

Abstract: Over the past decade, automated methods have been developed to detect cracks more efficiently, accurately, and objectively, with the ultimate goal of replacing conventional manual visual inspection techniques. Among these methods, semantic segmentation algorithms have demonstrated promising results in pixel-wise crack detection tasks. However, training such networks requires a large amount of huma… ▽ More Over the past decade, automated methods have been developed to detect cracks more efficiently, accurately, and objectively, with the ultimate goal of replacing conventional manual visual inspection techniques. Among these methods, semantic segmentation algorithms have demonstrated promising results in pixel-wise crack detection tasks. However, training such networks requires a large amount of human-annotated datasets with pixel-level annotations, which is a highly labor-intensive and time-consuming process. Moreover, supervised learning-based methods often struggle with poor generalizability in unseen datasets. Therefore, we propose an unsupervised pixel-wise road crack detection network, known as UP-CrackNet. Our approach first generates multi-scale square masks and randomly selects them to corrupt undamaged road images by removing certain regions. Subsequently, a generative adversarial network is trained to restore the corrupted regions by leveraging the semantic context learned from surrounding uncorrupted regions. During the testing phase, an error map is generated by calculating the difference between the input and restored images, which allows for pixel-wise crack detection. Our comprehensive experimental results demonstrate that UP-CrackNet outperforms other general-purpose unsupervised anomaly detection algorithms, and exhibits satisfactory performance and superior generalizability when compared with state-of-the-art supervised crack segmentation algorithms. Our source code is publicly available at mias.group/UP-CrackNet. △ Less

Submitted 6 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2401.08740 [pdf, other]

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Authors: Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie

Abstract: We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. c… ▽ More We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Code available: https://github.com/willisma/SiT

arXiv:2310.19817 [pdf, other]

Intelligibility prediction with a pretrained noise-robust automatic speech recognition model

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a… ▽ More This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a simulated noisy speech corpus and does not take advantage of the CPC2 data. For that reason, the intelligibility prediction systems are robust to unseen scenarios given the accurate prediction performance on the CPC2 evaluation. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.14184 [pdf, other]

Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increase Hypothesis

Authors: Ke Liu, Feng Liu, Haishuai Wang, Ning Ma, Jiajun Bu, Bo Han

Abstract: $\textit{Implicit neural representations}$ (INRs) aim to learn a $\textit{continuous function}$ (i.e., a neural network) to represent an image, where the input and output of the function are pixel coordinates and RGB/Gray values, respectively. However, images tend to consist of many objects whose colors are not perfectly consistent, resulting in the challenge that image is actually a $\textit{disc… ▽ More $\textit{Implicit neural representations}$ (INRs) aim to learn a $\textit{continuous function}$ (i.e., a neural network) to represent an image, where the input and output of the function are pixel coordinates and RGB/Gray values, respectively. However, images tend to consist of many objects whose colors are not perfectly consistent, resulting in the challenge that image is actually a $\textit{discontinuous piecewise function}$ and cannot be well estimated by a continuous function. In this paper, we empirically investigate that if a neural network is enforced to fit a discontinuous piecewise function to reach a fixed small error, the time costs will increase exponentially with respect to the boundaries in the spatial domain of the target signal. We name this phenomenon the $\textit{exponential-increase}$ hypothesis. Under the $\textit{exponential-increase}$ hypothesis, learning INRs for images with many objects will converge very slowly. To address this issue, we first prove that partitioning a complex signal into several sub-regions and utilizing piecewise INRs to fit that signal can significantly speed up the convergence. Based on this fact, we introduce a simple partition mechanism to boost the performance of two INR methods for image reconstruction: one for learning INRs, and the other for learning-to-learn INRs. In both cases, we partition an image into different sub-regions and dedicate smaller networks for each part. In addition, we further propose two partition rules based on regular grids and semantic segmentation maps, respectively. Extensive experiments validate the effectiveness of the proposed partitioning methods in terms of learning INR for a single image (ordinary learning framework) and the learning-to-learn framework. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023

arXiv:2309.08966 [pdf, other]

FF-LOGO: Cross-Modality Point Cloud Registration with Feature Filtering and Local to Global Optimization

Authors: Nan Ma, Mohan Wang, Yiheng Han, Yong-** Liu

Abstract: Cross-modality point cloud registration is confronted with significant challenges due to inherent differences in modalities between different sensors. We propose a cross-modality point cloud registration framework FF-LOGO: a cross-modality point cloud registration method with feature filtering and local-global optimization. The cross-modality feature correlation filtering module extracts geometric… ▽ More Cross-modality point cloud registration is confronted with significant challenges due to inherent differences in modalities between different sensors. We propose a cross-modality point cloud registration framework FF-LOGO: a cross-modality point cloud registration method with feature filtering and local-global optimization. The cross-modality feature correlation filtering module extracts geometric transformation-invariant features from cross-modality point clouds and achieves point selection by feature matching. We also introduce a cross-modality optimization process, including a local adaptive key region aggregation module and a global modality consistency fusion optimization module. Experimental results demonstrate that our two-stage optimization significantly improves the registration accuracy of the feature association and selection module. Our method achieves a substantial increase in recall rate compared to the current state-of-the-art methods on the 3DCSR dataset, improving from 40.59% to 75.74%. Our code will be available at https://github.com/wangmohan17/FFLOGO. △ Less

Submitted 12 April, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

Comments: Accepted by 2024 IEEE International Conference on Robotics and Automation (ICRA),7 pages, 2 figures

arXiv:2309.07084 [pdf, other]

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

Authors: Yiran Qin, Chaoqun Wang, Zijian Kang, Ningning Ma, Zhen Li, Ruimao Zhang

Abstract: In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These fea… ▽ More In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector's abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors. △ Less

Submitted 31 October, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted to ICCV2023

arXiv:2309.02171 [pdf, other]

A Wideband MIMO Channel Model for Aerial Intelligent Reflecting Surface-Assisted Wireless Communications

Authors: Shaoyi Liu, Nan Ma, Yaning Chen, Ke Peng, Dongsheng Xue

Abstract: Compared to traditional intelligent reflecting surfaces(IRS), aerial IRS (AIRS) has unique advantages, such as more flexible deployment and wider service coverage. However, modeling AIRS in the channel presents new challenges due to their mobility. In this paper, a three-dimensional (3D) wideband channel model for AIRS and IRS joint-assisted multiple-input multiple-output (MIMO) communication syst… ▽ More Compared to traditional intelligent reflecting surfaces(IRS), aerial IRS (AIRS) has unique advantages, such as more flexible deployment and wider service coverage. However, modeling AIRS in the channel presents new challenges due to their mobility. In this paper, a three-dimensional (3D) wideband channel model for AIRS and IRS joint-assisted multiple-input multiple-output (MIMO) communication system is proposed, where considering the rotational degrees of freedom in three directions and the motion angles of AIRS in space. Based on the proposed model, the channel impulse response (CIR), correlation function, and channel capacity are derived, and several feasible joint phase shifts schemes for AIRS and IRS units are proposed. Simulation results show that the proposed model can capture the channel characteristics accurately, and the proposed phase shifts methods can effectively improve the channel statistical characteristics and increase the system capacity. Additionally, we observe that in certain scenarios, the paths involving the IRS and the line-of-sight (LoS) paths exhibit similar characteristics. These findings provide valuable insights for the future development of intelligent communication systems. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 6 pages, 7 figures

arXiv:2308.05309 [pdf, other]

Homophily-enhanced Structure Learning for Graph Clustering

Authors: Ming Gu, Gaoming Yang, Sheng Zhou, Ning Ma, Jiawei Chen, Qiaoyu Tan, Meihan Liu, Jiajun Bu

Abstract: Graph clustering is a fundamental task in graph analysis, and recent advances in utilizing graph neural networks (GNNs) have shown impressive results. Despite the success of existing GNN-based graph clustering methods, they often overlook the quality of graph structure, which is inherent in real-world graphs due to their sparse and multifarious nature, leading to subpar performance. Graph structur… ▽ More Graph clustering is a fundamental task in graph analysis, and recent advances in utilizing graph neural networks (GNNs) have shown impressive results. Despite the success of existing GNN-based graph clustering methods, they often overlook the quality of graph structure, which is inherent in real-world graphs due to their sparse and multifarious nature, leading to subpar performance. Graph structure learning allows refining the input graph by adding missing links and removing spurious connections. However, previous endeavors in graph structure learning have predominantly centered around supervised settings, and cannot be directly applied to our specific clustering tasks due to the absence of ground-truth labels. To bridge the gap, we propose a novel method called \textbf{ho}mophily-enhanced structure \textbf{le}arning for graph clustering (HoLe). Our motivation stems from the observation that subtly enhancing the degree of homophily within the graph structure can significantly improve GNNs and clustering outcomes. To realize this objective, we develop two clustering-oriented structure learning modules, i.e., hierarchical correlation estimation and cluster-aware sparsification. The former module enables a more accurate estimation of pairwise node relationships by leveraging guidance from latent and clustering spaces, while the latter one generates a sparsified structure based on the similarity matrix and clustering assignments. Additionally, we devise a joint optimization approach alternating between training the homophily-enhanced structure learning and GNN-based clustering, thereby enforcing their reciprocal effects. Extensive experiments on seven benchmark datasets of various types and scales, across a range of clustering metrics, demonstrate the superiority of HoLe against state-of-the-art baselines. △ Less

Submitted 30 October, 2023; v1 submitted 9 August, 2023; originally announced August 2023.

Comments: 11 pages with 7 figures. Accepted by CIKM'23

arXiv:2305.19069 [pdf, other]

doi 10.1016/j.asoc.2023.110675

Multi-source adversarial transfer learning for ultrasound image segmentation with limited similarity

Authors: Yifu Zhang, Hongru Li, Tao Yang, Rui Tao, Zhengyuan Liu, Shimeng Shi, Jiansong Zhang, Ning Ma, Wu** Feng, Zhanhu Zhang, Xinyu Zhang

Abstract: Lesion segmentation of ultrasound medical images based on deep learning techniques is a widely used method for diagnosing diseases. Although there is a large amount of ultrasound image data in medical centers and other places, labeled ultrasound datasets are a scarce resource, and it is likely that no datasets are available for new tissues/organs. Transfer learning provides the possibility to solv… ▽ More Lesion segmentation of ultrasound medical images based on deep learning techniques is a widely used method for diagnosing diseases. Although there is a large amount of ultrasound image data in medical centers and other places, labeled ultrasound datasets are a scarce resource, and it is likely that no datasets are available for new tissues/organs. Transfer learning provides the possibility to solve this problem, but there are too many features in natural images that are not related to the target domain. As a source domain, redundant features that are not conducive to the task will be extracted. Migration between ultrasound images can avoid this problem, but there are few types of public datasets, and it is difficult to find sufficiently similar source domains. Compared with natural images, ultrasound images have less information, and there are fewer transferable features between different ultrasound images, which may cause negative transfer. To this end, a multi-source adversarial transfer learning network for ultrasound image segmentation is proposed. Specifically, to address the lack of annotations, the idea of adversarial transfer learning is used to adaptively extract common features between a certain pair of source and target domains, which provides the possibility to utilize unlabeled ultrasound data. To alleviate the lack of knowledge in a single source domain, multi-source transfer learning is adopted to fuse knowledge from multiple source domains. In order to ensure the effectiveness of the fusion and maximize the use of precious data, a multi-source domain independent strategy is also proposed to improve the estimation of the target domain data distribution, which further increases the learning ability of the multi-source adversarial migration learning network in multiple domains. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Submitted to Applied Soft Computing Journal

arXiv:2304.03526 [pdf, other]

Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field

Authors: Leheng Li, Qing Lian, Luozhou Wang, Ningning Ma, Ying-Cong Chen

Abstract: This work explores the use of 3D generative models to synthesize training data for 3D vision tasks. The key requirements of the generative models are that the generated data should be photorealistic to match the real-world scenarios, and the corresponding 3D attributes should be aligned with given sampling labels. However, we find that the recent NeRF-based 3D GANs hardly meet the above requiremen… ▽ More This work explores the use of 3D generative models to synthesize training data for 3D vision tasks. The key requirements of the generative models are that the generated data should be photorealistic to match the real-world scenarios, and the corresponding 3D attributes should be aligned with given sampling labels. However, we find that the recent NeRF-based 3D GANs hardly meet the above requirements due to their designed generation pipeline and the lack of explicit 3D supervision. In this work, we propose Lift3D, an inverted 2D-to-3D generation framework to achieve the data generation objectives. Lift3D has several merits compared to prior methods: (1) Unlike previous 3D GANs that the output resolution is fixed after training, Lift3D can generalize to any camera intrinsic with higher resolution and photorealistic output. (2) By lifting well-disentangled 2D GAN to 3D object NeRF, Lift3D provides explicit 3D information of generated objects, thus offering accurate 3D annotations for downstream tasks. We evaluate the effectiveness of our framework by augmenting autonomous driving datasets. Experimental results demonstrate that our data generation framework can effectively improve the performance of 3D object detectors. Project page: https://len-li.github.io/lift3d-web. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: CVPR 2023

arXiv:2303.07399 [pdf, other]

RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose

Authors: Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, Kai Chen

Abstract: Recent studies on 2D pose estimation have achieved excellent performance on public benchmarks, yet its application in the industrial community still suffers from heavy model parameters and high latency. In order to bridge this gap, we empirically explore key factors in pose estimation including paradigm, model architecture, training strategy, and deployment, and present a high-performance real-tim… ▽ More Recent studies on 2D pose estimation have achieved excellent performance on public benchmarks, yet its application in the industrial community still suffers from heavy model parameters and high latency. In order to bridge this gap, we empirically explore key factors in pose estimation including paradigm, model architecture, training strategy, and deployment, and present a high-performance real-time multi-person pose estimation framework, RTMPose, based on MMPose. Our RTMPose-m achieves 75.8% AP on COCO with 90+ FPS on an Intel i7-11700 CPU and 430+ FPS on an NVIDIA GTX 1660 Ti GPU, and RTMPose-l achieves 67.0% AP on COCO-WholeBody with 130+ FPS. To further evaluate RTMPose's capability in critical real-time applications, we also report the performance after deploying on the mobile device. Our RTMPose-s achieves 72.2% AP on COCO with 70+ FPS on a Snapdragon 865 chip, outperforming existing open-source libraries. Code and models are released at https://github.com/open-mmlab/mmpose/tree/1.x/projects/rtmpose. △ Less

Submitted 2 July, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2301.06826 [pdf, other]

Vis2Hap: Vision-based Haptic Rendering by Cross-modal Generation

Authors: Guanqun Cao, Jiaqi Jiang, Ningtao Mao, Danushka Bollegala, Min Li, Shan Luo

Abstract: To assist robots in teleoperation tasks, haptic rendering which allows human operators access a virtual touch feeling has been developed in recent years. Most previous haptic rendering methods strongly rely on data collected by tactile sensors. However, tactile data is not widely available for robots due to their limited reachable space and the restrictions of tactile sensors. To eliminate the nee… ▽ More To assist robots in teleoperation tasks, haptic rendering which allows human operators access a virtual touch feeling has been developed in recent years. Most previous haptic rendering methods strongly rely on data collected by tactile sensors. However, tactile data is not widely available for robots due to their limited reachable space and the restrictions of tactile sensors. To eliminate the need for tactile data, in this paper we propose a novel method named as Vis2Hap to generate haptic rendering from visual inputs that can be obtained from a distance without physical interaction. We take the surface texture of objects as key cues to be conveyed to the human operator. To this end, a generative model is designed to simulate the roughness and slipperiness of the object's surface. To embed haptic cues in Vis2Hap, we use height maps from tactile sensors and spectrograms from friction coefficients as the intermediate outputs of the generative model. Once Vis2Hap is trained, it can be used to generate height maps and spectrograms of new surface textures, from which a friction image can be obtained and displayed on a haptic display. The user study demonstrates that our proposed Vis2Hap method enables users to access a realistic haptic feeling similar to that of physical objects. The proposed vision-based haptic rendering has the potential to enhance human operators' perception of the remote environment and facilitate robotic manipulation. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: This paper is accepted at ICRA 2023

arXiv:2211.02701 [pdf, other]

MONAI: An open-source framework for deep learning in healthcare

Authors: M. Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, Vishwesh Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Andriy Myronenko, Wentao Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charlie Budd , et al. (32 additional authors not shown)

Abstract: Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geo… ▽ More Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: www.monai.io

arXiv:2211.01153 [pdf, other]

A Two Step Approach to Weighted Bipartite Link Recommendations

Authors: Nathan Ma

Abstract: Many real world person-person or person-product relationships can be modeled graphically. More specifically, bipartite graphs can be especially useful when modeling scenarios that involve two disjoint groups. As a result, many existing papers have utilized bipartite graphs for the classical link recommendation problem. In this paper, using the principle of bipartite graphs, we present another appr… ▽ More Many real world person-person or person-product relationships can be modeled graphically. More specifically, bipartite graphs can be especially useful when modeling scenarios that involve two disjoint groups. As a result, many existing papers have utilized bipartite graphs for the classical link recommendation problem. In this paper, using the principle of bipartite graphs, we present another approach to this problem with a two step algorithm that takes into account frequency and similarity between common edges to make recommendations. We test this approach with bipartite data gathered from the Epinions and Movielens data sources, and find it to perform with roughly 14 percent error, which improves upon baseline results. This is a promising result, and can be refined to generate even more accurate recommendations. △ Less

Submitted 29 October, 2022; originally announced November 2022.

arXiv:2210.17440 [pdf, other]

Semantic Novelty Detection and Characterization in Factual Text Involving Named Entities

Authors: Nianzu Ma, Sahisnu Mazumder, Alexander Politowicz, Bing Liu, Eric Robertson, Scott Grigsby

Abstract: Much of the existing work on text novelty detection has been studied at the topic level, i.e., identifying whether the topic of a document or a sentence is novel or not. Little work has been done at the fine-grained semantic level (or contextual level). For example, given that we know Elon Musk is the CEO of a technology company, the sentence "Elon Musk acted in the sitcom The Big Bang Theory" is… ▽ More Much of the existing work on text novelty detection has been studied at the topic level, i.e., identifying whether the topic of a document or a sentence is novel or not. Little work has been done at the fine-grained semantic level (or contextual level). For example, given that we know Elon Musk is the CEO of a technology company, the sentence "Elon Musk acted in the sitcom The Big Bang Theory" is novel and surprising because normally a CEO would not be an actor. Existing topic-based novelty detection methods work poorly on this problem because they do not perform semantic reasoning involving relations between named entities in the text and their background knowledge. This paper proposes an effective model (called PAT-SND) to solve the problem, which can also characterize the novelty. An annotated dataset is also created. Evaluation shows that PAT-SND outperforms 10 baselines by large margins. △ Less

Submitted 31 October, 2022; originally announced October 2022.

Comments: 28 pages, 2 figures

ACM Class: I.2.7

arXiv:2210.13291 [pdf, other]

doi 10.48550/arXiv.2210.13291

NVIDIA FLARE: Federated Learning from Simulation to Real-World

Authors: Holger R. Roth, Yan Cheng, Yuhong Wen, Isaac Yang, Ziyue Xu, Yuan-Ting Hsieh, Kristopher Kersten, Ahmed Harouni, Can Zhao, Kevin Lu, Zhihong Zhang, Wenqi Li, Andriy Myronenko, Dong Yang, Sean Yang, Nicola Rieke, Abood Quraini, Chester Chen, Daguang Xu, Nic Ma, Prerna Dogra, Mona Flores, Andrew Feng

Abstract: Federated learning (FL) enables building robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without centralizing the data. We created NVIDIA FLARE as an open-source software development kit (SDK) to make it easier for data scientists to use FL in their research and real-world applications. The SDK includes solutions for state-of-the-art FL algorithms and… ▽ More Federated learning (FL) enables building robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without centralizing the data. We created NVIDIA FLARE as an open-source software development kit (SDK) to make it easier for data scientists to use FL in their research and real-world applications. The SDK includes solutions for state-of-the-art FL algorithms and federated machine learning approaches, which facilitate building workflows for distributed learning across enterprises and enable platform developers to create a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package. It allows researchers to apply their data science workflows in any training libraries (PyTorch, TensorFlow, XGBoost, or even NumPy) in real-world FL settings. This paper introduces the key design principles of NVFlare and illustrates some use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms. Code is available at https://github.com/NVIDIA/NVFlare. △ Less

Submitted 28 April, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: Accepted at the International Workshop on Federated Learning, NeurIPS 2022, New Orleans, USA (https://federated-learning.org/fl-neurips-2022); Revised version v2: added Key Components list, system metrics for homomorphic encryption experiment; Extended v3 for journal submission

Journal ref: IEEE Data Eng. Bull., Vol. 46, No. 1, 2023

arXiv:2209.13259 [pdf, other]

Timeliness of Information for Computation-intensive Status Updates in Task-oriented Communications

Authors: Xiaoqi Qin, Yanlin Li, Xianxin Song, Nan Ma, Chuan Huang, ** Zhang

Abstract: Moving beyond just interconnected devices, the increasing interplay between communication and computation has fed the vision of real-time networked control systems. To obtain timely situational awareness, IoT devices continuously sample computation-intensive status updates, generate perception tasks and offload them to edge servers for processing. In this sense, the timeliness of information is co… ▽ More Moving beyond just interconnected devices, the increasing interplay between communication and computation has fed the vision of real-time networked control systems. To obtain timely situational awareness, IoT devices continuously sample computation-intensive status updates, generate perception tasks and offload them to edge servers for processing. In this sense, the timeliness of information is considered as one major contextual attribute of status updates. In this paper, we derive the closed-form expressions of timeliness of information for computation offloading at both edge tier and fog tier, where two stage tandem queues are exploited to abstract the transmission and computation process. Moreover, we exploit the statistical structure of Gauss-Markov process, which is widely adopted to model temporal dynamics of system states, and derive the closed-form expression for process-related timeliness of information. The obtained analytical formulas explicitly characterize the dependency among task generation, transmission and execution, which can serve as objective functions for system optimization. Based on the theoretical results, we formulate a computation offloading optimization problem at edge tier, where the timeliness of status updates is minimized among multiple devices by joint optimization of task generation, bandwidth allocation, and computation resource allocation. An iterative solution procedure is proposed to solve the formulated problem. Numerical results reveal the intertwined relationship among transmission and computation stages, and verify the necessity of factoring in the task generation process for computation offloading strategy design. △ Less

Submitted 27 September, 2022; originally announced September 2022.

arXiv:2205.09377 [pdf, other]

Coexistence between Task- and Data-Oriented Communications: A Whittle's Index Guided Multi-Agent Reinforcement Learning Approach

Authors: Ran Li, Chuan Huang, Xiaoqi Qin, Shengpei Jiang, Nan Ma, Shuguang Cui

Abstract: We investigate the coexistence of task-oriented and data-oriented communications in a IoT system that shares a group of channels, and study the scheduling problem to jointly optimize the weighted age of incorrect information (AoII) and throughput, which are the performance metrics of the two types of communications, respectively. This problem is formulated as a Markov decision problem, which is di… ▽ More We investigate the coexistence of task-oriented and data-oriented communications in a IoT system that shares a group of channels, and study the scheduling problem to jointly optimize the weighted age of incorrect information (AoII) and throughput, which are the performance metrics of the two types of communications, respectively. This problem is formulated as a Markov decision problem, which is difficult to solve due to the large discrete action space and the time-varying action constraints induced by the stochastic availability of channels. By exploiting the intrinsic properties of this problem and reformulating the reward function based on channel statistics, we first simplify the solution space, state space, and optimality criteria, and convert it to an equivalent Markov game, for which the large discrete action space issue is greatly relieved. Then, we propose a Whittle's index guided multi-agent proximal policy optimization (WI-MAPPO) algorithm to solve the considered game, where the embedded Whittle's index module further shrinks the action space, and the proposed offline training algorithm extends the training kernel of conventional MAPPO to address the issue of time-varying constraints. Finally, numerical results validate that the proposed algorithm significantly outperforms state-of-the-art age of information (AoI) based algorithms under scenarios with insufficient channel resources. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2205.00692 [pdf, other]

Energy-efficient Caching and Task offloading for Timely Status Updates in UAV-assisted VANETs

Authors: Nan Hu, Xiaoqi Qin, Nan Ma, Yiming Liu, Yuanyuan Yao, ** Zhang

Abstract: Intelligent edge network is maturing to enable smart and efficient transportation systems. In this letter, we consider unmanned aerial vehicle (UAV)-assisted vehicular networks where UAVs provide caching and computing services in complement with base station (BS). One major challenge is that vehicles need to obtain timely situational awareness via orchestration of ubiquitous caching and computing… ▽ More Intelligent edge network is maturing to enable smart and efficient transportation systems. In this letter, we consider unmanned aerial vehicle (UAV)-assisted vehicular networks where UAVs provide caching and computing services in complement with base station (BS). One major challenge is that vehicles need to obtain timely situational awareness via orchestration of ubiquitous caching and computing resources. Note that cached data for vehicles' perception tasks contains time-varying context information, thus freshness of cached data should be considered in conjunction with task execution to guarantee timeliness of obtained status updates. To this end, we propose a two-stage performance metric to quantify the impact of cache refreshing and computation offloading decisions on the age of status updates. We formulate an energy minimization problem by jointly considering cache refreshing, computation offloading and aging of status updates. To facilitate online decision making, we propose a deep deterministic policy gradient(DDPG)-based solution procedure and incorporate differentiated experience replay mechanism to accelerate convergence. Simulation results show that the performance of proposed solution is competitive in terms of energy consumption for obtaining fresh status updates. △ Less

Submitted 4 May, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

arXiv:2204.13590 [pdf, other]

doi 10.1093/tse/tdac026

Computer Vision for Road Imaging and Pothole Detection: A State-of-the-Art Review of Systems and Algorithms

Authors: Nachuan Ma, Jiahe Fan, Wenshuo Wang, ** Wu, Yu Jiang, Lihua Xie, Rui Fan

Abstract: Computer vision algorithms have been prevalently utilized for 3-D road imaging and pothole detection for over two decades. Nonetheless, there is a lack of systematic survey articles on state-of-the-art (SoTA) computer vision techniques, especially deep learning models, developed to tackle these problems. This article first introduces the sensing systems employed for 2-D and 3-D road data acquisiti… ▽ More Computer vision algorithms have been prevalently utilized for 3-D road imaging and pothole detection for over two decades. Nonetheless, there is a lack of systematic survey articles on state-of-the-art (SoTA) computer vision techniques, especially deep learning models, developed to tackle these problems. This article first introduces the sensing systems employed for 2-D and 3-D road data acquisition, including camera(s), laser scanners, and Microsoft Kinect. Afterward, it thoroughly and comprehensively reviews the SoTA computer vision algorithms, including (1) classical 2-D image processing, (2) 3-D point cloud modeling and segmentation, and (3) machine/deep learning, developed for road pothole detection. This article also discusses the existing challenges and future development trends of computer vision-based road pothole detection approaches: classical 2-D image processing-based and 3-D point cloud modeling and segmentation-based approaches have already become history; and Convolutional neural networks (CNNs) have demonstrated compelling road pothole detection results and are promising to break the bottleneck with the future advances in self/un-supervised learning for multi-modal semantic segmentation. We believe that this survey can serve as practical guidance for develo** the next-generation road condition assessment systems. △ Less

Submitted 28 April, 2022; originally announced April 2022.

Comments: accepted to Transportation Safety and Environment

arXiv:2204.04288 [pdf, other]

Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech… ▽ More Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech intelligibility, which does not require intelligibility labels or reference signals to train the predictor. Our experiments demonstrate that the uncertainty from state-of-the-art end-to-end automatic speech recognition (ASR) models is highly correlated with speech intelligibility. The proposed method is evaluated on two databases and the results show that the unsupervised uncertainty measures of ASR models are more correlated with speech intelligibility from listening results than the predictions made by widely used intrusive methods. △ Less

Submitted 6 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH2022

arXiv:2204.04287 [pdf, other]

Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognitio… ▽ More An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognition. Meanwhile, deep neural network (DNN) based automatic speech recogniser (ASR) is approaching human performance in some speech recognition tasks. This work leverages the hidden representations from DNN-based ASR as features for speech intelligibility prediction in hearing-impaired listeners. The experiments based on a hearing aid intelligibility database show that the proposed method could make better prediction than a widely used short-time objective intelligibility (STOI) based binaural measure. △ Less

Submitted 6 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH2022

arXiv:2204.04284 [pdf, other]

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

Authors: Zehai Tu, Jack Deadman, Ning Ma, Jon Barker

Abstract: End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified auditory-based model, which can simulate various heari… ▽ More End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified auditory-based model, which can simulate various hearing abilities, is investigated for the purpose of data augmentation for end-to-end speech recognition. By introducing the auditory model into the data augmentation process, end-to-end systems are encouraged to ignore variation from the signal that cannot be heard and thereby focus on robust features for speech recognition. Two mechanisms in the auditory model, spectral smearing and loudness recruitment, are studied on the LibriSpeech dataset with a transformer-based end-to-end model. The results show that the proposed augmentation methods can bring statistically significant improvement on the performance of the state-of-the-art SpecAugment. △ Less

Submitted 8 April, 2022; originally announced April 2022.

arXiv:2112.02706 [pdf, other]

Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning

Authors: Zixuan Ke, Bing Liu, Nianzu Ma, Hu Xu, Lei Shu

Abstract: Continual learning (CL) learns a sequence of tasks incrementally with the goal of achieving two main objectives: overcoming catastrophic forgetting (CF) and encouraging knowledge transfer (KT) across tasks. However, most existing techniques focus only on overcoming CF and have no mechanism to encourage KT, and thus do not do well in KT. Although several papers have tried to deal with both CF and K… ▽ More Continual learning (CL) learns a sequence of tasks incrementally with the goal of achieving two main objectives: overcoming catastrophic forgetting (CF) and encouraging knowledge transfer (KT) across tasks. However, most existing techniques focus only on overcoming CF and have no mechanism to encourage KT, and thus do not do well in KT. Although several papers have tried to deal with both CF and KT, our experiments show that they suffer from serious CF when the tasks do not have much shared knowledge. Another observation is that most current CL methods do not use pre-trained models, but it has been shown that such models can significantly improve the end task performance. For example, in natural language processing, fine-tuning a BERT-like pre-trained language model is one of the most effective approaches. However, for CL, this approach suffers from serious CF. An interesting question is how to make the best use of pre-trained models for CL. This paper proposes a novel model called CTR to solve these problems. Our experimental results demonstrate the effectiveness of CTR △ Less

Submitted 5 December, 2021; originally announced December 2021.

Journal ref: NeurIPS 2021

arXiv:2108.02948 [pdf, other]

Deep Learning-based Biological Anatomical Landmark Detection in Colonoscopy Videos

Authors: Kaiwei Che, Chengwei Ye, Yibing Yao, Nachuan Ma, Ruo Zhang, Jiankun Wang, Max Q. -H. Meng

Abstract: Colonoscopy is a standard imaging tool for visualizing the entire gastrointestinal (GI) tract of patients to capture lesion areas. However, it takes the clinicians excessive time to review a large number of images extracted from colonoscopy videos. Thus, automatic detection of biological anatomical landmarks within the colon is highly demanded, which can help reduce the burden of clinicians by pro… ▽ More Colonoscopy is a standard imaging tool for visualizing the entire gastrointestinal (GI) tract of patients to capture lesion areas. However, it takes the clinicians excessive time to review a large number of images extracted from colonoscopy videos. Thus, automatic detection of biological anatomical landmarks within the colon is highly demanded, which can help reduce the burden of clinicians by providing guidance information for the locations of lesion areas. In this article, we propose a novel deep learning-based approach to detect biological anatomical landmarks in colonoscopy videos. First, raw colonoscopy video sequences are pre-processed to reject interference frames. Second, a ResNet-101 based network is used to detect three biological anatomical landmarks separately to obtain the intermediate detection results. Third, to achieve more reliable localization of the landmark periods within the whole video period, we propose to post-process the intermediate detection results by identifying the incorrectly predicted frames based on their temporal distribution and reassigning them back to the correct class. Finally, the average detection accuracy reaches 99.75\%. Meanwhile, the average IoU of 0.91 shows a high degree of similarity between our predicted landmark periods and ground truth. The experimental results demonstrate that our proposed model is capable of accurately detecting and localizing biological anatomical landmarks from colonoscopy videos. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: 9 pages, 7 figures

arXiv:2107.06782 [pdf]

Clustering and attention model based for intelligent trading

Authors: Mimansa Rana, Nanxiang Mao, Ming Ao, Xiaohui Wu, Poning Liang, Matloob Khushi

Abstract: The foreign exchange market has taken an important role in the global financial market. While foreign exchange trading brings high-yield opportunities to investors, it also brings certain risks. Since the establishment of the foreign exchange market in the 20th century, foreign exchange rate forecasting has become a hot issue studied by scholars from all over the world. Due to the complexity and n… ▽ More The foreign exchange market has taken an important role in the global financial market. While foreign exchange trading brings high-yield opportunities to investors, it also brings certain risks. Since the establishment of the foreign exchange market in the 20th century, foreign exchange rate forecasting has become a hot issue studied by scholars from all over the world. Due to the complexity and number of factors affecting the foreign exchange market, technical analysis cannot respond to administrative intervention or unexpected events. Our team chose several pairs of foreign currency historical data and derived technical indicators from 2005 to 2021 as the dataset and established different machine learning models for event-driven price prediction for oversold scenario. △ Less

Submitted 6 August, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

arXiv:2107.06735 [pdf, other]

Semi-Supervised Hypothesis Transfer for Source-Free Domain Adaptation

Authors: Ning Ma, Jiajun Bu, Lixian Lu, Jun Wen, Zhen Zhang, Sheng Zhou, Xifeng Yan

Abstract: Domain Adaptation has been widely used to deal with the distribution shift in vision, language, multimedia etc. Most domain adaptation methods learn domain-invariant features with data from both domains available. However, such a strategy might be infeasible in practice when source data are unavailable due to data-privacy concerns. To address this issue, we propose a novel adaptation method via hy… ▽ More Domain Adaptation has been widely used to deal with the distribution shift in vision, language, multimedia etc. Most domain adaptation methods learn domain-invariant features with data from both domains available. However, such a strategy might be infeasible in practice when source data are unavailable due to data-privacy concerns. To address this issue, we propose a novel adaptation method via hypothesis transfer without accessing source data at adaptation stage. In order to fully use the limited target data, a semi-supervised mutual enhancement method is proposed, in which entropy minimization and augmented label propagation are used iteratively to perform inter-domain and intra-domain alignments. Compared with state-of-the-art methods, the experimental results on three public datasets demonstrate that our method gets up to 19.9% improvements on semi-supervised adaptation tasks. △ Less

Submitted 14 July, 2021; originally announced July 2021.

arXiv:2107.06707 [pdf, other]

doi 10.1016/j.knosys.2022.110208

Uncertainty-Guided Mixup for Semi-Supervised Domain Adaptation without Source Data

Authors: Ning Ma, Jiajun Bu, Zhen Zhang, Sheng Zhou

Abstract: Present domain adaptation methods usually perform explicit representation alignment by simultaneously accessing the source data and target data. However, the source data are not always available due to the privacy preserving consideration or bandwidth limitation. Source-free domain adaptation aims to solve the above problem by performing domain adaptation without accessing the source data. The ada… ▽ More Present domain adaptation methods usually perform explicit representation alignment by simultaneously accessing the source data and target data. However, the source data are not always available due to the privacy preserving consideration or bandwidth limitation. Source-free domain adaptation aims to solve the above problem by performing domain adaptation without accessing the source data. The adaptation paradigm is receiving more and more attention in recent years, and multiple works have been proposed for unsupervised source-free domain adaptation. However, without utilizing any supervised signal and source data at the adaptation stage, the optimization of the target model is unstable and fragile. To alleviate the problem, we focus on semi-supervised domain adaptation under source-free setting. More specifically, we propose uncertainty-guided Mixup to reduce the representation's intra-domain discrepancy and perform inter-domain alignment without directly accessing the source data. Finally, we conduct extensive semi-supervised domain adaptation experiments on various datasets. Our method outperforms the recent semi-supervised baselines and the unsupervised variant also achieves competitive performance. The experiment codes will be released in the future. △ Less

Submitted 14 July, 2021; originally announced July 2021.

Report number: 11

Journal ref: Volume 262, 28 February 2023, 110208

arXiv:2106.04639 [pdf, other]

Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fittings that are customised to speech in different noisy… ▽ More Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fittings that are customised to speech in different noisy environments. A differentiable hearing loss model is proposed and used to optimise fittings with back-propagation. The customisation is reflected on the data of speech in different noise with also the consideration of noise suppression. The objective evaluation shows the advantages of optimised custom fittings over general prescriptive fittings. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2103.09030 [pdf, other]

A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character Recognition

Authors: Jianbang Liu, Yuqi Fang, Delong Zhu, Nachuan Ma, ** Pan, Max Q. -H. Meng

Abstract: Human activities are hugely restricted by COVID-19, recently. Robots that can conduct inter-floor navigation attract much public attention, since they can substitute human workers to conduct the service work. However, current robots either depend on human assistance or elevator retrofitting, and fully autonomous inter-floor navigation is still not available. As the very first step of inter-floor n… ▽ More Human activities are hugely restricted by COVID-19, recently. Robots that can conduct inter-floor navigation attract much public attention, since they can substitute human workers to conduct the service work. However, current robots either depend on human assistance or elevator retrofitting, and fully autonomous inter-floor navigation is still not available. As the very first step of inter-floor navigation, elevator button segmentation and recognition hold an important position. Therefore, we release the first large-scale publicly available elevator panel dataset in this work, containing 3,718 panel images with 35,100 button labels, to facilitate more powerful algorithms on autonomous elevator operation. Together with the dataset, a number of deep learning based implementations for button segmentation and recognition are also released to benchmark future methods in the community. The dataset will be available at \url{https://github.com/zhudelong/elevator_button_recognition △ Less

Submitted 22 March, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

arXiv:2103.08569 [pdf, other]

DHASP: Differentiable Hearing Aid Speech Processing

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Hearing aids are expected to improve speech intelligibility for listeners with hearing impairment. An appropriate amplification fitting tuned for the listener's hearing disability is critical for good performance. The developments of most prescriptive fittings are based on data collected in subjective listening experiments, which are usually expensive and time-consuming. In this paper, we explore… ▽ More Hearing aids are expected to improve speech intelligibility for listeners with hearing impairment. An appropriate amplification fitting tuned for the listener's hearing disability is critical for good performance. The developments of most prescriptive fittings are based on data collected in subjective listening experiments, which are usually expensive and time-consuming. In this paper, we explore an alternative approach to finding the optimal fitting by introducing a hearing aid speech processing framework, in which the fitting is optimised in an automated way using an intelligibility objective function based on the HASPI physiological auditory model. The framework is fully differentiable, thus can employ the back-propagation algorithm for efficient, data-driven optimisation. Our initial objective experiments show promising results for noise-free speech amplification, where the automatically optimised processors outperform one of the well recognised hearing aid prescriptions. △ Less

Submitted 15 March, 2021; originally announced March 2021.

Comments: To appear at ICASSP 2021

arXiv:2101.03697 [pdf, other]

RepVGG: Making VGG-style ConvNets Great Again

Authors: Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, Jian Sun

Abstract: We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG… ▽ More We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet. The code and trained models are available at https://github.com/megvii-model/RepVGG. △ Less

Submitted 29 March, 2021; v1 submitted 10 January, 2021; originally announced January 2021.

Comments: CVPR 2021

arXiv:2012.03166 [pdf, other]

Conditional Generative Adversarial Networks for Optimal Path Planning

Authors: Nachuan Ma, Jiankun Wang, Max Q. -H. Meng

Abstract: Path planning plays an important role in autonomous robot systems. Effective understanding of the surrounding environment and efficient generation of optimal collision-free path are both critical parts for solving path planning problem. Although conventional sampling-based algorithms, such as the rapidly-exploring random tree (RRT) and its improved optimal version (RRT*), have been widely used in… ▽ More Path planning plays an important role in autonomous robot systems. Effective understanding of the surrounding environment and efficient generation of optimal collision-free path are both critical parts for solving path planning problem. Although conventional sampling-based algorithms, such as the rapidly-exploring random tree (RRT) and its improved optimal version (RRT*), have been widely used in path planning problems because of their ability to find a feasible path in even complex environments, they fail to find an optimal path efficiently. To solve this problem and satisfy the two aforementioned requirements, we propose a novel learning-based path planning algorithm which consists of a novel generative model based on the conditional generative adversarial networks (CGAN) and a modified RRT* algorithm (denoted by CGANRRT*). Given the map information, our CGAN model can generate an efficient possibility distribution of feasible paths, which can be utilized by the CGAN-RRT* algorithm to find the optimal path with a non-uniform sampling strategy. The CGAN model is trained by learning from ground truth maps, each of which is generated by putting all the results of executing RRT algorithm 50 times on one raw map. We demonstrate the efficient performance of this CGAN model by testing it on two groups of maps and comparing CGAN-RRT* algorithm with conventional RRT* algorithm. △ Less

Submitted 5 December, 2020; originally announced December 2020.

arXiv:2012.00105 [pdf]

Using Data Analytics to predict students score

Authors: Nang Laik Ma, Gim Hong Chua

Abstract: Education is very important to Singapore, and the government has continued to invest heavily in our education system to become one of the world-class systems today. A strong foundation of Science, Technology, Engineering, and Mathematics (STEM) was what underpinned Singapore's development over the past 50 years. PISA is a triennial international survey that evaluates education systems worldwide by… ▽ More Education is very important to Singapore, and the government has continued to invest heavily in our education system to become one of the world-class systems today. A strong foundation of Science, Technology, Engineering, and Mathematics (STEM) was what underpinned Singapore's development over the past 50 years. PISA is a triennial international survey that evaluates education systems worldwide by testing the skills and knowledge of 15-year-old students who are nearing the end of compulsory education. In this paper, the authors used the PISA data from 2012 and 2015 and developed machine learning techniques to predictive the students' scores and understand the inter-relationships among social, economic, and education factors. The insights gained would be useful to have fresh perspectives on education, useful for policy formulation. △ Less

Submitted 19 November, 2020; originally announced December 2020.

arXiv:2009.04759 [pdf, other]

Activate or Not: Learning Customized Activation

Authors: Ningning Ma, Xiangyu Zhang, Ming Liu, Jian Sun

Abstract: We present a simple, effective, and general activation function we term ACON which learns to activate the neurons or not. Interestingly, we find Swish, the recent popular NAS-searched activation, can be interpreted as a smooth approximation to ReLU. Intuitively, in the same way, we approximate the more general Maxout family to our novel ACON family, which remarkably improves the performance and ma… ▽ More We present a simple, effective, and general activation function we term ACON which learns to activate the neurons or not. Interestingly, we find Swish, the recent popular NAS-searched activation, can be interpreted as a smooth approximation to ReLU. Intuitively, in the same way, we approximate the more general Maxout family to our novel ACON family, which remarkably improves the performance and makes Swish a special case of ACON. Next, we present meta-ACON, which explicitly learns to optimize the parameter switching between non-linear (activate) and linear (inactivate) and provides a new design space. By simply changing the activation function, we show its effectiveness on both small models and highly optimized large models (e.g. it improves the ImageNet top-1 accuracy rate by 6.7% and 1.8% on MobileNet-0.25 and ResNet-152, respectively). Moreover, our novel ACON can be naturally transferred to object detection and semantic segmentation, showing that ACON is an effective alternative in a variety of tasks. Code is available at https://github.com/nmaac/acon. △ Less

Submitted 16 April, 2021; v1 submitted 10 September, 2020; originally announced September 2020.

Comments: CVPR 2021

arXiv:2007.11824 [pdf, other]

Funnel Activation for Visual Recognition

Authors: Ningning Ma, Xiangyu Zhang, Jian Sun

Abstract: We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are y = max(x, 0) and y = max(x, px), respectively, while FReLU is in the form of y = max(x,T(x)), where T(x) is the 2D spatial condition. More… ▽ More We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are y = max(x, 0) and y = max(x, px), respectively, while FReLU is in the form of y = max(x,T(x)), where T(x) is the 2D spatial condition. Moreover, the spatial condition achieves a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions. We conduct experiments on ImageNet, COCO detection, and semantic segmentation tasks, showing great improvements and robustness of FReLU in the visual recognition tasks. Code is available at https://github.com/megvii-model/FunnelAct. △ Less

Submitted 24 July, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

Comments: ECCV 2020

arXiv:2007.11823 [pdf, other]

WeightNet: Revisiting the Design Space of Weight Networks

Authors: Ningning Ma, Xiangyu Zhang, Jiawei Huang, Jian Sun

Abstract: We present a conceptually simple, flexible and effective framework for weight generating networks. Our approach is general that unifies two current distinct and extremely effective SENet and CondConv into the same framework on weight space. The method, called WeightNet, generalizes the two methods by simply adding one more grouped fully-connected layer to the attention activation layer. We use the… ▽ More We present a conceptually simple, flexible and effective framework for weight generating networks. Our approach is general that unifies two current distinct and extremely effective SENet and CondConv into the same framework on weight space. The method, called WeightNet, generalizes the two methods by simply adding one more grouped fully-connected layer to the attention activation layer. We use the WeightNet, composed entirely of (grouped) fully-connected layers, to directly output the convolutional weight. WeightNet is easy and memory-conserving to train, on the kernel space instead of the feature space. Because of the flexibility, our method outperforms existing approaches on both ImageNet and COCO detection tasks, achieving better Accuracy-FLOPs and Accuracy-Parameter trade-offs. The framework on the flexible weight space has the potential to further improve the performance. Code is available at https://github.com/megvii-model/WeightNet. △ Less

Submitted 24 July, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

Comments: ECCV 2020

arXiv:2007.11806 [pdf, other]

Autonomous Removal of Perspective Distortion of Elevator Button Images based on Corner Detection

Authors: Nachuan Ma, Jianbang Liu, Delong Zhu

Abstract: Elevator button recognition is a critical function to realize the autonomous operation of elevators. However, challenging image conditions and various image distortions make it difficult to recognize buttons accurately. To fill this gap, we propose a novel deep learning-based approach, which aims to autonomously correct perspective distortions of elevator button images based on button corner detec… ▽ More Elevator button recognition is a critical function to realize the autonomous operation of elevators. However, challenging image conditions and various image distortions make it difficult to recognize buttons accurately. To fill this gap, we propose a novel deep learning-based approach, which aims to autonomously correct perspective distortions of elevator button images based on button corner detection results. First, we leverage a novel image segmentation model and the Hough Transform method to obtain button segmentation and button corner detection results. Then, pixel coordinates of standard button corners are used as reference features to estimate camera motions for correcting perspective distortions. Fifteen elevator button images are captured from different angles of view as the dataset. The experimental results demonstrate that our proposed approach is capable of estimating camera motions and removing perspective distortions of elevator button images with high accuracy. △ Less

Submitted 1 September, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

arXiv:2003.08246 [pdf, other]

Adaptive-Step Graph Meta-Learner for Few-Shot Graph Classification

Authors: Ning Ma, Jiajun Bu, Jieyu Yang, Zhen Zhang, Chengwei Yao, Zhi Yu, Sheng Zhou, Xifeng Yan

Abstract: Graph classification aims to extract accurate information from graph-structured data for classification and is becoming more and more important in graph learning community. Although Graph Neural Networks (GNNs) have been successfully applied to graph classification tasks, most of them overlook the scarcity of labeled graph data in many applications. For example, in bioinformatics, obtaining protei… ▽ More Graph classification aims to extract accurate information from graph-structured data for classification and is becoming more and more important in graph learning community. Although Graph Neural Networks (GNNs) have been successfully applied to graph classification tasks, most of them overlook the scarcity of labeled graph data in many applications. For example, in bioinformatics, obtaining protein graph labels usually needs laborious experiments. Recently, few-shot learning has been explored to alleviate this problem with only given a few labeled graph samples of test classes. The shared sub-structures between training classes and test classes are essential in few-shot graph classification. Exiting methods assume that the test classes belong to the same set of super-classes clustered from training classes. However, according to our observations, the label spaces of training classes and test classes usually do not overlap in real-world scenario. As a result, the existing methods don't well capture the local structures of unseen test classes. To overcome the limitation, in this paper, we propose a direct method to capture the sub-structures with well initialized meta-learner within a few adaptation steps. More specifically, (1) we propose a novel framework consisting of a graph meta-learner, which uses GNNs based modules for fast adaptation on graph data, and a step controller for the robustness and generalization of meta-learner; (2) we provide quantitative analysis for the framework and give a graph-dependent upper bound of the generalization error based on our framework; (3) the extensive experiments on real-world datasets demonstrate that our framework gets state-of-the-art results on several few-shot graph classification tasks compared to baselines. △ Less

Submitted 23 June, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

arXiv:2003.01294 [pdf, ps, other]

Accelerating Generalized Benders Decomposition for Wireless Resource Allocation

Authors: Mengyuan Lee, Ning Ma, Guanding Yu, Huaiyu Dai

Abstract: Generalized Benders decomposition (GBD) is a globally optimal algorithm for mixed integer nonlinear programming (MINLP) problems, which are NP-hard and can be widely found in the area of wireless resource allocation. The main idea of GBD is decomposing an MINLP problem into a primal problem and a master problem, which are iteratively solved until their solutions converge. However, a direct impleme… ▽ More Generalized Benders decomposition (GBD) is a globally optimal algorithm for mixed integer nonlinear programming (MINLP) problems, which are NP-hard and can be widely found in the area of wireless resource allocation. The main idea of GBD is decomposing an MINLP problem into a primal problem and a master problem, which are iteratively solved until their solutions converge. However, a direct implementation of GBD is time- and memory-consuming. The main bottleneck is the high complexity of the master problem, which increases over the iterations. Therefore, we propose to leverage machine learning (ML) techniques to accelerate GBD aiming at decreasing the complexity of the master problem. Specifically, we utilize two different ML techniques, classification and regression, to deal with this acceleration task. In this way, a cut classifier and a cut regressor are learned, respectively, to distinguish between useful and useless cuts. Only useful cuts are added to the master problem and thus the complexity of the master problem is reduced. By using a resource allocation problem in device-to-device communication networks as an example, we validate that the proposed method can reduce the computational complexity of GBD without loss of optimality and has strong generalization ability. The proposed method is applicable for solving various MINLP problems in wireless networks since the designs are invariant for different problems. △ Less

Submitted 14 October, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: This paper was accpeted by the IEEE Transactions on Wireless Communications

Showing 1–50 of 77 results for author: Mao, N