Search | arXiv e-print repository

Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

Authors: Jierun Chen, Fangyun Wei, ****g Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S. -H. Gary Chan, Hongyang Zhang

Abstract: Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with… ▽ More Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC models. Ref-L4 is distinguished by four key features: 1) a substantial sample size with 45,341 annotations; 2) a diverse range of object categories with 365 distinct types and varying instance scales from 30 to 3,767; 3) lengthy referring expressions averaging 24.2 words; and 4) an extensive vocabulary comprising 22,813 unique words. We evaluate a total of 24 large models on Ref-L4 and provide valuable insights. The cleaned versions of RefCOCO, RefCOCO+, and RefCOCOg, as well as our Ref-L4 benchmark and evaluation code, are available at https://github.com/JierunChen/Ref-L4. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2405.03218 [pdf, other]

Elevator, Escalator or Neither? Classifying Pedestrian Conveyor State Using Inertial Navigation System

Authors: Tianlang He, Zhiqiu Xia, S. -H. Gary Chan

Abstract: Classifying a pedestrian in one of the three conveyor states of "elevator," "escalator" and "neither" is fundamental to many applications such as indoor localization and people flow analysis. We estimate, for the first time, the pedestrian conveyor state given the inertial navigation system (INS) readings of accelerometer, gyroscope and magnetometer sampled from the phone. Our problem is challengi… ▽ More Classifying a pedestrian in one of the three conveyor states of "elevator," "escalator" and "neither" is fundamental to many applications such as indoor localization and people flow analysis. We estimate, for the first time, the pedestrian conveyor state given the inertial navigation system (INS) readings of accelerometer, gyroscope and magnetometer sampled from the phone. Our problem is challenging because the INS signals of the conveyor state are coupled and perturbed by unpredictable arbitrary human actions, confusing the decision process. We propose ELESON, a novel, effective and lightweight INS-based deep learning approach to classify whether a pedestrian is in an elevator, escalator or neither. ELESON utilizes a motion feature extractor to decouple the conveyor state from human action in the feature space, and a magnetic feature extractor to account for the speed difference between elevator and escalator. Given the results of the extractors, it employs an evidential state classifier to estimate the confidence of the pedestrian states. Based on extensive experiments conducted on twenty hours of real pedestrian data, we demonstrate that ELESON outperforms significantly the state-of-the-art approaches (where combined INS signals of both the conveyor state and human actions are processed together), with 15% classification improvement in F1 score, stronger confidence discriminability with 10% increase in AUROC (Area Under the Receiver Operating Characteristics), and low computational and memory requirements on smartphones. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2403.09124 [pdf, other]

Single Domain Generalization for Crowd Counting

Authors: Zhuoxuan Peng, S. -H. Gary Chan

Abstract: Due to its promising results, density map regression has been widely employed for image-based crowd counting. The approach, however, often suffers from severe performance degradation when tested on data from unseen scenarios, the so-called "domain shift" problem. To address the problem, we investigate in this work single domain generalization (SDG) for crowd counting. The existing SDG approaches a… ▽ More Due to its promising results, density map regression has been widely employed for image-based crowd counting. The approach, however, often suffers from severe performance degradation when tested on data from unseen scenarios, the so-called "domain shift" problem. To address the problem, we investigate in this work single domain generalization (SDG) for crowd counting. The existing SDG approaches are mainly for image classification and segmentation, and can hardly be extended to our case due to its regression nature and label ambiguity (i.e., ambiguous pixel-level ground truths). We propose MPCount, a novel effective SDG approach even for narrow source distribution. MPCount stores diverse density values for density map regression and reconstructs domain-invariant features by means of only one memory bank, a content error mask and attention consistency loss. By partitioning the image into grids, it employs patch-wise classification as an auxiliary task to mitigate label ambiguity. Through extensive experiments on different datasets, MPCount is shown to significantly improve counting accuracy compared to the state of the art under diverse scenarios unobserved in the training data characterized by narrow source distribution. Code is available at https://github.com/Shimmer93/MPCount. △ Less

Submitted 5 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR2024

arXiv:2312.00540 [pdf, other]

Target-agnostic Source-free Domain Adaptation for Regression Tasks

Authors: Tianlang He, Zhiqiu Xia, Jierun Chen, Haoliang Li, S. -H. Gary Chan

Abstract: Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcom… ▽ More Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: Accepted by ICDE 2024

arXiv:2310.11959 [pdf, other]

A Multi-Scale Decomposition MLP-Mixer for Time Series Analysis

Authors: Shuhan Zhong, Sizhe Song, Weipeng Zhuo, Guanyao Li, Yang Liu, S. -H. Gary Chan

Abstract: Time series data, including univariate and multivariate ones, are characterized by unique composition and complex multi-scale temporal variations. They often require special consideration of decomposition and multi-scale modeling to analyze. Existing deep learning methods on this best fit to univariate time series only, and have not sufficiently considered sub-series modeling and decomposition com… ▽ More Time series data, including univariate and multivariate ones, are characterized by unique composition and complex multi-scale temporal variations. They often require special consideration of decomposition and multi-scale modeling to analyze. Existing deep learning methods on this best fit to univariate time series only, and have not sufficiently considered sub-series modeling and decomposition completeness. To address these challenges, we propose MSD-Mixer, a Multi-Scale Decomposition MLP-Mixer, which learns to explicitly decompose and represent the input time series in its different layers. To handle the multi-scale temporal patterns and multivariate dependencies, we propose a novel temporal patching approach to model the time series as multi-scale patches, and employ MLPs to capture intra- and inter-patch variations and channel-wise correlations. In addition, we propose a novel loss function to constrain both the mean and the autocorrelation of the decomposition residual for better decomposition completeness. Through extensive experiments on various real-world datasets for five common time series analysis tasks, we demonstrate that MSD-Mixer consistently and significantly outperforms other state-of-the-art algorithms with better efficiency. △ Less

Submitted 24 March, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: Accepted for VLDB 2024

arXiv:2307.12987 [pdf, other]

Efficient Behavior-consistent Calibration for Multi-agent Market Simulation

Authors: Tianlang He, Keyan Lu, Chang Xu, Yang Liu, Weiqing Liu, S. -H. Gary Chan, Jiang Bian

Abstract: Order-driven market simulation mimics the trader behaviors to generate order streams to support interactive studies of financial strategies. In market simulator, the multi-agent approach is commonly adopted due to its explainability. Existing multi-agent systems employ heuristic search to generate order streams, which is inefficient for large-scale simulation. Furthermore, the search-based behavio… ▽ More Order-driven market simulation mimics the trader behaviors to generate order streams to support interactive studies of financial strategies. In market simulator, the multi-agent approach is commonly adopted due to its explainability. Existing multi-agent systems employ heuristic search to generate order streams, which is inefficient for large-scale simulation. Furthermore, the search-based behavior calibration often leads to inconsistent trader actions under the same general market condition, making the simulation results unstable and difficult to interpret. We propose CaliSim, the first search-free calibration approach multi-agent market simulator which achieves large-scale efficiency and behavior consistency. CaliSim uses meta-learning and devises a surrogate trading system with a consistency loss function for the reproducibility of order stream and trader behaviors. Extensive experiments in the market replay and case studies show that CaliSim achieves state-of-the-art in terms of order stream reproduction with consistent trader behavior and can capture patterns of real markets. △ Less

Submitted 5 June, 2023; originally announced July 2023.

arXiv:2307.12219 [pdf, other]

Improving Out-of-Distribution Robustness of Classifiers via Generative Interpolation

Authors: Haoyue Bai, Ceyuan Yang, Yinghao Xu, S. -H. Gary Chan, Bolei Zhou

Abstract: Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data. However, their performance deteriorates significantly when handling out-of-distribution (OoD) data, where the training and test are drawn from different distributions. In this paper, we explore utilizing the generative models as a data augmentation source for improving out-of-… ▽ More Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data. However, their performance deteriorates significantly when handling out-of-distribution (OoD) data, where the training and test are drawn from different distributions. In this paper, we explore utilizing the generative models as a data augmentation source for improving out-of-distribution robustness of neural classifiers. Specifically, we develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples. Training a generative model directly on the source domains tends to suffer from mode collapse and sometimes amplifies the data bias. Instead, we first train a StyleGAN model on one source domain and then fine-tune it on the other domains, resulting in many correlated generators where their model parameters have the same initialization thus are aligned. We then linearly interpolate the model parameters of the generators to spawn new sets of generators. Such interpolated generators are used as an extra data augmentation source to train the classifiers. The interpolation coefficients can flexibly control the augmentation direction and strength. In addition, a style-mixing mechanism is applied to further improve the diversity of the generated OoD samples. Our experiments show that the proposed method explicitly increases the diversity of training domains and achieves consistent improvements over baselines across datasets and multiple different distribution shifts. △ Less

Submitted 22 July, 2023; originally announced July 2023.

arXiv:2307.05914 [pdf, other]

FIS-ONE: Floor Identification System with One Label for Crowdsourced RF Signals

Authors: Weipeng Zhuo, Ka Ho Chiu, Jierun Chen, Ziqi Zhao, S. -H. Gary Chan, Sangtae Ha, Chul-Ho Lee

Abstract: Floor labels of crowdsourced RF signals are crucial for many smart-city applications, such as multi-floor indoor localization, geofencing, and robot surveillance. To build a prediction model to identify the floor number of a new RF signal upon its measurement, conventional approaches using the crowdsourced RF signals assume that at least few labeled signal samples are available on each floor. In t… ▽ More Floor labels of crowdsourced RF signals are crucial for many smart-city applications, such as multi-floor indoor localization, geofencing, and robot surveillance. To build a prediction model to identify the floor number of a new RF signal upon its measurement, conventional approaches using the crowdsourced RF signals assume that at least few labeled signal samples are available on each floor. In this work, we push the envelope further and demonstrate that it is technically feasible to enable such floor identification with only one floor-labeled signal sample on the bottom floor while having the rest of signal samples unlabeled. We propose FIS-ONE, a novel floor identification system with only one labeled sample. FIS-ONE consists of two steps, namely signal clustering and cluster indexing. We first build a bipartite graph to model the RF signal samples and obtain a latent representation of each node (each signal sample) using our attention-based graph neural network model so that the RF signal samples can be clustered more accurately. Then, we tackle the problem of indexing the clusters with proper floor labels, by leveraging the observation that signals from an access point can be detected on different floors, i.e., signal spillover. Specifically, we formulate a cluster indexing problem as a combinatorial optimization problem and show that it is equivalent to solving a traveling salesman problem, whose (near-)optimal solution can be found efficiently. We have implemented FIS-ONE and validated its effectiveness on the Microsoft dataset and in three large shop** malls. Our results show that FIS-ONE outperforms other baseline algorithms significantly, with up to 23% improvement in adjusted rand index and 25% improvement in normalized mutual information using only one floor-labeled signal sample. △ Less

Submitted 12 July, 2023; originally announced July 2023.

Comments: Accepted by IEEE ICDCS 2023

arXiv:2303.03667 [pdf, other]

Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks

Authors: Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, S. -H. Gary Chan

Abstract: To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstra… ▽ More To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstrate that such low FLOPS is mainly due to frequent memory access of the operators, especially the depthwise convolution. We hence propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks. For example, on ImageNet-1k, our tiny FasterNet-T0 is $2.8\times$, $3.3\times$, and $2.4\times$ faster than MobileViT-XXS on GPU, CPU, and ARM processors, respectively, while being $2.9\%$ more accurate. Our large FasterNet-L achieves impressive $83.5\%$ top-1 accuracy, on par with the emerging Swin-B, while having $36\%$ higher inference throughput on GPU, as well as saving $37\%$ compute time on CPU. Code is available at \url{https://github.com/JierunChen/FasterNet}. △ Less

Submitted 21 May, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023

arXiv:2210.07895 [pdf, other]

GRAFICS: Graph Embedding-based Floor Identification Using Crowdsourced RF Signals

Authors: Weipeng Zhuo, Ziqi Zhao, Ka Ho Chiu, Shiju Li, Sangtae Ha, Chul-Ho Lee, S. -H. Gary Chan

Abstract: We study the problem of floor identification for radiofrequency (RF) signal samples obtained in a crowdsourced manner, where the signal samples are highly heterogeneous and most samples lack their floor labels. We propose GRAFICS, a graph embedding-based floor identification system. GRAFICS first builds a highly versatile bipartite graph model, having APs on one side and signal samples on the othe… ▽ More We study the problem of floor identification for radiofrequency (RF) signal samples obtained in a crowdsourced manner, where the signal samples are highly heterogeneous and most samples lack their floor labels. We propose GRAFICS, a graph embedding-based floor identification system. GRAFICS first builds a highly versatile bipartite graph model, having APs on one side and signal samples on the other. GRAFICS then learns the low-dimensional embeddings of signal samples via a novel graph embedding algorithm named E-LINE. GRAFICS finally clusters the node embeddings along with the embeddings of a few labeled samples through a proximity-based hierarchical clustering, which eases the floor identification of every new sample. We validate the effectiveness of GRAFICS based on two large-scale datasets that contain RF signal records from 204 buildings in Hangzhou, China, and five buildings in Hong Kong. Our experiment results show that GRAFICS achieves highly accurate prediction performance with only a few labeled samples (96% in both micro- and macro-F scores) and significantly outperforms several state-of-the-art algorithms (by about 45% improvement in micro-F score and 53% in macro-F score). △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Accepted by IEEE ICDCS 2022

arXiv:2210.07889 [pdf, other]

Semi-supervised Learning with Network Embedding on Ambient RF Signals for Geofencing Services

Authors: Weipeng Zhuo, Ka Ho Chiu, Jierun Chen, Jiajie Tan, Edmund Sumpena, S. -H. Gary Chan, Sangtae Ha, Chul-Ho Lee

Abstract: In applications such as elderly care, dementia anti-wandering and pandemic control, it is important to ensure that people are within a predefined area for their safety and well-being. We propose GEM, a practical, semi-supervised Geofencing system with network EMbedding, which is based only on ambient radio frequency (RF) signals. GEM models measured RF signal records as a weighted bipartite graph.… ▽ More In applications such as elderly care, dementia anti-wandering and pandemic control, it is important to ensure that people are within a predefined area for their safety and well-being. We propose GEM, a practical, semi-supervised Geofencing system with network EMbedding, which is based only on ambient radio frequency (RF) signals. GEM models measured RF signal records as a weighted bipartite graph. With access points on one side and signal records on the other, it is able to precisely capture the relationships between signal records. GEM then learns node embeddings from the graph via a novel bipartite network embedding algorithm called BiSAGE, based on a Bipartite graph neural network with a novel bi-level SAmple and aggreGatE mechanism and non-uniform neighborhood sampling. Using the learned embeddings, GEM finally builds a one-class classification model via an enhanced histogram-based algorithm for in-out detection, i.e., to detect whether the user is inside the area or not. This model also keeps on improving with newly collected signal records. We demonstrate through extensive experiments in diverse environments that GEM shows state-of-the-art performance with up to 34% improvement in F-score. BiSAGE in GEM leads to a 54% improvement in F-score, as compared to the one without BiSAGE. △ Less

Submitted 8 March, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: A conference version of this paper will appear in IEEE ICDE 2023

arXiv:2203.10489 [pdf, other]

TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing

Authors: Jierun Chen, Tianlang He, Weipeng Zhuo, Li Ma, Sangtae Ha, S. -H. Gary Chan

Abstract: As convolution has empowered many smart applications, dynamic convolution further equips it with the ability to adapt to diverse inputs. However, the static and dynamic convolutions are either layout-agnostic or computation-heavy, making it inappropriate for layout-specific applications, e.g., face recognition and medical image segmentation. We observe that these applications naturally exhibit the… ▽ More As convolution has empowered many smart applications, dynamic convolution further equips it with the ability to adapt to diverse inputs. However, the static and dynamic convolutions are either layout-agnostic or computation-heavy, making it inappropriate for layout-specific applications, e.g., face recognition and medical image segmentation. We observe that these applications naturally exhibit the characteristics of large intra-image (spatial) variance and small cross-image variance. This observation motivates our efficient translation variant convolution (TVConv) for layout-aware visual processing. Technically, TVConv is composed of affinity maps and a weight-generating block. While affinity maps depict pixel-paired relationships gracefully, the weight-generating block can be explicitly overparameterized for better training while maintaining efficient inference. Although conceptually simple, TVConv significantly improves the efficiency of the convolution and can be readily plugged into various network architectures. Extensive experiments on face recognition show that TVConv reduces the computational cost by up to 3.1x and improves the corresponding throughput by 2.3x while maintaining a high accuracy compared to the depthwise convolution. Moreover, for the same computation cost, we boost the mean accuracy by up to 4.21%. We also conduct experiments on the optic disc/cup segmentation task and obtain better generalization performance, which helps mitigate the critical data scarcity issue. Code is available at https://github.com/JierunChen/TVConv. △ Less

Submitted 22 March, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022

arXiv:2201.03817 [pdf, other]

Tackling Multipath and Biased Training Data for IMU-Assisted BLE Proximity Detection

Authors: Tianlang He, Jiajie Tan, Weipeng Zhuo, Maximilian Printz, S. -H. Gary Chan

Abstract: Proximity detection is to determine whether an IoT receiver is within a certain distance from a signal transmitter. Due to its low cost and high popularity, Bluetooth low energy (BLE) has been used to detect proximity based on the received signal strength indicator (RSSI). To address the fact that RSSI can be markedly influenced by device carriage states, previous works have incorporated RSSI with… ▽ More Proximity detection is to determine whether an IoT receiver is within a certain distance from a signal transmitter. Due to its low cost and high popularity, Bluetooth low energy (BLE) has been used to detect proximity based on the received signal strength indicator (RSSI). To address the fact that RSSI can be markedly influenced by device carriage states, previous works have incorporated RSSI with inertial measurement unit (IMU) using deep learning. However, they have not sufficiently accounted for the impact of multipath. Furthermore, due to the special setup, the IMU data collected in the training process may be biased, which hampers the system's robustness and generalizability. This issue has not been studied before. We propose PRID, an IMU-assisted BLE proximity detection approach robust against RSSI fluctuation and IMU data bias. PRID histogramizes RSSI to extract multipath features and uses carriage state regularization to mitigate overfitting due to IMU data bias. We further propose PRID-lite based on a binarized neural network to substantially cut memory requirements for resource-constrained devices. We have conducted extensive experiments under different multipath environments, data bias levels, and a crowdsourced dataset. Our results show that PRID significantly reduces false detection cases compared with the existing arts (by over 50%). PRID-lite further reduces over 90% PRID model size and extends 60% battery life, with a minor compromise in accuracy (7%). △ Less

Submitted 11 January, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

arXiv:2201.00008 [pdf, other]

A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting

Authors: Guanyao Li, Shuhan Zhong, S. -H. Gary Chan, Ruiyuan Li, Chih-Chieh Hung, Wen-Chih Peng

Abstract: We study the forecasting problem for traffic with dynamic, possibly periodical, and joint spatial-temporal dependency between regions. Given the aggregated inflow and outflow traffic of regions in a city from time slots 0 to t-1, we predict the traffic at time t at any region. Prior arts in the area often consider the spatial and temporal dependencies in a decoupled manner or are rather computatio… ▽ More We study the forecasting problem for traffic with dynamic, possibly periodical, and joint spatial-temporal dependency between regions. Given the aggregated inflow and outflow traffic of regions in a city from time slots 0 to t-1, we predict the traffic at time t at any region. Prior arts in the area often consider the spatial and temporal dependencies in a decoupled manner or are rather computationally intensive in training with a large number of hyper-parameters to tune. We propose ST-TIS, a novel, lightweight, and accurate Spatial-Temporal Transformer with information fusion and region sampling for traffic forecasting. ST-TIS extends the canonical Transformer with information fusion and region sampling. The information fusion module captures the complex spatial-temporal dependency between regions. The region sampling module is to improve the efficiency and prediction accuracy, cutting the computation complexity for dependency learning from $O(n^2)$ to $O(n\sqrt{n})$, where n is the number of regions. With far fewer parameters than state-of-the-art models, the offline training of our model is significantly faster in terms of tuning and computation (with a reduction of up to $90\%$ on training time and network parameters). Notwithstanding such training efficiency, extensive experiments show that ST-TIS is substantially more accurate in online prediction than state-of-the-art approaches (with an average improvement of up to $9.5\%$ on RMSE, and $12.4\%$ on MAPE). △ Less

Submitted 3 May, 2022; v1 submitted 30 December, 2021; originally announced January 2022.

arXiv:2109.02038 [pdf, other]

NAS-OoD: Neural Architecture Search for Out-of-Distribution Generalization

Authors: Haoyue Bai, Fengwei Zhou, Lanqing Hong, Nanyang Ye, S. -H. Gary Chan, Zhenguo Li

Abstract: Recent advances on Out-of-Distribution (OoD) generalization reveal the robustness of deep learning models against distribution shifts. However, existing works focus on OoD algorithms, such as invariant risk minimization, domain generalization, or stable learning, without considering the influence of deep model architectures on OoD generalization, which may lead to sub-optimal performance. Neural A… ▽ More Recent advances on Out-of-Distribution (OoD) generalization reveal the robustness of deep learning models against distribution shifts. However, existing works focus on OoD algorithms, such as invariant risk minimization, domain generalization, or stable learning, without considering the influence of deep model architectures on OoD generalization, which may lead to sub-optimal performance. Neural Architecture Search (NAS) methods search for architecture based on its performance on the training data, which may result in poor generalization for OoD tasks. In this work, we propose robust Neural Architecture Search for OoD generalization (NAS-OoD), which optimizes the architecture with respect to its performance on generated OoD data by gradient descent. Specifically, a data generator is learned to synthesize OoD data by maximizing losses computed by different neural architectures, while the goal for architecture search is to find the optimal architecture parameters that minimize the synthetic OoD data losses. The data generator and the neural architecture are jointly optimized in an end-to-end manner, and the minimax training process effectively discovers robust architectures that generalize well for different distribution shifts. Extensive experimental results show that NAS-OoD achieves superior performance on various OoD generalization benchmarks with deep models having a much fewer number of parameters. In addition, on a real industry dataset, the proposed NAS-OoD method reduces the error rate by more than 70% compared with the state-of-the-art method, demonstrating the proposed method's practicality for real applications. △ Less

Submitted 5 September, 2021; originally announced September 2021.

Comments: Accepted by ICCV2021

arXiv:2105.09684 [pdf, other]

Crowd Counting by Self-supervised Transfer Colorization Learning and Global Prior Classification

Authors: Haoyue Bai, Song Wen, S. -H. Gary Chan

Abstract: Labeled crowd scene images are expensive and scarce. To significantly reduce the requirement of the labeled images, we propose ColorCount, a novel CNN-based approach by combining self-supervised transfer colorization learning and global prior classification to leverage the abundantly available unlabeled data. The self-supervised colorization branch learns the semantics and surface texture of the i… ▽ More Labeled crowd scene images are expensive and scarce. To significantly reduce the requirement of the labeled images, we propose ColorCount, a novel CNN-based approach by combining self-supervised transfer colorization learning and global prior classification to leverage the abundantly available unlabeled data. The self-supervised colorization branch learns the semantics and surface texture of the image by using its color components as pseudo labels. The classification branch extracts global group priors by learning correlations among image clusters. Their fused resultant discriminative features (global priors, semantics and textures) provide ample priors for counting, hence significantly reducing the requirement of labeled images. We conduct extensive experiments on four challenging benchmarks. ColorCount achieves much better performance as compared with other unsupervised approaches. Its performance is close to the supervised baseline with substantially less labeled data (10\% of the original one). △ Less

Submitted 20 May, 2021; originally announced May 2021.

arXiv:2104.13946 [pdf, other]

Motion-guided Non-local Spatial-Temporal Network for Video Crowd Counting

Authors: Haoyue Bai, S. -H. Gary Chan

Abstract: We study video crowd counting, which is to estimate the number of objects (people in this paper) in all the frames of a video sequence. Previous work on crowd counting is mostly on still images. There has been little work on how to properly extract and take advantage of the spatial-temporal correlation between neighboring frames in both short and long ranges to achieve high estimation accuracy for… ▽ More We study video crowd counting, which is to estimate the number of objects (people in this paper) in all the frames of a video sequence. Previous work on crowd counting is mostly on still images. There has been little work on how to properly extract and take advantage of the spatial-temporal correlation between neighboring frames in both short and long ranges to achieve high estimation accuracy for a video sequence. In this work, we propose Monet, a novel and highly accurate motion-guided non-local spatial-temporal network for video crowd counting. Monet first takes people flow (motion information) as guidance to coarsely segment the regions of pixels where a person may be. Given these regions, Monet then uses a non-local spatial-temporal network to extract spatial-temporally both short and long-range contextual information. The whole network is finally trained end-to-end with a fused loss to generate a high-quality density map. Noting the scarcity and low quality (in terms of resolution and scene diversity) of the publicly available video crowd datasets, we have collected and built a large-scale video crowd counting datasets, VidCrowd, to contribute to the community. VidCrowd contains 9,000 frames of high resolution (2560 x 1440), with 1,150,239 head annotations captured in different scenes, crowd density and lighting in two cities. We have conducted extensive experiments on the challenging VideoCrowd and two public video crowd counting datasets: UCSD and Mall. Our approach achieves substantially better performance in terms of MAE and MSE as compared with other state-of-the-art approaches. △ Less

Submitted 28 April, 2021; originally announced April 2021.

arXiv:2101.04442 [pdf, other]

Joint Demosaicking and Denoising in the Wild: The Case of Training Under Ground Truth Uncertainty

Authors: Jierun Chen, Song Wen, S. -H. Gary Chan

Abstract: Image demosaicking and denoising are the two key fundamental steps in digital camera pipelines, aiming to reconstruct clean color images from noisy luminance readings. In this paper, we propose and study Wild-JDD, a novel learning framework for joint demosaicking and denoising in the wild. In contrast to previous works which generally assume the ground truth of training data is a perfect reflectio… ▽ More Image demosaicking and denoising are the two key fundamental steps in digital camera pipelines, aiming to reconstruct clean color images from noisy luminance readings. In this paper, we propose and study Wild-JDD, a novel learning framework for joint demosaicking and denoising in the wild. In contrast to previous works which generally assume the ground truth of training data is a perfect reflection of the reality, we consider here the more common imperfect case of ground truth uncertainty in the wild. We first illustrate its manifestation as various kinds of artifacts including zipper effect, color moire and residual noise. Then we formulate a two-stage data degradation process to capture such ground truth uncertainty, where a conjugate prior distribution is imposed upon a base distribution. After that, we derive an evidence lower bound (ELBO) loss to train a neural network that approximates the parameters of the conjugate prior distribution conditioned on the degraded input. Finally, to further enhance the performance for out-of-distribution input, we design a simple but effective fine-tuning strategy by taking the input as a weakly informative prior. Taking into account ground truth uncertainty, Wild-JDD enjoys good interpretability during optimization. Extensive experiments validate that it outperforms state-of-the-art schemes on joint demosaicking and denoising tasks on both synthetic and realistic raw datasets. △ Less

Submitted 12 January, 2021; originally announced January 2021.

Comments: Accepted by AAAI2021

arXiv:2012.15685 [pdf, other]

A Survey on Deep Learning-based Single Image Crowd Counting: Network Design, Loss Function and Supervisory Signal

Authors: Haoyue Bai, Jiageng Mao, S. -H. Gary Chan

Abstract: Single image crowd counting is a challenging computer vision problem with wide applications in public safety, city planning, traffic management, etc. With the recent development of deep learning techniques, crowd counting has aroused much attention and achieved great success in recent years. This survey is to provide a comprehensive summary of recent advances on deep learning-based crowd counting… ▽ More Single image crowd counting is a challenging computer vision problem with wide applications in public safety, city planning, traffic management, etc. With the recent development of deep learning techniques, crowd counting has aroused much attention and achieved great success in recent years. This survey is to provide a comprehensive summary of recent advances on deep learning-based crowd counting techniques via density map estimation by systematically reviewing and summarizing more than 200 works in the area since 2015. Our goals are to provide an up-to-date review of recent approaches, and educate new researchers in this field the design principles and trade-offs. After presenting publicly available datasets and evaluation metrics, we review the recent advances with detailed comparisons on three major design modules for crowd counting: deep neural network designs, loss functions, and supervisory signals. We study and compare the approaches using the public datasets and evaluation metrics. We conclude the survey with some future directions. △ Less

Submitted 11 July, 2022; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: Neurocomputing minor revision. Project page is at https://github.com/HaoyueBaiZJU/A-Recent-Systematic-Survey-for-Crowd-Counting

arXiv:2012.09382 [pdf, other]

DecAug: Out-of-Distribution Generalization via Decomposed Feature Representation and Semantic Augmentation

Authors: Haoyue Bai, Rui Sun, Lanqing Hong, Fengwei Zhou, Nanyang Ye, Han-Jia Ye, S. -H. Gary Chan, Zhenguo Li

Abstract: While deep learning demonstrates its strong ability to handle independent and identically distributed (IID) data, it often suffers from out-of-distribution (OoD) generalization, where the test data come from another distribution (w.r.t. the training one). Designing a general OoD generalization framework to a wide range of applications is challenging, mainly due to possible correlation shift and di… ▽ More While deep learning demonstrates its strong ability to handle independent and identically distributed (IID) data, it often suffers from out-of-distribution (OoD) generalization, where the test data come from another distribution (w.r.t. the training one). Designing a general OoD generalization framework to a wide range of applications is challenging, mainly due to possible correlation shift and diversity shift in the real world. Most of the previous approaches can only solve one specific distribution shift, such as shift across domains or the extrapolation of correlation. To address that, we propose DecAug, a novel decomposed feature representation and semantic augmentation approach for OoD generalization. DecAug disentangles the category-related and context-related features. Category-related features contain causal information of the target object, while context-related features describe the attributes, styles, backgrounds, or scenes, causing distribution shifts between training and test data. The decomposition is achieved by orthogonalizing the two gradients (w.r.t. intermediate features) of losses for predicting category and context labels. Furthermore, we perform gradient-based augmentation on context-related features to improve the robustness of the learned representations. Experimental results show that DecAug outperforms other state-of-the-art methods on various OoD datasets, which is among the very few methods that can deal with different types of OoD generalization challenges. △ Less

Submitted 16 December, 2020; originally announced December 2020.

Comments: Accepted by AAAI2021

arXiv:2009.05944 [pdf, other]

vContact: Private WiFi-based Contact Tracing with Virus Lifespan

Authors: Guanyao Li, Siyan Hu, Shuhan Zhong, Wai Lun Tsui, S. -H. Gary Chan

Abstract: Covid-19 is primarily spread through contact with the virus which may survive on surfaces with lifespan of more than hours. To curb its spread, it is hence of vital importance to detect and quarantine those who have been in contact with the virus for sustained period of time, the so-called close contacts. In this work, we study, for the first time, automatic contact detection when the virus has a… ▽ More Covid-19 is primarily spread through contact with the virus which may survive on surfaces with lifespan of more than hours. To curb its spread, it is hence of vital importance to detect and quarantine those who have been in contact with the virus for sustained period of time, the so-called close contacts. In this work, we study, for the first time, automatic contact detection when the virus has a lifespan. Leveraging upon the ubiquity of WiFi signals, we propose a novel, private, and fully distributed WiFi-based approach called vContact. Users installing an app continuously scan WiFi and store its hashed IDs. Given a confirmed case, the signals of the major places he/she visited are then uploaded to a server and matched with the stored signals of users to detect contact. vContact is not based on phone pairing, and no information of any other users is stored locally. The confirmed case does not need to have installed the app for it to work properly. As WiFi data are sampled sporadically, we propose efficient signal processing approaches and similarity metric to align and match signals of any time. We conduct extensive indoor and outdoor experiments to evaluate the performance of vContact. Our results demonstrate that vContact is efficient and robust for contact detection. The precision and recall of contact detection are high (in the range of 50-90%) for close contact proximity (2m). Its performance is robust with respect to signal lengths (AP numbers) and phone heterogeneity. By implementing vContact as an app, we present a case study to demonstrate the validity of our design in notifying its users their exposure to virus with lifespan. △ Less

Submitted 26 January, 2021; v1 submitted 13 September, 2020; originally announced September 2020.

arXiv:1909.03839 [pdf, other]

Crowd Counting on Images with Scale Variation and Isolated Clusters

Authors: Haoyue Bai, Song Wen, S. -H. Gary Chan

Abstract: Crowd counting is to estimate the number of objects (e.g., people or vehicles) in an image of unconstrained congested scenes. Designing a general crowd counting algorithm applicable to a wide range of crowd images is challenging, mainly due to the possibly large variation in object scales and the presence of many isolated small clusters. Previous approaches based on convolution operations with mul… ▽ More Crowd counting is to estimate the number of objects (e.g., people or vehicles) in an image of unconstrained congested scenes. Designing a general crowd counting algorithm applicable to a wide range of crowd images is challenging, mainly due to the possibly large variation in object scales and the presence of many isolated small clusters. Previous approaches based on convolution operations with multi-branch architecture are effective for only some narrow bands of scales and have not captured the long-range contextual relationship due to isolated clustering. To address that, we propose SACANet, a novel scale-adaptive long-range context-aware network for crowd counting. SACANet consists of three major modules: the pyramid contextual module which extracts long-range contextual information and enlarges the receptive field, a scale-adaptive self-attention multi-branch module to attain high scale sensitivity and detection accuracy of isolated clusters, and a hierarchical fusion module to fuse multi-level self-attention features. With group normalization, SACANet achieves better optimality in the training process. We have conducted extensive experiments using the VisDrone2019 People dataset, the VisDrone2019 Vehicle dataset, and some other challenging benchmarks. As compared with the state-of-the-art methods, SACANet is shown to be effective, especially for extremely crowded conditions with diverse scales and scattered clusters, and achieves much lower MAE as compared with baselines. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: Accepted at International Conference on Computer Vision (ICCV) 2019 Workshop

arXiv:1903.02082 [pdf, other]

DA-LSTM: A Long Short-Term Memory with Depth Adaptive to Non-uniform Information Flow in Sequential Data

Authors: Yifeng Zhang, Ka-Ho Chow, S. -H. Gary Chan

Abstract: Much sequential data exhibits highly non-uniform information distribution. This cannot be correctly modeled by traditional Long Short-Term Memory (LSTM). To address that, recent works have extended LSTM by adding more activations between adjacent inputs. However, the approaches often use a fixed depth, which is at the step of the most information content. This one-size-fits-all worst-case approach… ▽ More Much sequential data exhibits highly non-uniform information distribution. This cannot be correctly modeled by traditional Long Short-Term Memory (LSTM). To address that, recent works have extended LSTM by adding more activations between adjacent inputs. However, the approaches often use a fixed depth, which is at the step of the most information content. This one-size-fits-all worst-case approach is not satisfactory, because when little information is distributed to some steps, shallow structures can achieve faster convergence and consume less computation resource. In this paper, we develop a Depth-Adaptive Long Short-Term Memory (DA-LSTM) architecture, which can dynamically adjust the structure depending on information distribution without prior knowledge. Experimental results on real-world datasets show that DA-LSTM costs much less computation resource and substantially reduce convergence time by $41.78\%$ and $46.01 \%$, compared with Stacked LSTM and Deep Transition LSTM, respectively. △ Less

Submitted 18 January, 2019; originally announced March 2019.

arXiv:1811.08069 [pdf, other]

Representation Learning of Pedestrian Trajectories Using Actor-Critic Sequence-to-Sequence Autoencoder

Authors: Ka-Ho Chow, Anish Hiranandani, Yifeng Zhang, S. -H. Gary Chan

Abstract: Representation learning of pedestrian trajectories transforms variable-length timestamp-coordinate tuples of a trajectory into a fixed-length vector representation that summarizes spatiotemporal characteristics. It is a crucial technique to connect feature-based data mining with trajectory data. Trajectory representation is a challenging problem, because both environmental constraints (e.g., wall… ▽ More Representation learning of pedestrian trajectories transforms variable-length timestamp-coordinate tuples of a trajectory into a fixed-length vector representation that summarizes spatiotemporal characteristics. It is a crucial technique to connect feature-based data mining with trajectory data. Trajectory representation is a challenging problem, because both environmental constraints (e.g., wall partitions) and temporal user dynamics should be meticulously considered and accounted for. Furthermore, traditional sequence-to-sequence autoencoders using maximum log-likelihood often require dataset covering all the possible spatiotemporal characteristics to perform well. This is infeasible or impractical in reality. We propose TREP, a practical pedestrian trajectory representation learning algorithm which captures the environmental constraints and the pedestrian dynamics without the need of any training dataset. By formulating a sequence-to-sequence autoencoder with a spatial-aware objective function under the paradigm of actor-critic reinforcement learning, TREP intelligently encodes spatiotemporal characteristics of trajectories with the capability of handling diverse trajectory patterns. Extensive experiments on both synthetic and real datasets validate the high fidelity of TREP to represent trajectories. △ Less

Submitted 19 November, 2018; originally announced November 2018.

arXiv:1211.4767 [pdf, ps, other]

Collaborative P2P Streaming of Interactive Live Free Viewpoint Video

Authors: Dongni Ren, S. -H. Gary Chan, Gene Cheung, Vicky Zhao, Pascal Frossard

Abstract: We study an interactive live streaming scenario where multiple peers pull streams of the same free viewpoint video that are synchronized in time but not necessarily in view. In free viewpoint video, each user can periodically select a virtual view between two anchor camera views for display. The virtual view is synthesized using texture and depth videos of the anchor views via depth-image-based re… ▽ More We study an interactive live streaming scenario where multiple peers pull streams of the same free viewpoint video that are synchronized in time but not necessarily in view. In free viewpoint video, each user can periodically select a virtual view between two anchor camera views for display. The virtual view is synthesized using texture and depth videos of the anchor views via depth-image-based rendering (DIBR). In general, the distortion of the virtual view increases with the distance to the anchor views, and hence it is beneficial for a peer to select the closest anchor views for synthesis. On the other hand, if peers interested in different virtual views are willing to tolerate larger distortion in using more distant anchor views, they can collectively share the access cost of common anchor views. Given anchor view access cost and synthesized distortion of virtual views between anchor views, we study the optimization of anchor view allocation for collaborative peers. We first show that, if the network reconfiguration costs due to view-switching are negligible, the problem can be optimally solved in polynomial time using dynamic programming. We then consider the case of non-negligible reconfiguration costs (e.g., large or frequent view-switching leading to anchor-view changes). In this case, the view allocation problem becomes NP-hard. We thus present a locally optimal and centralized allocation algorithm inspired by Lloyd's algorithm in non-uniform scalar quantization. We also propose a distributed algorithm with guaranteed convergence where each peer group independently make merge-and-split decisions with a well-defined fairness criteria. The results show that depending on the problem settings, our proposed algorithms achieve respective optimal and close-to-optimal performance in terms of total cost, and outperform a P2P scheme without collaborative anchor selection. △ Less

Submitted 20 November, 2012; originally announced November 2012.

Showing 1–25 of 25 results for author: Chan, S - G