-
FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs
Authors:
Haodong Chen,
Haojian Huang,
Junhao Dong,
Mingzhe Zheng,
Dian Shao
Abstract:
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Re…
▽ More
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Analysis and ablation studies further validate its effectiveness.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Learning Decision Policies with Instrumental Variables through Double Machine Learning
Authors:
Daqian Shao,
Ashkan Soleymani,
Francesco Quinzan,
Marta Kwiatkowska
Abstract:
A common issue in learning decision-making policies in data-rich settings is spurious correlations in the offline dataset, which can be caused by hidden confounders. Instrumental variable (IV) regression, which utilises a key unconfounded variable known as the instrument, is a standard technique for learning causal relationships between confounded action, outcome, and context variables. Most recen…
▽ More
A common issue in learning decision-making policies in data-rich settings is spurious correlations in the offline dataset, which can be caused by hidden confounders. Instrumental variable (IV) regression, which utilises a key unconfounded variable known as the instrument, is a standard technique for learning causal relationships between confounded action, outcome, and context variables. Most recent IV regression algorithms use a two-stage approach, where a deep neural network (DNN) estimator learnt in the first stage is directly plugged into the second stage, in which another DNN is used to estimate the causal effect. Naively plugging the estimator can cause heavy bias in the second stage, especially when regularisation bias is present in the first stage estimator. We propose DML-IV, a non-linear IV regression method that reduces the bias in two-stage IV regressions and effectively learns high-performing policies. We derive a novel learning objective to reduce bias and design the DML-IV algorithm following the double/debiased machine learning (DML) framework. The learnt DML-IV estimator has strong convergence rate and $O(N^{-1/2})$ suboptimality guarantees that match those when the dataset is unconfounded. DML-IV outperforms state-of-the-art IV regression methods on IV regression benchmarks and learns high-performing policies in the presence of instruments.
△ Less
Submitted 28 June, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting
Authors:
Haodong Chen,
Yongle Huang,
Haojian Huang,
Xiangsheng Ge,
Dian Shao
Abstract:
The increasing prominence of e-commerce has underscored the importance of Virtual Try-On (VTON). However, previous studies predominantly focus on the 2D realm and rely heavily on extensive data for training. Research on 3D VTON primarily centers on garment-body shape compatibility, a topic extensively covered in 2D VTON. Thanks to advances in 3D scene editing, a 2D diffusion model has now been ada…
▽ More
The increasing prominence of e-commerce has underscored the importance of Virtual Try-On (VTON). However, previous studies predominantly focus on the 2D realm and rely heavily on extensive data for training. Research on 3D VTON primarily centers on garment-body shape compatibility, a topic extensively covered in 2D VTON. Thanks to advances in 3D scene editing, a 2D diffusion model has now been adapted for 3D editing via multi-viewpoint editing. In this work, we propose GaussianVTON, an innovative 3D VTON pipeline integrating Gaussian Splatting (GS) editing with 2D VTON. To facilitate a seamless transition from 2D to 3D VTON, we propose, for the first time, the use of only images as editing prompts for 3D editing. To further address issues, e.g., face blurring, garment inaccuracy, and degraded viewpoint quality during editing, we devise a three-stage refinement strategy to gradually mitigate potential issues. Furthermore, we introduce a new editing strategy termed Edit Recall Reconstruction (ERR) to tackle the limitations of previous editing strategies in leading to complex geometric changes. Our comprehensive experiments demonstrate the superiority of GaussianVTON, offering a novel perspective on 3D VTON while also establishing a novel starting point for image-prompting 3D scene editing.
△ Less
Submitted 23 May, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks
Authors:
Guanhua Ding,
Zexi Ye,
Zhen Zhong,
Gang Li,
David Shao
Abstract:
Deep Neural Network (DNN) pruning has emerged as a key strategy to reduce model size, improve inference latency, and lower power consumption on DNN accelerators. Among various pruning techniques, block and output channel pruning have shown significant potential in accelerating hardware performance. However, their accuracy often requires further improvement. In response to this challenge, we introd…
▽ More
Deep Neural Network (DNN) pruning has emerged as a key strategy to reduce model size, improve inference latency, and lower power consumption on DNN accelerators. Among various pruning techniques, block and output channel pruning have shown significant potential in accelerating hardware performance. However, their accuracy often requires further improvement. In response to this challenge, we introduce a separate, dynamic and differentiable (SMART) pruner. This pruner stands out by utilizing a separate, learnable probability mask for weight importance ranking, employing a differentiable Top k operator to achieve target sparsity, and leveraging a dynamic temperature parameter trick to escape from non-sparse local minima. In our experiments, the SMART pruner consistently demonstrated its superiority over existing pruning methods across a wide range of tasks and models on block and output channel pruning. Additionally, we extend our testing to Transformer-based models in N:M pruning scenarios, where SMART pruner also yields state-of-the-art results, demonstrating its adaptability and robustness across various neural network architectures, and pruning types.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers
Authors:
Daqian Shao,
Lukas Fesser,
Marta Kwiatkowska
Abstract:
Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datase…
▽ More
Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.
△ Less
Submitted 28 November, 2023;
originally announced January 2024.
-
Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning
Authors:
Sha Meng,
Dian Shao,
Jiacheng Guo,
Shan Gao
Abstract:
Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, incl…
▽ More
Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
A General-Purpose Self-Supervised Model for Computational Pathology
Authors:
Richard J. Chen,
Tong Ding,
Ming Y. Lu,
Drew F. K. Williamson,
Guillaume Jaume,
Bowen Chen,
Andrew Zhang,
Daniel Shao,
Andrew H. Song,
Muhammad Shaban,
Mane Williams,
Anurag Vaidya,
Sharifa Sahai,
Lukas Oldenburg,
Luca L. Weishaupt,
Judy J. Wang,
Walt Williams,
Long Phi Le,
Georg Gerber,
Faisal Mahmood
Abstract:
Tissue phenoty** is a fundamental computational pathology (CPath) task in learning objective characterizations of histopathologic biomarkers in anatomic pathology. However, whole-slide imaging (WSI) poses a complex computer vision problem in which the large-scale image resolutions of WSIs and the enormous diversity of morphological phenotypes preclude large-scale data annotation. Current efforts…
▽ More
Tissue phenoty** is a fundamental computational pathology (CPath) task in learning objective characterizations of histopathologic biomarkers in anatomic pathology. However, whole-slide imaging (WSI) poses a complex computer vision problem in which the large-scale image resolutions of WSIs and the enormous diversity of morphological phenotypes preclude large-scale data annotation. Current efforts have proposed using pretrained image encoders with either transfer learning from natural image datasets or self-supervised pretraining on publicly-available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. We introduce UNI, a general-purpose self-supervised model for pathology, pretrained using over 100 million tissue patches from over 100,000 diagnostic haematoxylin and eosin-stained WSIs across 20 major tissue types, and evaluated on 33 representative CPath clinical tasks in CPath of varying diagnostic difficulties. In addition to outperforming previous state-of-the-art models, we demonstrate new modeling capabilities in CPath such as resolution-agnostic tissue classification, slide classification using few-shot class prototypes, and disease subty** generalization in classifying up to 108 cancer types in the OncoTree code classification system. UNI advances unsupervised representation learning at scale in CPath in terms of both pretraining data and downstream evaluation, enabling data-efficient AI models that can generalize and transfer to a gamut of diagnostically-challenging tasks and clinical workflows in anatomic pathology.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
Weighted Point Cloud Normal Estimation
Authors:
Weijia Wang,
Xuequan Lu,
Di Shao,
Xiao Liu,
Richard Dazeley,
Antonio Robles-Kelly,
Wei Pan
Abstract:
Existing normal estimation methods for point clouds are often less robust to severe noise and complex geometric structures. Also, they usually ignore the contributions of different neighbouring points during normal estimation, which leads to less accurate results. In this paper, we introduce a weighted normal estimation method for 3D point cloud data. We innovate in two key points: 1) we develop a…
▽ More
Existing normal estimation methods for point clouds are often less robust to severe noise and complex geometric structures. Also, they usually ignore the contributions of different neighbouring points during normal estimation, which leads to less accurate results. In this paper, we introduce a weighted normal estimation method for 3D point cloud data. We innovate in two key points: 1) we develop a novel weighted normal regression technique that predicts point-wise weights from local point patches and use them for robust, feature-preserving normal regression; 2) we propose to conduct contrastive learning between point patches and the corresponding ground-truth normals of the patches' central points as a pre-training process to facilitate normal regression. Comprehensive experiments demonstrate that our method can robustly handle noisy and complex point clouds, achieving state-of-the-art performance on both synthetic and real-world datasets.
△ Less
Submitted 6 May, 2023;
originally announced May 2023.
-
Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees
Authors:
Daqian Shao,
Marta Kwiatkowska
Abstract:
Linear Temporal Logic (LTL) is widely used to specify high-level objectives for system policies, and it is highly desirable for autonomous systems to learn the optimal policy with respect to such specifications. However, learning the optimal policy from LTL specifications is not trivial. We present a model-free Reinforcement Learning (RL) approach that efficiently learns an optimal policy for an u…
▽ More
Linear Temporal Logic (LTL) is widely used to specify high-level objectives for system policies, and it is highly desirable for autonomous systems to learn the optimal policy with respect to such specifications. However, learning the optimal policy from LTL specifications is not trivial. We present a model-free Reinforcement Learning (RL) approach that efficiently learns an optimal policy for an unknown stochastic system, modelled using Markov Decision Processes (MDPs). We propose a novel and more general product MDP, reward structure and discounting mechanism that, when applied in conjunction with off-the-shelf model-free RL algorithms, efficiently learn the optimal policy that maximizes the probability of satisfying a given LTL specification with optimality guarantees. We also provide improved theoretical results on choosing the key parameters in RL to ensure optimality. To directly evaluate the learned policy, we adopt probabilistic model checker PRISM to compute the probability of the policy satisfying such specifications. Several experiments on various tabular MDP environments across different LTL tasks demonstrate the improved sample efficiency and optimal policy convergence.
△ Less
Submitted 3 May, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Group Equivariant BEV for 3D Object Detection
Authors:
Hongwei Liu,
Jian Yang,
Jianfeng Zhang,
Dongheng Shao,
Jielong Guo,
Shaobo Li,
Xuan Tang,
Xian Wei
Abstract:
Recently, 3D object detection has attracted significant attention and achieved continuous improvement in real road scenarios. The environmental information is collected from a single sensor or multi-sensor fusion to detect interested objects. However, most of the current 3D object detection approaches focus on develo** advanced network architectures to improve the detection precision of the obje…
▽ More
Recently, 3D object detection has attracted significant attention and achieved continuous improvement in real road scenarios. The environmental information is collected from a single sensor or multi-sensor fusion to detect interested objects. However, most of the current 3D object detection approaches focus on develo** advanced network architectures to improve the detection precision of the object rather than considering the dynamic driving scenes, where data collected from sensors equipped in the vehicle contain various perturbation features. As a result, existing work cannot still tackle the perturbation issue. In order to solve this problem, we propose a group equivariant bird's eye view network (GeqBevNet) based on the group equivariant theory, which introduces the concept of group equivariant into the BEV fusion object detection network. The group equivariant network is embedded into the fused BEV feature map to facilitate the BEV-level rotational equivariant feature extraction, thus leading to lower average orientation error. In order to demonstrate the effectiveness of the GeqBevNet, the network is verified on the nuScenes validation dataset in which mAOE can be decreased to 0.325. Experimental results demonstrate that GeqBevNet can extract more rotational equivariant features in the 3D object detection of the actual road scene and improve the performance of object orientation prediction.
△ Less
Submitted 28 June, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Using Consensual Biterms from Text Structures of Requirements and Code to Improve IR-Based Traceability Recovery
Authors:
Hui Gao,
Hongyu Kuang,
Kexin Sun,
Xiaoxing Ma,
Alexander Egyed,
Patrick Mäder,
Guo** Rong,
Dong Shao,
He Zhang
Abstract:
Traceability approves trace links among software artifacts based on whether two artifacts are related by system functionalities. The traces are valuable for software development, but are difficult to obtain manually. To cope with the costly and fallible manual recovery, automated approaches are proposed to recover traces through textual similarities among software artifacts, such as those based on…
▽ More
Traceability approves trace links among software artifacts based on whether two artifacts are related by system functionalities. The traces are valuable for software development, but are difficult to obtain manually. To cope with the costly and fallible manual recovery, automated approaches are proposed to recover traces through textual similarities among software artifacts, such as those based on Information Retrieval (IR). However, the low quality & quantity of artifact texts negatively impact the calculated IR values, thus greatly hindering the performance of IR-based approaches. In this study, we propose to extract co-occurred word pairs from the text structures of both requirements and code (i.e., consensual biterms) to improve IR-based traceability recovery. We first collect a set of biterms based on the part-of-speech of requirement texts, and then filter them through the code texts. We then use these consensual biterms to both enrich the input corpus for IR techniques and enhance the calculations of IR values. A nine-system-based evaluation shows that in general, when solely used to enhance IR techniques, our approach can outperform pure IR-based approaches and another baseline by 21.9% & 21.8% in AP, and 9.3% & 7.2% in MAP, respectively. Moreover, when used to collaborate with another enhancing strategy from different perspectives, it can outperform this baseline by 5.9% in AP and 4.8% in MAP.
△ Less
Submitted 4 September, 2022;
originally announced September 2022.
-
A Cross-Company Ethnographic Study on Software Teams for DevOps and Microservices: Organization, Benefits, and Issues
Authors:
Xin Zhou,
Huang Huang,
He Zhang,
Xin Huang,
Dong Shao,
Chenxing Zhong
Abstract:
Context: DevOps and microservices are acknowledged to be important new paradigms to tackle contemporary software demands and provide capabilities for rapid and reliable software development. Industrial reports show that they are quickly adopted together in massive software companies. However, because of the technical and organizational requirements, many difficulties against efficient implementati…
▽ More
Context: DevOps and microservices are acknowledged to be important new paradigms to tackle contemporary software demands and provide capabilities for rapid and reliable software development. Industrial reports show that they are quickly adopted together in massive software companies. However, because of the technical and organizational requirements, many difficulties against efficient implementation of the both emerge in real software teams. Objectives: This study aims to discover the organization, benefits and issues of software teams using DevOps & microservices from an immersive perspective. Method: An ethnographic study was carried out in three companies with different business, size, products, customers, and degree of globalization. All the three companies claimed their adoption of DevOps and microservices. Seven months (cumulative) of participant observations and nine interviews with practitioners were conducted to collect the data of software teams related to DevOps and microservices. A cross-company empirical investigation using grounded theory was done by analyzing the archive data. Results: The adoption of DevOps and microservices brings benefits to rapid delivery, ability improvements and burden reduction, whilst the high cost and lack of practical guidance were emerged. Moreover, our observations and interviews reflect that in software teams, the relationship between DevOps and microservices is not significant, which differs from the relationship described in the previous studies. Four lessons for practitioners and four implications for researchers were discussed based on our findings. Conclusion: Our findings contribute to the understanding of the organization, benefits and issues of adopting DevOps and microservices from an immersive perspective of software teams.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-branch Learning
Authors:
Di Shao,
Xuequan Lu,
Xiao Liu
Abstract:
Intracranial aneurysms are common nowadays and how to detect them intelligently is of great significance in digital health. While most existing deep learning research focused on medical images in a supervised way, we introduce an unsupervised method for the detection of intracranial aneurysms based on 3D point cloud data. In particular, our method consists of two stages: unsupervised pre-training…
▽ More
Intracranial aneurysms are common nowadays and how to detect them intelligently is of great significance in digital health. While most existing deep learning research focused on medical images in a supervised way, we introduce an unsupervised method for the detection of intracranial aneurysms based on 3D point cloud data. In particular, our method consists of two stages: unsupervised pre-training and downstream tasks. As for the former, the main idea is to pair each point cloud with its jittered counterpart and maximise their correspondence. Then we design a dual-branch contrastive network with an encoder for each branch and a subsequent common projection head. As for the latter, we design simple networks for supervised classification and segmentation training. Experiments on the public dataset (IntrA) show that our unsupervised method achieves comparable or even better performance than some state-of-the-art supervised techniques, and it is most prominent in the detection of aneurysmal vessels. Experiments on the ModelNet40 also show that our method achieves the accuracy of 90.79\% which outperforms existing state-of-the-art unsupervised models.
△ Less
Submitted 16 January, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
Exploiting the Unique Expression for Improved Sentiment Analysis in Software Engineering Text
Authors:
Kexin Sun,
Hui Gao,
Hongyu Kuang,
Xiaoxing Ma,
Guo** Rong,
Dong Shao,
He Zhang
Abstract:
Sentiment analysis on software engineering (SE) texts has been widely used in the SE research, such as evaluating app reviews or analyzing developers sentiments in commit messages. To better support the use of automated sentiment analysis for SE tasks, researchers built an SE-domain-specified sentiment dictionary to further improve the accuracy of the results. Unfortunately, recent work reported t…
▽ More
Sentiment analysis on software engineering (SE) texts has been widely used in the SE research, such as evaluating app reviews or analyzing developers sentiments in commit messages. To better support the use of automated sentiment analysis for SE tasks, researchers built an SE-domain-specified sentiment dictionary to further improve the accuracy of the results. Unfortunately, recent work reported that current mainstream tools for sentiment analysis still cannot provide reliable results when analyzing the sentiments in SE texts. We suggest that the reason for this situation is because the way of expressing sentiments in SE texts is largely different from the way in social network or movie comments. In this paper, we propose to improve sentiment analysis in SE texts by using sentence structures, a different perspective from building a domain dictionary. Specifically, we use sentence structures to first identify whether the author is expressing her sentiment in a given clause of an SE text, and to further adjust the calculation of sentiments which are confirmed in the clause. An empirical evaluation based on four different datasets shows that our approach can outperform two dictionary-based baseline approaches, and is more generalizable compared to a learning-based baseline approach.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
DecAug: Augmenting HOI Detection via Decomposition
Authors:
Yichen Xie,
Hao-Shu Fang,
Dian Shao,
Yong-Lu Li,
Cewu Lu
Abstract:
Human-object interaction (HOI) detection requires a large amount of annotated data. Current algorithms suffer from insufficient training samples and category imbalance within datasets. To increase data efficiency, in this paper, we propose an efficient and effective data augmentation method called DecAug for HOI detection. Based on our proposed object state similarity metric, object patterns acros…
▽ More
Human-object interaction (HOI) detection requires a large amount of annotated data. Current algorithms suffer from insufficient training samples and category imbalance within datasets. To increase data efficiency, in this paper, we propose an efficient and effective data augmentation method called DecAug for HOI detection. Based on our proposed object state similarity metric, object patterns across different HOIs are shared to augment local object appearance features without changing their state. Further, we shift spatial correlation between humans and objects to other feasible configurations with the aid of a pose-guided Gaussian Mixture Model while preserving their interactions. Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset for two advanced models. Specifically, interactions with fewer samples enjoy more notable improvement. Our method can be easily integrated into various HOI detection models with negligible extra computational consumption. Our code will be made publicly available.
△ Less
Submitted 2 October, 2020;
originally announced October 2020.
-
DIRV: Dense Interaction Region Voting for End-to-End Human-Object Interaction Detection
Authors:
Hao-Shu Fang,
Yichen Xie,
Dian Shao,
Cewu Lu
Abstract:
Recent years, human-object interaction (HOI) detection has achieved impressive advances. However, conventional two-stage methods are usually slow in inference. On the other hand, existing one-stage methods mainly focus on the union regions of interactions, which introduce unnecessary visual information as disturbances to HOI detection. To tackle the problems above, we propose a novel one-stage HOI…
▽ More
Recent years, human-object interaction (HOI) detection has achieved impressive advances. However, conventional two-stage methods are usually slow in inference. On the other hand, existing one-stage methods mainly focus on the union regions of interactions, which introduce unnecessary visual information as disturbances to HOI detection. To tackle the problems above, we propose a novel one-stage HOI detection approach DIRV in this paper, based on a new concept called interaction region for the HOI problem. Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair, so as to capture the subtle visual features that is most essential to the interaction. Moreover, in order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy that makes full use of those overlapped interaction regions in place of conventional Non-Maximal Suppression (NMS). Extensive experiments on two popular benchmarks: V-COCO and HICO-DET show that our approach outperforms existing state-of-the-arts by a large margin with the highest inference speed and lightest network architecture. We achieved 56.1 mAP on V-COCO without addtional input. Our code is publicly available at: https://github.com/MVIG-SJTU/DIRV
△ Less
Submitted 19 January, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
A validated multi-agent simulation test bed to evaluate congestion pricing policies on population segments by time-of-day in New York City
Authors:
Brian Yueshuai He,
**kai Zhou,
Ziyi Ma,
Ding Wang,
Di Sha,
Mina Lee,
Joseph Y. J. Chow,
Kaan Ozbay
Abstract:
Evaluation of the demand for emerging transportation technologies and policies can vary by time of day due to spillbacks on roadways, rescheduling of travelers' activity patterns, and shifting to other modes that affect the level of congestion. These effects are not well-captured with static travel demand models. We calibrate and validate the first open-source multi-agent simulation model for New…
▽ More
Evaluation of the demand for emerging transportation technologies and policies can vary by time of day due to spillbacks on roadways, rescheduling of travelers' activity patterns, and shifting to other modes that affect the level of congestion. These effects are not well-captured with static travel demand models. We calibrate and validate the first open-source multi-agent simulation model for New York City, called MATSim-NYC, to support agencies in evaluating policies such as congestion pricing. The simulation-based virtual test bed is loaded with an 8M+ synthetic 2016 population calibrated in a prior study. The road network is calibrated to INRIX speed data and average annual daily traffic for a screenline along the East River crossings, resulting in average speed differences of 7.2% on freeways and 17.1% on arterials, leading to average difference of +1.8% from the East River screenline. Validation against transit stations shows an 8% difference from observed counts and median difference of 29% for select road link counts. The model is used to evaluate a congestion pricing plan proposed by the Regional Plan Association and suggests a much higher (127K) car trip reduction compared to their report (59K). The pricing policy would impact the population segment making trips within Manhattan differently from the population segment of trips outside Manhattan. The multiagent simulation can show that 37.3% of the Manhattan segment would be negatively impacted by the pricing compared to 39.9% of the non-Manhattan segment, which has implications for redistribution of congestion pricing revenues. The citywide travel consumer surplus decreases when the congestion pricing goes up from $9.18 to $14 both ways even as it increases for the Charging-related population segment. This implies that increasing pricing from $9.18 to $14 benefits Manhattanites at the expense of the rest of the city.
△ Less
Submitted 21 December, 2020; v1 submitted 31 July, 2020;
originally announced August 2020.
-
Intra- and Inter-Action Understanding via Temporal Action Parsing
Authors:
Dian Shao,
Yue Zhao,
Bo Dai,
Dahua Lin
Abstract:
Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures, relate to high-level semantics, which may lead to benefits i…
▽ More
Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures, relate to high-level semantics, which may lead to benefits in multiple aspects, e.g. interpretable predictions and even new methods that can take the recognition performances to a next level. Towards this goal, we construct TAPOS, a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them. On the constructed TAPOS, the proposed method is shown to reveal intra-action information, i.e. how action instances are made of sub-actions, and inter-action information, i.e. one specific sub-action may commonly appear in various actions.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
Authors:
Dian Shao,
Yue Zhao,
Bo Dai,
Dahua Lin
Abstract:
On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new…
▽ More
On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnastic videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jump-hop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigate representative methods on this dataset and obtain a number of interesting findings. We hope this dataset could advance research towards action understanding.
△ Less
Submitted 14 April, 2020;
originally announced April 2020.
-
The Impact of Countdown Clocks on Subway Ridership in New York City
Authors:
Zhengbo Zou,
Di Sha
Abstract:
Protecting the passengers' safety and increasing ridership are two never ending pursuits of public transit agencies. One of the proposed methods to achieve both goals for subway service is to implement real time train arriving countdown clocks in subway stations. Metropolitan Transportation Authority (MTA) of New York City (NYC) chose to install such countdown clocks in their stations starting fro…
▽ More
Protecting the passengers' safety and increasing ridership are two never ending pursuits of public transit agencies. One of the proposed methods to achieve both goals for subway service is to implement real time train arriving countdown clocks in subway stations. Metropolitan Transportation Authority (MTA) of New York City (NYC) chose to install such countdown clocks in their stations starting from 2007 on a selection of subway lines. Due to the recent development of Bluetooth Beacon technology, the MTA could now install countdown clocks and train trackers in a non intrusive manner with much faster speed. As a result, the MTA is aiming to install countdown clocks in every subway station on every line. However, with such an aggressive plan, the impact of countdown clocks on subway ridership has not been fully studied. This paper proposes using Panel Regression methods, specifically, Random Effect (RE) model and Fixed Effect (FE) model to quantify the impact of countdown clocks on subway ridership. Machine Learning methods, namely Random Forest (RF) with AdaBoost and Decision Tree (DT) Regression, are also used as alternative data driven approaches for the FE and RE model. The results show that for the G line service, which runs between Brooklyn and Queens, the introduction of countdown clocks could increase weekly ridership by about 1783 per station. The study also found that the machine learning methods provide better accuracy in predicting the ridership than RE and FE models.
△ Less
Submitted 26 December, 2018;
originally announced January 2019.
-
Three-dimensional Torques and Power of Horse Forelimb Joints at Trot
Authors:
H. M. Clayton,
D. H. Sha,
D. R. Mullineaux
Abstract:
Reasons for Performing Study: Equine gait analysis has focused on 2D analysis in the sagittal plane, while descriptions of 3D kinetics and ground reaction force could provide more information on the Equine gait analysis.
Hypothesis or Objectives: The aim of this study was to characterize the 3D torques and powers of the forelimb joints at trotting.
Methods: Eight sound horses were used in the st…
▽ More
Reasons for Performing Study: Equine gait analysis has focused on 2D analysis in the sagittal plane, while descriptions of 3D kinetics and ground reaction force could provide more information on the Equine gait analysis.
Hypothesis or Objectives: The aim of this study was to characterize the 3D torques and powers of the forelimb joints at trotting.
Methods: Eight sound horses were used in the study. A full 3D torque and power for elbow, carpus, fetlock, pastern and coffin joints of right forelimb in horses at trot were obtained by calculating the inverse kinetics of simplified link segmental model.
Results: Over two third of energy (70%) generated by all joints come from stance phase, and most of energy generated was by elbow joint both in stance (77%) and sway (88%) phases. Energy absorbed by all joints during stance (40%) and sway (60%) phases respectively is not a big difference. During stance phase, all most two third of energy (65%) absorbed was by fetlock joint, while over two third of energy (74%) absorbed was by carpus joint during sway phase.
Conclusions & Clinical Relevance: This study presents a full 3D kinetic analysis of the relative motion of the humerus, radius, cannon, pastern and coffin segments of the forelimb at the trot. The results could provide for a more sensitive measure for kinetic analysis.
△ Less
Submitted 19 August, 2011;
originally announced August 2011.
-
An optimized recursive learning algorithm for three-layer feedforward neural networks for mimo nonlinear system identifications
Authors:
Daohang Sha,
Vladimir B. Bajic
Abstract:
Back-propagation with gradient method is the most popular learning algorithm for feed-forward neural networks. However, it is critical to determine a proper fixed learning rate for the algorithm. In this paper, an optimized recursive algorithm is presented for online learning based on matrix operation and optimization methods analytically, which can avoid the trouble to select a proper learning ra…
▽ More
Back-propagation with gradient method is the most popular learning algorithm for feed-forward neural networks. However, it is critical to determine a proper fixed learning rate for the algorithm. In this paper, an optimized recursive algorithm is presented for online learning based on matrix operation and optimization methods analytically, which can avoid the trouble to select a proper learning rate for the gradient method. The proof of weak convergence of the proposed algorithm also is given. Although this approach is proposed for three-layer, feed-forward neural networks, it could be extended to multiple layer feed-forward neural networks. The effectiveness of the proposed algorithms applied to the identification of behavior of a two-input and two-output non-linear dynamic system is demonstrated by simulation experiments.
△ Less
Submitted 12 April, 2010;
originally announced April 2010.