Search | arXiv e-print repository

M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism

Authors: Chuhao Qin, Alexander Robins, Callum Lillywhite-Roake, Adam Pearce, Hritik Mehta, Scott James, Tsz Ho Wong, Evangelos Pournaras

Abstract: Distributed sensing by cooperative drone swarms is crucial for several Smart City applications, such as traffic monitoring and disaster response. Using an indoor lab with inexpensive drones, a testbed supports complex and ambitious studies on these systems while maintaining low cost, rigor, and external validity. This paper introduces the Multi-drone Sensing Experimentation Testbed (M-SET), a nove… ▽ More Distributed sensing by cooperative drone swarms is crucial for several Smart City applications, such as traffic monitoring and disaster response. Using an indoor lab with inexpensive drones, a testbed supports complex and ambitious studies on these systems while maintaining low cost, rigor, and external validity. This paper introduces the Multi-drone Sensing Experimentation Testbed (M-SET), a novel platform designed to prototype, develop, test, and evaluate distributed sensing with swarm intelligence. M-SET addresses the limitations of existing testbeds that fail to emulate collisions, thus lacking realism in outdoor environments. By integrating a collision avoidance method based on a potential field algorithm, M-SET ensures collision-free navigation and sensing, further optimized via a multi-agent collective learning algorithm. Extensive evaluation demonstrates accurate energy consumption estimation and a low risk of collisions, providing a robust proof-of-concept. New insights show that M-SET has significant potential to support ambitious research with minimal cost, simplicity, and high sensing quality. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 7 pages, 7 figures. This work has been submitted to the IEEE conferenece

arXiv:2405.17933 [pdf, other]

ToonCrafter: Generative Cartoon Interpolation

Authors: **bo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong

Abstract: We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulti… ▽ More We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulting in implausible or even failed interpolation results. To overcome these limitations, we explore the potential of adapting live-action video priors to better suit cartoon interpolation within a generative framework. ToonCrafter effectively addresses the challenges faced when applying live-action video motion priors to generative cartoon interpolation. First, we design a toon rectification learning strategy that seamlessly adapts live-action video priors to the cartoon domain, resolving the domain gap and content leakage issues. Next, we introduce a dual-reference-based 3D decoder to compensate for lost details due to the highly compressed latent prior spaces, ensuring the preservation of fine details in interpolation results. Finally, we design a flexible sketch encoder that empowers users with interactive control over the interpolation results. Experimental results demonstrate that our proposed method not only produces visually convincing and more natural dynamics, but also effectively handles dis-occlusion. The comparative evaluation demonstrates the notable superiority of our approach over existing competitors. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Project page: https://doubiiu.github.io/projects/ToonCrafter/

arXiv:2405.12460 [pdf, other]

doi 10.1145/3641519.3657517

Physics-based Scene Layout Generation from Human Motion

Authors: Jianan Li, Tao Huang, Qingxu Zhu, Tien-Tsin Wong

Abstract: Creating scenes for captured motions that achieve realistic human-scene interaction is crucial for 3D animation in movies or video games. As character motion is often captured in a blue-screened studio without real furniture or objects in place, there may be a discrepancy between the planned motion and the captured one. This gives rise to the need for automatic scene layout generation to relieve t… ▽ More Creating scenes for captured motions that achieve realistic human-scene interaction is crucial for 3D animation in movies or video games. As character motion is often captured in a blue-screened studio without real furniture or objects in place, there may be a discrepancy between the planned motion and the captured one. This gives rise to the need for automatic scene layout generation to relieve the burdens of selecting and positioning furniture and objects. Previous approaches cannot avoid artifacts like penetration and floating due to the lack of physical constraints. Furthermore, some heavily rely on specific data to learn the contact affordances, restricting the generalization ability to different motions. In this work, we present a physics-based approach that simultaneously optimizes a scene layout generator and simulates a moving human in a physics simulator. To attain plausible and realistic interaction motions, our method explicitly introduces physical constraints. To automatically recover and generate the scene layout, we minimize the motion tracking errors to identify the objects that can afford interaction. We use reinforcement learning to perform a dual-optimization of both the character motion imitation controller and the scene layout generator. To facilitate the optimization, we reshape the tracking rewards and devise pose prior guidance obtained from our estimated pseudo-contact labels. We evaluate our method using motions from SAMP and PROX, and demonstrate physically plausible scene layout reconstruction compared with the previous kinematics-based method. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: SIGGRAPH conference

arXiv:2403.08266 [pdf, other]

Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models

Authors: Jian Lin, Xueting Liu, Chengze Li, Minshan Xie, Tien-Tsin Wong

Abstract: While manga is a popular entertainment form, creating manga is tedious, especially adding screentones to the created sketch, namely manga screening. Unfortunately, there is no existing method that tailors for automatic manga screening, probably due to the difficulty of generating high-quality shaded high-frequency screentones. The classic manga screening approaches generally require user input to… ▽ More While manga is a popular entertainment form, creating manga is tedious, especially adding screentones to the created sketch, namely manga screening. Unfortunately, there is no existing method that tailors for automatic manga screening, probably due to the difficulty of generating high-quality shaded high-frequency screentones. The classic manga screening approaches generally require user input to provide screentone exemplars or a reference manga image. The recent deep learning models enables the automatic generation by learning from a large-scale dataset. However, the state-of-the-art models still fail to generate high-quality shaded screentones due to the lack of a tailored model and high-quality manga training data. In this paper, we propose a novel sketch-to-manga framework that first generates a color illustration from the sketch and then generates a screentoned manga based on the intensity guidance. Our method significantly outperforms existing methods in generating high-quality manga with shaded high-frequency screentones. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 7 pages, 6 figures

ACM Class: I.4.6; I.3.3; I.3.8

arXiv:2402.15903 [pdf, other]

ESFL: Efficient Split Federated Learning over Resource-Constrained Heterogeneous Wireless Devices

Authors: Guangyu Zhu, Yiqin Deng, Xianhao Chen, Haixia Zhang, Yuguang Fang, Tan F. Wong

Abstract: Federated learning (FL) allows multiple parties (distributed devices) to train a machine learning model without sharing raw data. How to effectively and efficiently utilize the resources on devices and the central server is a highly interesting yet challenging problem. In this paper, we propose an efficient split federated learning algorithm (ESFL) to take full advantage of the powerful computing… ▽ More Federated learning (FL) allows multiple parties (distributed devices) to train a machine learning model without sharing raw data. How to effectively and efficiently utilize the resources on devices and the central server is a highly interesting yet challenging problem. In this paper, we propose an efficient split federated learning algorithm (ESFL) to take full advantage of the powerful computing capabilities at a central server under a split federated learning framework with heterogeneous end devices (EDs). By splitting the model into different submodels between the server and EDs, our approach jointly optimizes user-side workload and server-side computing resource allocation by considering users' heterogeneity. We formulate the whole optimization problem as a mixed-integer non-linear program, which is an NP-hard problem, and develop an iterative approach to obtain an approximate solution efficiently. Extensive simulations have been conducted to validate the significantly increased efficiency of our ESFL approach compared with standard federated learning, split learning, and splitfed learning. △ Less

Submitted 16 April, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.08788 [pdf]

Syllable based DNN-HMM Cantonese Speech to Text System

Authors: Timothy Wong, Claire Li, Sam Lam, Billy Chiu, Qin Lu, Minglei Li, Dan Xiong, Roy Shing Yu, Vincent T. Y. Ng

Abstract: This paper reports our work on building up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model. This is a part of an effort in building a STT system to aid dyslexic students who have cognitive deficiency in writing skills but have no problem expressing their ideas through speech. For Cantonese speech recognition, the basic unit of acoustic models can either be the conventi… ▽ More This paper reports our work on building up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model. This is a part of an effort in building a STT system to aid dyslexic students who have cognitive deficiency in writing skills but have no problem expressing their ideas through speech. For Cantonese speech recognition, the basic unit of acoustic models can either be the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-Coda (ONC) syllables where finals are further split into nucleus and coda to reflect the intra-syllable variations in Cantonese. By using the Kaldi toolkit, our system is trained using the stochastic gradient descent optimization model with the aid of GPUs for the hybrid Deep Neural Network and Hidden Markov Model (DNN-HMM) with and without I-vector based speaker adaptive training technique. The input features of the same Gaussian Mixture Model with speaker adaptive training (GMM-SAT) to DNN are used in all cases. Experiments show that the ONC-based syllable acoustic modeling with I-vector based DNN-HMM achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 7 pages, 3 figures, LREC 2016

MSC Class: 94-06 ACM Class: I.2.7

arXiv:2402.07916 [pdf, other]

Perceptual Thresholds for Radial Optic Flow Distortion in Near-Eye Stereoscopic Displays

Authors: Mohammad R. Saeedpour-Parizi, Niall L. Williams, Tim Wong, Phillip Guan, Dinesh Manocha, Ian M. Erkelens

Abstract: We provide the first perceptual quantification of user's sensitivity to radial optic flow artifacts and demonstrate a promising approach for masking this optic flow artifact via blink suppression. Near-eye HMDs allow users to feel immersed in virtual environments by providing visual cues, like motion parallax and stereoscopy, that mimic how we view the physical world. However, these systems exhibi… ▽ More We provide the first perceptual quantification of user's sensitivity to radial optic flow artifacts and demonstrate a promising approach for masking this optic flow artifact via blink suppression. Near-eye HMDs allow users to feel immersed in virtual environments by providing visual cues, like motion parallax and stereoscopy, that mimic how we view the physical world. However, these systems exhibit a variety of perceptual artifacts that can limit their usability and the user's sense of presence in VR. One well-known artifact is the vergence-accommodation conflict (VAC). Varifocal displays can mitigate VAC, but bring with them other artifacts such as a change in virtual image size (radial optic flow) when the focal plane changes. We conducted a set of psychophysical studies to measure users' ability to perceive this radial flow artifact before, during, and after self-initiated blinks. Our results showed that visual sensitivity was reduced by a factor of 10 at the start and for ~70 ms after a blink was detected. Pre- and post-blink sensitivity was, on average, ~0.15% image size change during normal viewing and increased to ~1.5-2.0% during blinks. Our results imply that a rapid (under 70 ms) radial optic flow distortion can go unnoticed during a blink. Furthermore, our results provide empirical data that can be used to inform engineering requirements for both hardware design and software-based graphical correction algorithms for future varifocal near-eye displays. Our project website is available at https://gamma.umd.edu/RoF/. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2402.02463 [pdf, other]

A Fast Method for Lasso and Logistic Lasso

Authors: Siu-Wing Cheng, Man Ting Wong

Abstract: We propose a fast method for solving compressed sensing, Lasso regression, and Logistic Lasso regression problems that iteratively runs an appropriate solver using an active set approach. We design a strategy to update the active set that achieves a large speedup over a single call of several solvers, including gradient projection for sparse reconstruction (GPSR), lassoglm of Matlab, and glmnet. F… ▽ More We propose a fast method for solving compressed sensing, Lasso regression, and Logistic Lasso regression problems that iteratively runs an appropriate solver using an active set approach. We design a strategy to update the active set that achieves a large speedup over a single call of several solvers, including gradient projection for sparse reconstruction (GPSR), lassoglm of Matlab, and glmnet. For compressed sensing, the hybrid of our method and GPSR is 31.41 times faster than GPSR on average for Gaussian ensembles and 25.64 faster on average for binary ensembles. For Lasso regression, the hybrid of our method and GPSR achieves a 30.67-fold average speedup in our experiments. In our experiments on Logistic Lasso regression, the hybrid of our method and lassoglm gives an 11.95-fold average speedup, and the hybrid of our method and glmnet gives a 1.40-fold average speedup. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2312.00933 [pdf, other]

Privacy Preserving Event Detection

Authors: Xiaoshan Wang, Tan F. Wong

Abstract: This paper presents a privacy-preserving event detection scheme based on measurements made by a network of sensors. A diameter-like decision statistic made up of the marginal types of the measurements observed by the sensors is employed. The proposed detection scheme can achieve the best type-I error exponent as the type-II error rate is required to be negligible. Detection performance with finite… ▽ More This paper presents a privacy-preserving event detection scheme based on measurements made by a network of sensors. A diameter-like decision statistic made up of the marginal types of the measurements observed by the sensors is employed. The proposed detection scheme can achieve the best type-I error exponent as the type-II error rate is required to be negligible. Detection performance with finite-length observations is also demonstrated through a simulation example of spectrum sensing. Privacy protection is achieved by obfuscating the type data with random zero-modulo-sum numbers that are generated and distributed via the exchange of encrypted messages among the sensors. The privacy-preserving performance against ``honest but curious'' adversaries, including colluding sensors, the fusion center, and external eavesdroppers, is analyzed through a series of cryptographic games. It is shown that the probability that any probabilistic polynomial time adversary successfully estimates the sensors' measured types can not be much better than independent guessing, when there are at least two non-colluding sensors. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: 26 pages, 9 figures, submitted to IEEE Transactions on Information Theory

arXiv:2311.14343 [pdf, other]

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

Authors: Minshan Xie, Hanyuan Liu, Chengze Li, Tien-Tsin Wong

Abstract: Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion… ▽ More Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: 11 pages, 11 figures

arXiv:2311.12891 [pdf, other]

Text-Guided Texturing by Synchronized Multi-View Diffusion

Authors: Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong

Abstract: This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchr… ▽ More This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2310.12190 [pdf, other]

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Authors: **bo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, Ying Shan

Abstract: Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for op… ▽ More Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich context representation space using a query transformer, which facilitates the video model to digest the image content in a compatible fashion. However, some visual details still struggle to be preserved in the resultant videos. To supplement with more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results show that our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors. △ Less

Submitted 27 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: Project page: https://doubiiu.github.io/projects/DynamiCrafter

arXiv:2310.04992 [pdf, other]

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Authors: Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, Yuyang Zhao, Xuehui Shi, Junfang Xian, Xiaoxia Qu, Sirui Zhu, Lijie Pan, Xiaoniao Chen, Xiaojia Zhang, Shuai Jiang, Kebing Wang, Chenlong Yang, Mingqiang Chen, Sujie Fan, Jianhua Hu, Aiguo Lv , et al. (17 additional authors not shown)

Abstract: We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassifi… ▽ More We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.03884 [pdf, other]

Information Geometry for the Working Information Theorist

Authors: Kumar Vijay Mishra, M. Ashok Kumar, Ting-Kam Leonard Wong

Abstract: Information geometry is a study of statistical manifolds, that is, spaces of probability distributions from a geometric perspective. Its classical information-theoretic applications relate to statistical concepts such as Fisher information, sufficient statistics, and efficient estimators. Today, information geometry has emerged as an interdisciplinary field that finds applications in diverse areas… ▽ More Information geometry is a study of statistical manifolds, that is, spaces of probability distributions from a geometric perspective. Its classical information-theoretic applications relate to statistical concepts such as Fisher information, sufficient statistics, and efficient estimators. Today, information geometry has emerged as an interdisciplinary field that finds applications in diverse areas such as radar sensing, array signal processing, quantum physics, deep learning, and optimal transport. This article presents an overview of essential information geometry to initiate an information theorist, who may be unfamiliar with this exciting area of research. We explain the concepts of divergences on statistical manifolds, generalized notions of distances, orthogonality, and geodesics, thereby paving the way for concrete applications and novel theoretical investigations. We also highlight some recent information-geometric developments, which are of interest to the broader information theory community. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: 12 pages, 3 figures, 1 table

arXiv:2310.01081 [pdf, other]

Unmasking Role-Play Attack Strategies in Exploiting Decentralized Finance (DeFi) Systems

Authors: Weilin Li, Zhun Wang, Chenyu Li, Heying Chen, Taiyu Wong, Pengyu Sun, Yufei Yu, Chao Zhang

Abstract: The rapid growth and adoption of decentralized finance (DeFi) systems have been accompanied by various threats, notably those emerging from vulnerabilities in their intricate design. In our work, we introduce and define an attack strategy termed as Role-Play Attack, in which the attacker acts as multiple roles concurrently to exploit the DeFi system and cause substantial financial losses. We provi… ▽ More The rapid growth and adoption of decentralized finance (DeFi) systems have been accompanied by various threats, notably those emerging from vulnerabilities in their intricate design. In our work, we introduce and define an attack strategy termed as Role-Play Attack, in which the attacker acts as multiple roles concurrently to exploit the DeFi system and cause substantial financial losses. We provide a formal definition of this strategy and demonstrate its potential impacts by revealing the total loss of \$435.1M caused by 14 historical attacks with applying this pattern. Besides, we mathematically analyzed the attacks with top 2 losses and retrofitted the corresponding attack pattern by concrete execution, indicating that this strategy could increase the potential profit for original attacks by \$3.34M (51.4%) and \$3.76M (12.0%), respectively. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2309.03509 [pdf, other]

BroadCAM: Outcome-agnostic Class Activation Map** for Small-scale Weakly Supervised Applications

Authors: Jiatai Lin, Guoqiang Han, Xuemiao Xu, Changhong Liang, Tien-Tsin Wong, C. L. Philip Chen, Zaiyi Liu, Chu Han

Abstract: Class activation map**~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward info… ▽ More Class activation map**~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward information), gradients~(backward information), etc. However, when with small-scale data, unstable training may lead to less effective model outcomes and generate unreliable weights, finally resulting in incorrect activation and noisy CAM seeds. In this paper, we propose an outcome-agnostic CAM approach, called BroadCAM, for small-scale weakly supervised applications. Since broad learning system (BLS) is independent to the model learning, BroadCAM can avoid the weights being affected by the unreliable model outcomes when with small-scale data. By evaluating BroadCAM on VOC2012 (natural images) and BCSS-WSSS (medical images) for WSSS and OpenImages30k for WSOL, BroadCAM demonstrates superior performance than existing CAM methods with small-scale data (less than 5\%) in different CNN architectures. It also achieves SOTA performance with large-scale training data. Extensive qualitative comparisons are conducted to demonstrate how BroadCAM activates the high class-relevance feature maps and generates reliable CAMs when with small-scale training data. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2308.12642 [pdf, other]

Tag-Based Annotation for Avatar Face Creation

Authors: An Ngo, Daniel Phelps, Derrick Lai, Thanyared Wong, Lucas Mathias, Anish Shivamurthy, Mustafa Ajmal, Minghao Liu, James Davis

Abstract: Currently, digital avatars can be created manually using human images as reference. Systems such as Bitmoji are excellent producers of detailed avatar designs, with hundreds of choices for customization. A supervised learning model could be trained to generate avatars automatically, but the hundreds of possible options create difficulty in securing non-noisy data to train a model. As a solution, w… ▽ More Currently, digital avatars can be created manually using human images as reference. Systems such as Bitmoji are excellent producers of detailed avatar designs, with hundreds of choices for customization. A supervised learning model could be trained to generate avatars automatically, but the hundreds of possible options create difficulty in securing non-noisy data to train a model. As a solution, we train a model to produce avatars from human images using tag-based annotations. This method provides better annotator agreement, leading to less noisy data and higher quality model predictions. Our contribution is an application of tag-based annotation to train a model for avatar face creation. We design tags for 3 different facial facial features offered by Bitmoji, and train a model using tag-based annotation to predict the nose. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: 9 pages, 5 figures, 18 tables

arXiv:2308.07767 [pdf, other]

Preliminary investigation of the short-term in situ performance of an automatic masker selection system

Authors: Bhan Lam, Zhen-Ting Ong, Kenneth Ooi, Wen-Hui Ong, Trevor Wong, Karn N. Watcharasupat, Woon-Seng Gan

Abstract: Soundscape augmentation or "masking" introduces wanted sounds into the acoustic environment to improve acoustic comfort. Usually, the masker selection and playback strategies are either arbitrary or based on simple rules (e.g. -3 dBA), which may lead to sub-optimal increment or even reduction in acoustic comfort for dynamic acoustic environments. To reduce ambiguity in the selection of maskers, an… ▽ More Soundscape augmentation or "masking" introduces wanted sounds into the acoustic environment to improve acoustic comfort. Usually, the masker selection and playback strategies are either arbitrary or based on simple rules (e.g. -3 dBA), which may lead to sub-optimal increment or even reduction in acoustic comfort for dynamic acoustic environments. To reduce ambiguity in the selection of maskers, an automatic masker selection system (AMSS) was recently developed. The AMSS uses a deep-learning model trained on a large-scale dataset of subjective responses to maximize the derived ISO pleasantness (ISO 12913-2). Hence, this study investigates the short-term in situ performance of the AMSS implemented in a gazebo in an urban park. Firstly, the predicted ISO pleasantness from the AMSS is evaluated in comparison to the in situ subjective evaluation scores. Secondly, the effect of various masker selection schemes on the perceived affective quality and appropriateness would be evaluated. In total, each participant evaluated 6 conditions: (1) ambient environment with no maskers; (2) AMSS; (3) bird and (4) water masker from prior art; (5) random selection from same pool of maskers used to train the AMSS; and (6) selection of best-performing maskers based on the analysis of the dataset used to train the AMSS. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: paper submitted to the 52nd International Congress and Exposition on Noise Control Engineering held in Chiba, Greater Tokyo, Japan, on 20-23 August 2023 (Inter-Noise 2023)

ACM Class: J.2; J.4

arXiv:2306.08309 [pdf, other]

doi 10.1109/TVCG.2023.3278691

Taming Reversible Halftoning via Predictive Luminance

Authors: Cheuk-Kit Lau, Menghan Xia, Tien-Tsin Wong

Abstract: Traditional halftoning usually drops colors when dithering images with binary dots, which makes it difficult to recover the original color information. We proposed a novel halftoning technique that converts a color image into a binary halftone with full restorability to its original version. Our novel base halftoning technique consists of two convolutional neural networks (CNNs) to produce the rev… ▽ More Traditional halftoning usually drops colors when dithering images with binary dots, which makes it difficult to recover the original color information. We proposed a novel halftoning technique that converts a color image into a binary halftone with full restorability to its original version. Our novel base halftoning technique consists of two convolutional neural networks (CNNs) to produce the reversible halftone patterns, and a noise incentive block (NIB) to mitigate the flatness degradation issue of CNNs. Furthermore, to tackle the conflicts between the blue-noise quality and restoration accuracy in our novel base method, we proposed a predictor-embedded approach to offload predictable information from the network, which in our case is the luminance information resembling from the halftone pattern. Such an approach allows the network to gain more flexibility to produce halftones with better blue-noise quality without compromising the restoration quality. Detailed studies on the multiple-stage training method and loss weightings have been conducted. We have compared our predictor-embedded method and our novel method regarding spectrum analysis on halftone, halftone accuracy, restoration accuracy, and the data embedding studies. Our entropy evaluation evidences our halftone contains less encoding information than our novel base method. The experiments show our predictor-embedded method gains more flexibility to improve the blue-noise quality of halftones and maintains a comparable restoration quality with a higher tolerance for disturbances. △ Less

Submitted 7 February, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: published in IEEE Transactions on Visualization and Computer Graphics

arXiv:2306.04114 [pdf, other]

Manga Rescreening with Interpretable Screentone Representation

Authors: Minshan Xie, Chengze Li, Tien-Tsin Wong

Abstract: The process of adapting or repurposing manga pages is a time-consuming task that requires manga artists to manually work on every single screentone region and apply new patterns to create novel screentones across multiple panels. To address this issue, we propose an automatic manga rescreening pipeline that aims to minimize the human effort involved in manga adaptation. Our pipeline automatically… ▽ More The process of adapting or repurposing manga pages is a time-consuming task that requires manga artists to manually work on every single screentone region and apply new patterns to create novel screentones across multiple panels. To address this issue, we propose an automatic manga rescreening pipeline that aims to minimize the human effort involved in manga adaptation. Our pipeline automatically recognizes screentone regions and generates novel screentones with newly specified characteristics (e.g., intensity or type). Existing manga generation methods have limitations in understanding and synthesizing complex tone- or intensity-varying regions. To overcome these limitations, we propose a novel interpretable representation of screentones that disentangles their intensity and type features, enabling better recognition and synthesis of screentones. This interpretable screentone representation reduces ambiguity in recognizing intensity-varying regions and provides fine-grained controls during screentone synthesis by decoupling and anchoring the type or the intensity feature. Our proposed method is demonstrated to be effective and convenient through various experiments, showcasing the superiority of the newly proposed pipeline with the interpretable screentone representations. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: 10 pages, 11 figures

arXiv:2306.01732 [pdf, other]

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Authors: Hanyuan Liu, Minshan Xie, **bo Xing, Chengze Li, Tien-Tsin Wong

Abstract: Video colorization is a challenging task that involves inferring plausible and temporally consistent colors for grayscale frames. In this paper, we present ColorDiffuser, an adaptation of a pre-trained text-to-image latent diffusion model for video colorization. With the proposed adapter-based approach, we repropose the pre-trained text-to-image model to accept input grayscale video frames, with t… ▽ More Video colorization is a challenging task that involves inferring plausible and temporally consistent colors for grayscale frames. In this paper, we present ColorDiffuser, an adaptation of a pre-trained text-to-image latent diffusion model for video colorization. With the proposed adapter-based approach, we repropose the pre-trained text-to-image model to accept input grayscale video frames, with the optional text description, for video colorization. To enhance the temporal coherence and maintain the vividness of colorization across frames, we propose two novel techniques: the Color Propagation Attention and Alternated Sampling Strategy. Color Propagation Attention enables the model to refine its colorization decision based on a reference latent frame, while Alternated Sampling Strategy captures spatiotemporal dependencies by using the next and previous adjacent latent frames alternatively as reference during the generative diffusion sampling steps. This encourages bidirectional color information propagation between adjacent video frames, leading to improved color consistency across frames. We conduct extensive experiments on benchmark datasets, and the results demonstrate the effectiveness of our proposed framework. The evaluations show that ColorDiffuser achieves state-of-the-art performance in video colorization, surpassing existing methods in terms of color fidelity, temporal consistency, and visual quality. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: project page: https://colordiffuser.github.io/

arXiv:2306.00943 [pdf, other]

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Authors: **bo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong

Abstract: Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as c… ▽ More Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 13 pages, 8 figures. Project page: https://doubiiu.github.io/projects/Make-Your-Video/

arXiv:2305.17193 [pdf]

AI-based analysis of super-resolution microscopy: Biological discovery in the absence of ground truth

Authors: Ivan R. Nabi, Ben Cardoen, Ismail M. Khater, Guang Gao, Timothy H. Wong, Ghassan Hamarneh

Abstract: Super-resolution microscopy, or nanoscopy, enables the use of fluorescent-based molecular localization tools to study molecular structure at the nanoscale level in the intact cell, bridging the mesoscale gap to classical structural biology methodologies. Analysis of super-resolution data by artificial intelligence (AI), such as machine learning, offers tremendous potential for discovery of new bio… ▽ More Super-resolution microscopy, or nanoscopy, enables the use of fluorescent-based molecular localization tools to study molecular structure at the nanoscale level in the intact cell, bridging the mesoscale gap to classical structural biology methodologies. Analysis of super-resolution data by artificial intelligence (AI), such as machine learning, offers tremendous potential for discovery of new biology, that, by definition, is not known and lacks ground truth. Herein, we describe the application of weakly supervised paradigms to super-resolution microscopy and its potential to enable the accelerated exploration of the nanoscale architecture of subcellular macromolecules and organelles. △ Less

Submitted 27 May, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: 26 pages, 4 figures

arXiv:2304.11105 [pdf, other]

Improved Diffusion-based Image Colorization via Piggybacked Models

Authors: Hanyuan Liu, **bo Xing, Minshan Xie, Chengze Li, Tien-Tsin Wong

Abstract: Image colorization has been attracting the research interests of the community for decades. However, existing methods still struggle to provide satisfactory colorized results given grayscale images due to a lack of human-like global understanding of colors. Recently, large-scale Text-to-Image (T2I) models have been exploited to transfer the semantic information from the text prompts to the image d… ▽ More Image colorization has been attracting the research interests of the community for decades. However, existing methods still struggle to provide satisfactory colorized results given grayscale images due to a lack of human-like global understanding of colors. Recently, large-scale Text-to-Image (T2I) models have been exploited to transfer the semantic information from the text prompts to the image domain, where text provides a global control for semantic objects in the image. In this work, we introduce a colorization model piggybacking on the existing powerful T2I diffusion model. Our key idea is to exploit the color prior knowledge in the pre-trained T2I diffusion model for realistic and diverse colorization. A diffusion guider is designed to incorporate the pre-trained weights of the latent diffusion model to output a latent color prior that conforms to the visual semantics of the grayscale input. A lightness-aware VQVAE will then generate the colorized result with pixel-perfect alignment to the given grayscale image. Our model can also achieve conditional colorization with additional inputs (e.g. user hints and texts). Extensive experiments show that our method achieves state-of-the-art performance in terms of perceptual quality. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: project page: https://piggyback-color.github.io/

arXiv:2303.16117 [pdf, ps, other]

Feature Engineering Methods on Multivariate Time-Series Data for Financial Data Science Competitions

Authors: Thomas Wong, Mauricio Barahona

Abstract: This paper is a work in progress. We are looking for collaborators to provide us financial datasets in Equity/Futures market to conduct more bench-marking studies. The authors have papers employing similar methods applied on the Numerai dataset, which is freely available but obfuscated. We apply different feature engineering methods for time-series to US market price data. The predictive power o… ▽ More This paper is a work in progress. We are looking for collaborators to provide us financial datasets in Equity/Futures market to conduct more bench-marking studies. The authors have papers employing similar methods applied on the Numerai dataset, which is freely available but obfuscated. We apply different feature engineering methods for time-series to US market price data. The predictive power of models are tested against Numerai-Signals targets. △ Less

Submitted 18 April, 2023; v1 submitted 25 March, 2023; originally announced March 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2303.07925

arXiv:2303.07925 [pdf, other]

Deep incremental learning models for financial temporal tabular datasets with distribution shifts

Authors: Thomas Wong, Mauricio Barahona

Abstract: We present a robust deep incremental learning framework for regression tasks on financial temporal tabular datasets which is built upon the incremental use of commonly available tabular and time series prediction models to adapt to distributional shifts typical of financial datasets. The framework uses a simple basic building block (decision trees) to build self-similar models of any required comp… ▽ More We present a robust deep incremental learning framework for regression tasks on financial temporal tabular datasets which is built upon the incremental use of commonly available tabular and time series prediction models to adapt to distributional shifts typical of financial datasets. The framework uses a simple basic building block (decision trees) to build self-similar models of any required complexity to deliver robust performance under adverse situations such as regime changes, fat-tailed distributions, and low signal-to-noise ratios. As a detailed study, we demonstrate our scheme using XGBoost models trained on the Numerai dataset and show that a two layer deep ensemble of XGBoost models over different model snapshots delivers high quality predictions under different market regimes. We also show that the performance of XGBoost models with different number of boosting rounds in three scenarios (small, standard and large) is monotonically increasing with respect to model size and converges towards the generalisation upper bound. We also evaluate the robustness of the model under variability of different hyperparameters, such as model complexity and data sampling settings. Our model has low hardware requirements as no specialised neural architectures are used and each base model can be independently trained in parallel. △ Less

Submitted 10 October, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

arXiv:2301.02379 [pdf, other]

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

Authors: **bo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong

Abstract: Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal map** into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast spe… ▽ More Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal map** into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal map** uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality. △ Less

Submitted 3 April, 2023; v1 submitted 6 January, 2023; originally announced January 2023.

Comments: CVPR2023 Camera-Ready. Project Page: https://doubiiu.github.io/projects/codetalker/, Code: https://github.com/Doubiiu/CodeTalker

arXiv:2301.01841 [pdf]

Classification of Single Tree Decay Stages from Combined Airborne LiDAR Data and CIR Imagery

Authors: Tsz Chung Wong, Abubakar Sani-Mohammed, **hong Wang, Puzuo Wang, Wei Yao, Marco Heurich

Abstract: Understanding forest health is of great importance for the conservation of the integrity of forest ecosystems. In this regard, evaluating the amount and quality of dead wood is of utmost interest as they are favorable indicators of biodiversity. Apparently, remote sensing-based machine learning techniques have proven to be more efficient and sustainable with unprecedented accuracy in forest invent… ▽ More Understanding forest health is of great importance for the conservation of the integrity of forest ecosystems. In this regard, evaluating the amount and quality of dead wood is of utmost interest as they are favorable indicators of biodiversity. Apparently, remote sensing-based machine learning techniques have proven to be more efficient and sustainable with unprecedented accuracy in forest inventory. This study, for the first time, automatically categorizing individual coniferous trees (Norway spruce) into five decay stages (live, declining, dead, loose bark, and clean) from combined airborne laser scanning (ALS) point clouds and color infrared (CIR) images using three different Machine Learning methods - 3D point cloud-based deep learning (KPConv), Convolutional Neural Network (CNN), and Random Forest (RF). First, CIR colorized point clouds are created by fusing the ALS point clouds and color infrared images. Then, individual tree segmentation is conducted, after which the results are further projected onto four orthogonal planes. Finally, the classification is conducted on the two datasets (3D multispectral point clouds and 2D projected images) based on the three Machine Learning algorithms. All models achieved promising results, reaching overall accuracy (OA) of up to 88.8%, 88.4% and 85.9% for KPConv, CNN and RF, respectively. The experimental results reveal that color information, 3D coordinates, and intensity of point clouds have significant impact on the promising classification performance. The performance of our models, therefore, shows the significance of machine/deep learning for individual tree decay stages classification and landscape-wide assessment of the dead wood amount and quality by using modern airborne remote sensing techniques. The proposed method can contribute as an important and reliable tool for monitoring biodiversity in forest ecosystems. △ Less

Submitted 21 December, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

arXiv:2301.00790 [pdf, other]

Online learning techniques for prediction of temporal tabular datasets with regime changes

Authors: Thomas Wong, Mauricio Barahona

Abstract: The application of deep learning to non-stationary temporal datasets can lead to overfitted models that underperform under regime changes. In this work, we propose a modular machine learning pipeline for ranking predictions on temporal panel datasets which is robust under regime changes. The modularity of the pipeline allows the use of different models, including Gradient Boosting Decision Trees (… ▽ More The application of deep learning to non-stationary temporal datasets can lead to overfitted models that underperform under regime changes. In this work, we propose a modular machine learning pipeline for ranking predictions on temporal panel datasets which is robust under regime changes. The modularity of the pipeline allows the use of different models, including Gradient Boosting Decision Trees (GBDTs) and Neural Networks, with and without feature engineering. We evaluate our framework on financial data for stock portfolio prediction, and find that GBDT models with dropout display high performance, robustness and generalisability with reduced complexity and computational cost. We then demonstrate how online learning techniques, which require no retraining of models, can be used post-prediction to enhance the results. First, we show that dynamic feature projection improves robustness by reducing drawdown in regime changes. Second, we demonstrate that dynamical model ensembling based on selection of models with good recent performance leads to improved Sharpe and Calmar ratios of out-of-sample predictions. We also evaluate the robustness of our pipeline across different data splits and random seeds with good reproducibility. △ Less

Submitted 10 August, 2023; v1 submitted 30 December, 2022; originally announced January 2023.

arXiv:2207.12899 [pdf, other]

Assessment of a cost-effective headphone calibration procedure for soundscape evaluations

Authors: Bhan Lam, Kenneth Ooi, Zhen-Ting Ong, Karn N. Watcharasupat, Trevor Wong, Woon-Seng Gan

Abstract: To increase the availability and adoption of the soundscape standard, a low-cost calibration procedure for reproduction of audio stimuli over headphones was proposed as part of the global ``Soundscape Attributes Translation Project'' (SATP) for validating ISO/TS~12913-2:2018 perceived affective quality (PAQ) attribute translations. A previous preliminary study revealed significant deviations from… ▽ More To increase the availability and adoption of the soundscape standard, a low-cost calibration procedure for reproduction of audio stimuli over headphones was proposed as part of the global ``Soundscape Attributes Translation Project'' (SATP) for validating ISO/TS~12913-2:2018 perceived affective quality (PAQ) attribute translations. A previous preliminary study revealed significant deviations from the intended equivalent continuous A-weighted sound pressure levels ($L_{\text{A,eq}}$) using the open-circuit voltage (OCV) calibration procedure. For a more holistic human-centric perspective, the OCV method is further investigated here in terms of psychoacoustic parameters, including relevant exceedance levels to account for temporal effects on the same 27 stimuli from the SATP. Moreover, a within-subjects experiment with 36 participants was conducted to examine the effects of OCV calibration on the PAQ attributes in ISO/TS~12913-2:2018. Bland-Altman analysis of the objective indicators revealed large biases in the OCV method across all weighted sound level and loudness indicators; and roughness indicators at \SI{5}{\%} and \SI{10}{\%} exceedance levels. Significant perceptual differences due to the OCV method were observed in about \SI{20}{\%} of the stimuli, which did not correspond clearly with the biased acoustic indicators. A cautioned interpretation of the objective and perceptual differences due to small and unpaired samples nevertheless provide grounds for further investigation. △ Less

Submitted 24 July, 2022; originally announced July 2022.

Comments: For 24th International Congress on Acoustics

Journal ref: in Proc. 24th Int. Congr. Acoust., 2022, pp. 1-8

arXiv:2207.07839 [pdf, other]

On Non-Negative Quadratic Programming in Geometric Optimization

Authors: Siu-Wing Cheng, Man Ting Wong

Abstract: We present experimental and theoretical results on a method that applies a numerical solver iteratively to solve several non-negative quadratic programming problems in geometric optimization. The method gains efficiency by exploiting the potential sparsity of the intermediate solutions. We implemented the method to call quadprog of MATLAB iteratively. In comparison with a single call of quadprog,… ▽ More We present experimental and theoretical results on a method that applies a numerical solver iteratively to solve several non-negative quadratic programming problems in geometric optimization. The method gains efficiency by exploiting the potential sparsity of the intermediate solutions. We implemented the method to call quadprog of MATLAB iteratively. In comparison with a single call of quadprog, we obtain a 10-fold speedup on two proximity graph problems in $\mathbb{R}^d$ on some public data sets, a 10-fold speedup on the minimum enclosing ball problem on random points in a unit cube in $\mathbb{R}^d$, and a 5-fold speedup on the polytope distance problem on random points from a cube in $\mathbb{R}^d$ when the input size is significantly larger than the dimension; we also obtain a 2-fold or more speedup on deblurring some gray-scale space and thermal images via non-negative least square. We compare with two minimum enclosing ball software by Gärtner and Fischer et al.; for 1000 nearly cospherical points or random points in a unit cube, the iterative method overtakes the software by Gärtner at 20 dimensions and the software by Fischer et al. at 170 dimensions. In the image deblurring experiments, the iterative method compares favorably with other software that can solve non-negative least square, including FISTA with backtracking, SBB, FNNLS, and lsqnonneg of MATLAB. We analyze theoretically the number of iterations taken by the iterative scheme to reduce the gap between the current solution value and the optimum by a factor $e$. Under certain assumptions, we prove a bound proportional to the square root of the number of variables. △ Less

Submitted 19 November, 2023; v1 submitted 16 July, 2022; originally announced July 2022.

arXiv:2206.15445 [pdf, other]

Asymmetry Disentanglement Network for Interpretable Acute Ischemic Stroke Infarct Segmentation in Non-Contrast CT Scans

Authors: Haomiao Ni, Yuan Xue, Kelvin Wong, John Volpi, Stephen T. C. Wong, James Z. Wang, Xiaolei Huang

Abstract: Accurate infarct segmentation in non-contrast CT (NCCT) images is a crucial step toward computer-aided acute ischemic stroke (AIS) assessment. In clinical practice, bilateral symmetric comparison of brain hemispheres is usually used to locate pathological abnormalities. Recent research has explored asymmetries to assist with AIS segmentation. However, most previous symmetry-based work mixed differ… ▽ More Accurate infarct segmentation in non-contrast CT (NCCT) images is a crucial step toward computer-aided acute ischemic stroke (AIS) assessment. In clinical practice, bilateral symmetric comparison of brain hemispheres is usually used to locate pathological abnormalities. Recent research has explored asymmetries to assist with AIS segmentation. However, most previous symmetry-based work mixed different types of asymmetries when evaluating their contribution to AIS. In this paper, we propose a novel Asymmetry Disentanglement Network (ADN) to automatically separate pathological asymmetries and intrinsic anatomical asymmetries in NCCTs for more effective and interpretable AIS segmentation. ADN first performs asymmetry disentanglement based on input NCCTs, which produces different types of 3D asymmetry maps. Then a synthetic, intrinsic-asymmetry-compensated and pathology-asymmetry-salient NCCT volume is generated and later used as input to a segmentation network. The training of ADN incorporates domain knowledge and adopts a tissue-type aware regularization loss function to encourage clinically-meaningful pathological asymmetry extraction. Coupled with an unsupervised 3D transformation network, ADN achieves state-of-the-art AIS segmentation performance on a public NCCT dataset. In addition to the superior performance, we believe the learned clinically-interpretable asymmetry maps can also provide insights towards a better understanding of AIS assessment. Our code is available at https://github.com/nihaomiao/MICCAI22_ADN. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: MICCAI 2022

arXiv:2206.01741 [pdf, other]

Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation

Authors: Yanglan Ou, Ye Yuan, Xiaolei Huang, Stephen T. C. Wong, John Volpi, James Z. Wang, Kelvin Wong

Abstract: We present a new encoder-decoder Vision Transformer architecture, Patcher, for medical image segmentation. Unlike standard Vision Transformers, it employs Patcher blocks that segment an image into large patches, each of which is further divided into small patches. Transformers are applied to the small patches within a large patch, which constrains the receptive field of each pixel. We intentionall… ▽ More We present a new encoder-decoder Vision Transformer architecture, Patcher, for medical image segmentation. Unlike standard Vision Transformers, it employs Patcher blocks that segment an image into large patches, each of which is further divided into small patches. Transformers are applied to the small patches within a large patch, which constrains the receptive field of each pixel. We intentionally make the large patches overlap to enhance intra-patch communication. The encoder employs a cascade of Patcher blocks with increasing receptive fields to extract features from local to global levels. This design allows Patcher to benefit from both the coarse-to-fine feature extraction common in CNNs and the superior spatial relationship modeling of Transformers. We also propose a new mixture-of-experts (MoE) based decoder, which treats the feature maps from the encoder as experts and selects a suitable set of expert features to predict the label for each pixel. The use of MoE enables better specializations of the expert features and reduces interference between them during inference. Extensive experiments demonstrate that Patcher outperforms state-of-the-art Transformer- and CNN-based approaches significantly on stroke lesion segmentation and polyp segmentation. Code for Patcher is released with publication to facilitate future research. △ Less

Submitted 29 May, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: MICCAI 2022

arXiv:2205.04728 [pdf, other]

Preliminary assessment of a cost-effective headphone calibration procedure for soundscape evaluations

Authors: Bhan Lam, Kenneth Ooi, Karn N. Watcharasupat, Zhen-Ting Ong, Yun-Ting Lau, Trevor Wong, Woon-Seng Gan

Abstract: The introduction of ISO 12913-2:2018 has provided a framework for standardized data collection and reporting procedures for soundscape practitioners. A strong emphasis was placed on the use of calibrated head and torso simulators (HATS) for binaural audio capture to obtain an accurate subjective impression and acoustic measure of the soundscape under evaluation. To auralise the binaural recordings… ▽ More The introduction of ISO 12913-2:2018 has provided a framework for standardized data collection and reporting procedures for soundscape practitioners. A strong emphasis was placed on the use of calibrated head and torso simulators (HATS) for binaural audio capture to obtain an accurate subjective impression and acoustic measure of the soundscape under evaluation. To auralise the binaural recordings as recorded or at set levels, the audio stimuli and the headphone setup are usually calibrated with a HATS. However, calibrated HATS are too financially prohibitive for most research teams, inevitably diminishing the availability of the soundscape standard. With the increasing availability of soundscape binaural recording datasets, and the importance of cross-cultural validation of the soundscape ISO standards, e.g.\ via the Soundscape Attributes Translation Project (SATP), it is imperative to assess the suitability of cost-effective headphone calibration methods to maximise availability without severely compromising on accuracy. Hence, this study objectively examines an open-circuit voltage (OCV) calibration method in comparison to a calibrated HATS on various soundcard and headphone combinations. Preliminary experiments found that calibration with the OCV method differed significantly from the reference binaural recordings in sound pressure levels, whereas negligible differences in levels were observed with the HATS calibration. △ Less

Submitted 10 May, 2022; originally announced May 2022.

Comments: Submitted to the 28th International Congress on Sound and Vibration

arXiv:2204.13890 [pdf, other]

doi 10.3397/IN_2022_0290

Deployment of an IoT System for Adaptive In-Situ Soundscape Augmentation

Authors: Trevor Wong, Karn N. Watcharasupat, Bhan Lam, Kenneth Ooi, Zhen-Ting Ong, Furi Andi Karnapi, Woon-Seng Gan

Abstract: Soundscape augmentation is an emerging approach for noise mitigation by introducing additional sounds known as "maskers" to increase acoustic comfort. Traditionally, the choice of maskers is often predicated on expert guidance or post-hoc analysis which can be time-consuming and sometimes arbitrary. Moreover, this often results in a static set of maskers that are inflexible to the dynamic nature o… ▽ More Soundscape augmentation is an emerging approach for noise mitigation by introducing additional sounds known as "maskers" to increase acoustic comfort. Traditionally, the choice of maskers is often predicated on expert guidance or post-hoc analysis which can be time-consuming and sometimes arbitrary. Moreover, this often results in a static set of maskers that are inflexible to the dynamic nature of real-world acoustic environments. Overcoming the inflexibility of traditional soundscape augmentation is twofold. First, given a snapshot of a soundscape, the system must be able to select an optimal masker without human supervision. Second, the system must also be able to react to changes in the acoustic environment with near real-time latency. In this work, we harness the combined prowess of cloud computing and the Internet of Things (IoT) to allow in-situ listening and playback using microcontrollers while delegating computationally expensive inference tasks to the cloud. In particular, a serverless cloud architecture was used for inference, ensuring near real-time latency and scalability without the need to provision computing resources. A working prototype of the system is currently being deployed in a public area experiencing high traffic noise, as well as undergoing public evaluation for future improvements. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: To be presented at the 51st International Congress and Exposition on Noise Control Engineering

Journal ref: INTER-NOISE and NOISE-CON Congress and Conference Proceedings, Feb. 2022, vol. 265, no. 5, pp. 2013-2021

arXiv:2204.13883 [pdf, other]

doi 10.1109/LSP.2022.3194419

Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain

Authors: Karn N. Watcharasupat, Kenneth Ooi, Bhan Lam, Trevor Wong, Zhen-Ting Ong, Woon-Seng Gan

Abstract: The selection of maskers and playback gain levels in a soundscape augmentation system is crucial to its effectiveness in improving the overall acoustic comfort of a given environment. Traditionally, the selection of appropriate maskers and gain levels has been informed by expert opinion, which may not representative of the target population, or by listening tests, which can be time-consuming and l… ▽ More The selection of maskers and playback gain levels in a soundscape augmentation system is crucial to its effectiveness in improving the overall acoustic comfort of a given environment. Traditionally, the selection of appropriate maskers and gain levels has been informed by expert opinion, which may not representative of the target population, or by listening tests, which can be time-consuming and labour-intensive. Furthermore, the resulting static choices of masker and gain are often inflexible to the dynamic nature of real-world soundscapes. In this work, we utilized a deep learning model to perform joint selection of the optimal masker and its gain level for a given soundscape. The proposed model was designed with highly modular building blocks, allowing for an optimized inference process that can quickly search through a large number of masker and gain combinations. In addition, we introduced the use of feature-domain soundscape augmentation conditioned on the digital gain level, eliminating the computationally expensive waveform-domain mixing process during inference time, as well as the tedious pre-calibration process required for new maskers. The proposed system was validated on a large-scale dataset of subjective responses to augmented soundscapes with more than 440 participants, ensuring the ability of the model to predict combined effect of the masker and its gain level on the perceptual pleasantness level. △ Less

Submitted 23 July, 2022; v1 submitted 29 April, 2022; originally announced April 2022.

Journal ref: IEEE Signal Processing Letters, Vol. 29, pp. 1749 - 1753, 2022

arXiv:2203.09860 [pdf, other]

Pseudo Bias-Balanced Learning for Debiased Chest X-ray Classification

Authors: Luyang Luo, Dunyuan Xu, Hao Chen, Tien-Tsin Wong, Pheng-Ann Heng

Abstract: Deep learning models were frequently reported to learn from shortcuts like dataset biases. As deep learning is playing an increasingly important role in the modern healthcare system, it is of great need to combat shortcut learning in medical data as well as develop unbiased and trustworthy models. In this paper, we study the problem of develo** debiased chest X-ray diagnosis models from the bias… ▽ More Deep learning models were frequently reported to learn from shortcuts like dataset biases. As deep learning is playing an increasingly important role in the modern healthcare system, it is of great need to combat shortcut learning in medical data as well as develop unbiased and trustworthy models. In this paper, we study the problem of develo** debiased chest X-ray diagnosis models from the biased training data without knowing exactly the bias labels. We start with the observations that the imbalance of bias distribution is one of the key reasons causing shortcut learning, and the dataset biases are preferred by the model if they were easier to be learned than the intended features. Based on these observations, we proposed a novel algorithm, pseudo bias-balanced learning, which first captures and predicts per-sample bias labels via generalized cross entropy loss and then trains a debiased model using pseudo bias labels and bias-balanced softmax function. We constructed several chest X-ray datasets with various dataset bias situations and demonstrated with extensive experiments that our proposed method achieved consistent improvements over other state-of-the-art approaches. △ Less

Submitted 4 August, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

Comments: To appear in MICCAI 2022. Code available at https://github.com/LLYXC/PBBL

arXiv:2203.03396 [pdf, other]

Screentone-Preserved Manga Retargeting

Authors: Minshan Xie, Menghan Xia, Xueting Liu, Tien-Tsin Wong

Abstract: As a popular comic style, manga offers a unique impression by utilizing a rich set of bitonal patterns, or screentones, for illustration. However, screentones can easily be contaminated with visual-unpleasant aliasing and/or blurriness after resampling, which harms its visualization on displays of diverse resolutions. To address this problem, we propose the first manga retargeting method that synt… ▽ More As a popular comic style, manga offers a unique impression by utilizing a rich set of bitonal patterns, or screentones, for illustration. However, screentones can easily be contaminated with visual-unpleasant aliasing and/or blurriness after resampling, which harms its visualization on displays of diverse resolutions. To address this problem, we propose the first manga retargeting method that synthesizes a rescaled manga image while retaining the screentone in each screened region. This is a non-trivial task as accurate region-wise segmentation remains challenging. Fortunately, the rescaled manga shares the same region-wise screentone correspondences with the original manga, which enables us to simplify the screentone synthesis problem as an anchor-based proposals selection and rearrangement problem. Specifically, we design a novel manga sampling strategy to generate aliasing-free screentone proposals, based on hierarchical grid-based anchors that connect the correspondences between the original and the target rescaled manga. Furthermore, a Recurrent Proposal Selection Module (RPSM) is proposed to adaptively integrate these proposals for target screentone synthesis. Besides, to deal with the translation insensitivity nature of screentones, we propose a translation-invariant screentone loss to facilitate the training convergence. Extensive qualitative and quantitative experiments are conducted to verify the effectiveness of our method, and notably compelling results are achieved compared to existing alternative techniques. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: 10 pages, 13 figures

arXiv:2202.13577 [pdf, other]

Point Set Self-Embedding

Authors: Ruihui Li, Xianzhi Li, Tien-Tsin Wong, Chi-Wing Fu

Abstract: This work presents an innovative method for point set self-embedding, that encodes the structural information of a dense point set into its sparser version in a visual but imperceptible form. The self-embedded point set can function as the ordinary downsampled one and be visualized efficiently on mobile devices. Particularly, we can leverage the self-embedded information to fully restore the origi… ▽ More This work presents an innovative method for point set self-embedding, that encodes the structural information of a dense point set into its sparser version in a visual but imperceptible form. The self-embedded point set can function as the ordinary downsampled one and be visualized efficiently on mobile devices. Particularly, we can leverage the self-embedded information to fully restore the original point set for detailed analysis on remote servers. This task is challenging since both the self-embedded point set and the restored point set should resemble the original one. To achieve a learnable self-embedding scheme, we design a novel framework with two jointly-trained networks: one to encode the input point set into its self-embedded sparse point set and the other to leverage the embedded information for inverting the original point set back. Further, we develop a pair of up-shuffle and down-shuffle units in the two networks, and formulate loss terms to encourage the shape similarity and point distribution in the results. Extensive qualitative and quantitative results demonstrate the effectiveness of our method on both synthetic and real-scanned datasets. △ Less

Submitted 28 February, 2022; originally announced February 2022.

Comments: Accepted by IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG), 2022. All resources can be found at https://liruihui.github.io/

arXiv:2202.06242 [pdf, other]

Beyond NaN: Resiliency of Optimization Layers in The Face of Infeasibility

Authors: Wai Tuck Wong, Sarah Kinsey, Ramesha Karunasena, Thanh Nguyen, Arunesh Sinha

Abstract: Prior work has successfully incorporated optimization layers as the last layer in neural networks for various problems, thereby allowing joint learning and planning in one neural network forward pass. In this work, we identify a weakness in such a set-up where inputs to the optimization layer lead to undefined output of the neural network. Such undefined decision outputs can lead to possible catas… ▽ More Prior work has successfully incorporated optimization layers as the last layer in neural networks for various problems, thereby allowing joint learning and planning in one neural network forward pass. In this work, we identify a weakness in such a set-up where inputs to the optimization layer lead to undefined output of the neural network. Such undefined decision outputs can lead to possible catastrophic outcomes in critical real time applications. We show that an adversary can cause such failures by forcing rank deficiency on the matrix fed to the optimization layer which results in the optimization failing to produce a solution. We provide a defense for the failure cases by controlling the condition number of the input matrix. We study the problem in the settings of synthetic data, Jigsaw Sudoku, and in speed planning for autonomous driving, building on top of prior frameworks in end-to-end learning and optimization. We show that our proposed defense effectively prevents the framework from failing with undefined output. Finally, we surface a number of edge cases which lead to serious bugs in popular equation and optimization solvers which can be abused as well. △ Less

Submitted 4 February, 2023; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: To appear in AAAI-23 Special Track on Safe and Robust AI

arXiv:2201.12576 [pdf, other]

Scale-arbitrary Invertible Image Downscaling

Authors: **bo Xing, Wenbo Hu, Tien-Tsin Wong

Abstract: Conventional social media platforms usually downscale the HR images to restrict their resolution to a specific size for saving transmission/storage cost, which leads to the super-resolution (SR) being highly ill-posed. Recent invertible image downscaling methods jointly model the downscaling/upscaling problems and achieve significant improvements. However, they only consider fixed integer scale fa… ▽ More Conventional social media platforms usually downscale the HR images to restrict their resolution to a specific size for saving transmission/storage cost, which leads to the super-resolution (SR) being highly ill-posed. Recent invertible image downscaling methods jointly model the downscaling/upscaling problems and achieve significant improvements. However, they only consider fixed integer scale factors that cannot downscale HR images with various resolutions to meet the resolution restriction of social media platforms. In this paper, we propose a scale-Arbitrary Invertible image Downscaling Network (AIDN), to natively downscale HR images with arbitrary scale factors. Meanwhile, the HR information is embedded in the downscaled low-resolution (LR) counterparts in a nearly imperceptible form such that our AIDN can also restore the original HR images solely from the LR images. The key to supporting arbitrary scale factors is our proposed Conditional Resampling Module (CRM) that conditions the downscaling/upscaling kernels and sampling locations on both scale factors and image content. Extensive experimental results demonstrate that our AIDN achieves top performance for invertible downscaling with both arbitrary integer and non-integer scale factors. Code will be released upon publication. △ Less

Submitted 9 March, 2022; v1 submitted 29 January, 2022; originally announced January 2022.

arXiv:2110.12596 [pdf, other]

GeoSneakPique: Visual Autocompletion for Geospatial Queries

Authors: Vidya Setlur, Sarah Battersby, Tracy Wong

Abstract: How many crimes occurred in the city center? And exactly which part of town is the 'city center'? While location is at the heart of many data questions, geographic location can be difficult to specify in natural language (NL) queries. This is especially true when working with fuzzy cognitive regions or regions that may be defined based on data distributions instead of absolute administrative locat… ▽ More How many crimes occurred in the city center? And exactly which part of town is the 'city center'? While location is at the heart of many data questions, geographic location can be difficult to specify in natural language (NL) queries. This is especially true when working with fuzzy cognitive regions or regions that may be defined based on data distributions instead of absolute administrative location (e.g., state, country). GeoSneakPique presents a novel method for using a map** widget to support the NL query process, allowing users to specify location via direct manipulation with data-driven guidance on spatial distributions to help select the area of interest. Users receive feedback to help them evaluate and refine their spatial selection interactively and can save spatial definitions for re-use in subsequent queries. We conduct a qualitative evaluation of the GeoSneakPique that indicates the usefulness of the interface as well as opportunities for better supporting geospatial workflows in visual analysis tasks employing cognitive regions. △ Less

Submitted 24 October, 2021; originally announced October 2021.

Comments: 5 pages (4 + 1 page references), two figures

Journal ref: IEEE VIS conference 2021 (TVCG journal)

arXiv:2110.04491 [pdf, other]

Invertible Tone Map** with Selectable Styles

Authors: Zhuming Zhang, Menghan Xia, Xueting Liu, Chengze Li, Tien-Tsin Wong

Abstract: Although digital cameras can acquire high-dynamic range (HDR) images, the captured HDR information are mostly quantized to low-dynamic range (LDR) images for display compatibility and compact storage. In this paper, we propose an invertible tone map** method that converts the multi-exposure HDR to a true LDR (8-bit per color channel) and reserves the capability to accurately restore the original… ▽ More Although digital cameras can acquire high-dynamic range (HDR) images, the captured HDR information are mostly quantized to low-dynamic range (LDR) images for display compatibility and compact storage. In this paper, we propose an invertible tone map** method that converts the multi-exposure HDR to a true LDR (8-bit per color channel) and reserves the capability to accurately restore the original HDR from this {\em invertible LDR}. Our invertible LDR can mimic the appearance of a user-selected tone map** style. It can be shared over any existing social network platforms that may re-encode or format-convert the uploaded images, without much hurting the accuracy of the restored HDR counterpart. To achieve this, we regard the tone map** and the restoration as coupled processes, and formulate them as an encoding-and-decoding problem through convolutional neural networks. Particularly, our model supports pluggable style modulators, each of which bakes a specific tone map** style, and thus favors the application flexibility. Our method is evaluated with a rich variety of HDR images and multiple tone map** operators, which shows the superiority over relevant state-of-the-art methods. Also, we conduct ablation study to justify our design and discuss the robustness and generality toward real applications. △ Less

Submitted 9 October, 2021; originally announced October 2021.

arXiv:2109.13460 [pdf, other]

Self-Improving Voronoi Construction for a Hidden Mixture of Product Distributions

Authors: Siu-Wing Cheng, Man Ting Wong

Abstract: We propose a self-improving algorithm for computing Voronoi diagrams under a given convex distance function with constant description complexity. The $n$ input points are drawn from a hidden mixture of product distributions; we are only given an upper bound $m = o(\sqrt{n})$ on the number of distributions in the mixture, and the property that for each distribution, an input instance is drawn from… ▽ More We propose a self-improving algorithm for computing Voronoi diagrams under a given convex distance function with constant description complexity. The $n$ input points are drawn from a hidden mixture of product distributions; we are only given an upper bound $m = o(\sqrt{n})$ on the number of distributions in the mixture, and the property that for each distribution, an input instance is drawn from it with a probability of $Ω(1/n)$. For any $\varepsilon \in (0,1)$, after spending $O\bigl(mn\log^{O(1)} (mn) + m^{\varepsilon} n^{1+\varepsilon}\log(mn)\bigr)$ time in a training phase, our algorithm achieves an $O\bigl(\frac{1}{\varepsilon}n\log m + \frac{1}{\varepsilon}n2^{O(\log^* n)} + \frac{1}{\varepsilon}H\bigr)$ expected running time with probability at least $1 - O(1/n)$, where $H$ is the entropy of the distribution of the Voronoi diagram output. The expectation is taken over the input distribution and the randomized decisions of the algorithm. For the Euclidean metric, the expected running time improves to $O\bigl(\frac{1}{\varepsilon}n\log m + \frac{1}{\varepsilon}H\bigr)$. △ Less

Submitted 24 October, 2021; v1 submitted 27 September, 2021; originally announced September 2021.

arXiv:2109.12065 [pdf, other]

DeepStroke: An Efficient Stroke Screening Framework for Emergency Rooms with Multimodal Adversarial Deep Learning

Authors: Tongan Cai, Haomiao Ni, Mingli Yu, Xiaolei Huang, Kelvin Wong, John Volpi, James Z. Wang, Stephen T. C. Wong

Abstract: In an emergency room (ER) setting, stroke triage or screening is a common challenge. A quick CT is usually done instead of MRI due to MRI's slow throughput and high cost. Clinical tests are commonly referred to during the process, but the misdiagnosis rate remains high. We propose a novel multimodal deep learning framework, DeepStroke, to achieve computer-aided stroke presence assessment by recogn… ▽ More In an emergency room (ER) setting, stroke triage or screening is a common challenge. A quick CT is usually done instead of MRI due to MRI's slow throughput and high cost. Clinical tests are commonly referred to during the process, but the misdiagnosis rate remains high. We propose a novel multimodal deep learning framework, DeepStroke, to achieve computer-aided stroke presence assessment by recognizing patterns of minor facial muscles incoordination and speech inability for patients with suspicion of stroke in an acute setting. Our proposed DeepStroke takes one-minute facial video data and audio data readily available during stroke triage for local facial paralysis detection and global speech disorder analysis. Transfer learning was adopted to reduce face-attribute biases and improve generalizability. We leverage a multi-modal lateral fusion to combine the low- and high-level features and provide mutual regularization for joint training. Novel adversarial training is introduced to obtain identity-free and stroke-discriminative features. Experiments on our video-audio dataset with actual ER patients show that DeepStroke outperforms state-of-the-art models and achieves better performance than both a triage team and ER doctors, attaining a 10.94% higher sensitivity and maintaining 7.37% higher accuracy than traditional stroke triage when specificity is aligned. Meanwhile, each assessment can be completed in less than six minutes, demonstrating the framework's great potential for clinical translation. △ Less

Submitted 27 June, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

arXiv:2109.08311 [pdf, other]

Adaptive Hierarchical Dual Consistency for Semi-Supervised Left Atrium Segmentation on Cross-Domain Data

Authors: Jun Chen, Heye Zhang, Raad Mohiaddin, Tom Wong, David Firmin, Jennifer Keegan, Guang Yang

Abstract: Semi-supervised learning provides great significance in left atrium (LA) segmentation model learning with insufficient labelled data. Generalising semi-supervised learning to cross-domain data is of high importance to further improve model robustness. However, the widely existing distribution difference and sample mismatch between different data domains hinder the generalisation of semi-supervised… ▽ More Semi-supervised learning provides great significance in left atrium (LA) segmentation model learning with insufficient labelled data. Generalising semi-supervised learning to cross-domain data is of high importance to further improve model robustness. However, the widely existing distribution difference and sample mismatch between different data domains hinder the generalisation of semi-supervised learning. In this study, we alleviate these problems by proposing an Adaptive Hierarchical Dual Consistency (AHDC) for the semi-supervised LA segmentation on cross-domain data. The AHDC mainly consists of a Bidirectional Adversarial Inference module (BAI) and a Hierarchical Dual Consistency learning module (HDC). The BAI overcomes the difference of distributions and the sample mismatch between two different domains. It mainly learns two map** networks adversarially to obtain two matched domains through mutual adaptation. The HDC investigates a hierarchical dual learning paradigm for cross-domain semi-supervised segmentation based on the obtained matched domains. It mainly builds two dual-modelling networks for mining the complementary information in both intra-domain and inter-domain. For the intra-domain learning, a consistency constraint is applied to the dual-modelling targets to exploit the complementary modelling information. For the inter-domain learning, a consistency constraint is applied to the LAs modelled by two dual-modelling networks to exploit the complementary knowledge among different data domains. We demonstrated the performance of our proposed AHDC on four 3D late gadolinium enhancement cardiac MR (LGE-CMR) datasets from different centres and a 3D CT dataset. Compared to other state-of-the-art methods, our proposed AHDC achieved higher segmentation accuracy, which indicated its capability in the cross-domain semi-supervised LA segmentation. △ Less

Submitted 20 September, 2021; v1 submitted 16 September, 2021; originally announced September 2021.

arXiv:2108.09591 [pdf, other]

doi 10.1109/BHI50953.2021.9508604

Multimodal Breast Lesion Classification Using Cross-Attention Deep Networks

Authors: Hung Q. Vo, Pengyu Yuan, Tiancheng He, Stephen T. C. Wong, Hien V. Nguyen

Abstract: Accurate breast lesion risk estimation can significantly reduce unnecessary biopsies and help doctors decide optimal treatment plans. Most existing computer-aided systems rely solely on mammogram features to classify breast lesions. While this approach is convenient, it does not fully exploit useful information in clinical reports to achieve the optimal performance. Would clinical features signifi… ▽ More Accurate breast lesion risk estimation can significantly reduce unnecessary biopsies and help doctors decide optimal treatment plans. Most existing computer-aided systems rely solely on mammogram features to classify breast lesions. While this approach is convenient, it does not fully exploit useful information in clinical reports to achieve the optimal performance. Would clinical features significantly improve breast lesion classification compared to using mammograms alone? How to handle missing clinical information caused by variation in medical practice? What is the best way to combine mammograms and clinical features? There is a compelling need for a systematic study to address these fundamental questions. This paper investigates several multimodal deep networks based on feature concatenation, cross-attention, and co-attention to combine mammograms and categorical clinical variables. We show that the proposed architectures significantly increase the lesion classification performance (average area under ROC curves from 0.89 to 0.94). We also evaluate the model when clinical variables are missing. △ Less

Submitted 21 August, 2021; originally announced August 2021.

arXiv:2107.11925 [pdf, other]

Tsallis and Rényi deformations linked via a new $λ$-duality

Authors: Ting-Kam Leonard Wong, Jun Zhang

Abstract: Tsallis and Rényi entropies, which are monotone transformations of each other, are deformations of the celebrated Shannon entropy. Maximization of these deformed entropies, under suitable constraints, leads to the $q$-exponential family which has applications in non-extensive statistical physics, information theory and statistics. In previous information-geometric studies, the $q$-exponential fami… ▽ More Tsallis and Rényi entropies, which are monotone transformations of each other, are deformations of the celebrated Shannon entropy. Maximization of these deformed entropies, under suitable constraints, leads to the $q$-exponential family which has applications in non-extensive statistical physics, information theory and statistics. In previous information-geometric studies, the $q$-exponential family was analyzed using classical convex duality and Bregman divergence. In this paper, we show that a generalized $λ$-duality, where $λ= 1 - q$ is the constant information-geometric curvature, leads to a generalized exponential family which is essentially equivalent to the $q$-exponential family and has deep connections with Rényi entropy and optimal transport. Using this generalized convex duality and its associated logarithmic divergence, we show that our $λ$-exponential family satisfies properties that parallel and generalize those of the exponential family. Under our framework, the Rényi entropy and divergence arise naturally, and we give a new proof of the Tsallis/Rényi entropy maximizing property of the $q$-exponential family. We also introduce a $λ$-mixture family which may be regarded as the dual of the $λ$-exponential family, and connect it with other mixture-type families. Finally, we discuss a duality between the $λ$-exponential family and the $λ$-logarithmic divergence, and study its statistical consequences. △ Less

Submitted 12 January, 2022; v1 submitted 25 July, 2021; originally announced July 2021.

Comments: 41 pages, 7 figures. Revised

arXiv:2107.07797 [pdf, other]

Conditional Directed Graph Convolution for 3D Human Pose Estimation

Authors: Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, Tien-Tsin Wong

Abstract: Graph convolutional networks have significantly improved 3D human pose estimation by representing the human skeleton as an undirected graph. However, this representation fails to reflect the articulated characteristic of human skeletons as the hierarchical orders among the joints are not explicitly presented. In this paper, we propose to represent the human skeleton as a directed graph with the jo… ▽ More Graph convolutional networks have significantly improved 3D human pose estimation by representing the human skeleton as an undirected graph. However, this representation fails to reflect the articulated characteristic of human skeletons as the hierarchical orders among the joints are not explicitly presented. In this paper, we propose to represent the human skeleton as a directed graph with the joints as nodes and bones as edges that are directed from parent joints to child joints. By so doing, the directions of edges can explicitly reflect the hierarchical relationships among the nodes. Based on this representation, we further propose a spatial-temporal conditional directed graph convolution to leverage varying non-local dependence for different poses by conditioning the graph topology on input poses. Altogether, we form a U-shaped network, named U-shaped Conditional Directed Graph Convolutional Network, for 3D human pose estimation from monocular videos. To evaluate the effectiveness of our method, we conducted extensive experiments on two challenging large-scale benchmarks: Human3.6M and MPI-INF-3DHP. Both quantitative and qualitative results show that our method achieves top performance. Also, ablation studies show that directed graphs can better exploit the hierarchy of articulated human skeletons than undirected graphs, and the conditional connections can yield adaptive graph topologies for different poses. △ Less

Submitted 4 August, 2021; v1 submitted 16 July, 2021; originally announced July 2021.

Journal ref: ACM International Conference on Multimedia (ACM MM), 2021

arXiv:2106.12041 [pdf, other]

doi 10.5194/ascmo-8-117-2022

Analysis of the Evolution of Parametric Drivers of High-End Sea-Level Hazards

Authors: Alana Hough, Tony E. Wong

Abstract: Climate models are critical tools for develo** strategies to manage the risks posed by sea-level rise to coastal communities. While these models are necessary for understanding climate risks, there is a level of uncertainty inherent in each parameter in the models. This model parametric uncertainty leads to uncertainty in future climate risks. Consequently, there is a need to understand how thos… ▽ More Climate models are critical tools for develo** strategies to manage the risks posed by sea-level rise to coastal communities. While these models are necessary for understanding climate risks, there is a level of uncertainty inherent in each parameter in the models. This model parametric uncertainty leads to uncertainty in future climate risks. Consequently, there is a need to understand how those parameter uncertainties impact our assessment of future climate risks and the efficacy of strategies to manage them. Here, we use random forests to examine the parametric drivers of future climate risk and how the relative importances of those drivers change over time. We find that the equilibrium climate sensitivity and a factor that scales the effect of aerosols on radiative forcing are consistently the most important climate model parametric uncertainties throughout the 2020 to 2150 interval for both low and high radiative forcing scenarios. The near-term hazards of high-end sea-level rise are driven primarily by thermal expansion, while the longer-term hazards are associated with mass loss from the Antarctic and Greenland ice sheets. Our results highlight the practical importance of considering time-evolving parametric uncertainties when develo** strategies to manage future climate risks. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Showing 1–50 of 118 results for author: Wong, T