-
Trajectory optimization of tail-sitter considering speed constraints
Authors:
Mingyue Fan,
Fangfang Xie,
Tingwei Ji,
Yao Zheng
Abstract:
Tail-sitters, with the advantages of both the fixed-wing unmanned aerial vehicles (UAVs) and vertical take-off and landing UAVs, have been widely designed and researched in recent years. With the change in modern UAV application scenarios, it is required that UAVs have fast maneuverable three-dimensional flight capabilities. Due to the highly nonlinear aerodynamics produced by the fuselage and win…
▽ More
Tail-sitters, with the advantages of both the fixed-wing unmanned aerial vehicles (UAVs) and vertical take-off and landing UAVs, have been widely designed and researched in recent years. With the change in modern UAV application scenarios, it is required that UAVs have fast maneuverable three-dimensional flight capabilities. Due to the highly nonlinear aerodynamics produced by the fuselage and wings of the tail-sitter, how to quickly generate a smooth and executable trajectory is a problem that needs to be solved urgently. We constrain the speed of the tail-sitter, eliminate the differential dynamics constraints in the trajectory generation process of the tail-sitter through differential flatness, and allocate the time variable of the trajectory through the state-of-the-art trajectory generation method named MINCO. By discretizing the trajectory in time, we convert the speed constraint on the vehicle into a soft constraint, thereby achieving the time-optimal trajectory for the tail-sitter to fly through any given waypoints.
△ Less
Submitted 23 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Dimba: Transformer-Mamba Diffusion Models
Authors:
Zhengcong Fei,
Mingyuan Fan,
Changqian Yu,
Debang Li,
Youqiang Zhang,
Junshi Huang
Abstract:
This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investig…
▽ More
This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier
Authors:
Aristeidis Tsaris,
Chengming Zhang,
Xiao Wang,
Junqi Yin,
Siyan Liu,
Moetasim Ashfaq,
Ming Fan,
Jong Youl Choi,
Mohamed Wahib,
Dan Lu,
Prasanna Balaprakash,
Feiyi Wang
Abstract:
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to…
▽ More
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a transformer model on a full-attention matrix over 188K sequence length.
△ Less
Submitted 17 April, 2024;
originally announced May 2024.
-
Interpretable Data Fusion for Distributed Learning: A Representative Approach via Gradient Matching
Authors:
Mengchen Fan,
Baocheng Geng,
Keren Li,
Xueqian Wang,
Pramod K. Varshney
Abstract:
This paper introduces a representative-based approach for distributed learning that transforms multiple raw data points into a virtual representation. Unlike traditional distributed learning methods such as Federated Learning, which do not offer human interpretability, our method makes complex machine learning processes accessible and comprehensible. It achieves this by condensing extensive datase…
▽ More
This paper introduces a representative-based approach for distributed learning that transforms multiple raw data points into a virtual representation. Unlike traditional distributed learning methods such as Federated Learning, which do not offer human interpretability, our method makes complex machine learning processes accessible and comprehensible. It achieves this by condensing extensive datasets into digestible formats, thus fostering intuitive human-machine interactions. Additionally, this approach maintains privacy and communication efficiency, and it matches the training performance of models using raw data. Simulation results show that our approach is competitive with or outperforms traditional Federated Learning in accuracy and convergence, especially in scenarios with complex models and a higher number of clients. This framework marks a step forward in integrating human intuition with machine intelligence, which potentially enhances human-machine learning interfaces and collaborative efforts.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability
Authors:
Xiao Wang,
Aristeidis Tsaris,
Siyan Liu,
Jong-Youl Choi,
Ming Fan,
Wei Zhang,
Junqi Yin,
Moetasim Ashfaq,
Dan Lu,
Prasanna Balaprakash
Abstract:
Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitati…
▽ More
Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision-transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 230 to 707 PFLOPS, with scaling efficiency maintained at 78% to 96% across 24,576 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve the Earth system predictability.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Music Consistency Models
Authors:
Zhengcong Fei,
Mingyuan Fan,
Junshi Huang
Abstract:
Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music…
▽ More
Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
Authors:
Zhengcong Fei,
Mingyuan Fan,
Changqian Yu,
Debang Li,
Junshi Huang
Abstract:
Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields. However, substantial computational complexity poses limitations for their application in long-context tasks, such as high-resolution image generation. This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusio…
▽ More
Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields. However, substantial computational complexity poses limitations for their application in long-context tasks, such as high-resolution image generation. This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks, referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images, thereby eliminating the necessity for windowing or group cached operations. Experimental results on both condition and unconditional image generation tasks demonstrate that Diffison-RWKV achieves performance on par with or surpasses existing CNN or Transformer-based diffusion models in FID and IS metrics while significantly reducing total computation FLOP usage.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Conditional Pseudo-Reversible Normalizing Flow for Surrogate Modeling in Quantifying Uncertainty Propagation
Authors:
Minglei Yang,
Pengjun Wang,
Ming Fan,
Dan Lu,
Yanzhao Cao,
Guannan Zhang
Abstract:
We introduce a conditional pseudo-reversible normalizing flow for constructing surrogate models of a physical model polluted by additive noise to efficiently quantify forward and inverse uncertainty propagation. Existing surrogate modeling approaches usually focus on approximating the deterministic component of physical model. However, this strategy necessitates knowledge of noise and resorts to a…
▽ More
We introduce a conditional pseudo-reversible normalizing flow for constructing surrogate models of a physical model polluted by additive noise to efficiently quantify forward and inverse uncertainty propagation. Existing surrogate modeling approaches usually focus on approximating the deterministic component of physical model. However, this strategy necessitates knowledge of noise and resorts to auxiliary sampling methods for quantifying inverse uncertainty propagation. In this work, we develop the conditional pseudo-reversible normalizing flow model to directly learn and efficiently generate samples from the conditional probability density functions. The training process utilizes dataset consisting of input-output pairs without requiring prior knowledge about the noise and the function. Our model, once trained, can generate samples from any conditional probability density functions whose high probability regions are covered by the training set. Moreover, the pseudo-reversibility feature allows for the use of fully-connected neural network architectures, which simplifies the implementation and enables theoretical analysis. We provide a rigorous convergence analysis of the conditional pseudo-reversible normalizing flow model, showing its ability to converge to the target conditional probability density function using the Kullback-Leibler divergence. To demonstrate the effectiveness of our method, we apply it to several benchmark tests and a real-world geologic carbon storage problem.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Designing Upper-Body Gesture Interaction with and for People with Spinal Muscular Atrophy in VR
Authors:
**gze Tian,
Yingna Wang,
Keye Yu,
Liyi Xu,
Junan Xie,
Franklin Mingzhe Li,
Yafeng Niu,
Mingming Fan
Abstract:
Recent research proposed gaze-assisted gestures to enhance interaction within virtual reality (VR), providing opportunities for people with motor impairments to experience VR. Compared to people with other motor impairments, those with Spinal Muscular Atrophy (SMA) exhibit enhanced distal limb mobility, providing them with more design space. However, it remains unknown what gaze-assisted upper-bod…
▽ More
Recent research proposed gaze-assisted gestures to enhance interaction within virtual reality (VR), providing opportunities for people with motor impairments to experience VR. Compared to people with other motor impairments, those with Spinal Muscular Atrophy (SMA) exhibit enhanced distal limb mobility, providing them with more design space. However, it remains unknown what gaze-assisted upper-body gestures people with SMA would want and be able to perform. We conducted an elicitation study in which 12 VR-experienced people with SMA designed upper-body gestures for 26 VR commands, and collected 312 user-defined gestures. Participants predominantly favored creating gestures with their hands. The type of tasks and participants' abilities influence their choice of body parts for gesture design. Participants tended to enhance their body involvement and preferred gestures that required minimal physical effort, and were aesthetically pleasing. Our research will contribute to creating better gesture-based input methods for people with motor impairments to interact with VR.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation
Authors:
Duotun Wang,
Hengyu Meng,
Zeyu Cai,
Zhi**g Shao,
Qianxi Liu,
Lin Wang,
Mingming Fan,
Xiaohang Zhan,
Zeyu Wang
Abstract:
We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce…
▽ More
We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce a trainable parameter as a weighting factor for the Jacobian at each triangle to adaptively change local shapes while maintaining global correspondences and facial features. Moreover, to ensure the coherence of the resulting shape and appearance from different viewpoints, we use pretrained image diffusion models for differentiable rendering with regularization terms to refine the deformation under text guidance. Extensive experiments demonstrate that our method can generate diverse head avatars with an articulated mesh that can be edited seamlessly in 3D graphics software, facilitating downstream applications such as more efficient animation with inherited blend shapes and semantic consistency.
△ Less
Submitted 10 June, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Typist Experiment: an Investigation of Human-to-Human Dictation via Role-play to Inform Voice-based Text Authoring
Authors:
Can Liu,
Siying Hu,
Li Feng,
Mingming Fan
Abstract:
Voice dictation is increasingly used for text entry, especially in mobile scenarios. However, the speech-based experience gets disrupted when users must go back to a screen and keyboard to review and edit the text. While existing dictation systems focus on improving transcription and error correction, little is known about how to support speech input for the entire text creation process, including…
▽ More
Voice dictation is increasingly used for text entry, especially in mobile scenarios. However, the speech-based experience gets disrupted when users must go back to a screen and keyboard to review and edit the text. While existing dictation systems focus on improving transcription and error correction, little is known about how to support speech input for the entire text creation process, including composition, reviewing and editing. We conducted an experiment in which ten pairs of participants took on the roles of authors and typists to work on a text authoring task. By analysing the natural language patterns of both authors and typists, we identified new challenges and opportunities for the design of future dictation interfaces, including the ambiguity of human dictation, the differences between audio-only and with screen, and various passive and active assistance that can potentially be provided by future systems.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
To Reach the Unreachable: Exploring the Potential of VR Hand Redirection for Upper Limb Rehabilitation
Authors:
Peixuan Xiong,
Yukai Zhang,
Nandi Zhang,
Shihan Fu,
Xin Li,
Yadan Zheng,
**ni Zhou,
Xiquan Hu,
Mingming Fan
Abstract:
Rehabilitation therapies are widely employed to assist people with motor impairments in regaining control over their affected body parts. Nevertheless, factors such as fatigue and low self-efficacy can hinder patient compliance during extensive rehabilitation processes. Utilizing hand redirection in virtual reality (VR) enables patients to accomplish seemingly more challenging tasks, thereby bolst…
▽ More
Rehabilitation therapies are widely employed to assist people with motor impairments in regaining control over their affected body parts. Nevertheless, factors such as fatigue and low self-efficacy can hinder patient compliance during extensive rehabilitation processes. Utilizing hand redirection in virtual reality (VR) enables patients to accomplish seemingly more challenging tasks, thereby bolstering their motivation and confidence. While previous research has investigated user experience and hand redirection among able-bodied people, its effects on motor-impaired people remain unexplored. In this paper, we present a VR rehabilitation application that harnesses hand redirection. Through a user study and semi-structured interviews, we examine the impact of hand redirection on the rehabilitation experiences of people with motor impairments and its potential to enhance their motivation for upper limb rehabilitation. Our findings suggest that patients are not sensitive to hand movement inconsistency, and the majority express interest in incorporating hand redirection into future long-term VR rehabilitation programs.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
Authors:
Zhi**g Shao,
Zhaolong Wang,
Zhuang Li,
Duotun Wang,
Xiangru Lin,
Yu Zhang,
Mingming Fan,
Zeyu Wang
Abstract:
We present SplattingAvatar, a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh, which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric…
▽ More
We present SplattingAvatar, a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh, which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric coordinates and displacement on a triangle mesh as Phong surfaces. We extend lifted optimization to simultaneously optimize the parameters of the Gaussians while walking on the triangle mesh. SplattingAvatar is a hybrid representation of virtual humans where the mesh represents low-frequency motion and surface deformation, while the Gaussians take over the high-frequency geometry and detailed appearance. Unlike existing deformation methods that rely on an MLP-based linear blend skinning (LBS) field for motion, we control the rotation and translation of the Gaussians directly by mesh, which empowers its compatibility with various animation techniques, e.g., skeletal animation, blend shapes, and mesh editing. Trainable from monocular videos for both full-body and head avatars, SplattingAvatar shows state-of-the-art rendering quality across multiple datasets.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
LightSword: A Customized Virtual Reality Exergame for Long-Term Cognitive Inhibition Training in Older Adults
Authors:
Qiuxin Du,
Zhen Song,
Haiyan Jiang,
Xiaoying Wei,
Dongdong Weng,
Mingming Fan
Abstract:
The decline of cognitive inhibition significantly impacts older adults' quality of life and well-being, making it a vital public health problem in today's aging society. Previous research has demonstrated that Virtual reality (VR) exergames have great potential to enhance cognitive inhibition among older adults. However, existing commercial VR exergames were unsuitable for older adults' long-term…
▽ More
The decline of cognitive inhibition significantly impacts older adults' quality of life and well-being, making it a vital public health problem in today's aging society. Previous research has demonstrated that Virtual reality (VR) exergames have great potential to enhance cognitive inhibition among older adults. However, existing commercial VR exergames were unsuitable for older adults' long-term cognitive training due to the inappropriate cognitive activation paradigm, unnecessary complexity, and unbefitting difficulty levels. To bridge these gaps, we developed a customized VR cognitive training exergame (LightSword) based on Dual-task and Stroop paradigms for long-term cognitive inhibition training among healthy older adults. Subsequently, we conducted an eight-month longitudinal user study with 12 older adults aged 60 years and above to demonstrate the effectiveness of LightSword in improving cognitive inhibition. After the training, the cognitive inhibition abilities of older adults were significantly enhanced, with benefits persisting for 6 months. This result indicated that LightSword has both short-term and long-term effects in enhancing cognitive inhibition. Furthermore, qualitative feedback revealed that older adults exhibited a positive attitude toward long-term training with LightSword, which enhanced their motivation and compliance.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
FetchAid: Making Parcel Lockers More Accessible to Blind and Low Vision People With Deep-learning Enhanced Touchscreen Guidance, Error-Recovery Mechanism, and AR-based Search Support
Authors:
Zhitong Guan,
Zeyu Xiong,
Mingming Fan
Abstract:
Parcel lockers have become an increasingly prevalent last-mile delivery method. Yet, a recent study revealed its accessibility challenges to blind and low-vision people (BLV). Informed by the study, we designed FetchAid, a standalone intelligent mobile app assisting BLV in using a parcel locker in real-time by integrating computer vision and augmented reality (AR) technologies. FetchAid first uses…
▽ More
Parcel lockers have become an increasingly prevalent last-mile delivery method. Yet, a recent study revealed its accessibility challenges to blind and low-vision people (BLV). Informed by the study, we designed FetchAid, a standalone intelligent mobile app assisting BLV in using a parcel locker in real-time by integrating computer vision and augmented reality (AR) technologies. FetchAid first uses a deep network to detect the user's fingertip and relevant buttons on the touch screen of the parcel locker to guide the user to reveal and scan the QR code to open the target compartment door and then guide the user to reach the door safely with AR-based context-aware audio feedback. Moreover, FetchAid provides an error-recovery mechanism and real-time feedback to keep the user on track. We show that FetchAid substantially improved task accomplishment and efficiency, and reduced frustration and overall effort in a study with 12 BLV participants, regardless of their vision conditions and previous experience.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
"It Is Hard to Remove from My Eye": Design Makeup Residue Visualization System for Chinese Traditional Opera (Xiqu) Performers
Authors:
Zeyu Xiong,
Shihan Fu,
Yanying Zhu,
Chenqing Zhu,
Xiaojuan Ma,
Mingming Fan
Abstract:
Chinese traditional opera (Xiqu) performers often experience skin problems due to the long-term use of heavy-metal-laden face paints. To explore the current skincare challenges encountered by Xiqu performers, we conducted an online survey (N=136) and semi-structured interviews (N=15) as a formative study. We found that incomplete makeup removal is the leading cause of human-induced skin problems,…
▽ More
Chinese traditional opera (Xiqu) performers often experience skin problems due to the long-term use of heavy-metal-laden face paints. To explore the current skincare challenges encountered by Xiqu performers, we conducted an online survey (N=136) and semi-structured interviews (N=15) as a formative study. We found that incomplete makeup removal is the leading cause of human-induced skin problems, especially the difficulty in removing eye makeup. Therefore, we proposed EyeVis, a prototype that can visualize the residual eye makeup and record the time make-up was worn by Xiqu performers. We conducted a 7-day deployment study (N=12) to evaluate EyeVis. Results indicate that EyeVis helps to increase Xiqu performers' awareness about removing makeup, as well as boosting their confidence and security in skincare. Overall, this work also provides implications for studying the work of people who wear makeup on a daily basis, and helps to promote and preserve the intangible cultural heritage of practitioners.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Scalable Diffusion Models with State Space Backbone
Authors:
Zhengcong Fei,
Mingyuan Fan,
Changqian Yu,
Junshi Huang
Abstract:
This paper presents a new exploration into a category of diffusion models built upon state space architecture. We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space backbone, functioning on raw patches or latent space. Given its notable efficacy in accommodating long-range dependencies, Diffusion State Space Models (DiS) are dis…
▽ More
This paper presents a new exploration into a category of diffusion models built upon state space architecture. We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space backbone, functioning on raw patches or latent space. Given its notable efficacy in accommodating long-range dependencies, Diffusion State Space Models (DiS) are distinguished by treating all inputs including time, condition, and noisy image patches as tokens. Our assessment of DiS encompasses both unconditional and class-conditional image generation scenarios, revealing that DiS exhibits comparable, if not superior, performance to CNN-based or Transformer-based U-Net architectures of commensurate size. Furthermore, we analyze the scalability of DiS, gauged by the forward pass complexity quantified in Gflops. DiS models with higher Gflops, achieved through augmentation of depth/width or augmentation of input tokens, consistently demonstrate lower FID. In addition to demonstrating commendable scalability characteristics, DiS-H/2 models in latent space achieve performance levels akin to prior diffusion models on class-conditional ImageNet benchmarks at the resolution of 256$\times$256 and 512$\times$512, while significantly reducing the computational burden. The code and models are available at: https://github.com/feizc/DiS.
△ Less
Submitted 28 March, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Exploring the Opportunity of Augmented Reality (AR) in Supporting Older Adults Explore and Learn Smartphone Applications
Authors:
Xiaofu **,
Wai Tong,
Xiaoying Wei,
Xian Wang,
Emily Kuang,
Xiaoyu Mo,
Huamin Qu,
Mingming Fan
Abstract:
The global aging trend compels older adults to navigate the evolving digital landscape, presenting a substantial challenge in mastering smartphone applications. While Augmented Reality (AR) holds promise for enhancing learning and user experience, its role in aiding older adults' smartphone app exploration remains insufficiently explored. Therefore, we conducted a two-phase study: (1) a workshop w…
▽ More
The global aging trend compels older adults to navigate the evolving digital landscape, presenting a substantial challenge in mastering smartphone applications. While Augmented Reality (AR) holds promise for enhancing learning and user experience, its role in aiding older adults' smartphone app exploration remains insufficiently explored. Therefore, we conducted a two-phase study: (1) a workshop with 18 older adults to identify app exploration challenges and potential AR interventions, and (2) tech-probe participatory design sessions with 15 participants to co-create AR support tools. Our research highlights AR's effectiveness in reducing physical and cognitive strain among older adults during app exploration, especially during multi-app usage and the trial-and-error learning process. We also examined their interactional experiences with AR, yielding design considerations on tailoring AR tools for smartphone app exploration. Ultimately, our study unveils the prospective landscape of AR in supporting the older demographic, both presently and in future scenarios.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Incorporating Exemplar Optimization into Training with Dual Networks for Human Mesh Recovery
Authors:
Yongwei Nie,
Mingxian Fan,
Chengjiang Long,
Qing Zhang,
Jian Zhu,
Xuemiao Xu
Abstract:
We propose a novel optimization-based human mesh recovery method from a single image. Given a test exemplar, previous approaches optimize the pre-trained regression network to minimize the 2D re-projection loss, which however suffer from over-/under-fitting problems. This is because the ``exemplar optimization'' at testing time has too weak relation to the pre-training process, and the exemplar op…
▽ More
We propose a novel optimization-based human mesh recovery method from a single image. Given a test exemplar, previous approaches optimize the pre-trained regression network to minimize the 2D re-projection loss, which however suffer from over-/under-fitting problems. This is because the ``exemplar optimization'' at testing time has too weak relation to the pre-training process, and the exemplar optimization loss function is different from the training loss function. (1) We incorporate exemplar optimization into the training stage. During training, our method first executes exemplar optimization and subsequently proceeds with training-time optimization. The exemplar optimization may run into a wrong direction, while the subsequent training optimization serves to correct the deviation. Involved in training, the exemplar optimization learns to adapt its behavior to training data, thereby acquires generalibility to test exemplars. (2) We devise a dual-network architecture to convey the novel training paradigm, which is composed of a main regression network and an auxiliary network, in which we can formulate the exemplar optimization loss function in the same form as the training loss function. This further enhances the compatibility between the exemplar and training optimizations. Experiments demonstrate that our exemplar optimization after the novel training scheme significantly outperforms state-of-the-art approaches.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Cross-Inlining Binary Function Similarity Detection
Authors:
Ang Jia,
Ming Fan,
Xi Xu,
Wuxia **,
Haijun Wang,
Ting Liu
Abstract:
Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function map** is more complex, especially when function inlining happens.
In this paper, we will systematically…
▽ More
Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function map** is more complex, especially when function inlining happens.
In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function map**s by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Simplified Information Geometry Approach for Massive MIMO-OFDM Channel Estimation -- Part II: Convergence Analysis
Authors:
Jiyuan Yang,
Yan Chen,
Mingrui Fan,
Xiqi Gao,
Xiang-Gen Xia,
Dirk Slock
Abstract:
In Part II of this two-part paper, we prove the convergence of the simplified information geometry approach (SIGA) proposed in Part I. For a general Bayesian inference problem, we first show that the iteration of the common second-order natural parameter (SONP) is separated from that of the common first-order natural parameter (FONP). Hence, the convergence of the common SONP can be checked indepe…
▽ More
In Part II of this two-part paper, we prove the convergence of the simplified information geometry approach (SIGA) proposed in Part I. For a general Bayesian inference problem, we first show that the iteration of the common second-order natural parameter (SONP) is separated from that of the common first-order natural parameter (FONP). Hence, the convergence of the common SONP can be checked independently. We show that with the initialization satisfying a specific but large range, the common SONP is convergent regardless of the value of the dam** factor. For the common FONP, we establish a sufficient condition of its convergence and prove that the convergence of the common FONP relies on the spectral radius of a particular matrix related to the dam** factor. We give the range of the dam** factor that guarantees the convergence in the worst case. Further, we determine the range of the dam** factor for massive MIMO-OFDM channel estimation by using the specific properties of the measurement matrices. Simulation results are provided to confirm the theoretical results.
△ Less
Submitted 3 June, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Efficient Information Geometry Approach for Massive MIMO-OFDM Channel Estimation
Authors:
Jiyuan Yang,
Yan Chen,
Mingrui Fan,
An-An Lu,
Wen Zhong,
Xiqi Gao,
Xiaohu You,
Xiang-Gen Xia,
Dirk Slock
Abstract:
We investigate the channel estimation for massive multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. We revisit the information geometry approach (IGA) for massive MIMO-OFDM channel estimation. By using the constant magnitude property of the entries of the measurement matrix, we find that the second-order natural parameters of the distributions on all th…
▽ More
We investigate the channel estimation for massive multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. We revisit the information geometry approach (IGA) for massive MIMO-OFDM channel estimation. By using the constant magnitude property of the entries of the measurement matrix, we find that the second-order natural parameters of the distributions on all the auxiliary manifolds are equivalent to each other, and the first-order natural parameters are asymptotically equivalent to each other at the fixed point. Motivated by these results, we simplify the process of IGA and propose an efficient IGA (EIGA) for massive MIMO-OFDM channel estimation, which allows efficient implementation with fast Fourier transformation (FFT). We then establish a sufficient condition of its convergence and accordingly find a range of the dam** factor for the convergence. We show that this range of dam** factor is sufficiently wide by using the specific properties of the measurement matrices. Further, we prove that at the fixed point, the a posteriori mean obtained by EIGA is asymptotically optimal. Simulations confirm that EIGA can achieve the optimal performance with low complexity in a limited number of iterations.
△ Less
Submitted 3 June, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Tuning-Free Inversion-Enhanced Control for Consistent Image Editing
Authors:
Xiaoyue Duan,
Shuhao Cui,
Guoliang Kang,
Baochang Zhang,
Zhengcong Fei,
Mingyuan Fan,
Junshi Huang
Abstract:
Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non…
▽ More
Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration
Authors:
Meihao Fan,
Xiaoyue Han,
Ju Fan,
Chengliang Chai,
Nan Tang,
Guoliang Li,
Xiaoyong Du
Abstract:
Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, whic…
▽ More
Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
Authors:
Zhengcong Fei,
Mingyuan Fan,
Junshi Huang
Abstract:
This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visibl…
▽ More
This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input drop** or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.
△ Less
Submitted 11 January, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
OperARtistry: An AR-based Interactive Application to Assist the Learning of Chinese Traditional Opera (Xiqu) Makeup
Authors:
Zeyu Xiong,
Shihan Fu,
Mingming Fan
Abstract:
Chinese Traditional Opera (Xiqu) is an important type of intangible cultural heritage and one key characteristic of Xiqu is its visual effects on face achieved via makeup. However, Xiqu makeup process, especially the eye-area makeup process, is complex and time-consuming, which poses a learning challenge for potential younger inheritors. We introduce OperARtistry, an interactive application based…
▽ More
Chinese Traditional Opera (Xiqu) is an important type of intangible cultural heritage and one key characteristic of Xiqu is its visual effects on face achieved via makeup. However, Xiqu makeup process, especially the eye-area makeup process, is complex and time-consuming, which poses a learning challenge for potential younger inheritors. We introduce OperARtistry, an interactive application based on Augmented Reality (AR) that offers in-situ Xiqu makeup guidance for beginners. Our application provides a step-by-step guide for Xiqu eye-area makeup, incorporating AR effects at each stage. Furthermore, we conducted an initial user study (n=6) to compare our approach with existing video-based tutorials to assess the effectiveness and usefulness of our approach. Our findings show that OperARtisty helped participants achieve high-quality eye-area makeup effects with less learning time.
△ Less
Submitted 19 November, 2023;
originally announced November 2023.
-
Flatness-aware Adversarial Attack
Authors:
Mingyuan Fan,
Xiaodan Li,
Cen Chen,
Yinggui Wang
Abstract:
The transferability of adversarial examples can be exploited to launch black-box attacks. However, adversarial examples often present poor transferability. To alleviate this issue, by observing that the diversity of inputs can boost transferability, input regularization based methods are proposed, which craft adversarial examples by combining several transformed inputs. We reveal that input regula…
▽ More
The transferability of adversarial examples can be exploited to launch black-box attacks. However, adversarial examples often present poor transferability. To alleviate this issue, by observing that the diversity of inputs can boost transferability, input regularization based methods are proposed, which craft adversarial examples by combining several transformed inputs. We reveal that input regularization based methods make resultant adversarial examples biased towards flat extreme regions. Inspired by this, we propose an attack called flatness-aware adversarial attack (FAA) which explicitly adds a flatness-aware regularization term in the optimization target to promote the resultant adversarial examples towards flat extreme regions. The flatness-aware regularization term involves gradients of samples around the resultant adversarial examples but optimizing gradients requires the evaluation of Hessian matrix in high-dimension spaces which generally is intractable. To address the problem, we derive an approximate solution to circumvent the construction of Hessian matrix, thereby making FAA practical and cheap. Extensive experiments show the transferability of adversarial examples crafted by FAA can be considerably boosted compared with state-of-the-art baselines.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
CoPrompt: Supporting Prompt Sharing and Referring in Collaborative Natural Language Programming
Authors:
Li Feng,
Ryan Yen,
Yuzhe You,
Mingming Fan,
Jian Zhao,
Zhicong Lu
Abstract:
Natural language (NL) programming has become more approachable due to the powerful code-generation capability of large language models (LLMs). This shift to using NL to program enhances collaborative programming by reducing communication barriers and context-switching among programmers from varying backgrounds. However, programmers may face challenges during prompt engineering in a collaborative s…
▽ More
Natural language (NL) programming has become more approachable due to the powerful code-generation capability of large language models (LLMs). This shift to using NL to program enhances collaborative programming by reducing communication barriers and context-switching among programmers from varying backgrounds. However, programmers may face challenges during prompt engineering in a collaborative setting as they need to actively keep aware of their collaborators' progress and intents. In this paper, we aim to investigate ways to assist programmers' prompt engineering in a collaborative context. We first conducted a formative study to understand the workflows and challenges of programmers when using NL for collaborative programming. Based on our findings, we implemented a prototype, CoPrompt, to support collaborative prompt engineering by providing referring, requesting, sharing, and linking mechanisms. Our user study indicates that CoPrompt assists programmers in comprehending collaborators' prompts and building on their collaborators' work, reducing repetitive updates and communication costs.
△ Less
Submitted 1 March, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
uxSense: Supporting User Experience Analysis with Visualization and Computer Vision
Authors:
Andrea Batch,
Yipeng Ji,
Mingming Fan,
Jian Zhao,
Niklas Elmqvist
Abstract:
Analyzing user behavior from usability evaluation can be a challenging and time-consuming task, especially as the number of participants and the scale and complexity of the evaluation grows. We propose uxSense, a visual analytics system using machine learning methods to extract user behavior from audio and video recordings as parallel time-stamped data streams. Our implementation draws on pattern…
▽ More
Analyzing user behavior from usability evaluation can be a challenging and time-consuming task, especially as the number of participants and the scale and complexity of the evaluation grows. We propose uxSense, a visual analytics system using machine learning methods to extract user behavior from audio and video recordings as parallel time-stamped data streams. Our implementation draws on pattern recognition, computer vision, natural language processing, and machine learning to extract user sentiment, actions, posture, spoken words, and other features from such recordings. These streams are visualized as parallel timelines in a web-based front-end, enabling the researcher to search, filter, and annotate data across time and space. We present the results of a user study involving professional UX researchers evaluating user data using uxSense. In fact, we used uxSense itself to evaluate their sessions.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Proof Repair across Quotient Type Equivalences
Authors:
Cosmo Viola,
Max Fan,
Talia Ringer
Abstract:
Proofs in proof assistants like Coq can be brittle, breaking easily in response to changes in the terms and types those proofs depend on. To address this, recent work introduced an algorithm and tool in Coq to automatically repair broken proofs in response to changes that correspond to type equivalences. However, many changes remained out of the scope of this algorithm and tool -- especially chang…
▽ More
Proofs in proof assistants like Coq can be brittle, breaking easily in response to changes in the terms and types those proofs depend on. To address this, recent work introduced an algorithm and tool in Coq to automatically repair broken proofs in response to changes that correspond to type equivalences. However, many changes remained out of the scope of this algorithm and tool -- especially changes in underlying behavior. We extend this proof repair algorithm so that it can express certain changes in behavior that were previously out of scope. We focus in particular on equivalences between quotient types -- types equipped with a relation that describes what it means for any two elements of that type to be equal. Quotient type equivalences can be used to express interesting changes in representations of mathematical structures, as well as changes in the underlying implementations of data structures -- two use cases highlighted by our case studies.
We extend this algorithm to support quotient type equivalences in two different ways: (1) internally to cubical type theory (applied to Cubical Agda), and (2) externally to CIC$_ω$ (applied to Coq). While our approach in Coq comes equipped with prototype automation, it suffers notably from Coq's lack of quotient types -- something we circumvent using Coq's setoid machinery and an extension to the proof repair algorithm to support the corresponding new proof obligations. In contrast, while our approach in Cubical Agda is completely manual, it takes advantage of cubical type theory's internal quotient types, which makes the algorithm straightforward. Furthermore, it includes the first internal proofs of correctness of repaired proofs, something not possible in general in Coq. We report on the tradeoffs between these two approaches, and demonstrate these tradeoffs on proof repair case studies for previously unsupported changes.
△ Less
Submitted 18 March, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
Authors:
Shuaiwen Leon Song,
Bonnie Kruft,
Minjia Zhang,
Conglong Li,
Shiyang Chen,
Chengming Zhang,
Masahiro Tanaka,
Xiaoxia Wu,
Jeff Rasley,
Ammar Ahmad Awan,
Connor Holmes,
Martin Cai,
Adam Ghanem,
Zhongzhu Zhou,
Yuxiong He,
Pete Luferenko,
Divya Kumar,
Jonathan Weyn,
Ruixiong Zhang,
Sylwester Klocek,
Volodymyr Vragov,
Mohammed AlQuraishi,
Gustaf Ahdritz,
Christina Floristean,
Cristina Negri
, et al. (67 additional authors not shown)
Abstract:
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique…
▽ More
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.
△ Less
Submitted 11 October, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Designing Loving-Kindness Meditation in Virtual Reality for Long-Distance Romantic Relationships
Authors:
Xian Wang,
Xiaoyu Mo,
Lik-Hang Lee,
Xiaoying Wei,
Xiaofu **,
Mingming Fan,
Pan Hui
Abstract:
Loving-kindness meditation (LKM) is used in clinical psychology for couples' relationship therapy, but physical isolation can make the relationship more strained and inaccessible to LKM. Virtual reality (VR) can provide immersive LKM activities for long-distance couples. However, no suitable commercial VR applications for couples exist to engage in LKM activities of long-distance. This paper organ…
▽ More
Loving-kindness meditation (LKM) is used in clinical psychology for couples' relationship therapy, but physical isolation can make the relationship more strained and inaccessible to LKM. Virtual reality (VR) can provide immersive LKM activities for long-distance couples. However, no suitable commercial VR applications for couples exist to engage in LKM activities of long-distance. This paper organized a series of workshops with couples to build a prototype of a couple-preferred LKM app. Through analysis of participants' design works and semi-structured interviews, we derived design considerations for such VR apps and created a prototype for couples to experience. We conducted a study with couples to understand their experiences of performing LKM using the VR prototype and a traditional video conferencing tool. Results show that LKM session utilizing both tools has a positive effect on the intimate relationship and the VR prototype is a more preferable tool for long-term use. We believe our experience can inform future researchers.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
An Explicit Method for Fast Monocular Depth Recovery in Corridor Environments
Authors:
Yehao Liu,
Ruoyan Xia,
Xiaosu Xu,
Zijian Wang,
Yiqing Ya,
Mingze Fan
Abstract:
Monocular cameras are extensively employed in indoor robotics, but their performance is limited in visual odometry, depth estimation, and related applications due to the absence of scale information.Depth estimation refers to the process of estimating a dense depth map from the corresponding input image, existing researchers mostly address this issue through deep learning-based approaches, yet the…
▽ More
Monocular cameras are extensively employed in indoor robotics, but their performance is limited in visual odometry, depth estimation, and related applications due to the absence of scale information.Depth estimation refers to the process of estimating a dense depth map from the corresponding input image, existing researchers mostly address this issue through deep learning-based approaches, yet their inference speed is slow, leading to poor real-time capabilities. To tackle this challenge, we propose an explicit method for rapid monocular depth recovery specifically designed for corridor environments, leveraging the principles of nonlinear optimization. We adopt the virtual camera assumption to make full use of the prior geometric features of the scene. The depth estimation problem is transformed into an optimization problem by minimizing the geometric residual. Furthermore, a novel depth plane construction technique is introduced to categorize spatial points based on their possible depths, facilitating swift depth estimation in enclosed structural scenarios, such as corridors. We also propose a new corridor dataset, named Corr\_EH\_z, which contains images as captured by the UGV camera of a variety of corridors. An exhaustive set of experiments in different corridors reveal the efficacy of the proposed algorithm.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Exploring the Opportunities of AR for Enriching Storytelling with Family Photos between Grandparents and Grandchildren
Authors:
Zisu Li,
Li Feng,
Chen Liang,
Yuru Huang,
Mingming Fan
Abstract:
Storytelling with family photos, as an important mode of reminiscence-based activities, can be instrumental in promoting intergenerational communication between grandparents and grandchildren by strengthening generation bonds and shared family values. Motivated by challenges that existing technology approaches encountered for improving intergenerational storytelling (e.g., the need to hold the tab…
▽ More
Storytelling with family photos, as an important mode of reminiscence-based activities, can be instrumental in promoting intergenerational communication between grandparents and grandchildren by strengthening generation bonds and shared family values. Motivated by challenges that existing technology approaches encountered for improving intergenerational storytelling (e.g., the need to hold the tablet, the potential view detachment from the physical world in Virtual Reality (VR)), we sought to find new ways of using Augmented Reality (AR) to support intergenerational storytelling, which offers new capabilities (e.g., 3D models, new interactivity) to enhance the expression for the storyteller. We conducted a two-part exploratory study, where pairs of grandparents and grandchildren 1) participated in an in-person storytelling activity with a semi-structured interview 2) and then a participatory design session with AR technology probes that we designed to inspire their exploration. Our findings revealed insights into the possible ways of intergenerational storytelling, the feasibility and usages of AR in facilitating it, and the key design implications for leveraging AR in intergenerational storytelling.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Occ$^2$Net: Robust Image Matching Based on 3D Occupancy Estimation for Occluded Regions
Authors:
Miao Fan,
Mingrui Chen,
Chen Hu,
Shuchang Zhou
Abstract:
Image matching is a fundamental and critical task in various visual applications, such as Simultaneous Localization and Map** (SLAM) and image retrieval, which require accurate pose estimation. However, most existing methods ignore the occlusion relations between objects caused by camera motion and scene structure. In this paper, we propose Occ$^2$Net, a novel image matching method that models o…
▽ More
Image matching is a fundamental and critical task in various visual applications, such as Simultaneous Localization and Map** (SLAM) and image retrieval, which require accurate pose estimation. However, most existing methods ignore the occlusion relations between objects caused by camera motion and scene structure. In this paper, we propose Occ$^2$Net, a novel image matching method that models occlusion relations using 3D occupancy and infers matching points in occluded regions. Thanks to the inductive bias encoded in the Occupancy Estimation (OE) module, it greatly simplifies bootstrap** of a multi-view consistent 3D representation that can then integrate information from multiple views. Together with an Occlusion-Aware (OA) module, it incorporates attention layers and rotation alignment to enable matching between occluded and visible points. We evaluate our method on both real-world and simulated datasets and demonstrate its superior performance over state-of-the-art methods on several metrics, especially in occlusion scenarios.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
An Efficient Early-breaking Estimation and Tree-splitting Missing RFID Tag Identification Protocol
Authors:
Lijuan Zhang,
Mingqiu Fan,
Chunni Yu,
Lei Lei
Abstract:
Recent statistics have demonstrated that missing items have become the main cause of loss for retailers in inventory management. To quickly identify missing tags, traditional protocols adopt Aloha-based strategies which take a long time, especially when the number of tags is large. Among them, few works considered the effect of unexpected unknown tags on the missing tag identification process. Wit…
▽ More
Recent statistics have demonstrated that missing items have become the main cause of loss for retailers in inventory management. To quickly identify missing tags, traditional protocols adopt Aloha-based strategies which take a long time, especially when the number of tags is large. Among them, few works considered the effect of unexpected unknown tags on the missing tag identification process. With the presence of unknown tags, some missing tags may be falsely identified as present. Thus, the system's reliability is hardly guaranteed. In this work, we propose an efficient early-breaking estimation and tree-splitting-based missing tag identification (ETMTI) protocol for large-scale RFID systems. In ETMTI, a new early-breaking estimation and deactivation method is developed to effectively estimate the number of unknown tags and deactivate them within a short time. Next, a new tree-splitting-based missing tag identification method is proposed to quickly identify missing tags with a B-ary splitting tree. Besides, a bit-tracking response strategy is designed to further reduce the time cost. The optimal parameters, time cost, and false negative rate of ETMTI are analyzed theoretically. Simulation results are presented to demonstrate that the proposed ETMTI protocol takes a smaller time and has a lower false negative rate than the best-performing benchmarks.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length
Authors:
Miao Fan,
Chen Hu,
Shuchang Zhou
Abstract:
The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in sha** the impact of large language models (LLMs), contributing significantly to controlling output toxicity and selecting output styles, particularly as LLMs often harbor misleading content, highlighting the urgency to align them with human values for secure AI systems. The RLHF, characterized by complexity, instabilit…
▽ More
The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in sha** the impact of large language models (LLMs), contributing significantly to controlling output toxicity and selecting output styles, particularly as LLMs often harbor misleading content, highlighting the urgency to align them with human values for secure AI systems. The RLHF, characterized by complexity, instability, and sensitivity to hyperparameters, makes the evaluation of the reward model for complex tasks challenging, thereby further complicating the use of Proximal Policy Optimization (PPO). In this paper, we introduce a simple task designed to employ Gloden as a reward model that validates the effectiveness of PPO and inspires it, primarily explaining the task of utilizing PPO to manipulate the tokenizer length of the output generated by the model. Experiments confirm that PPO is not only effective in manipulating the output tokenizer length to a certain extent in this type of task but also exhibits facilitated training once the influence of the reward model effect is excluded, making it an exciting development.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook
Authors:
Mingyuan Fan,
Chengyu Wang,
Cen Chen,
Yang Liu,
Jun Huang
Abstract:
Diffusion models and large language models have emerged as leading-edge generative models, revolutionizing various aspects of human life. However, the practical implementations of these models have also exposed inherent risks, bringing to the forefront their evil sides and sparking concerns regarding their trustworthiness. Despite the wealth of literature on this subject, a comprehensive survey sp…
▽ More
Diffusion models and large language models have emerged as leading-edge generative models, revolutionizing various aspects of human life. However, the practical implementations of these models have also exposed inherent risks, bringing to the forefront their evil sides and sparking concerns regarding their trustworthiness. Despite the wealth of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, this paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: 1) privacy, 2) security, 3) fairness, and 4) responsibility. Based on the investigation results, we develop an extensive map outlining the trustworthiness of large generative models. After that, we provide practical recommendations and potential research directions for future secure applications equipped with large generative models, ultimately promoting the trustworthiness of the models and benefiting the society as a whole.
△ Less
Submitted 7 December, 2023; v1 submitted 31 July, 2023;
originally announced July 2023.
-
On the Robustness of Split Learning against Adversarial Attacks
Authors:
Mingyuan Fan,
Cen Chen,
Chengyu Wang,
Wenmeng Zhou,
Jun Huang
Abstract:
Split learning enables collaborative deep learning model training while preserving data privacy and model security by avoiding direct sharing of raw data and model details (i.e., sever and clients only hold partial sub-networks and exchange intermediate computations). However, existing research has mainly focused on examining its reliability for privacy protection, with little investigation into m…
▽ More
Split learning enables collaborative deep learning model training while preserving data privacy and model security by avoiding direct sharing of raw data and model details (i.e., sever and clients only hold partial sub-networks and exchange intermediate computations). However, existing research has mainly focused on examining its reliability for privacy protection, with little investigation into model security. Specifically, by exploring full models, attackers can launch adversarial attacks, and split learning can mitigate this severe threat by only disclosing part of models to untrusted servers.This paper aims to evaluate the robustness of split learning against adversarial attacks, particularly in the most challenging setting where untrusted servers only have access to the intermediate layers of the model.Existing adversarial attacks mostly focus on the centralized setting instead of the collaborative setting, thus, to better evaluate the robustness of split learning, we develop a tailored attack called SPADV, which comprises two stages: 1) shadow model training that addresses the issue of lacking part of the model and 2) local adversarial attack that produces adversarial examples to evaluate.The first stage only requires a few unlabeled non-IID data, and, in the second stage, SPADV perturbs the intermediate output of natural samples to craft the adversarial ones. The overall cost of the proposed attack process is relatively low, yet the empirical attack effectiveness is significantly high, demonstrating the surprising vulnerability of split learning to adversarial attacks.
△ Less
Submitted 17 July, 2023; v1 submitted 15 July, 2023;
originally announced July 2023.
-
VERTICES: Efficient Two-Party Vertical Federated Linear Model with TTP-aided Secret Sharing
Authors:
Mingxuan Fan,
Yilun **,
Liu Yang,
Zhenghang Ren,
Kai Chen
Abstract:
Vertical Federated Learning (VFL) has emerged as one of the most predominant approaches for secure collaborative machine learning where the training data is partitioned by features among multiple parties. Most VFL algorithms primarily rely on two fundamental privacy-preserving techniques: Homomorphic Encryption (HE) and secure Multi-Party Computation (MPC). Though generally considered with stronge…
▽ More
Vertical Federated Learning (VFL) has emerged as one of the most predominant approaches for secure collaborative machine learning where the training data is partitioned by features among multiple parties. Most VFL algorithms primarily rely on two fundamental privacy-preserving techniques: Homomorphic Encryption (HE) and secure Multi-Party Computation (MPC). Though generally considered with stronger privacy guarantees, existing general-purpose MPC frameworks suffer from expensive computation and communication overhead and are inefficient especially under VFL settings. This study centers around MPC-based VFL algorithms and presents a novel approach for two-party vertical federated linear models via an efficient secret sharing (SS) scheme with a trusted coordinator. Our approach can achieve significant acceleration of the training procedure in vertical federated linear models of between 2.5x and 6.6x than other existing MPC frameworks under the same security setting.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Gradient-Free Textual Inversion
Authors:
Zhengcong Fei,
Mingyuan Fan,
Junshi Huang
Abstract:
Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion ret…
▽ More
Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a \emph{gradient-free} framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device Sensing
Authors:
Zisu Li,
Cheng Liang,
Yuntao Wang,
Yue Qin,
Chun Yu,
Yukang Yan,
Mingming Fan,
Yuanchun Shi
Abstract:
Gestures performed accompanying the voice are essential for voice interaction to convey complementary semantics for interaction purposes such as wake-up state and input modality. In this paper, we investigated voice-accompanying hand-to-face (VAHF) gestures for voice interaction. We targeted hand-to-face gestures because such gestures relate closely to speech and yield significant acoustic feature…
▽ More
Gestures performed accompanying the voice are essential for voice interaction to convey complementary semantics for interaction purposes such as wake-up state and input modality. In this paper, we investigated voice-accompanying hand-to-face (VAHF) gestures for voice interaction. We targeted hand-to-face gestures because such gestures relate closely to speech and yield significant acoustic features (e.g., impeding voice propagation). We conducted a user study to explore the design space of VAHF gestures, where we first gathered candidate gestures and then applied a structural analysis to them in different dimensions (e.g., contact position and type), outputting a total of 8 VAHF gestures with good usability and least confusion. To facilitate VAHF gesture recognition, we proposed a novel cross-device sensing method that leverages heterogeneous channels (vocal, ultrasound, and IMU) of data from commodity devices (earbuds, watches, and rings). Our recognition model achieved an accuracy of 97.3% for recognizing 3 gestures and 91.5% for recognizing 8 gestures, excluding the "empty" gesture, proving the high applicability. Quantitative analysis also sheds light on the recognition capability of each sensor channel and their different combinations. In the end, we illustrated the feasible use cases and their design principles to demonstrate the applicability of our system in various scenarios.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
Collaboration with Conversational AI Assistants for UX Evaluation: Questions and How to Ask them (Voice vs. Text)
Authors:
Emily Kuang,
Ehsan Jahangirzadeh Soure,
Mingming Fan,
Jian Zhao,
Kristen Shinohara
Abstract:
AI is promising in assisting UX evaluators with analyzing usability tests, but its judgments are typically presented as non-interactive visualizations. Evaluators may have questions about test recordings, but have no way of asking them. Interactive conversational assistants provide a Q&A dynamic that may improve analysis efficiency and evaluator autonomy. To understand the full range of analysis-r…
▽ More
AI is promising in assisting UX evaluators with analyzing usability tests, but its judgments are typically presented as non-interactive visualizations. Evaluators may have questions about test recordings, but have no way of asking them. Interactive conversational assistants provide a Q&A dynamic that may improve analysis efficiency and evaluator autonomy. To understand the full range of analysis-related questions, we conducted a Wizard-of-Oz design probe study with 20 participants who interacted with simulated AI assistants via text or voice. We found that participants asked for five categories of information: user actions, user mental model, help from the AI assistant, product and task information, and user demographics. Those who used the text assistant asked more questions, but the question lengths were similar. The text assistant was perceived as significantly more efficient, but both were rated equally in satisfaction and trust. We also provide design considerations for future conversational AI assistants for UX evaluation.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Enhancing Older Adults' Gesture Ty** Experience Using the T9 Keyboard on Small Touchscreen Devices
Authors:
Emily Kuang,
Ruihuan Chen,
Mingming Fan
Abstract:
Older adults increasingly adopt small-screen devices, but limited motor dexterity hinders their ability to type effectively. While a 9-key (T9) keyboard allocates larger space to each key, it is shared by multiple consecutive letters. Consequently, users must interrupt their gestures when ty** consecutive letters, leading to inefficiencies and poor user experience. Thus, we proposed a novel keyb…
▽ More
Older adults increasingly adopt small-screen devices, but limited motor dexterity hinders their ability to type effectively. While a 9-key (T9) keyboard allocates larger space to each key, it is shared by multiple consecutive letters. Consequently, users must interrupt their gestures when ty** consecutive letters, leading to inefficiencies and poor user experience. Thus, we proposed a novel keyboard that leverages the currently unused key 1 to duplicate letters from the previous key, allowing the entry of consecutive letters without interruptions. A user study with 12 older adults showed that it significantly outperformed the T9 with wiggle gesture in ty** speed, KSPC, insertion errors, and deletes per word while achieving comparable performance as the conventional T9. Repeating the ty** tasks with 12 young adults found that the advantages of the novel T9 were consistent or enhanced. We also provide error analysis and design considerations for improving gesture ty** on T9 for older adults.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Bridging the Generational Gap: Exploring How Virtual Reality Supports Remote Communication Between Grandparents and Grandchildren
Authors:
Xiaoying Wei,
Yizheng Gu,
Emily Kuang,
Xian Wang,
Beiyan Cao,
Xiaofu **,
Mingming Fan
Abstract:
When living apart, grandparents and grandchildren often use audio-visual communication approaches to stay connected. However, these approaches seldom provide sufficient companionship and intimacy due to a lack of co-presence and spatial interaction, which can be fulfilled by immersive virtual reality (VR). To understand how grandparents and grandchildren might leverage VR to facilitate their remot…
▽ More
When living apart, grandparents and grandchildren often use audio-visual communication approaches to stay connected. However, these approaches seldom provide sufficient companionship and intimacy due to a lack of co-presence and spatial interaction, which can be fulfilled by immersive virtual reality (VR). To understand how grandparents and grandchildren might leverage VR to facilitate their remote communication and better inform future design, we conducted a user-centered participatory design study with twelve pairs of grandparents and grandchildren. Results show that VR affords casual and equal communication by reducing the generational gap, and promotes conversation by offering shared activities as bridges for connection. Participants preferred resemblant appearances on avatars for conveying well-being but created ideal selves for gaining playfulness. Based on the results, we contribute eight design implications that inform future VR-based grandparent-grandchild communications.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Sparkling Silence: Practices and Challenges of Livestreaming Among Deaf or Hard of Hearing Streamers
Authors:
Beiyan Cao,
Changyang He,
Muzhi Zhou,
Mingming Fan
Abstract:
Understanding livestream platforms' accessibility challenges for minority groups, such as people with disabilities, is critical to increasing the diversity and inclusion of those platforms. While prior work investigated the experiences of streamers with vision or motor loss, little is known about the experiences of deaf or hard of hearing (DHH) streamers who must work with livestreaming platforms…
▽ More
Understanding livestream platforms' accessibility challenges for minority groups, such as people with disabilities, is critical to increasing the diversity and inclusion of those platforms. While prior work investigated the experiences of streamers with vision or motor loss, little is known about the experiences of deaf or hard of hearing (DHH) streamers who must work with livestreaming platforms that heavily depend on audio. We conducted semi-structured interviews with DHH streamers to learn why they livestream, how they navigate livestream platforms and related challenges. Our findings revealed their desire to break the stereotypes towards the DHH groups via livestream and the intense interplay between interaction methods, such as sign language, texts, lip language, background music, and viewer characteristics. Major accessibility challenges include the lack of real-time captioning, the small sign language reading window, and misinterpretation of sign language. We present design considerations for improving the accessibility of the livestream platforms.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Do as You Say: Consistency Detection of Data Practice in Program Code and Privacy Policy in Mini-App
Authors:
Yin Wang,
Ming Fan,
Junfeng Liu,
Junjie Tao,
Wuxia **,
Qi Xiong,
Yuhao Liu,
Qinghua Zheng,
Ting Liu
Abstract:
Mini-app is an emerging form of mobile application that combines web technology with native capabilities. Its features, e.g., no need to download and no installation, have made it popular rapidly. However, privacy issues that violate the laws or regulations are breeding in the swiftly expanding mini-app ecosystem. The consistency between what the mini-app does about the data in the program code an…
▽ More
Mini-app is an emerging form of mobile application that combines web technology with native capabilities. Its features, e.g., no need to download and no installation, have made it popular rapidly. However, privacy issues that violate the laws or regulations are breeding in the swiftly expanding mini-app ecosystem. The consistency between what the mini-app does about the data in the program code and what it declares in its privacy policy description is important. But no work has systematically investigated the privacy problem of the mini-app before. In this paper, to our best knowledge, we are the first to conduct the compliance detection of data practice and policy description in mini-apps. In this paper, we first customize a taint analysis method based on data entity dependency network to adapt to the characteristics of the JavaScript language in the mini-apps. Then, we transform data types and data operations to data practices in program codes and privacy policies, so as to finish a fine-grained consistency matching model.We crawl 100,000 mini-apps on WeChat client in the wild and extract 2,998 with a privacy policy. Among them, only 318 meet the consistency requirements, 2,680 are inconsistent, and the proportion of inconsistencies is as high as 89.4%. The inconsistency in the mini-app is very serious. Based on 6 real-world cases analyzed, in order to reduce this potential data leakage risk, we suggest that the developer should reduce the collection of irrelevant information and the straightforward use of templates, and the platform should provide data flow detection tools and privacy policy writing support.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
CoPracTter: Toward Integrating Personalized Practice Scenarios, Timely Feedback and Social Support into An Online Support Tool for Co** with Stuttering in China
Authors:
Feng Li,
Zeyu Xiong,
Xinyi Li,
Mingming Fan
Abstract:
Stuttering is a speech disorder influencing over 70 million people worldwide, including 13 million in China. It causes low self-esteem among other detrimental effects on people who stutter (PwS). Although prior work has explored approaches to assist PwS, they primarily focused on western contexts. In our formative study, we found unique practices and challenges among Chinese PwS. We then iterative…
▽ More
Stuttering is a speech disorder influencing over 70 million people worldwide, including 13 million in China. It causes low self-esteem among other detrimental effects on people who stutter (PwS). Although prior work has explored approaches to assist PwS, they primarily focused on western contexts. In our formative study, we found unique practices and challenges among Chinese PwS. We then iteratively designed an online tool, CoPracTter, to support Chinese PwS practicing speaking fluency with 1) targeted stress-inducing practice scenarios, 2) real-time speech indicators, and 3) personalized timely feedback from the community. We further conducted a seven-day deployment study (N=11) to understand how participants utilized these key features. To our knowledge, it is the first time such a prototype was designed and tested for a long time with multiple PwS participants online simultaneously. Results indicate that personalized practice with targeted scenarios and timely feedback from a supportive community assisted PwS in speaking fluently, staying positive, and facing similar real-life circumstances.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
"I am the follower, also the boss": Exploring Different Levels of Autonomy and Machine Forms of Guiding Robots for the Visually Impaired
Authors:
Yan Zhang,
Ziang Li,
Haole Guo,
Luyao Wang,
Qihe Chen,
Wenjie Jiang,
Mingming Fan,
Guyue Zhou,
Jiangtao Gong
Abstract:
Guiding robots, in the form of canes or cars, have recently been explored to assist blind and low vision (BLV) people. Such robots can provide full or partial autonomy when guiding. However, the pros and cons of different forms and autonomy for guiding robots remain unknown. We sought to fill this gap. We designed autonomy-switchable guiding robotic cane and car. We conducted a controlled lab-stud…
▽ More
Guiding robots, in the form of canes or cars, have recently been explored to assist blind and low vision (BLV) people. Such robots can provide full or partial autonomy when guiding. However, the pros and cons of different forms and autonomy for guiding robots remain unknown. We sought to fill this gap. We designed autonomy-switchable guiding robotic cane and car. We conducted a controlled lab-study (N=12) and a field study (N=9) on BLV. Results showed that full autonomy received better walking performance and subjective ratings in the controlled study, whereas participants used more partial autonomy in the natural environment as demanding more control. Besides, the car robot has demonstrated abilities to provide a higher sense of safety and navigation efficiency compared with the cane robot. Our findings offered empirical evidence about how the BLV community perceived different machine forms and autonomy, which can inform the design of assistive robots.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing
Authors:
Jiali Wei,
Ming Fan,
Wen**g Jiao,
Wuxia **,
Ting Liu
Abstract:
Deep neural networks (DNNs) and natural language processing (NLP) systems have developed rapidly and have been widely used in various real-world fields. However, they have been shown to be vulnerable to backdoor attacks. Specifically, the adversary injects a backdoor into the model during the training phase, so that input samples with backdoor triggers are classified as the target class. Some atta…
▽ More
Deep neural networks (DNNs) and natural language processing (NLP) systems have developed rapidly and have been widely used in various real-world fields. However, they have been shown to be vulnerable to backdoor attacks. Specifically, the adversary injects a backdoor into the model during the training phase, so that input samples with backdoor triggers are classified as the target class. Some attacks have achieved high attack success rates on the pre-trained language models (LMs), but there have yet to be effective defense methods. In this work, we propose a defense method based on deep model mutation testing. Our main justification is that backdoor samples are much more robust than clean samples if we impose random mutations on the LMs and that backdoors are generalizable. We first confirm the effectiveness of model mutation testing in detecting backdoor samples and select the most appropriate mutation operators. We then systematically defend against three extensively studied backdoor attack levels (i.e., char-level, word-level, and sentence-level) by detecting backdoor samples. We also make the first attempt to defend against the latest style-level backdoor attacks. We evaluate our approach on three benchmark datasets (i.e., IMDB, Yelp, and AG news) and three style transfer datasets (i.e., SST-2, Hate-speech, and AG news). The extensive experimental results demonstrate that our approach can detect backdoor samples more efficiently and accurately than the three state-of-the-art defense approaches.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.