Search | arXiv e-print repository

arXiv:2406.19481 [pdf, other]

The Galois-equivariant $K$-theory of finite fields

Abstract: We compute the $RO(G)$-graded equivariant algebraic $K$-groups of a finite field with an action by its Galois group $G$. Specifically, we show these $K$-groups split as the sum of an explicitly computable term and the well-studied $RO(G)$-graded coefficient groups of the equivariant Eilenberg--MacLane spectrum $H\underline{\mathbb Z}$. Our comparison between the equivariant $K$-theory spectrum and… ▽ More We compute the $RO(G)$-graded equivariant algebraic $K$-groups of a finite field with an action by its Galois group $G$. Specifically, we show these $K$-groups split as the sum of an explicitly computable term and the well-studied $RO(G)$-graded coefficient groups of the equivariant Eilenberg--MacLane spectrum $H\underline{\mathbb Z}$. Our comparison between the equivariant $K$-theory spectrum and $H\underline{\mathbb Z}$ further shows they share the same Tate spectra and geometric fixed point spectra. In the case where $G$ has prime order, we provide an explicit presentation of the equivariant $K$-groups. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Comments welcome!

MSC Class: 19D50 (Primary); 55P91 (Secondary)

arXiv:2405.16695 [pdf, other]

Oscillations in neuronal activity: a neuron-centered spatiotemporal model of the Unfolded Protein Response in prion diseases

Authors: Elliot M. Miller, Tat Chung D. Chan, Carlos Montes-Matamoros, Omar Sharif, Laurent Pujo-Menjouet, Michael R. Lindstrom

Abstract: Many neurodegenerative diseases (NDs) are characterized by the slow spatial spread of toxic protein species in the brain. The toxic proteins can induce neuronal stress, triggering the Unfolded Protein Response (UPR), which slows or stops protein translation and can indirectly reduce the toxic load. However, the UPR may also trigger processes leading to apoptotic cell death and the UPR is implicate… ▽ More Many neurodegenerative diseases (NDs) are characterized by the slow spatial spread of toxic protein species in the brain. The toxic proteins can induce neuronal stress, triggering the Unfolded Protein Response (UPR), which slows or stops protein translation and can indirectly reduce the toxic load. However, the UPR may also trigger processes leading to apoptotic cell death and the UPR is implicated in the progression of several NDs. In this paper, we develop a novel mathematical model to describe the spatiotemporal dynamics of the UPR mechanism for prion diseases. Our model is centered around a single neuron, with representative proteins P (healthy) and S (toxic) interacting with heterodimer dynamics (S interacts with P to form two S's). The model takes the form of a coupled system of nonlinear reaction-diffusion equations with a delayed, nonlinear flux for P (delay from the UPR). Through the delay, we find parameter regimes that exhibit oscillations in the P- and S-protein levels. We find that oscillations are more pronounced when the S-clearance rate and S-diffusivity are small in comparison to the P-clearance rate and P-diffusivity, respectively. The oscillations become more pronounced as delays in initiating the UPR increase. We also consider quasi-realistic clinical parameters to understand how possible drug therapies can alter the course of a prion disease. We find that decreasing the production of P, decreasing the recruitment rate, increasing the diffusivity of S, increasing the UPR S-threshold, and increasing the S clearance rate appear to be the most powerful modifications to reduce the mean UPR intensity and potentially moderate the disease progression. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: 35 pages, 11 tables, 13 figures

arXiv:2405.15113 [pdf, other]

A Wearable Resistance Devices Motor Learning Effects in Exercise

Authors: Eugenio Frias-Miranda, Hong-Anh Nguyen, Jeremy Hampton, Trenner Jones, Benjamin Spotts, Matthew Cochran, Deva Chan, Laura H Blumenschein

Abstract: The integration of technology into exercise regimens has emerged as a strategy to enhance normal human capabilities and return human motor function after injury or illness by enhancing motor learning and retention. Much research has focused on how active devices, whether confined to a lab or made into a wearable format, can apply forces at set times and conditions to optimize the process of learni… ▽ More The integration of technology into exercise regimens has emerged as a strategy to enhance normal human capabilities and return human motor function after injury or illness by enhancing motor learning and retention. Much research has focused on how active devices, whether confined to a lab or made into a wearable format, can apply forces at set times and conditions to optimize the process of learning. However, the focus on active force production often forces devices to either be confined to simple movements or interventions. As such, in this paper, we investigate how passive device behaviors can contribute to the process of motor learning by themselves. Our approach involves using a wearable resistance (WR) device, which is outfitted with elastic bands, to apply a force field that changes in response to a person's movements while performing exercises. We develop a method to measure the produced forces from the device without impeding the function and we characterize the device's force generation abilities. We then present a study assessing the impact of the WR device on motor learning of proper squat form compared to visual or no feedback. Biometrics such as knee and hip angles were used to monitor and assess subject performance. Our findings indicate that the force fields produced while training with the WR device can improve performance in full-body exercises similarly to a more direct visual feedback mechanism, though the improvement is not consistent across all performance metrics. Through our research, we contribute important insights into the application of passive wearable resistance technology in practical exercise settings. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 8 pages, 9 figures, To be published in IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob) 2024

arXiv:2405.08272 [pdf, other]

VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons

Authors: Zhen Chen, Xingjian Luo, **lin Wu, Danny T. M. Chan, Zhen Lei, **qiao Wang, Sebastien Ourselin, Hongbin Liu

Abstract: The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intellige… ▽ More The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2404.05696 [pdf]

BOLD v4: A Centralized Bioinformatics Platform for DNA-based Biodiversity Data

Authors: Sujeevan Ratnasingham, Catherine Wei, Dean Chan, Jireh Agda, Josh Agda, Liliana Ballesteros-Mejia, Hamza Ait Boutou, Zak Mohammad El Bastami, Eddie Ma, Ramya Manjunath, Dana Rea, Chris Ho, Angela Telfer, Jaclyn McKeowan, Miduna Rahulan, Claudia Steinke, Justin Dorsheimer, Megan Milton, Paul D. N. Hebert

Abstract: BOLD, the Barcode of Life Data System, supports the acquisition, storage, validation, analysis, and publication of DNA barcodes, activities requiring the integration of molecular, morphological, and distributional data. Its pivotal role in curating the reference library of DNA barcodes, coupled with its data management and analysis capabilities, make it a central resource for biodiversity science.… ▽ More BOLD, the Barcode of Life Data System, supports the acquisition, storage, validation, analysis, and publication of DNA barcodes, activities requiring the integration of molecular, morphological, and distributional data. Its pivotal role in curating the reference library of DNA barcodes, coupled with its data management and analysis capabilities, make it a central resource for biodiversity science. It enables rapid, accurate identification of specimens and also reveals patterns of genetic diversity and evolutionary relationships among taxa. Launched in 2005, BOLD has become an increasingly powerful tool for advancing understanding of planetary biodiversity. It currently hosts 17 million specimen records and 14 million barcodes that provide coverage for more than a million species from every continent and ocean. The platform has the long-term goal of providing a consistent, accurate system for identifying all species of eukaryotes. BOLD's integrated analytical tools, full data lifecycle support, and secure collaboration framework distinguish it from other biodiversity platforms. BOLD v4 brought enhanced data management and analysis capabilities as well as novel functionality for data dissemination and publication. Its next version will include features to strengthen its utility to the research community, governments, industry, and society-at-large. △ Less

Submitted 5 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.02904 [pdf, other]

ALOHa: A New Measure for Hallucination in Captioning Models

Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell

Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverage… ▽ More Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at https://davidmchan.github.io/aloha/. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: To appear at NAACL 2024

arXiv:2403.19822 [pdf, other]

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Authors: Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

Abstract: Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-trai… ▽ More Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Accepted in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

arXiv:2402.11590 [pdf, other]

Designing interactive data visualizations representing recovery progress for patients after stroke

Authors: Alicia Ouskine, Adrian D. C. Chan, Fateme Rajabiyazdi

Abstract: Stroke is one of the leading causes of disability worldwide. The efficacy of recovery is determined by a variety of factors, including patient adherence to rehabilitation programs. One way to increase patient adherence to their rehabilitation program is to show patients their progress that is visualized in a simple and intuitive way. We begin to gather preliminary information on Functional Capacit… ▽ More Stroke is one of the leading causes of disability worldwide. The efficacy of recovery is determined by a variety of factors, including patient adherence to rehabilitation programs. One way to increase patient adherence to their rehabilitation program is to show patients their progress that is visualized in a simple and intuitive way. We begin to gather preliminary information on Functional Capacity, Motor Function, and Mood/cognition from occupational Therapists at the Bruyere Hospital to gain a better understanding of how stroke recovery data is collected within in-patient stroke rehabilitation centers. The future aim is to design, develop, and evaluate a data visualization tool representing progress made by patients recovering from stroke. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: 2 pages

arXiv:2402.09679 [pdf, other]

Design and Visual Servoing Control of a Hybrid Dual-Segment Flexible Neurosurgical Robot for Intraventricular Biopsy

Authors: Jian Chen, Mingcong Chen, Qingxiang Zhao, Shuai Wang, Yihe Wang, Ying Xiao, Jian Hu, Danny Tat Ming Chan, Kam Tong Leo Yeung, David Yuen Chung Chan, Hongbin Liu

Abstract: Traditional rigid endoscopes have challenges in flexibly treating tumors located deep in the brain, and low operability and fixed viewing angles limit its development. This study introduces a novel dual-segment flexible robotic endoscope MicroNeuro, designed to perform biopsies with dexterous surgical manipulation deep in the brain. Taking into account the uncertainty of the control model, an imag… ▽ More Traditional rigid endoscopes have challenges in flexibly treating tumors located deep in the brain, and low operability and fixed viewing angles limit its development. This study introduces a novel dual-segment flexible robotic endoscope MicroNeuro, designed to perform biopsies with dexterous surgical manipulation deep in the brain. Taking into account the uncertainty of the control model, an image-based visual servoing with online robot Jacobian estimation has been implemented to enhance motion accuracy. Furthermore, the application of model predictive control with constraints significantly bolsters the flexible robot's ability to adaptively track mobile objects and resist external interference. Experimental results underscore that the proposed control system enhances motion stability and precision. Phantom testing substantiates its considerable potential for deployment in neurosurgery. △ Less

Submitted 23 February, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA) 2024, 7 pages, 9 figures

arXiv:2402.08205 [pdf, other]

TurtleRabbit 2024 SSL Team Description Paper

Authors: Linh Trinh, Alif Anzuman, Eric Batkhuu, Dychen Chan, Lisa Graf, Darpan Gurung, Tharunimm Jamal, Jigme Namgyal, Jason Ng, Wing Lam Tsang, X. Rosalind Wang, Eren Yilmaz, Oliver Obst

Abstract: TurtleRabbit is a new RoboCup SSL team from Western Sydney University. This team description paper presents our approach in navigating some of the challenges in develo** a new SSL team from scratch. SSL is dominated by teams with extensive experience and customised equipment that has been developed over many years. Here, we outline our approach in overcoming some of the complexities associated w… ▽ More TurtleRabbit is a new RoboCup SSL team from Western Sydney University. This team description paper presents our approach in navigating some of the challenges in develo** a new SSL team from scratch. SSL is dominated by teams with extensive experience and customised equipment that has been developed over many years. Here, we outline our approach in overcoming some of the complexities associated with replicating advanced open-sourced designs and managing the high costs of custom components. Opting for simplicity and cost-effectiveness, our strategy primarily employs off-the-shelf electronics components and ``hobby'' brushless direct current (BLDC) motors, complemented by 3D printing and CNC milling. This approach helped us to streamline the development process and, with our open-sourced hardware design, hopefully will also lower the bar for other teams to enter RoboCup SSL in the future. The paper details the specific hardware choices, their approximate costs, the integration of electronics and mechanics, and the initial steps taken in software development, for our entry into SSL that aims to be simple yet competitive. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: Submitted paper as part of the qualification for RoboCup 2024

arXiv:2401.14798 [pdf, ps, other]

Degenerations of orbifold curves as noncommutative varieties

Authors: Tarig Abdelgadir, Daniel Chan, Shinnosuke Okawa, Kazushi Ueda

Abstract: Boundary points on the moduli space of pointed curves corresponding to collisions of marked points have modular interpretations as degenerate curves. In this paper, we study degenerations of orbifold projective curves corresponding to collisions of stacky points from the point of view of noncommutative algebraic geometry. Boundary points on the moduli space of pointed curves corresponding to collisions of marked points have modular interpretations as degenerate curves. In this paper, we study degenerations of orbifold projective curves corresponding to collisions of stacky points from the point of view of noncommutative algebraic geometry. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 18 pages

arXiv:2401.14797 [pdf, ps, other]

A compact moduli of orbifold projective curves

Authors: Tarig Abdelgadir, Daniel Chan, Shinnosuke Okawa, Kazushi Ueda

Abstract: We introduce the notion of stable orbifold projective curves, and show that the moduli stack of stable orbifold projective curves is isomorphic to the moduli stack of weighted pointed stable curves in the sense of Hassett with respect to the weights determined by the automorphism groups of the stacky points. We introduce the notion of stable orbifold projective curves, and show that the moduli stack of stable orbifold projective curves is isomorphic to the moduli stack of weighted pointed stable curves in the sense of Hassett with respect to the weights determined by the automorphism groups of the stacky points. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 11 pages

arXiv:2401.05314 [pdf, other]

ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video

Authors: Kevin Cai, Chonghua Liu, David M. Chan

Abstract: The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex an… ▽ More The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex and challenging task due to pipelines, necessitating precise timing, facial movement synchronization, and prosody matching. While end-to-end dubbing offers a solution, data scarcity continues to impede the progress of both end-to-end and pipeline-based methods. In this work, we introduce Anim-400K, a comprehensive dataset of over 425K aligned animated video segments in Japanese and English supporting various video-related tasks, including automated dubbing, simultaneous translation, guided video summarization, and genre/theme/style classification. Our dataset is made publicly available for research purposes at https://github.com/davidmchan/Anim400K. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: To appear in ICASSP 2024

arXiv:2401.03384 [pdf, other]

conv_einsum: A Framework for Representation and Fast Evaluation of Multilinear Operations in Convolutional Tensorial Neural Networks

Authors: Tahseen Rabbani, Jiahao Su, Xiaoyu Liu, David Chan, Geoffrey Sangston, Furong Huang

Abstract: Modern ConvNets continue to achieve state-of-the-art results over a vast array of vision and image classification tasks, but at the cost of increasing parameters. One strategy for compactifying a network without sacrificing much expressive power is to reshape it into a tensorial neural network (TNN), which is a higher-order tensorization of its layers, followed by a factorization, such as a CP-dec… ▽ More Modern ConvNets continue to achieve state-of-the-art results over a vast array of vision and image classification tasks, but at the cost of increasing parameters. One strategy for compactifying a network without sacrificing much expressive power is to reshape it into a tensorial neural network (TNN), which is a higher-order tensorization of its layers, followed by a factorization, such as a CP-decomposition, which strips a weight down to its critical basis components. Passes through TNNs can be represented as sequences of multilinear operations (MLOs), where the evaluation path can greatly affect the number of floating point operations (FLOPs) incurred. While functions such as the popular einsum can evaluate simple MLOs such as contractions, existing implementations cannot process multi-way convolutions, resulting in scant assessments of how optimal evaluation paths through tensorized convolutional layers can improve training speed. In this paper, we develop a unifying framework for representing tensorial convolution layers as einsum-like strings and a meta-algorithm conv_einsum which is able to evaluate these strings in a FLOPs-minimizing manner. Comprehensive experiments, using our open-source implementation, over a wide range of models, tensor decompositions, and diverse tasks, demonstrate that conv_einsum significantly increases both computational and memory-efficiency of convolutional TNNs. △ Less

Submitted 6 January, 2024; originally announced January 2024.

arXiv:2401.02417 [pdf, other]

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

Authors: David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister

Abstract: While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these… ▽ More While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these interactions, even in an offline fashion. In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new public large-scale semi-synthetic meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines. We make OD3 publicly available at https://github.com/amazon-science/amazon-od3 . △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: To appear in ICASSP 2024

arXiv:2312.14378 [pdf, other]

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowledge transfer from attention matrices of models rooted in high resource modalities, text and images, to those in resource-constrained domains, speech and audio, employing a zero-shot paradigm. MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%, and relative classification error of an Audio Event Classification (AEC) model by 10.63%. In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. △ Less

Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

arXiv:2312.08366 [pdf, other]

See, Say, and Segment: Teaching LMMs to Overcome False Premises

Authors: Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E. Gonzalez, Trevor Darrell

Abstract: Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an obje… ▽ More Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the desired objects if they exist. Additionally, we introduce a novel False Premise Correction benchmark dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches, but under false premise conditions produces relative cIOU improvements of more than 31% over baselines, and produces natural language feedback judged helpful up to 67% of the time. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: Project Page: https://see-say-segment.github.io

arXiv:2312.04705 [pdf, ps, other]

Equivariant algebraic $K$-theory of symmetric monoidal Mackey functors

Authors: Maxine Calle, David Chan, Maximilien Péroux

Abstract: We provide a unifying approach to different constructions of the algebraic $K$-theory of equivariant symmetric monoidal categories. A consequence of our work is that every connective genuine $G$-spectrum is equivalent to the equivariant algebraic $K$-theory of categorical Mackey functors of Bohmann-Osorno. We provide a unifying approach to different constructions of the algebraic $K$-theory of equivariant symmetric monoidal categories. A consequence of our work is that every connective genuine $G$-spectrum is equivalent to the equivariant algebraic $K$-theory of categorical Mackey functors of Bohmann-Osorno. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 20 pages. Comments welcome!

MSC Class: 55P91 (Primary) 19D23; 18N60; 18N10; 55P42; 18F25 (Secondary)

arXiv:2310.12971 [pdf, other]

CLAIR: Evaluating Image Captions with Large Language Models

Authors: David Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John Canny

Abstract: The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score… ▽ More The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score that aligns closely with human judgments. Here, we propose CLAIR, a novel method that leverages the zero-shot language modeling capabilities of large language models (LLMs) to evaluate candidate captions. In our evaluations, CLAIR demonstrates a stronger correlation with human judgments of caption quality compared to existing measures. Notably, on Flickr8K-Expert, CLAIR achieves relative correlation improvements over SPICE of 39.6% and over image-augmented methods such as RefCLIP-S of 18.3%. Moreover, CLAIR provides noisily interpretable results by allowing the language model to identify the underlying reasoning behind its assigned score. Code is available at https://davidmchan.github.io/clair/ △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: To Appear at EMNLP 2023

arXiv:2309.08025 [pdf, ps, other]

A linearization map for genuine equivariant algebraic $K$-theory

Authors: Maxine Calle, David Chan, Andres Mejia

Abstract: We introduce a version of algebraic $K$-theory for coefficient systems of rings which is valued in genuine $G$-spectra for a finite group $G$. We use this construction to build a genuine $G$-spectrum $K_G(\underline{\mathbb{Z}[π_1(X)]})$ associated to a $G$-space $X$, which provides a home for equivariant versions of classical invariants like the Wall finiteness obstruction. We provide a compariso… ▽ More We introduce a version of algebraic $K$-theory for coefficient systems of rings which is valued in genuine $G$-spectra for a finite group $G$. We use this construction to build a genuine $G$-spectrum $K_G(\underline{\mathbb{Z}[π_1(X)]})$ associated to a $G$-space $X$, which provides a home for equivariant versions of classical invariants like the Wall finiteness obstruction. We provide a comparison between our $K$-theory spectrum and the equivariant $A$-theory of Malkiewich--Merling via a genuine equivariant linearization map. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Comments welcome!

MSC Class: 55P91 (Primary) 19D10; 19L47 (Secondary)

arXiv:2308.06593 [pdf, other]

The effect of host population heterogeneity on epidemic outbreaks

Authors: Martin Bootsma, Danny Chan, Odo Diekmann, Hisashi Inaba

Abstract: In the first part of this paper, we review old and new results about the influence of host population heterogeneity on (various characteristics of) epidemic outbreaks. In the second part we highlight a modelling issue that so far has received little attention: how do contact patterns, and hence transmission opportunities, depend on the size and the composition of the host population? Without any c… ▽ More In the first part of this paper, we review old and new results about the influence of host population heterogeneity on (various characteristics of) epidemic outbreaks. In the second part we highlight a modelling issue that so far has received little attention: how do contact patterns, and hence transmission opportunities, depend on the size and the composition of the host population? Without any claim on completeness, we offer a range of potential (quasi-mechanistic) submodels. The overall aim of the paper is to describe the state-of-the-art and to catalyse new work. △ Less

Submitted 16 January, 2024; v1 submitted 12 August, 2023; originally announced August 2023.

Comments: 36 pages, 5 figures

arXiv:2307.16749 [pdf, other]

Separable mixing: the general formulation and a particular example focusing on mask efficiency

Authors: M. C. J. Bootsma, K. M. D. Chan, O. Diekmann, H. Inaba

Abstract: The aim of this short note is twofold. We formulate the general Kermack-McKendrick epidemic model incorporating static heterogeneity and show how it simplifies to a scalar Renewal Equation (RE) when separable mixing is assumed. A key feature is that all information about the heterogeneity is encoded in one nonlinear real valued function of a real variable. Inspired by work of R. Pastor-Satorras an… ▽ More The aim of this short note is twofold. We formulate the general Kermack-McKendrick epidemic model incorporating static heterogeneity and show how it simplifies to a scalar Renewal Equation (RE) when separable mixing is assumed. A key feature is that all information about the heterogeneity is encoded in one nonlinear real valued function of a real variable. Inspired by work of R. Pastor-Satorras and C. Castellano, we next investigate mask efficiency and demonstrate that it is straightforward to rederive from the RE their main conclusion, that the best way to protect the population as a whole is to protect yourself. Thus we establish that this conclusion is robust, in the sense that it also holds outside the world of network models. △ Less

Submitted 31 July, 2023; originally announced July 2023.

arXiv:2306.06776 [pdf, other]

Comment on "Matter-wave interferometry with helium atoms in low-$l$ Rydberg states''

Authors: D. Z. Chan, J. D. D. Martin

Abstract: Tommey and Hogan [Phys. Rev. A, 104, 033305 (2021)] have reported a matter-wave interference experiment using Rydberg atoms traveling through inhomogeneous electric fields at approximately 2000 m/s. Using a simplified model containing the essential physics of their experiment, we show that the phase difference measured by their observed interference fringes does not depend -- in any significant wa… ▽ More Tommey and Hogan [Phys. Rev. A, 104, 033305 (2021)] have reported a matter-wave interference experiment using Rydberg atoms traveling through inhomogeneous electric fields at approximately 2000 m/s. Using a simplified model containing the essential physics of their experiment, we show that the phase difference measured by their observed interference fringes does not depend -- in any significant way -- on the acceleration of the Rydberg atoms, but instead simply on the uniform motion of the atoms through the inhomogeneous electric field. △ Less

Submitted 7 August, 2023; v1 submitted 11 June, 2023; originally announced June 2023.

Comments: 4 pages, 1 figure

arXiv:2304.02080 [pdf, other]

doi 10.1109/WACVW58289.2023.00043

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Authors: Vladislav Lialin, Stephen Rawls, David Chan, Shalini Ghosh, Anna Rumshisky, Wael Hamza

Abstract: Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty… ▽ More Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions. △ Less

Submitted 4 April, 2023; originally announced April 2023.

Journal ref: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

arXiv:2302.08949 [pdf, ps, other]

Equivariant Trees and Partition Complexes

Authors: Julia E. Bergner, Peter Bonventre, Maxine E. Calle, David Chan, Maru Sarazola

Abstract: We introduce two definitions of $G$-equivariant partitions of a finite $G$-set, both of which yield $G$-equivariant partition complexes. By considering suitable notions of equivariant trees, we show that $G$-equivariant partitions and $G$-trees are $G$-homotopy equivalent, generalizing existing results for the non-equivariant setting. Along the way, we develop equivariant versions of Quillen's The… ▽ More We introduce two definitions of $G$-equivariant partitions of a finite $G$-set, both of which yield $G$-equivariant partition complexes. By considering suitable notions of equivariant trees, we show that $G$-equivariant partitions and $G$-trees are $G$-homotopy equivalent, generalizing existing results for the non-equivariant setting. Along the way, we develop equivariant versions of Quillen's Theorems A and B, which are of independent interest. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: 36 pages. Comments welcome!

MSC Class: 55P91; 05A18; 20E08; 05E18

arXiv:2302.01328 [pdf, other]

IC3: Image Captioning by Committee Consensus

Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

Abstract: If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in th… ▽ More If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating significant improvements over SOTA approaches for visual description. Code is available at https://davidmchan.github.io/caption-by-committee/ △ Less

Submitted 19 October, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: To Appear at EMNLP 2023

arXiv:2301.02736 [pdf, other]

Using External Off-Policy Speech-To-Text Map**s in Contextual End-To-End Automated Speech Recognition

Authors: David M. Chan, Shalini Ghosh, Ariya Rastrow, Björn Hoffmeister

Abstract: Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveragin… ▽ More Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods, to allow for flexible post-training adaptation to new data distributions. In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR via an approximate k-nearest-neighbor (KNN) based attentive fusion step. Our experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours while providing up to 3% WER improvement compared to a fine-tuning baseline, suggesting a promising approach for adapting production ASR systems in challenging zero and few-shot scenarios. △ Less

Submitted 6 January, 2023; originally announced January 2023.

arXiv:2211.12465 [pdf, ps, other]

Noncommutative linear systems and noncommutative elliptic curves

Authors: Daniel Chan, Adam Nyman

Abstract: In this paper we introduce a noncommutative analogue of the notion of linear system, which we call a helix $\underline{\mathcal{L}} := (\mathcal{L}_{i})_{i \in \mathbb{Z}}$ in an abelian category ${\sf C}$ over a quadratic $\mathbb{Z}$-indexed algebra $A$. We show that, under natural hypotheses, a helix induces a morphism of noncommutative spaces from… ▽ More In this paper we introduce a noncommutative analogue of the notion of linear system, which we call a helix $\underline{\mathcal{L}} := (\mathcal{L}_{i})_{i \in \mathbb{Z}}$ in an abelian category ${\sf C}$ over a quadratic $\mathbb{Z}$-indexed algebra $A$. We show that, under natural hypotheses, a helix induces a morphism of noncommutative spaces from ${\sf Proj }\operatorname{End}(\underline{\mathcal{L}})$ to ${\sf Proj }A$. We construct examples of helices of vector bundles on elliptic curves generalizing the elliptic helices of line bundles constructed by Bondal-Polishchuk, where $A$ is the quadratic part of $B:= \operatorname{End}(\underline{\mathcal{L}})$. In this case, we identify $B$ as the quotient of the Koszul algebra $A$ by a normal family of regular elements of degree 3, and show that ${\sf Proj }B$ is a noncommutative elliptic curve in the sense of Polishchuk. One interprets this as embedding the noncommutative elliptic curve as a cubic divisor in some noncommutative projective plane, hence generalizing some well-known results of Artin-Tate-Van den Bergh. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: 37 pages

MSC Class: 14A22

arXiv:2211.05948 [pdf, ps, other]

Terminal orders on arithmetic surfaces

Authors: Daniel Chan, Colin Ingalls

Abstract: The local structure of terminal Brauer classes on arithmetic surfaces were classified in [CI21] generalising the classification on geometric surfaces carried out in [CI05]. Part of the interest in these classifications is that it enables the minimal model program to be applied to the noncommutative setting of orders on surfaces. In this paper, we give etale local structure theorems for terminal or… ▽ More The local structure of terminal Brauer classes on arithmetic surfaces were classified in [CI21] generalising the classification on geometric surfaces carried out in [CI05]. Part of the interest in these classifications is that it enables the minimal model program to be applied to the noncommutative setting of orders on surfaces. In this paper, we give etale local structure theorems for terminal orders on arithemtic surfaces, at least when the degree is a prime p >5. This generalises the structure theorem given in the geometric case. They can all be explicitly constructed as algebras of matrices over symbols. From this description one sees that such terminal orders all have global dimension two, thus generalising the fact that terminal (commutative) surfaces are smooth and hence homologically regular. △ Less

Submitted 10 November, 2022; originally announced November 2022.

MSC Class: 16H10; 16S38

arXiv:2209.07518 [pdf, other]

Distribution Aware Metrics for Conditional Natural Language Generation

Authors: David M Chan, Yiming Ni, David A Ross, Sudheendra Vijayanarasimhan, Austin Myers, John Canny

Abstract: Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersi… ▽ More Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity. △ Less

Submitted 29 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2208.05555 [pdf, ps, other]

doi 10.2140/tunis.2024.6.1

Bi-incomplete Tambara functors as $\mathcal{O}$-commutative monoids

Authors: David Chan

Abstract: Tambara functors are an equivariant generalization of rings that appear as the homotopy groups of genuine equivariant commutative ring spectra. In recent work, Blumberg and Hill have studied the corresponding algebraic structures, called bi-incomplete Tambara functors, that arise from ring spectra indexed on incomplete $G$-universes. In this paper, we answer a conjecture of Blumberg and Hill by pr… ▽ More Tambara functors are an equivariant generalization of rings that appear as the homotopy groups of genuine equivariant commutative ring spectra. In recent work, Blumberg and Hill have studied the corresponding algebraic structures, called bi-incomplete Tambara functors, that arise from ring spectra indexed on incomplete $G$-universes. In this paper, we answer a conjecture of Blumberg and Hill by proving a generalization of the Hoyer--Mazur theorem in the bi-incomplete setting. Bi-incomplete Tambara functors are characterized by indexing categories which parameterize incomplete systems of norms and transfers. In the course of our work, we develop several new tools for studying these indexing categories. In particular, we provide an easily checked, combinatorial characterization of when two indexing categories are compatible in the sense of Blumberg and Hill. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 36 pages

MSC Class: 55P91 (Primary) 55N91; 18M05 (Secondary)

Journal ref: Tunisian J. Math. 6 (2024) 1-47

arXiv:2207.08024 [pdf, other]

LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training

Authors: Sumanth Gurram, Andy Fang, David Chan, John Canny

Abstract: Generating representations of video data is of key importance in advancing the field of machine perception. Most current techniques rely on hand-annotated data, which can be difficult to work with, expensive to generate, and hard to scale. In this work, we propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representa… ▽ More Generating representations of video data is of key importance in advancing the field of machine perception. Most current techniques rely on hand-annotated data, which can be difficult to work with, expensive to generate, and hard to scale. In this work, we propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representations in a self-supervised manner. We pre-train LAVA on the Kinetics 700 dataset using transformer encoders to learn representations for each modality. We then demonstrate that LAVA performs competitively with the current state-of-the-art self-supervised and weakly-supervised pretraining techniques on UCF-101 and HMDB-51 video action recognition while using a fraction of the unlabeled data. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: Workshop Paper at ICML 2022

arXiv:2207.04737 [pdf, other]

A versatile stochastic dissemination model

Authors: K. M. D. Chan, M. R. H. Mandjes

Abstract: This paper consider a highly general dissemination model that keeps track of the stochastic evolution of the distribution of wealth over a set of agents. There are two types of events: (i) units of wealth externally arrive, and (ii) units of wealth are redistributed among the agents, while throughout Markov modulation is allowed. We derive a system of coupled differential equations describing the… ▽ More This paper consider a highly general dissemination model that keeps track of the stochastic evolution of the distribution of wealth over a set of agents. There are two types of events: (i) units of wealth externally arrive, and (ii) units of wealth are redistributed among the agents, while throughout Markov modulation is allowed. We derive a system of coupled differential equations describing the joint transient distribution of the agents' wealth values, which translate into linear differential equations when considering the corresponding means and (co-)variances. While our model uses the (economic) terminology of wealth being distributed over agents, we illustrate through a series of examples that it can be used considerably more broadly. Indeed, it also facilitates the analysis of the spread of opinions over a population (thus generalizing existing opinion dynamics models), and the analysis of the dynamics of a file storage system (thus allowing the assessment of the efficacy of storage policies). △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 26 pages, 7 figures

MSC Class: 60Gxx; 92D25; 68M20

arXiv:2206.08353 [pdf, other]

Towards Understanding How Machines Can Learn Causal Overhypotheses

Authors: Eliza Kosoy, David M. Chan, Adrian Liu, Jasmine Collins, Bryanna Kaufmann, Sandy Han Huang, Jessica B. Hamrick, John Canny, Nan Rosemary Ke, Alison Gopnik

Abstract: Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the k… ▽ More Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn and use causal overhypotheses. In this work, we present a new benchmark -- a flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses -- and demonstrate that many existing state-of-the-art methods have trouble generalizing in this environment. The code and resources for this benchmark are available at https://github.com/CannyLab/casual_overhypotheses. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2205.09872 [pdf, other]

Content-Context Factorized Representations for Automated Speech Recognition

Authors: David M. Chan, Shalini Ghosh

Abstract: Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected a… ▽ More Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios. △ Less

Submitted 15 September, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Comments: Presented at Interspeech 2022 (On-Site Oral Presentation)

arXiv:2205.06253 [pdf, other]

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny

Abstract: While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In th… ▽ More While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics. △ Less

Submitted 12 January, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: The 1st Workshop on Vision Datasets Understanding, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

arXiv:2202.10430 [pdf, other]

Learning Causal Overhypotheses through Exploration in Children and Computational Models

Authors: Eliza Kosoy, Adrian Liu, Jasmine Collins, David M Chan, Jessica B Hamrick, Nan Rosemary Ke, Sandy H Huang, Bryanna Kaufmann, John Canny, Alison Gopnik

Abstract: Despite recent progress in reinforcement learning (RL), RL algorithms for exploration still remain an active area of research. Existing methods often focus on state-based metrics, which do not consider the underlying causal structures of the environment, and while recent research has begun to explore RL environments for causal learning, these environments primarily leverage causal information thro… ▽ More Despite recent progress in reinforcement learning (RL), RL algorithms for exploration still remain an active area of research. Existing methods often focus on state-based metrics, which do not consider the underlying causal structures of the environment, and while recent research has begun to explore RL environments for causal learning, these environments primarily leverage causal information through causal inference or induction rather than exploration. In contrast, human children - some of the most proficient explorers - have been shown to use causal information to great benefit. In this work, we introduce a novel RL environment designed with a controllable causal structure, which allows us to evaluate exploration strategies used by both agents and children in a unified environment. In addition, through experimentation on both computation models and children, we demonstrate that there are significant differences between information-gain optimal RL exploration in causal environments and the exploration of children in the same environments. We conclude with a discussion of how these findings may inspire new directions of research into efficient exploration and disambiguation of causal structures for RL algorithms. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2202.07706

Misinformation Detection in Social Media Video Posts

Authors: Kehan Wang, David Chan, Seth Z. Zhao, John Canny, Avideh Zakhor

Abstract: With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-… ▽ More With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-modal datasets, we collect 160,000 video posts from Twitter, and leverage self-supervised learning to learn expressive representations of joint visual and textual data. In this work, we propose two new methods for detecting semantic inconsistencies within short-form social media video posts, based on contrastive learning and masked language modeling. We demonstrate that our new approaches outperform current state-of-the-art methods on both artificial data generated by random-swap** of positive samples and in the wild on a new manually-labeled test set for semantic misinformation. △ Less

Submitted 30 July, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: We discovered an error in our dataset construction where retweets were not properly filtered. This resulted in test data leakage in training data, and the results reported are affected

arXiv:2110.09890 [pdf, other]

Multi-Modal Pre-Training for Automated Speech Recognition

Authors: David M. Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister

Abstract: Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise… ▽ More Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models). △ Less

Submitted 15 September, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Presented at ICASSP 2022

arXiv:2110.03588 [pdf]

A transformer-based deep learning approach for classifying brain metastases into primary organ sites using clinical whole brain MRI

Authors: Qing Lyu, Sanjeev V. Namjoshi, Emory McTyre, Umit Topaloglu, Richard Barcus, Michael D. Chan, Christina K. Cramer, Waldemar Debinski, Metin N. Gurcan, Glenn J. Lesser, Hui-Kuan Lin, Reginald F. Munden, Boris C. Pasche, Kiran Kumar Solingapuram Sai, Roy E. Strowd, Stephen B. Tatter, Kounosuke Watabe, Wei Zhang, Ge Wang, Christopher T. Whitlow

Abstract: Treatment decisions for brain metastatic disease rely on knowledge of the primary organ site, and currently made with biopsy and histology. Here we develop a novel deep learning approach for accurate non-invasive digital histology with whole-brain MRI data. Our IRB-approved single-site retrospective study was comprised of patients (n=1,399) referred for MRI treatment-planning and gamma knife radio… ▽ More Treatment decisions for brain metastatic disease rely on knowledge of the primary organ site, and currently made with biopsy and histology. Here we develop a novel deep learning approach for accurate non-invasive digital histology with whole-brain MRI data. Our IRB-approved single-site retrospective study was comprised of patients (n=1,399) referred for MRI treatment-planning and gamma knife radiosurgery over 21 years. Contrast-enhanced T1-weighted and T2-weighted Fluid-Attenuated Inversion Recovery brain MRI exams (n=1,582) were preprocessed and input to the proposed deep learning workflow for tumor segmentation, modality transfer, and primary site classification into one of five classes. Ten-fold cross-validation generated overall AUC of 0.878 (95%CI:0.873,0.883), lung class AUC of 0.889 (95%CI:0.883,0.895), breast class AUC of 0.873 (95%CI:0.860,0.886), melanoma class AUC of 0.852 (95%CI:0.842,0.862), renal class AUC of 0.830 (95%CI:0.809,0.851), and other class AUC of 0.822 (95%CI:0.805,0.839). These data establish that whole-brain imaging features are discriminative to allow accurate diagnosis of the primary organ site of malignancy. Our end-to-end deep radiomic approach has great potential for classifying metastatic tumor types from whole-brain MRI images. Further refinement may offer an invaluable clinical tool to expedite primary cancer site identification for precision treatment and improved outcomes. △ Less

Submitted 20 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2108.13947 [pdf, other]

doi 10.6339/21-JDS1033

Decision Tree-Based Predictive Models for Academic Achievement Using College Students' Support Networks

Authors: Anthony Frazier, Joethi Silva, Rachel Meilak, Indranil Sahoo, David Chan, Michael Broda

Abstract: In this study, we examine a set of primary data collected from 484 students enrolled in a large public university in the Mid-Atlantic United States region during the early stages of the COVID-19 pandemic. The data, called Ties data, included students' demographic and support network information. The support network data comprised of information that highlighted the type of support, (i.e. emotional… ▽ More In this study, we examine a set of primary data collected from 484 students enrolled in a large public university in the Mid-Atlantic United States region during the early stages of the COVID-19 pandemic. The data, called Ties data, included students' demographic and support network information. The support network data comprised of information that highlighted the type of support, (i.e. emotional or educational; routine or intense). Using this data set, models for predicting students' academic achievement, quantified by their self-reported GPA, were created using Chi-Square Automatic Interaction Detection (CHAID), a decision tree algorithm, and cforest, a random forest algorithm that uses conditional inference trees. We compare the methods' accuracy and variation in the set of important variables suggested by each algorithm. Each algorithm found different variables important for different student demographics with some overlap. For White students, different types of educational support were important in predicting academic achievement, while for non-White students, different types of emotional support were important in predicting academic achievement. The presence of differing types of routine support were important in predicting academic achievement for cisgender women, while differing types of intense support were important in predicting academic achievement for cisgender men. △ Less

Submitted 12 September, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2108.03105 [pdf, ps, other]

The minimal model program for arithmetic surfaces enriched by a Brauer class

Authors: Daniel Chan, Colin Ingalls

Abstract: We examine the noncommutative minimal model program for orders on arithmetic surfaces, or equivalently, arithmetic surfaces enriched by a Brauer class $β$. When $β$ has prime index $p>5$, we show the classical theory extends with analogues of existence of terminal resolutions, Castelnuovo contraction and Zariski factorisation. We also classify $β$-terminal surfaces and Castelnuovo contractions, an… ▽ More We examine the noncommutative minimal model program for orders on arithmetic surfaces, or equivalently, arithmetic surfaces enriched by a Brauer class $β$. When $β$ has prime index $p>5$, we show the classical theory extends with analogues of existence of terminal resolutions, Castelnuovo contraction and Zariski factorisation. We also classify $β$-terminal surfaces and Castelnuovo contractions, and discover new unexpected behaviour. △ Less

Submitted 6 August, 2021; originally announced August 2021.

MSC Class: 14E30; 14G40

arXiv:2108.01651 [pdf, ps, other]

An Impossibility Result on Strong Linearizability in Message-Passing Systems

Authors: David Yu Cheng Chan, Vassos Hadzilacos, Xing Hu, Sam Toueg

Abstract: We prove that in asynchronous message-passing systems where at most one process may crash, there is no lock-free strongly linearizable implementation of a weak object that we call Test-or-Set (ToS). This object allows a single distinguished process to apply the set operation once, and a different distinguished process to apply the test operation also once. Since this weak object can be directly im… ▽ More We prove that in asynchronous message-passing systems where at most one process may crash, there is no lock-free strongly linearizable implementation of a weak object that we call Test-or-Set (ToS). This object allows a single distinguished process to apply the set operation once, and a different distinguished process to apply the test operation also once. Since this weak object can be directly implemented by a single-writer single-reader (SWSR) register (and other common objects such as max-register, snapshot and counter), this result implies that there is no $1$-resilient lock-free strongly linearizable implementation of a SWSR register (and of these other objects) in message-passing systems. We also prove that there is no $1$-resilient lock-free \emph{write} strongly-linearizable implementation of a 2-writer 1-reader (2W1R) register in asynchronous message-passing systems. △ Less

Submitted 9 August, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: 12 pages

arXiv:2106.03185 [pdf, ps, other]

Tight Lower Bounds for the RMR Complexity of Recoverable Mutual Exclusion

Authors: David Yu Cheng Chan, Philipp Woelfel

Abstract: We present a tight RMR complexity lower bound for the recoverable mutual exclusion (RME) problem, defined by Golab and Ramaraju \cite{GR2019a}. In particular, we show that any $n$-process RME algorithm using only atomic read, write, fetch-and-store, fetch-and-increment, and compare-and-swap operations, has an RMR complexity of $Ω(\log n/\log\log n)$ on the CC and DSM model. This lower bound covers… ▽ More We present a tight RMR complexity lower bound for the recoverable mutual exclusion (RME) problem, defined by Golab and Ramaraju \cite{GR2019a}. In particular, we show that any $n$-process RME algorithm using only atomic read, write, fetch-and-store, fetch-and-increment, and compare-and-swap operations, has an RMR complexity of $Ω(\log n/\log\log n)$ on the CC and DSM model. This lower bound covers all realistic synchronization primitives that have been used in RME algorithms and matches the best upper bounds of algorithms employing swap objects (e.g., [5,6,10]). Algorithms with better RMR complexity than that have only been obtained by either (i) assuming that all failures are system-wide [7], (ii) employing fetch-and-add objects of size $(\log n)^{ω(1)}$ [12], or (iii) using artificially defined synchronization primitives that are not available in actual systems [6,9]. △ Less

Submitted 6 June, 2021; originally announced June 2021.

Comments: 36 pages, 0 figures

arXiv:2105.10880 [pdf, other]

RtFPS: An Interactive Map that Visualizes and Predicts Wildfires in the US

Authors: Yang Li, Hermawan Mulyono, Ying Chen, Zhiyin Lu, Desmond Chan

Abstract: Climate change has largely impacted our daily lives. As one of its consequences, we are experiencing more wildfires. In the year 2020, wildfires burned a record number of 8,888,297 acres in the US. To awaken people's attention to climate change, and to visualize the current risk of wildfires, We developed RtFPS, "Real-Time Fire Prediction System". It provides a real-time prediction visualization o… ▽ More Climate change has largely impacted our daily lives. As one of its consequences, we are experiencing more wildfires. In the year 2020, wildfires burned a record number of 8,888,297 acres in the US. To awaken people's attention to climate change, and to visualize the current risk of wildfires, We developed RtFPS, "Real-Time Fire Prediction System". It provides a real-time prediction visualization of wildfire risk at specific locations base on a Machine Learning model. It also provides interactive map features that show the historical wildfire events with environmental info. △ Less

Submitted 21 June, 2021; v1 submitted 23 May, 2021; originally announced May 2021.

Comments: Source code: https://github.com/yangland/rtfps

MSC Class: 68U05; 68T30 ACM Class: J.2.5; H.4.0; I.5.1

arXiv:2104.04139 [pdf, other]

A relative approach to opinion formation

Authors: Kit Ming Danny Chan, Robert Duivenvoorden, Andreas Flache, Michel Mandjes

Abstract: Formal models of opinion formation commonly represent an individual's opinion by a value on a fixed opinion interval. We propose an alternative modeling method wherein interpretation is only provided to the relative positions of opinions vis-à-vis each other. This method is then considered in a similar setting as the discrete-time Altafini model (an extension of the well-known DeGroot model), but… ▽ More Formal models of opinion formation commonly represent an individual's opinion by a value on a fixed opinion interval. We propose an alternative modeling method wherein interpretation is only provided to the relative positions of opinions vis-à-vis each other. This method is then considered in a similar setting as the discrete-time Altafini model (an extension of the well-known DeGroot model), but with more general influence weights. Even in a linear framework, the model can describe, in the long run, polarization, dynamics with a periodic pattern, and (modulus) consensus formation. In addition, in our alternative approach key characteristics of the opinion dynamic can be derived from real-valued square matrices of influence weights, which immediately allows one to transfer matrix theory insights to the field of opinion formation dynamics under more relaxed conditions than in the DeGroot or discrete-time Altafini models. A few specific themes are covered: (i) We demonstrate how stable patterns in relative opinion dynamics are identified which are hidden when opinions are considered in an absolute opinion framework. (ii) For the two-agent case, we provide an exhaustive closed-form description of the relative opinion model's dynamic in the long run. (iii) We explore group dynamics analytically, in particular providing a non-trivial condition under which a subgroup's asymptotic behavior carries over to the entire population. △ Less

Submitted 2 February, 2022; v1 submitted 30 March, 2021; originally announced April 2021.

Comments: 40 pages

arXiv:2104.01263 [pdf, other]

A Semantic Segmentation Network for Urban-Scale Building Footprint Extraction Using RGB Satellite Imagery

Authors: Aatif Jiwani, Shubhrakanti Ganguly, Chao Ding, Nan Zhou, David M. Chan

Abstract: Urban areas consume over two-thirds of the world's energy and account for more than 70 percent of global CO2 emissions. As stated in IPCC's Global Warming of 1.5C report, achieving carbon neutrality by 2050 requires a clear understanding of urban geometry. High-quality building footprint generation from satellite images can accelerate this predictive process and empower municipal decision-making a… ▽ More Urban areas consume over two-thirds of the world's energy and account for more than 70 percent of global CO2 emissions. As stated in IPCC's Global Warming of 1.5C report, achieving carbon neutrality by 2050 requires a clear understanding of urban geometry. High-quality building footprint generation from satellite images can accelerate this predictive process and empower municipal decision-making at scale. However, previous Deep Learning-based approaches face consequential issues such as scale invariance and defective footprints, partly due to ever-present class-wise imbalance. Additionally, most approaches require supplemental data such as point cloud data, building height information, and multi-band imagery - which has limited availability and are tedious to produce. In this paper, we propose a modified DeeplabV3+ module with a Dilated Res-Net backbone to generate masks of building footprints from three-channel RGB satellite imagery only. Furthermore, we introduce an F-Beta measure in our objective function to help the model account for skewed class distributions and prevent false-positive footprints. In addition to F-Beta, we incorporate an exponentially weighted boundary loss and use a cross-dataset training strategy to further increase the quality of predictions. As a result, we achieve state-of-the-art performances across three public benchmarks and demonstrate that our RGB-only method produces higher quality visual results and is agnostic to the scale, resolution, and urban density of satellite imagery. △ Less

Submitted 18 November, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

Comments: 11 pages, 5 figures. Code available at https://github.com/aatifjiwani/rgb-footprint-extract/

arXiv:2103.11926 [pdf, other]

Differentiated nonblocking: a new progress condition and a matching queue algorithm

Authors: David Y. C. Chan, Shucheng Chi, Vassos Hadzilacos, Sam Toueg

Abstract: In this paper, we first propose a new liveness requirement for shared objects and data structures, we then give a shared queue algorithm that satisfies this requirement and we prove its correctness. We also implement this algorithm and compare it to a well-known shared queue algorithm that is used in practice. In addition to having a stronger worst-case progress guarantee, our experimental results… ▽ More In this paper, we first propose a new liveness requirement for shared objects and data structures, we then give a shared queue algorithm that satisfies this requirement and we prove its correctness. We also implement this algorithm and compare it to a well-known shared queue algorithm that is used in practice. In addition to having a stronger worst-case progress guarantee, our experimental results suggest that, at the cost of a marginal decrease in throughput, our algorithm is significantly fairer, by a natural definition of fairness that we introduce here. △ Less

Submitted 22 March, 2021; originally announced March 2021.

arXiv:2011.12513 [pdf, ps, other]

doi 10.1063/5.0033933

Analytical solution for an acoustic boundary layer around an oscillating rigid sphere

Authors: Evert Klaseboer, Qiang Sun, Derek Y. C. Chan

Abstract: Analytical solutions in fluid dynamics can be used to elucidate the physics of complex flows and to serve as test cases for numerical models. In this work, we present the analytical solution for the acoustic boundary layer that develops around a rigid sphere executing small amplitude harmonic rectilinear motion in a compressible fluid. The mathematical framework that describes the primary flow is… ▽ More Analytical solutions in fluid dynamics can be used to elucidate the physics of complex flows and to serve as test cases for numerical models. In this work, we present the analytical solution for the acoustic boundary layer that develops around a rigid sphere executing small amplitude harmonic rectilinear motion in a compressible fluid. The mathematical framework that describes the primary flow is identical to that of wave propagation in linearly elastic solids, the difference being the appearance of complex instead of real valued wave numbers. The solution reverts to well-known classical solutions in special limits: the potential flow solution in the thin boundary layer limit, the oscillatory flat plate solution in the limit of large sphere radius and the Stokes flow solutions in the incompressible limit of infinite sound speed. As a companion analytical result, the steady second order acoustic streaming flow is obtained. This streaming flow is driven by the Reynolds stress tensor that arises from the axisymmetric first order primary flow around such a rigid sphere. These results are obtained with a linearization of the non-linear Navier-Stokes equations valid for small amplitude oscillations of the sphere. The streaming flow obeys a time-averaged Stokes equation with a body force given by the Nyborg model in which the above mentioned primary flow in a compressible Newtonian fluid is used to estimate the time-averaged body force. Numerical results are presented to explore different regimes of the complex transverse and longitudinal wave numbers that characterize the primary flow. △ Less

Submitted 6 December, 2020; v1 submitted 24 November, 2020; originally announced November 2020.

Journal ref: Phys. Fluids 32, 126105 (2020)

arXiv:2008.02787 [pdf, other]

Efficient Non-Line-of-Sight Imaging from Transient Sinograms

Authors: Mariko Isogawa, Dorian Chan, Ye Yuan, Kris Kitani, Matthew O'Toole

Abstract: Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more… ▽ More Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more efficient form of NLOS scanning that reduces both acquisition times and computational requirements. We propose a circular and confocal non-line-of-sight (C2NLOS) scan that involves illuminating and imaging a common point, and scanning this point in a circular path along a wall. We observe that (1) these C2NLOS measurements consist of a superposition of sinusoids, which we refer to as a transient sinogram, (2) there exists computationally efficient reconstruction procedures that transform these sinusoidal measurements into 3D positions of hidden scatterers or NLOS images of hidden objects, and (3) despite operating on an order of magnitude fewer measurements than previous approaches, these C2NLOS scans provide sufficient information about the hidden scene to solve these different NLOS imaging tasks. We show results from both simulated and real C2NLOS scans. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: ECCV 2020. Project page: https://marikoisogawa.github.io/project/c2nlos

Showing 1–50 of 126 results for author: chan, D