Search | arXiv e-print repository

Retrieval-Augmented Generation for Generative Artificial Intelligence in Medicine

Authors: Rui Yang, Yilin Ning, Emilia Keppo, Mingxuan Liu, Chuan Hong, Danielle S Bitterman, Jasmine Chiat Ling Ong, Daniel Shu Wei Ting, Nan Liu

Abstract: Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the… ▽ More Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the way for connecting this transformative technology with medical applications and is expected to bring innovations in equity, reliability, and personalization to health care. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2405.17921 [pdf]

Towards Clinical AI Fairness: Filling Gaps in the Puzzle

Authors: Mingxuan Liu, Yilin Ning, Salinelat Teixayavong, Xiaoxuan Liu, Mayli Mertens, Yuqing Shang, Xin Li, Di Miao, Jie Xu, Daniel Shu Wei Ting, Lionel Tim-Ee Cheng, Jasmine Chiat Ling Ong, Zhen Ling Teo, Ting Fang Tan, Narrendar RaviChandran, Fei Wang, Leo Anthony Celi, Marcus Eng Hock Ong, Nan Liu

Abstract: The ethical integration of Artificial Intelligence (AI) in healthcare necessitates addressing fairness-a concept that is highly context-specific across medical fields. Extensive studies have been conducted to expand the technical components of AI fairness, while tremendous calls for AI fairness have been raised from healthcare. Despite this, a significant disconnect persists between technical adva… ▽ More The ethical integration of Artificial Intelligence (AI) in healthcare necessitates addressing fairness-a concept that is highly context-specific across medical fields. Extensive studies have been conducted to expand the technical components of AI fairness, while tremendous calls for AI fairness have been raised from healthcare. Despite this, a significant disconnect persists between technical advancements and their practical clinical applications, resulting in a lack of contextualized discussion of AI fairness in clinical settings. Through a detailed evidence gap analysis, our review systematically pinpoints several deficiencies concerning both healthcare data and the provided AI fairness solutions. We highlight the scarcity of research on AI fairness in many medical domains where AI technology is increasingly utilized. Additionally, our analysis highlights a substantial reliance on group fairness, aiming to ensure equality among demographic groups from a macro healthcare system perspective; in contrast, individual fairness, focusing on equity at a more granular level, is frequently overlooked. To bridge these gaps, our review advances actionable strategies for both the healthcare and AI research communities. Beyond applying existing AI fairness methods in healthcare, we further emphasize the importance of involving healthcare professionals to refine AI fairness concepts and methods to ensure contextually relevant and ethically sound AI applications in healthcare. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.04752 [pdf, other]

Challenges Faced by Large Language Models in Solving Multi-Agent Flocking

Authors: Peihan Li, Vishnu Menon, Bhavanaraj Gudiguntla, Daniel Ting, Lifeng Zhou

Abstract: Flocking is a behavior where multiple agents in a system attempt to stay close to each other while avoiding collision and maintaining a desired formation. This is observed in the natural world and has applications in robotics, including natural disaster search and rescue, wild animal tracking, and perimeter surveillance and patrol. Recently, large language models (LLMs) have displayed an impressiv… ▽ More Flocking is a behavior where multiple agents in a system attempt to stay close to each other while avoiding collision and maintaining a desired formation. This is observed in the natural world and has applications in robotics, including natural disaster search and rescue, wild animal tracking, and perimeter surveillance and patrol. Recently, large language models (LLMs) have displayed an impressive ability to solve various collaboration tasks as individual decision-makers. Solving multi-agent flocking with LLMs would demonstrate their usefulness in situations requiring spatial and decentralized decision-making. Yet, when LLM-powered agents are tasked with implementing multi-agent flocking, they fall short of the desired behavior. After extensive testing, we find that agents with LLMs as individual decision-makers typically opt to converge on the average of their initial positions or diverge from each other. After breaking the problem down, we discover that LLMs cannot understand maintaining a shape or kee** a distance in a meaningful way. Solving multi-agent flocking with LLMs would enhance their ability to understand collaborative spatial reasoning and lay a foundation for addressing more complex multi-agent tasks. This paper discusses the challenges LLMs face in multi-agent flocking and suggests areas for future improvement and research. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2402.10083 [pdf]

Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4

Authors: Ting Fang Tan, Kabilan Elangovan, Liyuan **, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting

Abstract: Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find… ▽ More Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset, additional 8 glaucoma QnA pairs were included. 200 responses to the testing dataset were generated by 5 fine-tuned LLMs for evaluation. A customized clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4 evaluation was then compared against ranking by 5 clinicians for clinical alignment. Results: Among all fine-tuned LLMs, GPT-3.5 scored the highest (87.1%), followed by LLAMA2-13b (80.9%), LLAMA2-13b-chat (75.5%), LLAMA2-7b-Chat (70%) and LLAMA2-7b (68.8%) based on the GPT-4 evaluation. GPT-4 evaluation demonstrated significant agreement with human clinician rankings, with Spearman and Kendall Tau correlation coefficients of 0.90 and 0.80 respectively; while correlation based on Cohen Kappa was more modest at 0.50. Notably, qualitative analysis and the glaucoma sub-analysis revealed clinical inaccuracies in the LLM-generated responses, which were appropriately identified by the GPT-4 evaluation. Conclusion: The notable clinical alignment of GPT-4 evaluation highlighted its potential to streamline the clinical evaluation of LLM chatbot responses to healthcare-related queries. By complementing the existing clinician-dependent manual grading, this efficient and automated evaluation could assist the validation of future developments in LLM applications for healthcare. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: 13 Pages, 1 Figure, 8 Tables

arXiv:2402.01741 [pdf]

Development and Testing of a Novel Large Language Model-Based Clinical Decision Support Systems for Medication Safety in 12 Clinical Specialties

Authors: Jasmine Chiat Ling Ong, Liyuan **, Kabilan Elangovan, Gilbert Yong San Lim, Daniel Yan Zheng Lim, Gerald Gui Ren Sng, Yuhe Ke, Joshua Yi Min Tung, Ryan Jian Zhong, Christopher Ming Yao Koh, Keane Zhi Hao Lee, Xiang Chen, Jack Kian Chng, Aung Than, Ken Junyang Goh, Daniel Shu Wei Ting

Abstract: Importance: We introduce a novel Retrieval Augmented Generation (RAG)-Large Language Model (LLM) framework as a Clinical Decision Support Systems (CDSS) to support safe medication prescription. Objective: To evaluate the efficacy of LLM-based CDSS in correctly identifying medication errors in different patient case vignettes from diverse medical and surgical sub-disciplines, against a human expe… ▽ More Importance: We introduce a novel Retrieval Augmented Generation (RAG)-Large Language Model (LLM) framework as a Clinical Decision Support Systems (CDSS) to support safe medication prescription. Objective: To evaluate the efficacy of LLM-based CDSS in correctly identifying medication errors in different patient case vignettes from diverse medical and surgical sub-disciplines, against a human expert panel derived ground truth. We compared performance for under 2 different CDSS practical healthcare integration modalities: LLM-based CDSS alone (fully autonomous mode) vs junior pharmacist + LLM-based CDSS (co-pilot, assistive mode). Design, Setting, and Participants: Utilizing a RAG model with state-of-the-art medically-related LLMs (GPT-4, Gemini Pro 1.0 and Med-PaLM 2), this study used 61 prescribing error scenarios embedded into 23 complex clinical vignettes across 12 different medical and surgical specialties. A multidisciplinary expert panel assessed these cases for Drug-Related Problems (DRPs) using the PCNE classification and graded severity / potential for harm using revised NCC MERP medication error index. We compared. Results RAG-LLM performed better compared to LLM alone. When employed in a co-pilot mode, accuracy, recall, and F1 scores were optimized, indicating effectiveness in identifying moderate to severe DRPs. The accuracy of DRP detection with RAG-LLM improved in several categories but at the expense of lower precision. Conclusions This study established that a RAG-LLM based CDSS significantly boosts the accuracy of medication error identification when used alongside junior pharmacists (co-pilot), with notable improvements in detecting severe DRPs. This study also illuminates the comparative performance of current state-of-the-art LLMs in RAG-based CDSS systems. △ Less

Submitted 17 February, 2024; v1 submitted 29 January, 2024; originally announced February 2024.

arXiv:2402.01733 [pdf]

Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report

Authors: YuHe Ke, Liyuan **, Kabilan Elangovan, Hairil Rizal Abdullah, Nan Liu, Alex Tiong Heng Sia, Chai Rick Soh, Joshua Yi Min Tung, Jasmine Chiat Ling Ong, Daniel Shu Wei Ting

Abstract: Purpose: Large Language Models (LLMs) hold significant promise for medical applications. Retrieval Augmented Generation (RAG) emerges as a promising approach for customizing domain knowledge in LLMs. This case study presents the development and evaluation of an LLM-RAG pipeline tailored for healthcare, focusing specifically on preoperative medicine. Methods: We developed an LLM-RAG model using 3… ▽ More Purpose: Large Language Models (LLMs) hold significant promise for medical applications. Retrieval Augmented Generation (RAG) emerges as a promising approach for customizing domain knowledge in LLMs. This case study presents the development and evaluation of an LLM-RAG pipeline tailored for healthcare, focusing specifically on preoperative medicine. Methods: We developed an LLM-RAG model using 35 preoperative guidelines and tested it against human-generated responses, with a total of 1260 responses evaluated. The RAG process involved converting clinical documents into text using Python-based frameworks like LangChain and Llamaindex, and processing these texts into chunks for embedding and retrieval. Vector storage techniques and selected embedding models to optimize data retrieval, using Pinecone for vector storage with a dimensionality of 1536 and cosine similarity for loss metrics. Human-generated answers, provided by junior doctors, were used as a comparison. Results: The LLM-RAG model generated answers within an average of 15-20 seconds, significantly faster than the 10 minutes typically required by humans. Among the basic LLMs, GPT4.0 exhibited the best accuracy of 80.1%. This accuracy was further increased to 91.4% when the model was enhanced with RAG. Compared to the human-generated instructions, which had an accuracy of 86.3%, the performance of the GPT4.0 RAG model demonstrated non-inferiority (p=0.610). Conclusions: In this case study, we demonstrated a LLM-RAG model for healthcare implementation. The pipeline shows the advantages of grounded knowledge, upgradability, and scalability as important aspects of healthcare LLM deployment. △ Less

Submitted 29 January, 2024; originally announced February 2024.

Comments: NA

arXiv:2401.14589 [pdf]

Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias

Authors: Yu He Ke, Rui Yang, Sui An Lie, Taylor Xin Yi Lim, Hairil Rizal Abdullah, Daniel Shu Wei Ting, Nan Liu

Abstract: Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. Objective: This study explores the role of large language models (LLMs) in mitigating these biases through the utilization of a multi-agent framework. We simulate the clinical decisi… ▽ More Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. Objective: This study explores the role of large language models (LLMs) in mitigating these biases through the utilization of a multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy. Methods: A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 to facilitate interactions among four simulated agents to replicate clinical team dynamics. Each agent has a distinct role: 1) To make the final diagnosis after considering the discussions, 2) The devil's advocate and correct confirmation and anchoring bias, 3) The tutor and facilitator of the discussion to reduce premature closure bias, and 4) To record and summarize the findings. A total of 80 simulations were evaluated for the accuracy of initial diagnosis, top differential diagnosis and final two differential diagnoses. Results: In a total of 80 responses evaluating both initial and final diagnoses, the initial diagnosis had an accuracy of 0% (0/80), but following multi-agent discussions, the accuracy for the top differential diagnosis increased to 71.3% (57/80), and for the final two differential diagnoses, to 80.0% (64/80). Conclusions: The framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. The LLM-driven multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios. △ Less

Submitted 12 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 21 pages, 3 figures

arXiv:2311.17858 [pdf, ps, other]

On the Limits of Regression Adjustment

Authors: Daniel Ting, Kenneth Hung

Abstract: Regression adjustment, sometimes known as Controlled-experiment Using Pre-Experiment Data (CUPED), is an important technique in internet experimentation. It decreases the variance of effect size estimates, often cutting confidence interval widths in half or more while never making them worse. It does so by carefully regressing the goal metric against pre-experiment features to reduce the variance.… ▽ More Regression adjustment, sometimes known as Controlled-experiment Using Pre-Experiment Data (CUPED), is an important technique in internet experimentation. It decreases the variance of effect size estimates, often cutting confidence interval widths in half or more while never making them worse. It does so by carefully regressing the goal metric against pre-experiment features to reduce the variance. The tremendous gains of regression adjustment begs the question: How much better can we do by engineering better features from pre-experiment data, for example by using machine learning techniques or synthetic controls? Could we even reduce the variance in our effect sizes arbitrarily close to zero with the right predictors? Unfortunately, our answer is negative. A simple form of regression adjustment, which uses just the pre-experiment values of the goal metric, captures most of the benefit. Specifically, under a mild assumption that observations closer in time are easier to predict that ones further away in time, we upper bound the potential gains of more sophisticated feature engineering, with respect to the gains of this simple form of regression adjustment. The maximum reduction in variance is $50\%$ in Theorem 1, or equivalently, the confidence interval width can be reduced by at most an additional $29\%$. △ Less

Submitted 29 November, 2023; originally announced November 2023.

MSC Class: 62P30 ACM Class: G.3

arXiv:2311.02107 [pdf]

Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist

Authors: Yilin Ning, Salinelat Teixayavong, Yuqing Shang, Julian Savulescu, Vaishaanth Nagaraj, Di Miao, Mayli Mertens, Daniel Shu Wei Ting, Jasmine Chiat Ling Ong, Mingxuan Liu, Jiuwen Cao, Michael Dunn, Roger Vaughan, Marcus Eng Hock Ong, Joseph Jao-Yiu Sung, Eric J Topol, Nan Liu

Abstract: The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been use… ▽ More The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (GenAI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare, but ethical discussions are yet to translate into operationalisable solutions. Furthermore, ongoing ethical discussions often neglect other types of GenAI that have been used to synthesise data (e.g., images) for research and practical purposes, which resolved some ethical issues and exposed others. We conduct a sco** review of ethical discussions on GenAI in healthcare to comprehensively analyse gaps in the current research, and further propose to reduce the gaps by develo** a checklist for comprehensive assessment and transparent documentation of ethical discussions in GenAI research. The checklist can be readily integrated into the current peer review and publication system to enhance GenAI research, and may be used for ethics-related disclosures for GenAI-powered products, healthcare applications of such products and beyond. △ Less

Submitted 23 February, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2304.13493 [pdf]

Towards clinical AI fairness: A translational perspective

Authors: Mingxuan Liu, Yilin Ning, Salinelat Teixayavong, Mayli Mertens, Jie Xu, Daniel Shu Wei Ting, Lionel Tim-Ee Cheng, Jasmine Chiat Ling Ong, Zhen Ling Teo, Ting Fang Tan, Ravi Chandran Narrendar, Fei Wang, Leo Anthony Celi, Marcus Eng Hock Ong, Nan Liu

Abstract: Artificial intelligence (AI) has demonstrated the ability to extract insights from data, but the issue of fairness remains a concern in high-stakes fields such as healthcare. Despite extensive discussion and efforts in algorithm development, AI fairness and clinical concerns have not been adequately addressed. In this paper, we discuss the misalignment between technical and clinical perspectives o… ▽ More Artificial intelligence (AI) has demonstrated the ability to extract insights from data, but the issue of fairness remains a concern in high-stakes fields such as healthcare. Despite extensive discussion and efforts in algorithm development, AI fairness and clinical concerns have not been adequately addressed. In this paper, we discuss the misalignment between technical and clinical perspectives of AI fairness, highlight the barriers to AI fairness' translation to healthcare, advocate multidisciplinary collaboration to bridge the knowledge gap, and provide possible solutions to address the clinical concerns pertaining to AI fairness. △ Less

Submitted 26 April, 2023; originally announced April 2023.

arXiv:2304.07310 [pdf]

doi 10.1093/jamia/ocad170

Federated and distributed learning applications for electronic health records and structured medical data: A sco** review

Authors: Siqi Li, Pinyan Liu, Gustavo G. Nascimento, Xinru Wang, Fabio Renato Manzolli Leite, Bibhas Chakraborty, Chuan Hong, Yilin Ning, Feng Xie, Zhen Ling Teo, Daniel Shu Wei Ting, Hamed Haddadi, Marcus Eng Hock Ong, Marco Aurélio Peres, Nan Liu

Abstract: Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medi… ▽ More Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations and discusses potential innovations. We searched five databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from three primary perspectives, including data quality, modeling strategies, and FL frameworks. Out of the 1160 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis. The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2302.02056 [pdf, other]

Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting

Authors: Jonathan Hehir, Daniel Ting, Graham Cormode

Abstract: Data sketching is a critical tool for distinct counting, enabling multisets to be represented by compact summaries that admit fast cardinality estimates. Because sketches may be merged to summarize multiset unions, they are a basic building block in data warehouses. Although many practical sketches for cardinality estimation exist, none provide privacy when merging. We propose the first practical… ▽ More Data sketching is a critical tool for distinct counting, enabling multisets to be represented by compact summaries that admit fast cardinality estimates. Because sketches may be merged to summarize multiset unions, they are a basic building block in data warehouses. Although many practical sketches for cardinality estimation exist, none provide privacy when merging. We propose the first practical cardinality sketches that are simultaneously mergeable, differentially private (DP), and have low empirical errors. These introduce a novel randomized algorithm for performing logical operations on noisy bits, a tight privacy analysis, and provably optimal estimation. Our sketches dramatically outperform existing theoretical solutions in simulations and on real-world data. △ Less

Submitted 3 February, 2023; originally announced February 2023.

Comments: 28 pages, 5 figures

arXiv:2203.15400 [pdf, ps, other]

Order-Invariant Cardinality Estimators Are Differentially Private

Authors: Charlie Dickens, Justin Thaler, Daniel Ting

Abstract: We consider privacy in the context of streaming algorithms for cardinality estimation. We show that a large class of algorithms all satisfy $ε$-differential privacy, so long as (a) the algorithm is combined with a simple down-sampling procedure, and (b) the cardinality of the input stream is $Ω(k/ε)$. Here, $k$ is a certain parameter of the sketch that is always at most the sketch size in bits, bu… ▽ More We consider privacy in the context of streaming algorithms for cardinality estimation. We show that a large class of algorithms all satisfy $ε$-differential privacy, so long as (a) the algorithm is combined with a simple down-sampling procedure, and (b) the cardinality of the input stream is $Ω(k/ε)$. Here, $k$ is a certain parameter of the sketch that is always at most the sketch size in bits, but is typically much smaller. We also show that, even with no modification, algorithms in our class satisfy $(ε, δ)$-differential privacy, where $δ$ falls exponentially with the stream cardinality. Our analysis applies to essentially all popular cardinality estimation algorithms, and substantially generalizes and tightens privacy bounds from earlier works. △ Less

Submitted 3 February, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Changed title and updated with camera ready version from conference

arXiv:2201.03291 [pdf]

A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study

Authors: Yilin Ning, Siqi Li, Marcus Eng Hock Ong, Feng Xie, Bibhas Chakraborty, Daniel Shu Wei Ting, Nan Liu

Abstract: Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach usin… ▽ More Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission, ShapleyVIC selected 6 of 41 candidate variables to create a well-performing model, which had similar performance to a 16-variable model from machine-learning-based ranking. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2110.02484 [pdf]

Shapley variable importance clouds for interpretable machine learning

Authors: Yilin Ning, Marcus Eng Hock Ong, Bibhas Chakraborty, Benjamin Alan Goldstein, Daniel Shu Wei Ting, Roger Vaughan, Nan Liu

Abstract: Interpretable machine learning has been focusing on explaining final models that optimize performance. The current state-of-the-art is the Shapley additive explanations (SHAP) that locally explains variable impact on individual predictions, and it is recently extended for a global assessment across the dataset. Recently, Dong and Rudin proposed to extend the investigation to models from the same c… ▽ More Interpretable machine learning has been focusing on explaining final models that optimize performance. The current state-of-the-art is the Shapley additive explanations (SHAP) that locally explains variable impact on individual predictions, and it is recently extended for a global assessment across the dataset. Recently, Dong and Rudin proposed to extend the investigation to models from the same class as the final model that are "good enough", and identified a previous overclaim of variable importance based on a single model. However, this method does not directly integrate with existing Shapley-based interpretations. We close this gap by proposing a Shapley variable importance cloud that pools information across good models to avoid biased assessments in SHAP analyses of final models, and communicate the findings via novel visualizations. We demonstrate the additional insights gain compared to conventional explanations and Dong and Rudin's method using criminal justice and electronic medical records data. △ Less

Submitted 5 October, 2021; originally announced October 2021.

arXiv:2107.04805 [pdf, other]

Few-Shot Domain Adaptation with Polymorphic Transformers

Authors: Shaohua Li, Xiuchao Sui, Jie Fu, Huazhu Fu, Xiangde Luo, Yangqin Feng, Xinxing Xu, Yong Liu, Daniel Ting, Rick Siow Mong Goh

Abstract: Deep neural networks (DNNs) trained on one set of medical images often experience severe performance drop on unseen test images, due to various domain discrepancy between the training images (source domain) and the test images (target domain), which raises a domain adaptation issue. In clinical settings, it is difficult to collect enough annotated target domain data in a short period. Few-shot dom… ▽ More Deep neural networks (DNNs) trained on one set of medical images often experience severe performance drop on unseen test images, due to various domain discrepancy between the training images (source domain) and the test images (target domain), which raises a domain adaptation issue. In clinical settings, it is difficult to collect enough annotated target domain data in a short period. Few-shot domain adaptation, i.e., adapting a trained model with a handful of annotations, is highly practical and useful in this case. In this paper, we propose a Polymorphic Transformer (Polyformer), which can be incorporated into any DNN backbones for few-shot domain adaptation. Specifically, after the polyformer layer is inserted into a model trained on the source domain, it extracts a set of prototype embeddings, which can be viewed as a "basis" of the source-domain features. On the target domain, the polyformer layer adapts by only updating a projection layer which controls the interactions between image features and the prototype embeddings. All other model weights (except BatchNorm parameters) are frozen during adaptation. Thus, the chance of overfitting the annotations is greatly reduced, and the model can perform robustly on the target domain after being trained on a few annotated images. We demonstrate the effectiveness of Polyformer on two medical segmentation tasks (i.e., optic disc/cup segmentation, and polyp segmentation). The source code of Polyformer is released at https://github.com/askerlee/segtran. △ Less

Submitted 10 July, 2021; originally announced July 2021.

Comments: MICCAI'2021 camera ready

arXiv:2104.05091 [pdf, ps, other]

Simple, Optimal Algorithms for Random Sampling Without Replacement

Authors: Daniel Ting

Abstract: Consider the fundamental problem of drawing a simple random sample of size k without replacement from [n] := {1, . . . , n}. Although a number of classical algorithms exist for this problem, we construct algorithms that are even simpler, easier to implement, and have optimal space and time complexity. Consider the fundamental problem of drawing a simple random sample of size k without replacement from [n] := {1, . . . , n}. Although a number of classical algorithms exist for this problem, we construct algorithms that are even simpler, easier to implement, and have optimal space and time complexity. △ Less

Submitted 11 April, 2021; originally announced April 2021.

arXiv:2011.10355 [pdf, other]

HyperLogLog (HLL) Security: Inflating Cardinality Estimates

Authors: Pedro Reviriego, Pablo Adell, Daniel Ting

Abstract: Counting the number of distinct elements on a set is needed in many applications, for example to track the number of unique users in Internet services or the number of distinct flows on a network. In many cases, an estimate rather than the exact value is sufficient and thus many algorithms for cardinality estimation that significantly reduce the memory and computation requirements have been propos… ▽ More Counting the number of distinct elements on a set is needed in many applications, for example to track the number of unique users in Internet services or the number of distinct flows on a network. In many cases, an estimate rather than the exact value is sufficient and thus many algorithms for cardinality estimation that significantly reduce the memory and computation requirements have been proposed. Among them, Hyperloglog has been widely adopted in both software and hardware implementations. The security of Hyperloglog has been recently studied showing that an attacker can create a set of elements that produces a cardinality estimate that is much smaller than the real cardinality of the set. This set can be used for example to evade detection systems that use Hyperloglog. In this paper, the security of Hyperloglog is considered from the opposite angle: the attacker wants to create a small set that when inserted on the Hyperloglog produces a large cardinality estimate. This set can be used to trigger false alarms in detection systems that use Hyperloglog but more interestingly, it can be potentially used to inflate the visits to websites or the number of hits of online advertisements. Our analysis shows that an attacker can create a set with a number of elements equal to the number of registers used in the Hyperloglog implementation that produces any arbitrary cardinality estimate. This has been validated in two commercial implementations of Hyperloglog: Presto and Redis. Based on those results, we also consider the protection of Hyperloglog against such an attack. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:2010.08296 [pdf, other]

Minimizing Labeling Effort for Tree Skeleton Segmentation using an Automated Iterative Training Methodology

Authors: Keenan Granland, Rhys Newbury, David Ting, Chao Chen

Abstract: Training of convolutional neural networks for semantic segmentation requires accurate pixel-wise labeling which requires large amounts of human effort. The human-in-the-loop method reduces labeling effort; however, it requires human intervention for each image. This paper describes a general iterative training methodology for semantic segmentation, Automating-the-Loop. This aims to replicate the m… ▽ More Training of convolutional neural networks for semantic segmentation requires accurate pixel-wise labeling which requires large amounts of human effort. The human-in-the-loop method reduces labeling effort; however, it requires human intervention for each image. This paper describes a general iterative training methodology for semantic segmentation, Automating-the-Loop. This aims to replicate the manual adjustments of the human-in-the-loop method with an automated process, hence, drastically reducing labeling effort. Using the application of detecting partially occluded apple tree segmentation, we compare manually labeled annotations, self-training, human-in-the-loop, and Automating-the-Loop methods in both the quality of the trained convolutional neural networks, and the effort needed to create them. The convolutional neural network (U-Net) performance is analyzed using traditional metrics and a new metric, Complete Grid Scan, which promotes connectivity and low noise. It is shown that in our application, the new Automating-the-Loop method greatly reduces the labeling effort while producing comparable performance to both human-in-the-loop and complete manual labeling methods. △ Less

Submitted 9 August, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

arXiv:2010.06879 [pdf, other]

Semantic Segmentation for Partially Occluded Apple Trees Based on Deep Learning

Authors: Zijue Chen, David Ting, Rhys Newbury, Chao Chen

Abstract: Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. However, recent works only consider the dormant season, where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape and hence, losing information about branch thickness. In this work, we apply two st… ▽ More Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. However, recent works only consider the dormant season, where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape and hence, losing information about branch thickness. In this work, we apply two state-of-the-art supervised learning models U-Net and DeepLabv3, and a conditional Generative Adversarial Network Pix2Pix (with and without the discriminator) to segment partially occluded 2D-open-V apple trees. Binary accuracy, Mean IoU, Boundary F1 score and Occluded branch recall were used to evaluate the performances of the models. DeepLabv3 outperforms the other models at Binary accuracy, Mean IoU and Boundary F1 score, but is surpassed by Pix2Pix (without discriminator) and U-Net in Occluded branch recall. We define two difficulty indices to quantify the difficulty of the task: (1) Occlusion Difficulty Index and (2) Depth Difficulty Index. We analyze the worst 10 images in both difficulty indices by means of Branch Recall and Occluded Branch Recall. U-Net outperforms the other two models in the current metrics. On the other hand, Pix2Pix (without discriminator) provides more information on branch paths, which are not reflected by the metrics. This highlights the need for more specific metrics on recovering occluded information. Furthermore, this shows the usefulness of image-transfer networks for hallucination behind occlusions. Future work is required to further enhance the models to recover more information from occlusions such that this technology can be applied to automating agricultural tasks in a commercial environment. △ Less

Submitted 14 October, 2020; originally announced October 2020.

arXiv:2007.03315 [pdf, other]

Manifold Learning via Manifold Deflation

Authors: Daniel Ting, Michael I. Jordan

Abstract: Nonlinear dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data. However, many popular methods can fail dramatically, even on simple two-dimensional manifolds, due to problems such as vulnerability to noise, repeated eigendirections, holes in convex bodies, and boundary bias. We derive an embedding method for Riemannian manifolds that iterativel… ▽ More Nonlinear dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data. However, many popular methods can fail dramatically, even on simple two-dimensional manifolds, due to problems such as vulnerability to noise, repeated eigendirections, holes in convex bodies, and boundary bias. We derive an embedding method for Riemannian manifolds that iteratively uses single-coordinate estimates to eliminate dimensions from an underlying differential operator, thus "deflating" it. These differential operators have been shown to characterize any local, spectral dimensionality reduction method. The key to our method is a novel, incremental tangent space estimator that incorporates global structure as coordinates are added. We prove its consistency when the coordinates converge to true coordinates. Empirically, we show our algorithm recovers novel and interesting embeddings on real-world and synthetic datasets. △ Less

Submitted 7 July, 2020; originally announced July 2020.

arXiv:2005.02537 [pdf, other]

Conditional Cuckoo Filters

Authors: Daniel Ting, Rick Cole

Abstract: Bloom filters, cuckoo filters, and other approximate set membership sketches have a wide range of applications. Oftentimes, expensive operations can be skipped if an item is not in a data set. These filters provide an inexpensive, memory efficient way to test if an item is in a set and avoid unnecessary operations. Existing sketches only allow membership testing for single set. However, in some ap… ▽ More Bloom filters, cuckoo filters, and other approximate set membership sketches have a wide range of applications. Oftentimes, expensive operations can be skipped if an item is not in a data set. These filters provide an inexpensive, memory efficient way to test if an item is in a set and avoid unnecessary operations. Existing sketches only allow membership testing for single set. However, in some applications such as join processing, the relevant set is not fixed and is determined by a set of predicates. We propose the Conditional Cuckoo Filter, a simple modification of the cuckoo filter that allows for set membership testing given predicates on a pre-computed sketch. This filter also introduces a novel chaining technique that enables cuckoo filters to handle insertion of duplicate keys. We evaluate our methods on a join processing application and show that they significantly reduce the number of tuples that a join must process. △ Less

Submitted 5 May, 2020; originally announced May 2020.

arXiv:2002.06463 [pdf, other]

Security of HyperLogLog (HLL) Cardinality Estimation: Vulnerabilities and Protection

Authors: Pedro Reviriego, Daniel Ting

Abstract: Count distinct or cardinality estimates are widely used in network monitoring for security. They can be used, for example, to detect the malware spread, network scans, or a denial of service attack. There are many algorithms to estimate cardinality. Among those, HyperLogLog (HLL) has been one of the most widely adopted. HLL is simple, provides good cardinality estimates over a wide range of values… ▽ More Count distinct or cardinality estimates are widely used in network monitoring for security. They can be used, for example, to detect the malware spread, network scans, or a denial of service attack. There are many algorithms to estimate cardinality. Among those, HyperLogLog (HLL) has been one of the most widely adopted. HLL is simple, provides good cardinality estimates over a wide range of values, requires a small amount of memory, and allows merging of estimates from different sources. However, as HLL is increasingly used to detect attacks, it can itself become the target of attackers that want to avoid being detected. To the best of our knowledge, the security of HLL has not been studied before. In this letter, we take an initial step in its study by first exposing a vulnerability of HLL that allows an attacker to manipulate its estimate. This shows the importance of designing secure HLL implementations. In the second part of the letter, we propose an efficient protection technique to detect and avoid the HLL manipulation. The results presented strongly suggest that the security of HLL should be further studied given that it is widely adopted in many networking and computing applications. △ Less

Submitted 15 February, 2020; originally announced February 2020.

arXiv:1911.05904 [pdf, other]

There is Limited Correlation between Coverage and Robustness for Deep Neural Networks

Authors: Yizhen Dong, Peixin Zhang, **gyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, ** Song Dong, Dai Ting

Abstract: Deep neural networks (DNN) are increasingly applied in safety-critical systems, e.g., for face recognition, autonomous car control and malware detection. It is also shown that DNNs are subject to attacks such as adversarial perturbation and thus must be properly tested. Many coverage criteria for DNN since have been proposed, inspired by the success of code coverage criteria for software programs.… ▽ More Deep neural networks (DNN) are increasingly applied in safety-critical systems, e.g., for face recognition, autonomous car control and malware detection. It is also shown that DNNs are subject to attacks such as adversarial perturbation and thus must be properly tested. Many coverage criteria for DNN since have been proposed, inspired by the success of code coverage criteria for software programs. The expectation is that if a DNN is a well tested (and retrained) according to such coverage criteria, it is more likely to be robust. In this work, we conduct an empirical study to evaluate the relationship between coverage, robustness and attack/defense metrics for DNN. Our study is the largest to date and systematically done based on 100 DNN models and 25 metrics. One of our findings is that there is limited correlation between coverage and robustness, i.e., improving coverage does not help improve the robustness. Our dataset and implementation have been made available to serve as a benchmark for future studies on testing DNN. △ Less

Submitted 13 November, 2019; originally announced November 2019.

arXiv:1907.02413 [pdf, ps, other]

Multi-Instance Multi-Scale CNN for Medical Image Classification

Authors: Shaohua Li, Yong Liu, Xiuchao Sui, Cheng Chen, Gabriel Tjio, Daniel Shu Wei Ting, Rick Siow Mong Goh

Abstract: Deep learning for medical image classification faces three major challenges: 1) the number of annotated medical images for training are usually small; 2) regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and may appear in arbitrary positions across the x,y (and also z in 3D images) dimensions. However often only labels of the whole images are anno… ▽ More Deep learning for medical image classification faces three major challenges: 1) the number of annotated medical images for training are usually small; 2) regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and may appear in arbitrary positions across the x,y (and also z in 3D images) dimensions. However often only labels of the whole images are annotated, and localized ROIs are unavailable; and 3) ROIs in medical images often appear in varying sizes (scales). We approach these three challenges with a Multi-Instance Multi-Scale (MIMS) CNN: 1) We propose a multi-scale convolutional layer, which extracts patterns of different receptive fields with a shared set of convolutional kernels, so that scale-invariant patterns are captured by this compact set of kernels. As this layer contains only a small number of parameters, training on small datasets becomes feasible; 2) We propose a "top-k pooling" to aggregate the feature maps in varying scales from multiple spatial dimensions, allowing the model to be trained using weak annotations within the multiple instance learning (MIL) framework. Our method is shown to perform well on three classification tasks involving two 3D and two 2D medical image datasets. △ Less

Submitted 22 October, 2019; v1 submitted 4 July, 2019; originally announced July 2019.

Comments: Accepted by MICCAI 2019

arXiv:1811.04150 [pdf, other]

Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions

Authors: Daniel Ting

Abstract: The Count-Min sketch is an important and well-studied data summarization method. It allows one to estimate the count of any item in a stream using a small, fixed size data sketch. However, the accuracy of the sketch depends on characteristics of the underlying data. This has led to a number of count estimation procedures which work well in one scenario but perform poorly in others. A practitioner… ▽ More The Count-Min sketch is an important and well-studied data summarization method. It allows one to estimate the count of any item in a stream using a small, fixed size data sketch. However, the accuracy of the sketch depends on characteristics of the underlying data. This has led to a number of count estimation procedures which work well in one scenario but perform poorly in others. A practitioner is faced with two basic, unanswered questions. Which variant should be chosen when the data is unknown? Given an estimate, is its error sufficiently small to be trustworthy? We provide answers to these questions. We derive new count estimators, including a provably optimal estimator, which best or match previous estimators in all scenarios. We also provide practical, tight error bounds at query time for both new and existing estimators. These error estimates also yield procedures to choose the sketch tuning parameters optimally, as they can extrapolate the error to different choices of sketch width and depth. The key observation is that the distribution of errors in each counter can be empirically estimated from the sketch itself. By first estimating this distribution, count estimation becomes a statistical estimation and inference problem with a known error distribution. This provides both a principled way to derive new and optimal estimators as well as a way to study the error and properties of existing estimators. △ Less

Submitted 9 November, 2018; originally announced November 2018.

Comments: Long version of a KDD 2018 paper of the same name

arXiv:1805.06608 [pdf]

Solid-Immersion Metalenses for Infrared Focal Plane Arrays

Authors: Shuyan Zhang, Alexander Soibel, Sam A. Keo, Daniel Wilson, Sir. B. Rafol, David Z. Ting, Alan She, Sarath D. Gunapala, Federico Capasso

Abstract: Novel optical components based on metasurfaces (metalenses) offer a new methodology for microlens arrays. In particular, metalens arrays have the potential of being monolithically integrated with infrared focal plane arrays (IR FPAs) to increase the operating temperature and sensitivity of the latter. In this work, we demonstrate a new type of transmissive metalens that focuses the incident light… ▽ More Novel optical components based on metasurfaces (metalenses) offer a new methodology for microlens arrays. In particular, metalens arrays have the potential of being monolithically integrated with infrared focal plane arrays (IR FPAs) to increase the operating temperature and sensitivity of the latter. In this work, we demonstrate a new type of transmissive metalens that focuses the incident light (λ = 3-5 μm) on the detector plane after propagating through the substrate, i.e. solid-immersion type of focusing. The metalens is fabricated by etching the backside of the detector substrate material (GaSb here) making this approach compatible with the architecture of back-illuminated FPAs. In addition, our designs work for all incident polarizations. We fabricate a 10x10 metalens array that proves the scalability of this approach for FPAs. In the future, these solid-immersion metalenses arrays will be monolithically integrated with IR FPAs. △ Less

Submitted 17 May, 2018; originally announced May 2018.

arXiv:1803.02432 [pdf, other]

On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

Authors: Daniel Ting, Michael I. Jordan

Abstract: We develop theory for nonlinear dimensionality reduction (NLDR). A number of NLDR methods have been developed, but there is limited understanding of how these methods work and the relationships between them. There is limited basis for using existing NLDR theory for deriving new algorithms. We provide a novel framework for analysis of NLDR via a connection to the statistical theory of linear smooth… ▽ More We develop theory for nonlinear dimensionality reduction (NLDR). A number of NLDR methods have been developed, but there is limited understanding of how these methods work and the relationships between them. There is limited basis for using existing NLDR theory for deriving new algorithms. We provide a novel framework for analysis of NLDR via a connection to the statistical theory of linear smoothers. This allows us to both understand existing methods and derive new ones. We use this connection to smoothing to show that asymptotically, existing NLDR methods correspond to discrete approximations of the solutions of sets of differential equations given a boundary condition. In particular, we can characterize many existing methods in terms of just three limiting differential operators and boundary conditions. Our theory also provides a way to assert that one method is preferable to another; indeed, we show Local Tangent Space Alignment is superior within a class of methods that assume a global coordinate chart defines an isometric embedding of the manifold. △ Less

Submitted 6 March, 2018; originally announced March 2018.

arXiv:1711.10028 [pdf, other]

Family learning: nonparametric statistical inference with parametric efficiency

Authors: William Fithian, Daniel Ting

Abstract: Hypothesis testing and other statistical inference procedures are most efficient when a reliable low-dimensional parametric family can be specified. We propose a method that learns such a family when one exists but its form is not known a priori, by examining samples from related populations and fitting a low-dimensional exponential family that approximates all the samples as well as possible. We… ▽ More Hypothesis testing and other statistical inference procedures are most efficient when a reliable low-dimensional parametric family can be specified. We propose a method that learns such a family when one exists but its form is not known a priori, by examining samples from related populations and fitting a low-dimensional exponential family that approximates all the samples as well as possible. We propose a computationally efficient spectral method that allows us to carry out hypothesis tests that are valid whether or not the fit is good, and recover asymptotically optimal power if it is. Our method is computationally efficient and can produce substantial power gains in simulation and real-world A/B testing data. △ Less

Submitted 27 November, 2017; originally announced November 2017.

arXiv:1709.04048 [pdf, other]

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Authors: Daniel Ting

Abstract: We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computi… ▽ More We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation. △ Less

Submitted 12 September, 2017; originally announced September 2017.

arXiv:1709.01716 [pdf, other]

Optimal Sub-sampling with Influence Functions

Authors: Daniel Ting, Eric Brochu

Abstract: Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the concept of an asymptotically linear estimator and the associated influence function leads to optimal sampling procedures for a wide class of popular models. Fu… ▽ More Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the concept of an asymptotically linear estimator and the associated influence function leads to optimal sampling procedures for a wide class of popular models. Furthermore, for linear regression models which have well-studied procedures for non-uniform sub-sampling, we show our optimal influence function based method outperforms previous approaches. We empirically show the improved performance of our method on real datasets. △ Less

Submitted 6 September, 2017; originally announced September 2017.

arXiv:1708.04970 [pdf, other]

Adaptive Threshold Sampling

Authors: Daniel Ting

Abstract: Sampling is a fundamental problem in computer science and statistics. However, for a given task and stream, it is often not possible to choose good sampling probabilities in advance. We derive a general framework for adaptively changing the sampling probabilities via a collection of thresholds.In general, adaptive sampling procedures introduce dependence amongst the sampled points, making it diffi… ▽ More Sampling is a fundamental problem in computer science and statistics. However, for a given task and stream, it is often not possible to choose good sampling probabilities in advance. We derive a general framework for adaptively changing the sampling probabilities via a collection of thresholds.In general, adaptive sampling procedures introduce dependence amongst the sampled points, making it difficult to compute expectations and ensure estimators are unbiased or consistent. Our framework address this issue and further shows when adaptive thresholds can be treated as if they were fixed thresholds which samples items independently. This makes our adaptive sampling schemes simple to apply as there is no need to create custom estimators for the sampling method. Using our framework, we derive new samplers that can address a broad range of new and existing problems including sampling with memory rather than sample size budgets, stratified samples, multiple objectives, distinct counting, and sliding windows. In particular, we design a sampling procedure for the top-K problem where, unlike in the heavy-hitter problem, the sketch size and sampling probabilities are adaptively chosen. △ Less

Submitted 15 June, 2022; v1 submitted 16 August, 2017; originally announced August 2017.

arXiv:1212.2744 [pdf, other]

Mixture Models of Endhost Network Traffic

Authors: John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft, Daniel Ting

Abstract: In this work we focus on modeling a little studied type of traffic, namely the network traffic generated from endhosts. We introduce a parsimonious parametric model of the marginal distribution for connection arrivals. We employ mixture models based on a convex combination of component distributions with both heavy and light-tails. These models can be fitted with high accuracy using maximum likeli… ▽ More In this work we focus on modeling a little studied type of traffic, namely the network traffic generated from endhosts. We introduce a parsimonious parametric model of the marginal distribution for connection arrivals. We employ mixture models based on a convex combination of component distributions with both heavy and light-tails. These models can be fitted with high accuracy using maximum likelihood techniques. Our methodology assumes that the underlying user data can be fitted to one of many modeling options, and we apply Bayesian model selection criteria as a rigorous way to choose the preferred combination of components. Our experiments show that a simple Pareto-exponential mixture model is preferred for a wide range of users, over both simpler and more complex alternatives. This model has the desirable property of modeling the entire distribution, effectively segmenting the traffic into the heavy-tailed as well as the non-heavy-tailed components. We illustrate that this technique has the flexibility to capture the wide diversity of user behaviors. △ Less

Submitted 12 December, 2012; originally announced December 2012.

arXiv:1203.3522 [pdf]

Online Semi-Supervised Learning on Quantized Graphs

Authors: Michal Valko, Branislav Kveton, Ling Huang, Daniel Ting

Abstract: In this paper, we tackle the problem of online semi-supervised learning (SSL). When data arrive in a stream, the dual problems of computation and data storage arise for any SSL method. We propose a fast approximate online SSL algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby… ▽ More In this paper, we tackle the problem of online semi-supervised learning (SSL). When data arrive in a stream, the dual problems of computation and data storage arise for any SSL method. We propose a fast approximate online SSL algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby points into a set of local "representative points" that minimize distortion. Moreover, we regularize the harmonic solution to achieve better stability properties. We apply our algorithm to face recognition and optical character recognition applications to show that we can take advantage of the manifold structure to outperform the previous methods. Unlike previous heuristic approaches, we show that our method yields provable performance bounds. △ Less

Submitted 15 March, 2012; originally announced March 2012.

Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

Report number: UAI-P-2010-PG-606-614

arXiv:1101.5435 [pdf, ps, other]

An Analysis of the Convergence of Graph Laplacians

Authors: Daniel Ting, Ling Huang, Michael Jordan

Abstract: Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a well-behaved kernel function with smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previously unstudied graphs including kNN graphs. We also introduce a kernel-free framework to analyze graph constructions with shrinking neighborhoods in gene… ▽ More Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a well-behaved kernel function with smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previously unstudied graphs including kNN graphs. We also introduce a kernel-free framework to analyze graph constructions with shrinking neighborhoods in general and apply it to analyze locally linear embedding (LLE). We also describe how for a given limiting Laplacian operator desirable properties such as a convergent spectrum and sparseness can be achieved choosing the appropriate graph construction. △ Less

Submitted 27 January, 2011; originally announced January 2011.

arXiv:cond-mat/0511538 [pdf, ps, other]

doi 10.1103/PhysRevB.73.205341

Higher order contributions to Rashba and Dresselhaus effects

Authors: X. Cartoixa, L. -W. Wang, D. Z. -Y. Ting, Y. -C. Chang

Abstract: We have developed a method to systematically compute the form of Rashba- and Dresselhaus-like contributions to the spin Hamiltonian of heterostructures to an arbitrary order in the wavevector k. This is achieved by using the double group representations to construct general symmetry-allowed Hamiltonians with full spin-orbit effects within the tight-binding formalism. We have computed full-zone s… ▽ More We have developed a method to systematically compute the form of Rashba- and Dresselhaus-like contributions to the spin Hamiltonian of heterostructures to an arbitrary order in the wavevector k. This is achieved by using the double group representations to construct general symmetry-allowed Hamiltonians with full spin-orbit effects within the tight-binding formalism. We have computed full-zone spin Hamiltonians for [001]-, [110]- and [111]-grown zinc blende heterostructures (D_{2d},C_{4v},C_{2v},C_{3v} point group symmetries), which are commonly used in spintronics. After an expansion of the Hamiltonian up to third order in k, we are able to obtain additional terms not found previously. The present method also provides the matrix elements for bulk zinc blendes (T_d) in the anion/cation and effective bond orbital model (EBOM) basis sets with full spin-orbit effects. △ Less

Submitted 22 November, 2005; originally announced November 2005.

Comments: v1: 11 pages, 3 figures, 8 tables

arXiv:cond-mat/0402237 [pdf, ps, other]

doi 10.1103/PhysRevB.71.045313

Suppression of the D'yakonov-Perel' spin relaxation mechanism for all spin components in [111] zincblende quantum wells

Authors: X. Cartoixa, D. Z. -Y. Ting, Y. -C. Chang

Abstract: We apply the D'yakonov-Perel' (DP) formalism to [111]-grown zincblende quantum wells (QWs) to compute the spin lifetimes of electrons in the two-dimensional electron gas. We account for both bulk and structural inversion asymmetry (Rashba) effects. We see that, under certain conditions, the spin splitting vanishes to first order in k, which effectively suppresses the DP spin relaxation mechanism… ▽ More We apply the D'yakonov-Perel' (DP) formalism to [111]-grown zincblende quantum wells (QWs) to compute the spin lifetimes of electrons in the two-dimensional electron gas. We account for both bulk and structural inversion asymmetry (Rashba) effects. We see that, under certain conditions, the spin splitting vanishes to first order in k, which effectively suppresses the DP spin relaxation mechanism for all spin components. We predict extended spin lifetimes as a result, giving rise to the possibility of enhanced spin storage. We also study [110]-grown QWs, where the effect of structural inversion asymmetry is to augment the spin relaxation rate of the component perpendicular to the well. We derive analytical expressions for the spin lifetime tensor and its proper axes, and see that they are dependent on the relative magnitude of the BIA- and SIA-induced splittings. △ Less

Submitted 15 July, 2004; v1 submitted 9 February, 2004; originally announced February 2004.

Comments: v1: 5 pages, 2 figures, submitted to PRL v2: added 1 figure and supporting content, PRB format

Journal ref: Phys. Rev. B, v. 71 (4), 045313 (2005)

arXiv:cond-mat/0212394 [pdf, ps, other]

Bulk Inversion Asymmetry effects on the band structure of zincblende heterostructures in an 8-band Effective Mass Approximation model

Authors: X. Cartoixa, D. Z. -Y. Ting, T. C. McGill

Abstract: We have developed an 8-band Effective Mass Approximation model that describes the zero field spin splitting in the band structure of zincblende heterostructures due to bulk inversion asymmetry (BIA). We have verified that our finite difference Hamiltonian transforms in almost all situations according to the true $D_{2d}$ or $C_{2v}$ symmetry of [001] heterostructures. This makes it a computation… ▽ More We have developed an 8-band Effective Mass Approximation model that describes the zero field spin splitting in the band structure of zincblende heterostructures due to bulk inversion asymmetry (BIA). We have verified that our finite difference Hamiltonian transforms in almost all situations according to the true $D_{2d}$ or $C_{2v}$ symmetry of [001] heterostructures. This makes it a computationally efficient tool for the accurate description of the band structure of heterostructures for spintronics. We first compute the band structure for an AlSb/GaSb/AlSb quantum well (QW), which presents only BIA, and delineate its effects. We then use our model to find the band structure of an AlSb/InAs/GaSb/AlSb QW and the relative contribution of structural and bulk inversion asymmetry to the spin splitting. We clarify statements about the importance of these contributions and conclude that, even for our small gap QW, BIA needs to be taken into account for the proper description of the band structure. △ Less

Submitted 17 December, 2002; originally announced December 2002.

Comments: 14 pages, 15 figures, 6 tables

Showing 1–38 of 38 results for author: Ting, D