Search | arXiv e-print repository

Lomas: A Platform for Confidential Analysis of Private Data

Authors: Damien Aymon, Dan-Thuy Lam, Lancelot Marti, Pauline Maury-Laribière, Christine Choirat, Raphaël de Fondeville

Abstract: Public services collect massive volumes of data to fulfill their missions. These data fuel the generation of regional, national, and international statistics across various sectors. However, their immense potential remains largely untapped due to strict and legitimate privacy regulations. In this context, Lomas is a novel open-source platform designed to realize the full potential of the data held… ▽ More Public services collect massive volumes of data to fulfill their missions. These data fuel the generation of regional, national, and international statistics across various sectors. However, their immense potential remains largely untapped due to strict and legitimate privacy regulations. In this context, Lomas is a novel open-source platform designed to realize the full potential of the data held by public administrations. It enables authorized users, such as approved researchers and government analysts, to execute algorithms on confidential datasets without directly accessing the data. The Lomas platform is designed to operate within a trusted computing environment, such as governmental IT infrastructure. Authorized users access the platform remotely to submit their algorithms for execution on private datasets. Lomas executes these algorithms without revealing the data to the user and returns the results protected by Differential Privacy, a framework that introduces controlled noise to the results, rendering any attempt to extract identifiable information unreliable. Differential Privacy allows for the mathematical quantification and control of the risk of disclosure while allowing for a complete transparency regarding how data is protected and utilized. The contributions of this project will significantly transform how data held by public services are used, unlocking valuable insights from previously inaccessible data. Lomas empowers research, informing policy development, e.g., public health interventions, and driving innovation across sectors, all while upholding the highest data confidentiality standards. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.13486 [pdf, ps, other]

Mean-Variance Portfolio Selection in Long-Term Investments with Unknown Distribution: Online Estimation, Risk Aversion under Ambiguity, and Universality of Algorithms

Authors: Duy Khanh Lam

Abstract: The standard approach for constructing a Mean-Variance portfolio involves estimating parameters for the model using collected samples. However, since the distribution of future data may not resemble that of the training set, the out-of-sample performance of the estimated portfolio is worse than one derived with true parameters, which has prompted several innovations for better estimation. Instead… ▽ More The standard approach for constructing a Mean-Variance portfolio involves estimating parameters for the model using collected samples. However, since the distribution of future data may not resemble that of the training set, the out-of-sample performance of the estimated portfolio is worse than one derived with true parameters, which has prompted several innovations for better estimation. Instead of treating the data without a timing aspect as in the common training-backtest approach, this paper adopts a perspective where data gradually and continuously reveal over time. The original model is recast into an online learning framework, which is free from any statistical assumptions, to propose a dynamic strategy of sequential portfolios such that its empirical utility, Sharpe ratio, and growth rate asymptotically achieve those of the true portfolio, derived with perfect knowledge of the future data. When the distribution of future data has a normal shape, the growth rate of wealth is shown to increase by lifting the portfolio along the efficient frontier through the calibration of risk aversion. Since risk aversion cannot be appropriately predetermined, another proposed algorithm updating this coefficient over time forms a dynamic strategy approaching the optimal empirical Sharpe ratio or growth rate associated with the true coefficient. The performance of these proposed strategies is universally guaranteed under specific stochastic markets. Furthermore, in stationary and ergodic markets, the so-called Bayesian strategy utilizing true conditional distributions, based on observed past market information during investment, almost surely does not perform better than the proposed strategies in terms of empirical utility, Sharpe ratio, or growth rate, which, in contrast, do not rely on conditional distributions. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 21 pages, working paper, first draft version (may contain errors)

arXiv:2406.03652 [pdf, other]

Ensembling Portfolio Strategies for Long-Term Investments: A Distribution-Free Preference Framework for Decision-Making and Algorithms

Authors: Duy Khanh Lam

Abstract: This paper investigates the problem of ensembling multiple strategies for sequential portfolios to outperform individual strategies in terms of long-term wealth. Due to the uncertainty of strategies' performances in the future market, which are often based on specific models and statistical assumptions, investors often mitigate risk and enhance robustness by combining multiple strategies, akin to… ▽ More This paper investigates the problem of ensembling multiple strategies for sequential portfolios to outperform individual strategies in terms of long-term wealth. Due to the uncertainty of strategies' performances in the future market, which are often based on specific models and statistical assumptions, investors often mitigate risk and enhance robustness by combining multiple strategies, akin to common approaches in collective learning prediction. However, the absence of a distribution-free and consistent preference framework complicates decisions of combination due to the ambiguous objective. To address this gap, we introduce a novel framework for decision-making in combining strategies, irrespective of market conditions, by establishing the investor's preference between decisions and then forming a clear objective. Through this framework, we propose a combinatorial strategy construction, free from statistical assumptions, for any scale of component strategies, even infinite, such that it meets the determined criterion. Finally, we test the proposed strategy along with its accelerated variant and some other multi-strategies. The numerical experiments show results in favor of the proposed strategies, albeit with small tradeoffs in their Sharpe ratios, in which their cumulative wealths eventually exceed those of the best component strategies while the accelerated strategy significantly improves performance. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 25 pages, 12 figures, 3 tables, working paper

arXiv:2405.16021 [pdf, other]

VADER: Visual Affordance Detection and Error Recovery for Multi Robot Human Collaboration

Authors: Michael Ahn, Montserrat Gonzalez Arenas, Matthew Bennice, Noah Brown, Christine Chan, Byron David, Anthony Francis, Gavin Gonzalez, Rainer Hessmer, Tomas Jackson, Nikhil J Joshi, Daniel Lam, Tsang-Wei Edward Lee, Alex Luong, Sharath Maddineni, Harsh Patel, Jodilyn Peralta, Jornell Quiambao, Diego Reyes, Rosario M Jauregui Ruano, Dorsa Sadigh, Pannag Sanketi, Leila Takayama, Pavel Vodenski, Fei Xia

Abstract: Robots today can exploit the rich world knowledge of large language models to chain simple behavioral skills into long-horizon tasks. However, robots often get interrupted during long-horizon tasks due to primitive skill failures and dynamic environments. We propose VADER, a plan, execute, detect framework with seeking help as a new skill that enables robots to recover and complete long-horizon ta… ▽ More Robots today can exploit the rich world knowledge of large language models to chain simple behavioral skills into long-horizon tasks. However, robots often get interrupted during long-horizon tasks due to primitive skill failures and dynamic environments. We propose VADER, a plan, execute, detect framework with seeking help as a new skill that enables robots to recover and complete long-horizon tasks with the help of humans or other robots. VADER leverages visual question answering (VQA) modules to detect visual affordances and recognize execution errors. It then generates prompts for a language model planner (LMP) which decides when to seek help from another robot or human to recover from errors in long-horizon task execution. We show the effectiveness of VADER with two long-horizon robotic tasks. Our pilot study showed that VADER is capable of performing complex long-horizon tasks by asking for help from another robot to clear a table. Our user study showed that VADER is capable of performing complex long-horizon tasks by asking for help from a human to clear a path. We gathered feedback from people (N=19) about the performance of the VADER performance vs. a robot that did not ask for help. https://google-vader.github.io/ △ Less

Submitted 30 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: 9 pages, 4 figures

arXiv:2404.15187 [pdf]

Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology

Authors: Zeshan Hussain, Barbara D. Lam, Fernando A. Acosta-Perez, Irbaz Bin Riaz, Maia Jacobs, Andrew J. Yee, David Sontag

Abstract: We evaluated how clinicians approach clinical decision-making when given findings from both randomized controlled trials (RCTs) and machine learning (ML) models. To do so, we designed a clinical decision support system (CDSS) that displays survival curves and adverse event information from a synthetic RCT and ML model for 12 patients with multiple myeloma. We conducted an interventional study in a… ▽ More We evaluated how clinicians approach clinical decision-making when given findings from both randomized controlled trials (RCTs) and machine learning (ML) models. To do so, we designed a clinical decision support system (CDSS) that displays survival curves and adverse event information from a synthetic RCT and ML model for 12 patients with multiple myeloma. We conducted an interventional study in a simulated setting to evaluate how clinicians synthesized the available data to make treatment decisions. Participants were invited to participate in a follow-up interview to discuss their choices in an open-ended format. When ML model results were concordant with RCT results, physicians had increased confidence in treatment choice compared to when they were given RCT results alone. When ML model results were discordant with RCT results, the majority of physicians followed the ML model recommendation in their treatment selection. Perceived reliability of the ML model was consistently higher after physicians were provided with data on how it was trained and validated. Follow-up interviews revealed four major themes: (1) variability in what variables participants used for decision-making, (2) perceived advantages to an ML model over RCT data, (3) uncertainty around decision-making when the ML model quality was poor, and (4) perception that this type of study is an important thought exercise for clinicians. Overall, ML-based CDSSs have the potential to change treatment decisions in cancer management. However, meticulous development and validation of these systems as well as clinician training are required before deployment. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: First two listed authors are co-first authors

arXiv:2404.09463 [pdf]

PRIME: A CyberGIS Platform for Resilience Inference Measurement and Enhancement

Authors: Debayan Mandal, Dr. Lei Zou, Rohan Singh Wilkho, Joynal Abedin, Bing Zhou, Dr. Heng Cai, Dr. Furqan Baig, Dr. Nasir Gharaibeh, Dr. Nina Lam

Abstract: In an era of increased climatic disasters, there is an urgent need to develop reliable frameworks and tools for evaluating and improving community resilience to climatic hazards at multiple geographical and temporal scales. Defining and quantifying resilience in the social domain is relatively subjective due to the intricate interplay of socioeconomic factors with disaster resilience. Meanwhile, t… ▽ More In an era of increased climatic disasters, there is an urgent need to develop reliable frameworks and tools for evaluating and improving community resilience to climatic hazards at multiple geographical and temporal scales. Defining and quantifying resilience in the social domain is relatively subjective due to the intricate interplay of socioeconomic factors with disaster resilience. Meanwhile, there is a lack of computationally rigorous, user-friendly tools that can support customized resilience assessment considering local conditions. This study aims to address these gaps through the power of CyberGIS with three objectives: 1) To develop an empirically validated disaster resilience model - Customized Resilience Inference Measurement designed for multi-scale community resilience assessment and influential socioeconomic factors identification, 2) To implement a Platform for Resilience Inference Measurement and Enhancement module in the CyberGISX platform backed by high-performance computing, 3) To demonstrate the utility of PRIME through a representative study. CRIM generates vulnerability, adaptability, and overall resilience scores derived from empirical hazard parameters. Computationally intensive Machine Learning methods are employed to explain the intricate relationships between these scores and socioeconomic driving factors. PRIME provides a web-based notebook interface guiding users to select study areas, configure parameters, calculate and geo-visualize resilience scores, and interpret socioeconomic factors sha** resilience capacities. A representative study showcases the efficiency of the platform while explaining how the visual results obtained may be interpreted. The essence of this work lies in its comprehensive architecture that encapsulates the requisite data, analytical and geo-visualization functions, and ML models for resilience assessment. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 28 pages, 6 figures

arXiv:2403.16958 [pdf, other]

TwinLiteNetPlus: A Stronger Model for Real-time Drivable Area and Lane Segmentation

Authors: Quang-Huy Che, Duc-Tri Le, Minh-Quan Pham, Vinh-Tiep Nguyen, Duc-Khai Lam

Abstract: Semantic segmentation is crucial for autonomous driving, particularly for Drivable Area and Lane Segmentation, ensuring safety and navigation. To address the high computational costs of current state-of-the-art (SOTA) models, this paper introduces TwinLiteNetPlus (TwinLiteNet$^+$), a model adept at balancing efficiency and accuracy. TwinLiteNet$^+$ incorporates standard and depth-wise separable di… ▽ More Semantic segmentation is crucial for autonomous driving, particularly for Drivable Area and Lane Segmentation, ensuring safety and navigation. To address the high computational costs of current state-of-the-art (SOTA) models, this paper introduces TwinLiteNetPlus (TwinLiteNet$^+$), a model adept at balancing efficiency and accuracy. TwinLiteNet$^+$ incorporates standard and depth-wise separable dilated convolutions, reducing complexity while maintaining high accuracy. It is available in four configurations, from the robust 1.94 million-parameter TwinLiteNet$^+_{\text{Large}}$ to the ultra-compact 34K-parameter TwinLiteNet$^+_{\text{Nano}}$. Notably, TwinLiteNet$^+_{\text{Large}}$ attains a 92.9\% mIoU for Drivable Area Segmentation and a 34.2\% IoU for Lane Segmentation. These results notably outperform those of current SOTA models while requiring a computational cost that is approximately 11 times lower in terms of Floating Point Operations (FLOPs) compared to the existing SOTA model. Extensively tested on various embedded devices, TwinLiteNet$^+$ demonstrates promising latency and power efficiency, underscoring its suitability for real-world autonomous vehicle applications. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2307.10705 [pdf, other]

doi 10.1109/MAPR59823.2023.10288710

TwinLiteNet: An Efficient and Lightweight Model for Driveable Area and Lane Segmentation in Self-Driving Cars

Authors: Quang Huy Che, Dinh Phuc Nguyen, Minh Quan Pham, Duc Khai Lam

Abstract: Semantic segmentation is a common task in autonomous driving to understand the surrounding environment. Driveable Area Segmentation and Lane Detection are particularly important for safe and efficient navigation on the road. However, original semantic segmentation models are computationally expensive and require high-end hardware, which is not feasible for embedded systems in autonomous vehicles.… ▽ More Semantic segmentation is a common task in autonomous driving to understand the surrounding environment. Driveable Area Segmentation and Lane Detection are particularly important for safe and efficient navigation on the road. However, original semantic segmentation models are computationally expensive and require high-end hardware, which is not feasible for embedded systems in autonomous vehicles. This paper proposes a lightweight model for the driveable area and lane line segmentation. TwinLiteNet is designed cheaply but achieves accurate and efficient segmentation results. We evaluate TwinLiteNet on the BDD100K dataset and compare it with modern models. Experimental results show that our TwinLiteNet performs similarly to existing approaches, requiring significantly fewer computational resources. Specifically, TwinLiteNet achieves a mIoU score of 91.3% for the Drivable Area task and 31.08% IoU for the Lane Detection task with only 0.4 million parameters and achieves 415 FPS on GPU RTX A5000. Furthermore, TwinLiteNet can run in real-time on embedded devices with limited computing power, especially since it achieves 60FPS on Jetson Xavier NX, making it an ideal solution for self-driving vehicles. Code is available: url{https://github.com/chequanghuy/TwinLiteNet}. △ Less

Submitted 13 December, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted by MAPR 2023

arXiv:2304.09082 [pdf, other]

Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

Authors: Michael Robinson, Tate Altman, Denley Lam, Letitia W. Li

Abstract: This paper proposes an unsupervised classification method that partitions a set of files into non-overlap** dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not.… ▽ More This paper proposes an unsupervised classification method that partitions a set of files into non-overlap** dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not. Patterns of messages can be used to classify files into dialects. A dialect is defined by a subset of messages, called the required messages. Once files are conditioned upon dialect and its required messages, the remaining messages are statistically independent. With this definition of dialect in hand, we present a greedy algorithm that deduces candidate dialects from a dataset consisting of a matrix of file-message data, demonstrate its performance on several file formats, and prove conditions under which it is optimal. We show that an analyst needs to consider fewer dialects than distinct message patterns, which reduces their cognitive load when studying a complex format. △ Less

Submitted 9 February, 2023; originally announced April 2023.

MSC Class: 62P30 (Primary); 06A07 (Secondary) ACM Class: D.3.4

arXiv:2304.02887 [pdf, other]

Design and Control of a Ballbot Drivetrain with High Agility, Minimal Footprint, and High Payload

Authors: Chenzhang Xiao, Mahshid Mansouri, David Lam, Joao Ramos, Elizabeth T. Hsiao-Wecksler

Abstract: This paper presents the design and control of a ballbot drivetrain that aims to achieve high agility, minimal footprint, and high payload capacity while maintaining dynamic stability. Two hardware platforms and analytical models were developed to test design and control methodologies. The full-scale ballbot prototype (MiaPURE) was constructed using off-the-shelf components and designed to have agi… ▽ More This paper presents the design and control of a ballbot drivetrain that aims to achieve high agility, minimal footprint, and high payload capacity while maintaining dynamic stability. Two hardware platforms and analytical models were developed to test design and control methodologies. The full-scale ballbot prototype (MiaPURE) was constructed using off-the-shelf components and designed to have agility, footprint, and balance similar to that of a walking human. The planar inverted pendulum testbed (PIPTB) was developed as a reduced-order testbed for quick validation of system performance. We then proposed a simple yet robust LQR-PI controller to balance and maneuver the ballbot drivetrain with a heavy payload. This is crucial because the drivetrain is often subject to high stiction due to elastomeric components in the torque transmission system. This controller was first tested in the PIPTB to compare with traditional LQR and cascaded PI-PD controllers, and then implemented in the ballbot drivetrain. The MiaPURE drivetrain was able to carry a payload of 60 kg, achieve a maximum speed of 2.3 m/s, and come to a stop from a speed of 1.4 m/s in 2 seconds in a selected translation direction. Finally, we demonstrated the omnidirectional movement of the ballbot drivetrain in an indoor environment as a payload-carrying robot and a human-riding mobility device. Our experiments demonstrated the feasibility of using the ballbot drivetrain as a universal mobility platform with agile movements, minimal footprint, and high payload capacity using our proposed design and control methodologies. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2112.04114 [pdf, other]

ESAFE: Enterprise Security and Forensics at Scale

Authors: Bernard McShea, Kevin Wright, Denley Lam, Steve Schmidt, Anna Choromanska, Devansh Bisla, Shihong Fang, Alireza Sarmadi, Prashanth Krishnamurthy, Farshad Khorrami

Abstract: Securing enterprise networks presents challenges in terms of both their size and distributed structure. Data required to detect and characterize malicious activities may be diffused and may be located across network and endpoint devices. Further, cyber-relevant data routinely exceeds total available storage, bandwidth, and analysis capability, often by several orders of magnitude. Real-time detect… ▽ More Securing enterprise networks presents challenges in terms of both their size and distributed structure. Data required to detect and characterize malicious activities may be diffused and may be located across network and endpoint devices. Further, cyber-relevant data routinely exceeds total available storage, bandwidth, and analysis capability, often by several orders of magnitude. Real-time detection of threats within or across very large enterprise networks is not simply an issue of scale, but also a challenge due to the variable nature of malicious activities and their presentations. The system seeks to develop a hierarchy of cyber reasoning layers to detect malicious behavior, characterize novel attack vectors and present an analyst with a contextualized human-readable output from a series of machine learning models. We developed machine learning algorithms for scalable throughput and improved recall for our Multi-Resolution Joint Optimization for Enterprise Security and Forensics (ESAFE) solution. This Paper will provide an overview of ESAFE's Machine Learning Modules, Attack Ontologies, and Automated Smart Alert generation which provide multi-layer reasoning over cross-correlated sensors for analyst consumption. △ Less

Submitted 7 December, 2021; originally announced December 2021.

Comments: 15 pages, 7 figures

arXiv:2109.08438 [pdf, other]

TS-MULE: Local Interpretable Model-Agnostic Explanations for Time Series Forecast Models

Authors: Udo Schlegel, Duy Vo Lam, Daniel A. Keim, Daniel Seebacher

Abstract: Time series forecasting is a demanding task ranging from weather to failure forecasting with black-box models achieving state-of-the-art performances. However, understanding and debugging are not guaranteed. We propose TS-MULE, a local surrogate model explanation method specialized for time series extending the LIME approach. Our extended LIME works with various ways to segment and perturb the tim… ▽ More Time series forecasting is a demanding task ranging from weather to failure forecasting with black-box models achieving state-of-the-art performances. However, understanding and debugging are not guaranteed. We propose TS-MULE, a local surrogate model explanation method specialized for time series extending the LIME approach. Our extended LIME works with various ways to segment and perturb the time series data. In our extension, we present six sampling segmentation approaches for time series to improve the quality of surrogate attributions and demonstrate their performances on three deep learning model architectures and three common multivariate time series datasets. △ Less

Submitted 17 September, 2021; originally announced September 2021.

Comments: 8 pages, 2 pages references, Advances in Interpretable Machine Learning and Artificial Intelligence (AIMLAI) at ECML/PKDD 2021

arXiv:2104.04415 [pdf]

Automatic Knowledge Extraction with Human Interface

Authors: Steve Schmidt, Denley Lam, Patrick Hayden

Abstract: OrbWeaver, an automatic knowledge extraction system paired with a human interface, streamlines the use of unintuitive natural language processing software for modeling systems from their documentation. OrbWeaver enables the indirect transfer of knowledge about legacy systems by leveraging open source tools in document understanding and processing as well as using web based user interface construct… ▽ More OrbWeaver, an automatic knowledge extraction system paired with a human interface, streamlines the use of unintuitive natural language processing software for modeling systems from their documentation. OrbWeaver enables the indirect transfer of knowledge about legacy systems by leveraging open source tools in document understanding and processing as well as using web based user interface constructs. By design, OrbWeaver is scalable, extensible, and usable; we demonstrate its utility by evaluating its performance in processing a corpus of documents related to advanced persistent threats in the cyber domain. The results indicate better knowledge extraction by revealing hidden relationships, linking co-related entities, and gathering evidence. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:1802.07856 [pdf, other]

xView: Objects in Context in Overhead Imagery

Authors: Darius Lam, Richard Kuzma, Kevin McGee, Samuel Dooley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, Brendan McCord

Abstract: We introduce a new large-scale dataset for the advancement of object detection techniques and overhead object detection research. This satellite imagery dataset enables research progress pertaining to four key computer vision frontiers. We utilize a novel process for geospatial category detection and bounding box annotation with three stages of quality control. Our data is collected from WorldView… ▽ More We introduce a new large-scale dataset for the advancement of object detection techniques and overhead object detection research. This satellite imagery dataset enables research progress pertaining to four key computer vision frontiers. We utilize a novel process for geospatial category detection and bounding box annotation with three stages of quality control. Our data is collected from WorldView-3 satellites at 0.3m ground sample distance, providing higher resolution imagery than most public satellite imagery datasets. We compare xView to other object detection datasets in both natural and overhead imagery domains and then provide a baseline analysis using the Single Shot MultiBox Detector. xView is one of the largest and most diverse publicly available object-detection datasets to date, with over 1 million objects across 60 classes in over 1,400 km^2 of imagery. △ Less

Submitted 21 February, 2018; originally announced February 2018.

Comments: Initial submission

Showing 1–14 of 14 results for author: Lam, D