-
The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds
Authors:
Owen Pendrigh Elliott,
Jesse Clark
Abstract:
Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M…
▽ More
Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper.
We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Our methodology highlights how insertion order, informed by measurable properties such as the pointwise Local Intrinsic Dimensionality (LID) or known categories, can shift recall by up to 12 percentage points. We also observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models. This work underscores the need for more nuanced benchmarks and design considerations in develo** robust vector search systems using approximate vector search algorithms. This study presents a number of scenarios with varying real world applicability which aim to better increase understanding and future development of ANN algorithms and embedding
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Infectious disease surveillance needs for the United States: lessons from COVID-19
Authors:
Marc Lipsitch,
Mary T. Bassett,
John S. Brownstein,
Paul Elliott,
David Eyre,
M. Kate Grabowski,
James A. Hay,
Michael Johansson,
Stephen M. Kissler,
Daniel B. Larremore,
Jennifer Layden,
Justin Lessler,
Ruth Lynfield,
Duncan MacCannell,
Lawrence C. Madoff,
C. Jessica E. Metcalf,
Lauren A. Meyers,
Sylvia K. Ofori,
Celia Quinn,
Ana I. Ramos Bento,
Nick Reich,
Steven Riley,
Roni Rosenfeld,
Matthew H. Samore,
Rangarajan Sampath
, et al. (5 additional authors not shown)
Abstract:
The COVID-19 pandemic has highlighted the need to upgrade systems for infectious disease surveillance and forecasting and modeling of the spread of infection, both of which inform evidence-based public health guidance and policies. Here, we discuss requirements for an effective surveillance system to support decision making during a pandemic, drawing on the lessons of COVID-19 in the U.S., while l…
▽ More
The COVID-19 pandemic has highlighted the need to upgrade systems for infectious disease surveillance and forecasting and modeling of the spread of infection, both of which inform evidence-based public health guidance and policies. Here, we discuss requirements for an effective surveillance system to support decision making during a pandemic, drawing on the lessons of COVID-19 in the U.S., while looking to jurisdictions in the U.S. and beyond to learn lessons about the value of specific data types. In this report, we define the range of decisions for which surveillance data are required, the data elements needed to inform these decisions and to calibrate inputs and outputs of transmission-dynamic models, and the types of data needed to inform decisions by state, territorial, local, and tribal health authorities. We define actions needed to ensure that such data will be available and consider the contribution of such efforts to improving health equity.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Intensionalizing Abstract Meaning Representations: Non-Veridicality and Scope
Authors:
Gregor Williamson,
Patrick Elliott,
Yuxin Ji,
**ho D. Choi
Abstract:
Abstract Meaning Representation (AMR) is a graphical meaning representation language designed to represent propositional information about argument structure. However, at present it is unable to satisfyingly represent non-veridical intensional contexts, often licensing inappropriate inferences. In this paper, we show how to resolve the problem of non-veridicality without appealing to layered graph…
▽ More
Abstract Meaning Representation (AMR) is a graphical meaning representation language designed to represent propositional information about argument structure. However, at present it is unable to satisfyingly represent non-veridical intensional contexts, often licensing inappropriate inferences. In this paper, we show how to resolve the problem of non-veridicality without appealing to layered graphs through a map** from AMRs into Simply-Typed Lambda Calculus (STLC). At least for some cases, this requires the introduction of a new role :content which functions as an intensional operator. The translation proposed is inspired by the formal linguistics literature on the event semantics of attitude reports. Next, we address the interaction of quantifier scope and intensional operators in so-called de re/de dicto ambiguities. We adopt a scope node from the literature and provide an explicit multidimensional semantics utilizing Cooper storage which allows us to derive the de re and de dicto scope readings as well as intermediate scope readings which prove difficult for accounts without a scope node.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Metabolomics in the Cloud: Scaling Computational Tools to Big Data
Authors:
Jianliang Gao,
Noureddin Sadawi,
Ibrahim Karaman,
Jake T M Pearce,
Pablo Moreno,
Anders Larsson,
Marco Capuccini,
Paul Elliott,
Jeremy K Nicholson,
Timothy M D Ebbels,
Robert Glen
Abstract:
Background: Metabolomics datasets are becoming increasingly large and complex, with multiple types of algorithms and workflows needed to process and analyse the data. A cloud infrastructure with portable software tools can provide much needed resources enabling faster processing of much larger datasets than would be possible at any individual lab. The PhenoMeNal project has developed such an infra…
▽ More
Background: Metabolomics datasets are becoming increasingly large and complex, with multiple types of algorithms and workflows needed to process and analyse the data. A cloud infrastructure with portable software tools can provide much needed resources enabling faster processing of much larger datasets than would be possible at any individual lab. The PhenoMeNal project has developed such an infrastructure, allowing users to run analyses on local or commercial cloud platforms. We have examined the computational scaling behaviour of the PhenoMeNal platform using four different implementations across 1-1000 virtual CPUs using two common metabolomics tools.
Results: Our results show that data which takes up to 4 days to process on a standard desktop computer can be processed in just 10 min on the largest cluster. Improved runtimes come at the cost of decreased efficiency, with all platforms falling below 80% efficiency above approximately 1/3 of the maximum number of vCPUs. An economic analysis revealed that running on large scale cloud platforms is cost effective compared to traditional desktop systems.
Conclusions: Overall, cloud implementations of PhenoMeNal show excellent scalability for standard metabolomics computing tasks on a range of platforms, making them a compelling choice for research computing in metabolomics.
△ Less
Submitted 9 April, 2019; v1 submitted 3 April, 2019;
originally announced April 2019.
-
Community detection using spectral clustering on sparse geosocial data
Authors:
Yves van Gennip,
Blake Hunter,
Raymond Ahn,
Peter Elliott,
Kyle Luh,
Megan Halvorson,
Shannon Reid,
Matt Valasik,
James Wo,
George E. Tita,
Andrea L. Bertozzi,
P. Jeffrey Brantingham
Abstract:
In this article we identify social communities among gang members in the Hollenbeck policing district in Los Angeles, based on sparse observations of a combination of social interactions and geographic locations of the individuals. This information, coming from LAPD Field Interview cards, is used to construct a similarity graph for the individuals. We use spectral clustering to identify clusters i…
▽ More
In this article we identify social communities among gang members in the Hollenbeck policing district in Los Angeles, based on sparse observations of a combination of social interactions and geographic locations of the individuals. This information, coming from LAPD Field Interview cards, is used to construct a similarity graph for the individuals. We use spectral clustering to identify clusters in the graph, corresponding to communities in Hollenbeck, and compare these with the LAPD's knowledge of the individuals' gang membership. We discuss different ways of encoding the geosocial information using a graph structure and the influence on the resulting clusterings. Finally we analyze the robustness of this technique with respect to noisy and incomplete data, thereby providing suggestions about the relative importance of quantity versus quality of collected data.
△ Less
Submitted 8 November, 2012; v1 submitted 21 June, 2012;
originally announced June 2012.