-
Fast Feature Selection with Fairness Constraints
Authors:
Francesco Quinzan,
Rajiv Khanna,
Moshik Hershcovitch,
Sarel Cohen,
Daniel G. Waddington,
Tobias Friedrich,
Michael W. Mahoney
Abstract:
We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-…
▽ More
We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.
△ Less
Submitted 3 February, 2023; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Fast & Flexible IO : A Compositional Approach to Storage Construction for High-Performance Devices
Authors:
Daniel G. Waddington
Abstract:
Building storage systems has remained the domain of systems experts for many years. They are complex and difficult to implement. Extreme care is needed to ensure necessary guarantees of performance and operational correctness. Furthermore, because of restrictions imposed by kernel-based designs, many legacy implementations have traded software flexibility for performance. Their implementation is r…
▽ More
Building storage systems has remained the domain of systems experts for many years. They are complex and difficult to implement. Extreme care is needed to ensure necessary guarantees of performance and operational correctness. Furthermore, because of restrictions imposed by kernel-based designs, many legacy implementations have traded software flexibility for performance. Their implementation is restricted to compiled languages such as C and assembler, and reuse tends to be difficult or constrained. Nevertheless, storage systems are implicitly well-suited to software reuse and compositional software construction. There are many logical functions, such as block allocation, caching, partitioning, metadata management and so forth, that are common across most variants of storage. In this paper, we present Comanche, an open-source project that considers, as first-class concerns, both compositional design and reuse, and the need for high-performance.
△ Less
Submitted 25 July, 2018;
originally announced July 2018.
-
Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor Data on a Single Node
Authors:
Juan A. Colmenares,
Reza Dorrigiv,
Daniel G. Waddington
Abstract:
Multidimensional data are becoming more prevalent, partly due to the rise of the Internet of Things (IoT), and with that the need to ingest and analyze data streams at rates higher than before. Some industrial IoT applications require ingesting millions of records per second, while processing queries on recently ingested and historical data. Unfortunately, existing database systems suited to multi…
▽ More
Multidimensional data are becoming more prevalent, partly due to the rise of the Internet of Things (IoT), and with that the need to ingest and analyze data streams at rates higher than before. Some industrial IoT applications require ingesting millions of records per second, while processing queries on recently ingested and historical data. Unfortunately, existing database systems suited to multidimensional data exhibit low per-node ingestion performance, and even if they can scale horizontally in distributed settings, they require large number of nodes to meet such ingest demands. For this reason, in this paper we evaluate a single-node multidimensional data store for high-velocity sensor data. Its design centers around a two-level indexing structure, wherein the global index is an in-memory R*-tree and the local indices are serialized kd-trees. This study is confined to records with numerical indexing fields and range queries, and covers ingest throughput, query response time, and storage footprint. We show that the adopted design streamlines data ingestion and offers ingress rates two orders of magnitude higher than those of Percona Server, SQLite, and Druid. Our prototype also reports query response times comparable to or better than those of Percona Server and Druid, and compares favorably in terms of storage footprint. In addition, we evaluate a kd-tree partitioning based scheme for grou** incoming streamed data records. Compared to a random scheme, this scheme produces less overlap between groups of streamed records, but contrary to what we expected, such reduced overlap does not translate into better query performance. By contrast, the local indices prove much more beneficial to query performance. We believe the experience reported in this paper is valuable to practitioners and researchers alike interested in building database systems for high-velocity multidimensional data.
△ Less
Submitted 4 July, 2017;
originally announced July 2017.
-
A Fast Lightweight Time-Series Store for IoT Data
Authors:
Daniel G. Waddington,
Changhui Lin
Abstract:
With the advent of the Internet-of-Things (IoT), handling large volumes of time-series data has become a growing concern. Data, generated from millions of Internet-connected sensors, will drive new IoT applications and services. A key requirement is the ability to aggregate, preprocess, index, store and analyze data with minimal latency so that time-to-insight can be reduced. In the future, we exp…
▽ More
With the advent of the Internet-of-Things (IoT), handling large volumes of time-series data has become a growing concern. Data, generated from millions of Internet-connected sensors, will drive new IoT applications and services. A key requirement is the ability to aggregate, preprocess, index, store and analyze data with minimal latency so that time-to-insight can be reduced. In the future, we expect real-time data collection and analysis to be performed both on small devices (e.g., in hubs and appliances) as well in server-based infrastructure. The ability to localize sensitive data to the home, and thus preserve privacy, is a key driver for small-device deployment.
In this paper, we present an efficient architecture for time-series data management that provides a high data ingestion rate, while still being sufficiently lightweight that it can be deployed in embedded environments or small virtual machines. Our solution strives to minimize overhead and explores what can be done without complex indexing schemes that typically, for performance reasons, must be held in main memory. We combine a simple in-memory hierarchical index, log-structured store and in-flight sort, with a high-performance data pipeline architecture that is optimized for multicore platforms. We show that our solution is able to handle streaming insertions at over 4 million records per second (on a single x86 server) while still retaining SQL query performance better than or comparable to existing RDBMS.
△ Less
Submitted 9 May, 2016; v1 submitted 4 May, 2016;
originally announced May 2016.