-
Real-Time LSM-Trees for HTAP Workloads
Authors:
Hemant Saxena,
Lukasz Golab,
Stratos Idreos,
Ihab F. Ilyas
Abstract:
Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycl…
▽ More
Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store.
△ Less
Submitted 14 July, 2022; v1 submitted 17 January, 2021;
originally announced January 2021.
-
Competitive Influence Propagation and Fake News Mitigation in the Presence of Strong User Bias
Authors:
Akrati Saxena,
Harsh Saxena,
Ralucca Gera
Abstract:
Due to the extensive role of social networks in social media, it is easy for people to share the news, and it spreads faster than ever before. These platforms also have been exploited to share the rumor or fake information, which is a threat to society. One method to reduce the impact of fake information is making people aware of the correct information based on hard proof. In this work, first, we…
▽ More
Due to the extensive role of social networks in social media, it is easy for people to share the news, and it spreads faster than ever before. These platforms also have been exploited to share the rumor or fake information, which is a threat to society. One method to reduce the impact of fake information is making people aware of the correct information based on hard proof. In this work, first, we propose a propagation model called Competitive Independent Cascade Model with users' Bias (CICMB) that considers the presence of strong user bias towards different opinions, believes, or political parties. We further propose a method, called $k-TruthScore$, to identify an optimal set of truth campaigners from a given set of prospective truth campaigners to minimize the influence of rumor spreaders on the network. We compare $k-TruthScore$ with state of the art methods, and we measure their performances as the percentage of the saved nodes (nodes that would have believed in the fake news in the absence of the truth campaigners). We present these results on a few real-world networks, and the results show that $k-TruthScore$ method outperforms baseline methods.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Distributed Dependency Discovery
Authors:
Hemant Saxena,
Lukasz Golab,
Ihab F. Ilyas
Abstract:
We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce si…
▽ More
We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.