Search | arXiv e-print repository

Real-Time LSM-Trees for HTAP Workloads

Authors: Hemant Saxena, Lukasz Golab, Stratos Idreos, Ihab F. Ilyas

Abstract: Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycl… ▽ More Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store. △ Less

Submitted 14 July, 2022; v1 submitted 17 January, 2021; originally announced January 2021.

arXiv:2011.04857 [pdf, other]

Competitive Influence Propagation and Fake News Mitigation in the Presence of Strong User Bias

Authors: Akrati Saxena, Harsh Saxena, Ralucca Gera

Abstract: Due to the extensive role of social networks in social media, it is easy for people to share the news, and it spreads faster than ever before. These platforms also have been exploited to share the rumor or fake information, which is a threat to society. One method to reduce the impact of fake information is making people aware of the correct information based on hard proof. In this work, first, we… ▽ More Due to the extensive role of social networks in social media, it is easy for people to share the news, and it spreads faster than ever before. These platforms also have been exploited to share the rumor or fake information, which is a threat to society. One method to reduce the impact of fake information is making people aware of the correct information based on hard proof. In this work, first, we propose a propagation model called Competitive Independent Cascade Model with users' Bias (CICMB) that considers the presence of strong user bias towards different opinions, believes, or political parties. We further propose a method, called $k-TruthScore$, to identify an optimal set of truth campaigners from a given set of prospective truth campaigners to minimize the influence of rumor spreaders on the network. We compare $k-TruthScore$ with state of the art methods, and we measure their performances as the percentage of the saved nodes (nodes that would have believed in the fake news in the absence of the truth campaigners). We present these results on a few real-world networks, and the results show that $k-TruthScore$ method outperforms baseline methods. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted in CSoNet 2020

arXiv:1903.05228 [pdf, other]

Distributed Dependency Discovery

Authors: Hemant Saxena, Lukasz Golab, Ihab F. Ilyas

Abstract: We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce si… ▽ More We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets. △ Less

Submitted 12 March, 2019; originally announced March 2019.

Showing 1–3 of 3 results for author: Saxena, H