-
Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization
Authors:
Anh L. Mai,
Pengyu Wang,
Azza Abouzied,
Matteo Brucato,
Peter J. Haas,
Alexandra Meliou
Abstract:
A package query returns a package - a multiset of tuples - that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important fi…
▽ More
A package query returns a package - a multiset of tuples - that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as KD-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.
△ Less
Submitted 14 November, 2023; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Understanding Business Users' Data-Driven Decision-Making: Practices, Challenges, and Opportunities
Authors:
Sneha Gathani,
Zhicheng Liu,
Peter J. Haas,
Çağatay Demiralp
Abstract:
Business users perform data analysis to inform decisions for improving business processes and outcomes despite having limited formal technical training. While earlier work has focused on data analysts' and data scientists' practices and challenges, little is known about business users' decision-making practices and how they incorporate data and visual analytics into their workflows. To address thi…
▽ More
Business users perform data analysis to inform decisions for improving business processes and outcomes despite having limited formal technical training. While earlier work has focused on data analysts' and data scientists' practices and challenges, little is known about business users' decision-making practices and how they incorporate data and visual analytics into their workflows. To address this gap, we first conduct an interview study with 22 business users to understand the general practices and challenges in their data-driven decision-making processes. We contribute an end-to-end model of business users' data-driven decision-making processes elaborating the tasks, tools, and challenges at each stage. We find that business users analyze data without relying on data analysts due to various practical constraints and considerations. However, their existing tools are inadequate, particularly in hel** understand the relationship between data variables and business goals and facilitating the exploration of what-if scenarios. These findings suggest a need for advanced predictive and prescriptive analytics (PPA) tools to support what-if analysis. Motivated by this need, we perform a follow-up, task-based study to understand PPA's role and potential in business users' decision-making processes. We find that PPA helps improve efficiency and confidence in decision-making. However, business users also believe that PPA-powered what-if analysis tools are currently in their nascent stages and report improvements before fully integrating them into their decision-making processes. Building upon these findings, we discuss the opportunities and challenges in incorporating PPA into data-driven decision-making and its implications for future data and visual analytics systems.
△ Less
Submitted 17 October, 2023; v1 submitted 27 December, 2022;
originally announced December 2022.
-
Augmenting Decision Making via Interactive What-If Analysis
Authors:
Sneha Gathani,
Madelon Hulsebos,
James Gale,
Peter J. Haas,
Çağatay Demiralp
Abstract:
The fundamental goal of business data analysis is to improve business decisions using data. Business users often make decisions to achieve key performance indicators (KPIs) such as increasing customer retention or sales, or decreasing costs. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to…
▽ More
The fundamental goal of business data analysis is to improve business decisions using data. Business users often make decisions to achieve key performance indicators (KPIs) such as increasing customer retention or sales, or decreasing costs. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to perform lengthy exploratory analyses. This involves considering multitudes of combinations and scenarios and performing slicing, dicing, and transformations on the data accordingly, e.g., analyzing customer retention across quarters of the year or suggesting optimal media channels across strata of customers. However, the increasing complexity of datasets combined with the cognitive limitations of humans makes it challenging to carry over multiple hypotheses, even for simple datasets. Therefore mentally performing such analyses is hard. Existing commercial tools either provide partial solutions or fail to cater to business users altogether. Here we argue for four functionalities to enable business users to interactively learn and reason about the relationships between sets of data attributes thereby facilitating data-driven decision making. We implement these functionalities in SystemD, an interactive visual data analysis system enabling business users to experiment with the data by asking what-if questions. We evaluate the system through three business use cases: marketing mix modeling, customer retention analysis, and deal closing analysis, and report on feedback from multiple business users. Users find the SystemD functionalities highly useful for quick testing and validation of their hypotheses around their KPIs of interest, addressing their unmet analysis needs. The feedback also suggests that the UX design can be enhanced to further improve the understandability of these functionalities.
△ Less
Submitted 8 February, 2022; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Exact PPS Sampling with Bounded Sample Size
Authors:
Brian Hentschel,
Peter J. Haas,
Yuanyuan Tian
Abstract:
Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number $n$ of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified "weight" (also called its "size"). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes pri…
▽ More
Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number $n$ of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified "weight" (also called its "size"). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes prioritize control of the sample size, violating the PPS property if necessary. We provide a new PPS scheme that allows a different trade-off: our method enforces the PPS property at all times while ensuring that the sample size never exceeds the target value $n$. The sample size is exactly equal to $n$ if possible, and otherwise has maximal expected value and minimal variance. Thus we bound the sample size, thereby avoiding storage overflows and hel** to control the time required for analytics over the sample, while allowing the user complete control over the sample contents. The method is both simple to implement and efficient, being a one-pass streaming algorithm with an amortized processing time of $O(1)$ per item.
△ Less
Submitted 22 May, 2021;
originally announced May 2021.
-
Stochastic Package Queries in Probabilistic Databases
Authors:
Matteo Brucato,
Nishant Yadav,
Azza Abouzied,
Peter J. Haas,
Alexandra Meliou
Abstract:
We provide methods for in-database support of decision making under uncertainty. Many important decision problems correspond to selecting a package (bag of tuples in a relational database) that jointly satisfy a set of constraints while minimizing some overall cost function; in most real-world problems, the data is uncertain. We provide methods for specifying -- via a SQL extension -- and processi…
▽ More
We provide methods for in-database support of decision making under uncertainty. Many important decision problems correspond to selecting a package (bag of tuples in a relational database) that jointly satisfy a set of constraints while minimizing some overall cost function; in most real-world problems, the data is uncertain. We provide methods for specifying -- via a SQL extension -- and processing stochastic package queries (SPQs), in order to solve optimization problems over uncertain data, right where the data resides. Prior work in stochastic programming uses Monte Carlo methods where the original stochastic optimization problem is approximated by a large deterministic optimization problem that incorporates many scenarios, i.e., sample realizations of the uncertain data values. For large database tables, however, a huge number of scenarios is required, leading to poor performance and, often, failure of the solver software. We therefore provide a novel SummarySearch algorithm that, instead of trying to solve a large deterministic problem, seamlessly approximates it via a sequence of smaller problems defined over carefully crafted summaries of the scenarios that accelerate convergence to a feasible and near-optimal solution. Experimental results on our prototype system show that SummarySearch can be orders of magnitude faster than prior methods at finding feasible and high-quality packages.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Temporally-Biased Sampling Schemes for Online Model Management
Authors:
Brian Hentschel,
Peter J. Haas,
Yuanyuan Tian
Abstract:
To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified "decay function". We then periodically retrain the models on the current sample. This approach speeds up the training proces…
▽ More
To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified "decay function". We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while---unlike in a sliding-window approach---still kee** some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a trade-off between sample footprint and sample-size stability. The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.
△ Less
Submitted 11 June, 2019;
originally announced June 2019.
-
Unknown Examples & Machine Learning Model Generalization
Authors:
Yeounoh Chung,
Peter J. Haas,
Eli Upfal,
Tim Kraska
Abstract:
Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, m…
▽ More
Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, more generally, covariate shift, i.e., a distribution shift between the training and deployment stage. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence biased predictions. We provide novel algorithms that estimate the number and properties of these unknown training examples---unknown unknowns. This information can then be used to correct the training set, prior to seeing any test data. The key idea is to combine species-estimation techniques with data-driven methods for estimating the feature values for the unknown unknowns. Experiments on a variety of ML models and datasets indicate that taking the unknown examples into account can yield a more robust ML model that generalizes better.
△ Less
Submitted 11 October, 2019; v1 submitted 24 August, 2018;
originally announced August 2018.
-
Temporally-Biased Sampling for Online Model Management
Authors:
Brian Hentschel,
Peter J. Haas,
Yuanyuan Tian
Abstract:
To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying exponentially over time. We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on al…
▽ More
To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying exponentially over time. We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while -- unlike in a sliding-window approach -- still kee** some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both complete control over the decay rate and a guaranteed upper bound on the sample size, while maximizing both expected sample size and sample-size stability. The latter scheme rests on the notion of a "fractional sample" and, unlike T-TBS, allows for data arrival rates that are unknown and time varying. R-TBS and T-TBS are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.
△ Less
Submitted 29 January, 2018;
originally announced January 2018.
-
Foresight: Rapid Data Exploration Through Guideposts
Authors:
Çağatay Demiralp,
Peter J. Haas,
Srinivasan Parthasarathy,
Tejaswini Pedapati
Abstract:
Current tools for exploratory data analysis (EDA) require users to manually select data attributes, statistical computations and visual encodings. This can be daunting for large-scale, complex data. We introduce Foresight, a visualization recommender system that helps the user rapidly explore large high-dimensional datasets through "guideposts." A guidepost is a visualization corresponding to a pr…
▽ More
Current tools for exploratory data analysis (EDA) require users to manually select data attributes, statistical computations and visual encodings. This can be daunting for large-scale, complex data. We introduce Foresight, a visualization recommender system that helps the user rapidly explore large high-dimensional datasets through "guideposts." A guidepost is a visualization corresponding to a pronounced instance of a statistical descriptor of the underlying data, such as a strong linear correlation between two attributes, high skewness or concentration about the mean of a single attribute, or a strong clustering of values. For each descriptor, Foresight initially presents visualizations of the "strongest" instances, based on an appropriate ranking metric. Given these initial guideposts, the user can then look at "nearby" guideposts by issuing "guidepost queries" containing constraints on metric type, metric strength, data attributes, and data values. Thus, the user can directly explore the network of guideposts, rather than the overwhelming space of data attributes and visual encodings. Foresight also provides for each descriptor a global visualization of ranking-metric values to both help orient the user and ensure a thorough exploration process. Foresight facilitates interactive exploration of large datasets using fast, approximate sketching to compute ranking metrics. We also contribute insights on EDA practices of data scientists, summarizing results from an interview study we conducted to inform the design of Foresight.
△ Less
Submitted 29 September, 2017;
originally announced September 2017.
-
Foresight: Recommending Visual Insights
Authors:
Çağatay Demiralp,
Peter J. Haas,
Srinivasan Parthasarathy,
Tejaswini Pedapati
Abstract:
Current tools for exploratory data analysis (EDA) require users to manually select data attributes, statistical computations and visual encodings. This can be daunting for large-scale, complex data. We introduce Foresight, a system that helps the user rapidly discover visual insights from large high-dimensional datasets. Formally, an "insight" is a strong manifestation of a statistical property of…
▽ More
Current tools for exploratory data analysis (EDA) require users to manually select data attributes, statistical computations and visual encodings. This can be daunting for large-scale, complex data. We introduce Foresight, a system that helps the user rapidly discover visual insights from large high-dimensional datasets. Formally, an "insight" is a strong manifestation of a statistical property of the data, e.g., high correlation between two attributes, high skewness or concentration about the mean of a single attribute, a strong clustering of values, and so on. For each insight type, Foresight initially presents visualizations of the top k instances in the data, based on an appropriate ranking metric. The user can then look at "nearby" insights by issuing "insight queries" containing constraints on insight strengths and data attributes. Thus the user can directly explore the space of insights, rather than the space of data dimensions and visual encodings as in other visual recommender systems. Foresight also provides "global" views of insight space to help orient the user and ensure a thorough exploration process. Furthermore, Foresight facilitates interactive exploration of large datasets through fast, approximate sketching.
△ Less
Submitted 12 July, 2017;
originally announced July 2017.
-
Biquaternion formulation of relativistic tensor dynamics
Authors:
E. P. J. de Haas
Abstract:
In this paper we show how relativistic tensor dynamics and relativistic electrodynamics can be formulated in a biquaternion tensor language. The treatment is restricted to mathematical physics, known facts as the Lorentz Force Law and the Lagrange Equation are presented in a relatively new formalism. The goal is to fuse anti-symmetric tensor dynamics, as used for example in relativistic electrodyn…
▽ More
In this paper we show how relativistic tensor dynamics and relativistic electrodynamics can be formulated in a biquaternion tensor language. The treatment is restricted to mathematical physics, known facts as the Lorentz Force Law and the Lagrange Equation are presented in a relatively new formalism. The goal is to fuse anti-symmetric tensor dynamics, as used for example in relativistic electrodynamics, and symmetric tensor dynamics, as used for example in introductions to general relativity, into one single formalism: a specific kind of biquaternion tensor calculus.
△ Less
Submitted 15 October, 2013;
originally announced January 2014.
-
The combination of de Broglie's Harmony of the Phases and Mie's theory of gravity results in a Principle of Equivalence for Quantum Gravity
Authors:
E. P. J. de Haas
Abstract:
Under a Lorentz-transformation, Mie's 1912 gravitational mass behaves identical as de Broglie's 1923 clock-like frequency. The same goes for Mie's inertial mass and de Broglie's wave-like frequency. This allows the interpretation of de Broglie's "Harmony of the Phases" as a "Principle of Equivalence" for Quantum Gravity. Thus, the particle-wave duality can be given a realist interpretation. The…
▽ More
Under a Lorentz-transformation, Mie's 1912 gravitational mass behaves identical as de Broglie's 1923 clock-like frequency. The same goes for Mie's inertial mass and de Broglie's wave-like frequency. This allows the interpretation of de Broglie's "Harmony of the Phases" as a "Principle of Equivalence" for Quantum Gravity. Thus, the particle-wave duality can be given a realist interpretation. The "Mie-de Broglie" interpretation suggests a correction of Hamilton's variational principle in the quantum domain. The equivalence of the masses can be seen as the classical "limit" of the quantum equivalence of the phases.
△ Less
Submitted 21 July, 2005;
originally announced July 2005.