-
Scaling ML Products At Startups: A Practitioner's Guide
Authors:
Atul Dhingra,
Gaurav Sood
Abstract:
How do you scale a machine learning product at a startup? In particular, how do you serve a greater volume, velocity, and variety of queries cost-effectively? We break down costs into variable costs-the cost of serving the model and performant-and fixed costs-the cost of develo** and training new models. We propose a framework for conceptualizing these costs, breaking them into finer categories,…
▽ More
How do you scale a machine learning product at a startup? In particular, how do you serve a greater volume, velocity, and variety of queries cost-effectively? We break down costs into variable costs-the cost of serving the model and performant-and fixed costs-the cost of develo** and training new models. We propose a framework for conceptualizing these costs, breaking them into finer categories, and limn ways to reduce costs. Lastly, since in our experience, the most expensive fixed cost of a machine learning system is the cost of identifying the root causes of failures and driving continuous improvement, we present a way to conceptualize the issues and share our methodology for the same.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Instate: Predicting the State of Residence From Last Name
Authors:
Atul Dhingra,
Gaurav Sood
Abstract:
India has twenty-two official languages. Serving such a diverse language base is a challenge for survey statisticians, call center operators, software developers, and other such service providers. To help provide better services to different language communities via better localization, we introduce a new machine learning model that predicts the language(s) that the user can speak from their name.…
▽ More
India has twenty-two official languages. Serving such a diverse language base is a challenge for survey statisticians, call center operators, software developers, and other such service providers. To help provide better services to different language communities via better localization, we introduce a new machine learning model that predicts the language(s) that the user can speak from their name. Using nearly 438M records spanning 33 Indian states and 1.13M unique last names from the Indian Electoral Rolls Corpus (?), we build a character-level transformer-based machine-learning model that predicts the state of residence based on the last name. The model has a top-3 accuracy of 85.3% on unseen names. We map the states to languages using the Indian census to infer languages understood by the respondent. We provide open-source software that implements the method discussed in the paper.
△ Less
Submitted 12 March, 2023;
originally announced March 2023.
-
QBF Merge Resolution is powerful but unnatural
Authors:
Meena Mahajan,
Gaurav Sood
Abstract:
The Merge Resolution proof system (M-Res) for QBFs, proposed by Beyersdorff et al. in 2019, explicitly builds partial strategies inside refutations. The original motivation for this approach was to overcome the limitations encountered in long-distance Q-Resolution proof system (LD-Q-Res), where the syntactic side-conditions, while prohibiting all unsound resolutions, also end up prohibiting some s…
▽ More
The Merge Resolution proof system (M-Res) for QBFs, proposed by Beyersdorff et al. in 2019, explicitly builds partial strategies inside refutations. The original motivation for this approach was to overcome the limitations encountered in long-distance Q-Resolution proof system (LD-Q-Res), where the syntactic side-conditions, while prohibiting all unsound resolutions, also end up prohibiting some sound resolutions. However, while the advantage of M-Res over many other resolution-based QBF proof systems was already demonstrated, a comparison with LD-Q-Res itself had remained open. In this paper, we settle this question. We show that M-Res has an exponential advantage over not only LD-Q-Res, but even over LQU$^+$-Res and IRM, the most powerful among currently known resolution-based QBF proof systems. Combining this with results from Beyersdorff et al. 2020, we conclude that M-Res is incomparable with LQU-Res and LQU$^+$-Res.
Our proof method reveals two additional and curious features about M-Res: (i) M-Res is not closed under restrictions, and is hence not a natural proof system, and (ii) weakening axiom clauses with existential variables provably yields an exponential advantage over M-Res without weakening. We further show that in the context of regular derivations, weakening axiom clauses with universal variables provably yields an exponential advantage over M-Res without weakening. These results suggest that M-Res is better used with weakening, though whether M-Res with weakening is closed under restrictions remains open. We note that even with weakening, M-Res continues to be simulated by eFrege $+$ $\forall$red (the simulation of ordinary M-Res was shown recently by Chew and Slivovsky).
△ Less
Submitted 17 December, 2023; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Hard QBFs for Merge Resolution
Authors:
Olaf Beyersdorff,
Joshua Blinkhorn,
Meena Mahajan,
Tomáš Peitl,
Gaurav Sood
Abstract:
We prove the first genuine QBF proof size lower bounds for the proof system Merge Resolution (MRes [Olaf Beyersdorff et al., 2020]), a refutational proof system for prenex quantified Boolean formulas (QBF) with a CNF matrix. Unlike most QBF resolution systems in the literature, proofs in MRes consist of resolution steps together with information on countermodels, which are syntactically stored in…
▽ More
We prove the first genuine QBF proof size lower bounds for the proof system Merge Resolution (MRes [Olaf Beyersdorff et al., 2020]), a refutational proof system for prenex quantified Boolean formulas (QBF) with a CNF matrix. Unlike most QBF resolution systems in the literature, proofs in MRes consist of resolution steps together with information on countermodels, which are syntactically stored in the proofs as merge maps. As demonstrated in [Olaf Beyersdorff et al., 2020], this makes MRes quite powerful: it has strategy extraction by design and allows short proofs for formulas which are hard for classical QBF resolution systems.
Here we show the first genuine QBF exponential lower bounds for MRes, thereby uncovering limitations of MRes. Technically, the results are either transferred from bounds from circuit complexity (for restricted versions of MRes) or directly obtained by combinatorial arguments (for full MRes). Our results imply that the MRes approach is largely orthogonal to other QBF resolution models such as the QCDCL resolution systems QRes and QURes and the expansion systems $\forall$Exp+Res and IR.
△ Less
Submitted 26 January, 2024; v1 submitted 12 December, 2020;
originally announced December 2020.
-
MaxSAT Resolution and Subcube Sums
Authors:
Yuval Filmus,
Meena Mahajan,
Gaurav Sood,
Marc Vinyals
Abstract:
We study the MaxRes rule in the context of certifying unsatisfiability. We show that it can be exponentially more powerful than tree-like resolution, and when augmented with weakening (the system MaxResW), p-simulates tree-like resolution. In devising a lower bound technique specific to MaxRes (and not merely inheriting lower bounds from Res), we define a new proof system called the SubCubeSums pr…
▽ More
We study the MaxRes rule in the context of certifying unsatisfiability. We show that it can be exponentially more powerful than tree-like resolution, and when augmented with weakening (the system MaxResW), p-simulates tree-like resolution. In devising a lower bound technique specific to MaxRes (and not merely inheriting lower bounds from Res), we define a new proof system called the SubCubeSums proof system. This system, which p-simulates MaxResW, can be viewed as a special case of the semialgebraic Sherali-Adams proof system. In expressivity, it is the integral restriction of conical juntas studied in the contexts of communication complexity and extension complexity. We show that it is not simulated by Res. Using a proof technique qualitatively different from the lower bounds that MaxResW inherits from Res, we show that Tseitin contradictions on expander graphs are hard to refute in SubCubeSums. We also establish a lower bound technique via lifting: for formulas requiring large degree in SubCubeSums, their XOR-ification requires large size in SubCubeSums.
△ Less
Submitted 22 October, 2022; v1 submitted 23 May, 2020;
originally announced May 2020.
-
Pwned: How Often Are Americans' Online Accounts Breached?
Authors:
Ken Cor,
Gaurav Sood
Abstract:
News about massive online breaches is increasingly common. But there has been little good data on how exposed people are because of these breaches. We combine data from a large, representative sample of adult Americans (n = 5,000) with data from \textit{Have I Been Pwned} to estimate the lower bound of the average number of breached online accounts per person. We find that at least 82.84% of Ameri…
▽ More
News about massive online breaches is increasingly common. But there has been little good data on how exposed people are because of these breaches. We combine data from a large, representative sample of adult Americans (n = 5,000) with data from \textit{Have I Been Pwned} to estimate the lower bound of the average number of breached online accounts per person. We find that at least 82.84% of Americans have had their accounts breached. And that on average Americans' accounts have been breached at least three times. Better educated, the middle-aged, women, and Whites are more likely to have had their accounts breached than the complementary groups.
△ Less
Submitted 18 February, 2019; v1 submitted 29 July, 2018;
originally announced August 2018.
-
Street Sense: Learning from Google Street View
Authors:
Suriyan Laohaprapanon,
Kimberly Ortleb,
Gaurav Sood
Abstract:
How good are the public services and the public infrastructure? Does their quality vary by income? These are vital questions---they shed light on how well the government is doing its job, the consequences of disparities in local funding, etc. But there is little good data on many of these questions. We fill this gap by describing a scalable method of getting data on one crucial piece of public inf…
▽ More
How good are the public services and the public infrastructure? Does their quality vary by income? These are vital questions---they shed light on how well the government is doing its job, the consequences of disparities in local funding, etc. But there is little good data on many of these questions. We fill this gap by describing a scalable method of getting data on one crucial piece of public infrastructure: roads. We assess the quality of roads and sidewalks by exploiting data from Google Street View. We randomly sample locations on major roads, query Google Street View images for those locations and code the images using Amazon's Mechanical Turk. We apply this method to assess the quality of roads in Bangkok, Jakarta, Lagos, and Wayne County, Michigan. Jakarta's roads have nearly four times the potholes than roads of any other city. Surprisingly, the proportion of road segments with potholes in Bangkok, Lagos, and Wayne is about the same, between .06 and .07. Using the data, we also estimate the relation between the condition of the roads and local income in Wayne, MI. We find that roads in more affluent census tracts have somewhat fewer potholes.
△ Less
Submitted 10 July, 2018;
originally announced July 2018.
-
On the computational complexity of Data Flow Analysis
Authors:
Gaurav Sood,
K. Murali Krishnan
Abstract:
We consider the problem of Data Flow Analysis over monotone data flow frameworks with a finite lattice. The problem of computing the Maximum Fixed Point (MFP) solution is shown to be P-complete even when the lattice has just four elements. This shows that the problem is unlikely to be efficiently parallelizable. It is also shown that the problem of computing the Meet Over all Paths (MOP) solution…
▽ More
We consider the problem of Data Flow Analysis over monotone data flow frameworks with a finite lattice. The problem of computing the Maximum Fixed Point (MFP) solution is shown to be P-complete even when the lattice has just four elements. This shows that the problem is unlikely to be efficiently parallelizable. It is also shown that the problem of computing the Meet Over all Paths (MOP) solution is NL-complete (and hence efficiently parallelizable) when the lattice is finite even for non-monotone data flow frameworks. These results appear in contrast with the fact that when the lattice is not finite, solving the MOP problem is undecidable and hence significantly harder than the MFP problem which is polynomial time computable for lattices of finite height.
△ Less
Submitted 18 March, 2013;
originally announced March 2013.