-
Speakerly: A Voice-based Writing Assistant for Text Composition
Authors:
Dhruv Kumar,
Vipul Raheja,
Alice Kaiser-Schatzlein,
Robyn Perry,
Apurva Joshi,
Justin Hugues-Nuger,
Samuel Lou,
Navid Chowdhury
Abstract:
We present Speakerly, a new real-time voice-based writing assistance system that helps users with text composition across various use cases such as emails, instant messages, and notes. The user can interact with the system through instructions or dictation, and the system generates a well-formatted and coherent document. We describe the system architecture and detail how we address the various cha…
▽ More
We present Speakerly, a new real-time voice-based writing assistance system that helps users with text composition across various use cases such as emails, instant messages, and notes. The user can interact with the system through instructions or dictation, and the system generates a well-formatted and coherent document. We describe the system architecture and detail how we address the various challenges while building and deploying such a system at scale. More specifically, our system uses a combination of small, task-specific models as well as pre-trained language models for fast and effective text composition while supporting a variety of input modes for better usability.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Causal Discovery in Heterogeneous Environments Under the Sparse Mechanism Shift Hypothesis
Authors:
Ronan Perry,
Julius von Kügelgen,
Bernhard Schölkopf
Abstract:
Machine learning approaches commonly rely on the assumption of independent and identically distributed (i.i.d.) data. In reality, however, this assumption is almost always violated due to distribution shifts between environments. Although valuable learning signals can be provided by heterogeneous data from changing distributions, it is also known that learning under arbitrary (adversarial) changes…
▽ More
Machine learning approaches commonly rely on the assumption of independent and identically distributed (i.i.d.) data. In reality, however, this assumption is almost always violated due to distribution shifts between environments. Although valuable learning signals can be provided by heterogeneous data from changing distributions, it is also known that learning under arbitrary (adversarial) changes is impossible. Causality provides a useful framework for modeling distribution shifts, since causal models encode both observational and interventional distributions. In this work, we explore the sparse mechanism shift hypothesis, which posits that distribution shifts occur due to a small number of changing causal conditionals. Motivated by this idea, we apply it to learning causal structure from heterogeneous environments, where i.i.d. data only allows for learning an equivalence class of graphs without restrictive assumptions. We propose the Mechanism Shift Score (MSS), a score-based approach amenable to various empirical estimators, which provably identifies the entire causal structure with high probability if the sparse mechanism shift hypothesis holds. Empirically, we verify behavior predicted by the theory and compare multiple estimators and score functions to identify the best approaches in practice. Compared to other methods, we show how MSS bridges a gap by both being nonparametric as well as explicitly leveraging sparse changes.
△ Less
Submitted 15 October, 2022; v1 submitted 4 June, 2022;
originally announced June 2022.
-
mvlearn: Multiview Machine Learning in Python
Authors:
Ronan Perry,
Gavin Mischler,
Richard Guo,
Theodore Lee,
Alexander Chang,
Arman Koul,
Cameron Franz,
Hugo Richard,
Iain Carmichael,
Pierre Ablin,
Alexandre Gramfort,
Joshua T. Vogelstein
Abstract:
As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that…
▽ More
As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that of scikit-learn for increased ease-of-use. The package can be installed from Python Package Index (PyPI) and the conda package manager and is released under the MIT open-source license. The documentation, detailed examples, and all releases are available at https://mvlearn.github.io/.
△ Less
Submitted 25 May, 2021; v1 submitted 24 May, 2020;
originally announced May 2020.
-
High-dimensional and universally consistent k-sample tests
Authors:
Sambit Panda,
Cencheng Shen,
Ronan Perry,
Jelle Zorn,
Antoine Lutz,
Carey E. Priebe,
Joshua T. Vogelstein
Abstract:
The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several univ…
▽ More
The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several universally consistent high-dimensional independence tests such as distance correlation (Dcorr) and Hilbert-Schmidt-Independence-Criterion (Hsic) enjoy solid theoretical and empirical properties. In this paper, we prove that independence tests achieve universally consistent k-sample testing and that k-sample statistics such as Energy and Maximum Mean Discrepancy (MMD) are precisely equivalent to Dcorr. An empirical evaluation of nonparametric independence tests showed that they generally perform better than the popular MANOVA test, even in Gaussian distributed scenarios. The evaluation included several popular independence statistics and covered a comprehensive set of simulations. Additionally, the testing approach was extended to perform multiway and multilevel tests, which were demonstrated in a simulated study as well as a real-world fMRI brain scans with a set of attributes.
△ Less
Submitted 11 October, 2023; v1 submitted 19 October, 2019;
originally announced October 2019.
-
Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks
Authors:
Adam Li,
Ronan Perry,
Chester Huynh,
Tyler M. Tomita,
Ronak Mehta,
Jesus Arroyo,
Jesse Patsolic,
Benjamin Falk,
Joshua T. Vogelstein
Abstract:
Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structur…
▽ More
Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structured data lying on a manifold (such as images, text, and speech) deep networks (Networks), specifically convolutional deep networks (ConvNets), tend to outperform Forests. We conjecture that at least part of the reason for this is that the input to Networks is not simply the feature magnitudes, but also their indices. In contrast, naive Forest implementations fail to explicitly consider feature indices. A recently proposed Forest approach demonstrates that Forests, for each node, implicitly sample a random matrix from some specific distribution. These Forests, like some classes of Networks, learn by partitioning the feature space into convex polytopes corresponding to linear functions. We build on that approach and show that one can choose distributions in a manifold-aware fashion to incorporate feature locality. We demonstrate the empirical performance on data whose features live on three different manifolds: a torus, images, and time-series. Moreover, we demonstrate its strength in multivariate simulated settings and also show superiority in predicting surgical outcome in epilepsy patients and predicting movement direction from raw stereotactic EEG data from non-motor brain regions. In all simulations and real data, Manifold Oblique Random Forest (MORF) algorithm outperforms approaches that ignore feature space structure and challenges the performance of ConvNets. Moreover, MORF runs fast and maintains interpretability and theoretical justification.
△ Less
Submitted 5 September, 2022; v1 submitted 25 September, 2019;
originally announced September 2019.
-
Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities
Authors:
Ronan Perry,
Ronak Mehta,
Richard Guo,
Eva Yezerets,
Jesús Arroyo,
Mike Powell,
Hayden Helm,
Cencheng Shen,
Joshua T. Vogelstein
Abstract:
Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when…
▽ More
Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when different features are measured on different scales.We propose decision forest-based adaptive nearest neighbor estimators and show that they are able to effectively estimate posterior probabilities, conditional entropies, and mutual information even in the aforementioned settings.We provide an extensive study of efficacy for classification and posterior probability estimation, and prove certain forest-based approaches to be consistent estimators of the true posteriors and derived information-theoretic quantities under certain assumptions. In a real-world connectome application, we quantify the uncertainty about neuron type given various cellular features in the Drosophila larva mushroom body, a key challenge for modern neuroscience.
△ Less
Submitted 5 October, 2021; v1 submitted 30 June, 2019;
originally announced July 2019.
-
An Internet Approach for Engineering Student Exercises
Authors:
Richard Perry
Abstract:
An approach for engineering student exercises using the Internet is described. In this approach, for a given exercise, each student receives the same problem, but with different data. The exercise content can be static or dynamic, and the dynamic form can be timeless or real-time. The implementation provides immediate feedback to the students, letting them know if their submitted answers are corre…
▽ More
An approach for engineering student exercises using the Internet is described. In this approach, for a given exercise, each student receives the same problem, but with different data. The exercise content can be static or dynamic, and the dynamic form can be timeless or real-time. The implementation provides immediate feedback to the students, letting them know if their submitted answers are correct. Student results for each exercise are recorded in log files which are available to the instructor. Example exercises from engineering computer security and cryptography courses are presented.
△ Less
Submitted 9 August, 2012;
originally announced August 2012.
-
Batch Spreadsheet for C Programmers
Authors:
Richard Perry
Abstract:
A computing environment is proposed, based on batch spreadsheet processing, which produces a spreadsheet display from plain text input files of commands, similar to the way documents are created using LaTeX. In this environment, besides the usual spreadsheet rows and columns of cells, variables can be defined and are stored in a separate symbol table. Cell and symbol formulas may contain cycles, a…
▽ More
A computing environment is proposed, based on batch spreadsheet processing, which produces a spreadsheet display from plain text input files of commands, similar to the way documents are created using LaTeX. In this environment, besides the usual spreadsheet rows and columns of cells, variables can be defined and are stored in a separate symbol table. Cell and symbol formulas may contain cycles, and cycles which converge can be used to implement iterative algorithms. Formulas are specified using the syntax of the C programming language, and all of C's numeric operators are supported, with operators such as ++, +=, etc. being implicitly cyclic. User-defined functions can be written in C and are accessed using a dynamic link library. The environment can be combined with a GUI front-end processor to enable easier interaction and graphics including plotting.
△ Less
Submitted 9 August, 2012;
originally announced August 2012.