-
Bracing for sustainable agriculture: the development and function of brace roots in members of Poaceae
Authors:
Ashley N. Hostetler,
Rajdeep S. Khangura,
Brian P. Dilkes,
Erin E. Sparks
Abstract:
Optimization of crop production requires root systems to function in water uptake, nutrient use, and anchorage. In maize, two types of nodal roots-subterranean crown and aerial brace roots function in anchorage and water uptake and preferentially express multiple water and nutrient transporters. Brace root development shares genetic control with juvenile-to-adult phase change and flowering time. W…
▽ More
Optimization of crop production requires root systems to function in water uptake, nutrient use, and anchorage. In maize, two types of nodal roots-subterranean crown and aerial brace roots function in anchorage and water uptake and preferentially express multiple water and nutrient transporters. Brace root development shares genetic control with juvenile-to-adult phase change and flowering time. We present a comprehensive list of the genes known to alter brace roots and explore these as candidates for QTL studies in maize and sorghum. Brace root development and function may be conserved in other members of Poaceae, however research is limited. This work highlights the critical knowledge gap of aerial nodal root development and function and suggests new focus areas for breeding resilient crops.
△ Less
Submitted 1 December, 2020; v1 submitted 8 August, 2020;
originally announced August 2020.
-
Field-based mechanical phenoty** of cereal crops to assess lodging resistance
Authors:
Lindsay Erndwein,
Douglas D. Cook,
Daniel J. Robertson,
Erin E. Sparks
Abstract:
Plant mechanical failure, also known as lodging, is the cause of significant and unpredictable yield losses in cereal crops. Lodging occurs in two distinct failure modes - stalk lodging and root lodging. Despite the prevalence and detrimental impact of lodging on crop yields, there is little consensus on how to phenotype plants in the field for lodging resistance and thus breed for mechanically re…
▽ More
Plant mechanical failure, also known as lodging, is the cause of significant and unpredictable yield losses in cereal crops. Lodging occurs in two distinct failure modes - stalk lodging and root lodging. Despite the prevalence and detrimental impact of lodging on crop yields, there is little consensus on how to phenotype plants in the field for lodging resistance and thus breed for mechanically resilient plants. This review provides an overview of field-based mechanical testing approaches to assess stalk and root lodging resistance. These approaches are placed in the context of future perspectives. Best practices and recommendations for acquiring field-based mechanical phenotypes of plants are also presented.
△ Less
Submitted 20 February, 2020; v1 submitted 18 September, 2019;
originally announced September 2019.
-
MLSys: The New Frontier of Machine Learning Systems
Authors:
Alexander Ratner,
Dan Alistarh,
Gustavo Alonso,
David G. Andersen,
Peter Bailis,
Sarah Bird,
Nicholas Carlini,
Bryan Catanzaro,
Jennifer Chayes,
Eric Chung,
Bill Dally,
Jeff Dean,
Inderjit S. Dhillon,
Alexandros Dimakis,
Pradeep Dubey,
Charles Elkan,
Grigori Fursin,
Gregory R. Ganger,
Lise Getoor,
Phillip B. Gibbons,
Garth A. Gibson,
Joseph E. Gonzalez,
Justin Gottschlich,
Song Han,
Kim Hazelwood
, et al. (44 additional authors not shown)
Abstract:
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne…
▽ More
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
△ Less
Submitted 1 December, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.
-
Design and Construction of Unmanned Ground Vehicles for Sub-Canopy Plant Phenoty**
Authors:
Adam Stager,
Herbert G. Tanner,
Erin E. Sparks
Abstract:
Unmanned ground vehicles can capture a sub-canopy perspective for plant phenoty**, but their design and construction can be a challenge for scientists unfamiliar with robotics. Here we describe the necessary components and provide guidelines for designing and constructing an autonomous ground robot that can be used for plant phenoty**.
Unmanned ground vehicles can capture a sub-canopy perspective for plant phenoty**, but their design and construction can be a challenge for scientists unfamiliar with robotics. Here we describe the necessary components and provide guidelines for designing and constructing an autonomous ground robot that can be used for plant phenoty**.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Exploiting Reuse in Pipeline-Aware Hyperparameter Tuning
Authors:
Liam Li,
Evan Sparks,
Kevin Jamieson,
Ameet Talwalkar
Abstract:
Hyperparameter tuning of multi-stage pipelines introduces a significant computational burden. Motivated by the observation that work can be reused across pipelines if the intermediate computations are the same, we propose a pipeline-aware approach to hyperparameter tuning. Our approach optimizes both the design and execution of pipelines to maximize reuse. We design pipelines amenable for reuse by…
▽ More
Hyperparameter tuning of multi-stage pipelines introduces a significant computational burden. Motivated by the observation that work can be reused across pipelines if the intermediate computations are the same, we propose a pipeline-aware approach to hyperparameter tuning. Our approach optimizes both the design and execution of pipelines to maximize reuse. We design pipelines amenable for reuse by (i) introducing a novel hybrid hyperparameter tuning method called gridded random search, and (ii) reducing the average training time in pipelines by adapting early-stop** hyperparameter tuning approaches. We then realize the potential for reuse during execution by introducing a novel caching problem for ML workloads which we pose as a mixed integer linear program (ILP), and subsequently evaluating various caching heuristics relative to the optimal solution of the ILP. We conduct experiments on simulated and real-world machine learning pipelines to show that a pipeline-aware approach to hyperparameter tuning can offer over an order-of-magnitude speedup over independently evaluating pipeline configurations.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics
Authors:
Evan R. Sparks,
Shivaram Venkataraman,
Tomer Kaftan,
Michael J. Franklin,
Benjamin Recht
Abstract:
Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach…
▽ More
Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach offers increased ease of use and higher performance over existing systems for large scale learning. We demonstrate the effectiveness of KeystoneML in achieving high quality statistical accuracy and scalable training using real world datasets in several domains. By optimizing execution KeystoneML achieves up to 15x training throughput over unoptimized execution on a real image classification application.
△ Less
Submitted 29 October, 2016;
originally announced October 2016.
-
Scalable Linear Causal Inference for Irregularly Sampled Time Series with Long Range Dependencies
Authors:
Francois W. Belletti,
Evan R. Sparks,
Michael J. Franklin,
Alexandre M. Bayen,
Joseph E. Gonzalez
Abstract:
Linear causal analysis is central to a wide range of important application spanning finance, the physical sciences, and engineering. Much of the existing literature in linear causal analysis operates in the time domain. Unfortunately, the direct application of time domain linear causal analysis to many real-world time series presents three critical challenges: irregular temporal sampling, long ran…
▽ More
Linear causal analysis is central to a wide range of important application spanning finance, the physical sciences, and engineering. Much of the existing literature in linear causal analysis operates in the time domain. Unfortunately, the direct application of time domain linear causal analysis to many real-world time series presents three critical challenges: irregular temporal sampling, long range dependencies, and scale. Moreover, real-world data is often collected at irregular time intervals across vast arrays of decentralized sensors and with long range dependencies which make naive time domain correlation estimators spurious. In this paper we present a frequency domain based estimation framework which naturally handles irregularly sampled data and long range dependencies while enabled memory and communication efficient distributed processing of time series data. By operating in the frequency domain we eliminate the need to interpolate and help mitigate the effects of long range dependencies. We implement and evaluate our new work-flow in the distributed setting using Apache Spark and demonstrate on both Monte Carlo simulations and high-frequency financial trading that we can accurately recover causal structure at scale.
△ Less
Submitted 10 March, 2016;
originally announced March 2016.
-
Embarrassingly Parallel Time Series Analysis for Large Scale Weak Memory Systems
Authors:
Francois Belletti,
Evan Sparks,
Michael Franklin,
Alexandre M. Bayen
Abstract:
Second order stationary models in time series analysis are based on the analysis of essential statistics whose computations follow a common pattern. In particular, with a map-reduce nomenclature, most of these operations can be modeled as map** a kernel that only depends on short windows of consecutive data and reducing the results produced by each computation. This computational pattern stems f…
▽ More
Second order stationary models in time series analysis are based on the analysis of essential statistics whose computations follow a common pattern. In particular, with a map-reduce nomenclature, most of these operations can be modeled as map** a kernel that only depends on short windows of consecutive data and reducing the results produced by each computation. This computational pattern stems from the ergodicity of the model under consideration and is often referred to as weak or short memory when it comes to data indexed with respect to time. In the following we will show how studying weak memory systems can be done in a scalable manner thanks to a framework relying on specifically designed overlap** distributed data structures that enable fragmentation and replication of the data across many machines as well as parallelism in computations. This scheme has been implemented for Apache Spark but is certainly not system specific. Indeed we prove it is also adapted to leveraging high bandwidth fragmented memory blocks on GPUs.
△ Less
Submitted 20 November, 2015;
originally announced November 2015.
-
Matrix Computations and Optimization in Apache Spark
Authors:
Reza Bosagh Zadeh,
Xiangrui Meng,
Aaron Staple,
Burak Yavuz,
Li Pu,
Shivaram Venkataraman,
Evan Sparks,
Alexander Ulanov,
Matei Zaharia
Abstract:
We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operatio…
▽ More
We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and ship** the matrix operations to be ran on the cluster, while kee** vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.
△ Less
Submitted 12 July, 2016; v1 submitted 8 September, 2015;
originally announced September 2015.
-
Scientific Computing Meets Big Data Technology: An Astronomy Use Case
Authors:
Zhao Zhang,
Kyle Barbary,
Frank Austin Nothaft,
Evan Sparks,
Oliver Zahn,
Michael J. Franklin,
David A. Patterson,
Saul Perlmutter
Abstract:
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applicati…
▽ More
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications.
△ Less
Submitted 14 March, 2016; v1 submitted 13 July, 2015;
originally announced July 2015.
-
MLlib: Machine Learning in Apache Spark
Authors:
Xiangrui Meng,
Joseph Bradley,
Burak Yavuz,
Evan Sparks,
Shivaram Venkataraman,
Davies Liu,
Jeremy Freeman,
DB Tsai,
Manish Amde,
Sean Owen,
Doris Xin,
Reynold Xin,
Michael J. Franklin,
Reza Zadeh,
Matei Zaharia,
Ameet Talwalkar
Abstract:
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shippe…
▽ More
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
△ Less
Submitted 26 May, 2015;
originally announced May 2015.
-
TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries
Authors:
Evan R. Sparks,
Ameet Talwalkar,
Michael J. Franklin,
Michael I. Jordan,
Tim Kraska
Abstract:
The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. These and many other applications can be supported by Predictive Analytic Queries (PAQs). A major obstacle to supporting PAQs is the chal…
▽ More
The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. These and many other applications can be supported by Predictive Analytic Queries (PAQs). A major obstacle to supporting PAQs is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, thus limiting the effectiveness of such approaches on large-scale problems. In this work, we build upon these recent efforts and propose an integrated PAQ planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The result is TuPAQ, a component of the MLbase system, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.
△ Less
Submitted 8 March, 2015; v1 submitted 30 January, 2015;
originally announced February 2015.
-
MLI: An API for Distributed Machine Learning
Authors:
Evan R. Sparks,
Ameet Talwalkar,
Virginia Smith,
Jey Kottalam,
Xinghao Pan,
Joseph Gonzalez,
Michael J. Franklin,
Michael I. Jordan,
Tim Kraska
Abstract:
MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implement…
▽ More
MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
△ Less
Submitted 25 October, 2013; v1 submitted 21 October, 2013;
originally announced October 2013.
-
Quantum Interference in Single Molecule Electronic Systems
Authors:
R. E. Sparks,
V. M. García-Suárez,
D. Zs. Manrique,
C. J. Lambert
Abstract:
We present a general analytical formula and an ab initio study of quantum interference in multi-branch molecules. Ab initio calculations are used to investigate quantum interference in a benzene-1,2-dithiolate (BDT) molecule sandwiched between gold electrodes and through oligoynes of various lengths. We show that when a point charge is located in the plane of a BDT molecule and its position varied…
▽ More
We present a general analytical formula and an ab initio study of quantum interference in multi-branch molecules. Ab initio calculations are used to investigate quantum interference in a benzene-1,2-dithiolate (BDT) molecule sandwiched between gold electrodes and through oligoynes of various lengths. We show that when a point charge is located in the plane of a BDT molecule and its position varied, the electrical conductance exhibits a clear interference effect, whereas when the charge approaches a BDT molecule along a line normal to the plane of the molecule and passing through the centre of the phenyl ring, interference effects are negligible. In the case of olygoynes, quantum interference leads to the appearance of a critical energy $E_c$, at which the electron transmission coefficient $T(E)$ of chains with even or odd numbers of atoms is independent of length. To illustrate the underlying physics, we derive a general analytical formula for electron transport through multi-branch structures and demonstrate the versatility of the formula by comparing it with the above ab-initio simulations. We also employ the analytical formula to investigate the current inside the molecule and demonstrate that large counter currents can occur within a ring-like molecule such as BDT, when the point charge is located in the plane of the molecule. The formula can be used to describe quantum interference and Fano resonances in structures with branches containing arbitrary elastic scattering regions connected to nodal sites.
△ Less
Submitted 7 March, 2011;
originally announced March 2011.