Building a machine learning (ML) model for data analytics is seldom a “slam dunk”–it is usually an iterative and exploratory process with a data scientist in the loop. Typically, the data scientist alters one of three configurations per iteration: what features to use (feature engineering, FE), what ML model to use (algorithm selection, AS), and hyper-parameter tuning (HT for short). Model selection is the process of picking a value of this triple (FE, AS, HT) that yields satisfactory accuracy. Alas, most existing ML tools force data scientists to explore just one triple per iteration. Apart from being painful, such an approach does not fully exploit the massive parallelism inherent in this search. On the other extreme, given a feature set, tools such as AutoWeka and MLbase automate the whole process. While this works for some applications, in many others, ignoring the data scientist’s domain expertise could lead to unappealing model results due to hardwired decisions and wasted compute resources. Ideally, we need to capture the whole spectrum of automation so that data scientists can control how much and what kind of automation they want. Thus, in this proposed project, we plan to build new abstractions and systems on top of existing ML tools to achieve this goal; we call such systems model selection management systems (MSMSs). Drawing inspiration from the database literature, we propose three key components for an MSMS. (1) A “declarative interface” that makes it easier for data scientists to specify multiple logically related triples (e.g., multiple feature sets and automated HT for a fixed AS value) without enumerating them. (2) An “optimizer” to speed up the search by exploiting massive parallelism, data and compute redundancy, and materialization of partial results. (3) A “provenance manager” to track of this process and enable data scientists to debug the results and steer the next iteration better. We plan to build two complementary MSMSs: Triptych, which will focus on simple linear classifiers, probabilistic classifiers, and trees in the Spark ecosystem, and Cerebro, which will focus on neural networks (for which AS becomes network structure selection) on top of TensorFlow+Keras.
- Factorized Linear Algebra for Distributed Data Analytics / Arun Kumar
- Model-based Pricing of Relational Data in the Cloud / Arun Kumar