Abstract: Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics or data science, is increasingly critical for many data-driven applications in the enterprise, Web, science, and other domains. The data management, ML, and systems communities are working on scalable and fast implementations of ML algorithms. However, several orthogonal bottlenecks in the end-to-end process of building and deploying ML models for data analytics have largely been ignored, leading to wasted resources and poor productivity of data scientists. Thus, a pressing challenge in democratizing advanced analytics is devising new abstractions and systems that integrate data processing with the building and deployment of ML models at scale. In this context, I will talk about three new projects in my research group:
(1) Morpheus: A new framework that aims to make it easier to integrate key feature engineering operations over structured data with ML tasks by unifying linear algebra with relational algebra, thus opening up new optimization opportunities.
(2) MSMS: A new class of systems that elevate the process of model selection in ML training to a declarative level, which opens up new optimization opportunities at scale, while also enabling the system to manage the provenance of this process. I will describe two instantiations of the MSMS idea: Triptych for “classical” ML models and Cerebro for deep learning.
(3) Genisys: A new system that realizes our vision of “database perception” to integrate dark data types, specifically, images, audio, and video, with structured data management and enable querying, reporting, and analytics applications to include such data types, while managing the materialization and access trade-offs of deep learning-based feature extraction.
We plan to prototype our tools on top of popular distributed ML and data analytics systems, including SystemML, TensorFlow, Spark, and Greenplum, while offering interfaces in popular data science environments such as R and Python.
The primary goal of this talk is to introduce these projects to the CNS audience and solicit critical feedback.
Bio: Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and an affiliate member of the AI Group and CNS. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, an area that is increasingly called advanced analytics or data science. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. He is a recipient of the Best Paper Award at ACM SIGMOD 2014, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award.