Factorized Linear Algebra for Distributed Data Analytics / Arun Kumar

May 3, 2017May 12, 2017 cnsmember

While almost all machine learning (ML) toolkits expect the training data to be in a single table, many real datasets are multi-table, connected by key-foreign key and other database dependencies. In such cases, data scientists join all tables to obtain a single table. But this intermediate table could blow up in size due to the redundancy introduced by joins, which wastes storage space and runtime. Our paradigm of “factorized ML” solves this issue by pushing ML computations down through joins for a class of ML algorithms. But “factorizing” new ML algorithms requires tedious manual rewrites, which is a daunting development overhead. To solve this issue, our new framework named Morpheus “automates” factorized ML by exploiting linear algebra (LA) as a formal representation language for ML. Given an ML algorithm in LA using operators such as matrix-vector multiplication, Morpheus factorizes it by using a set of algebraic rewrite optimization rules we created for factorized LA. We have demonstrated the initial feasibility and efficiency of Morpheus for single-node computations. But many ML applications use distributed data analytics frameworks such as SystemML+Spark and TensorFlow, both of which support LA. Thus, in this proposed project, we plan to extend Morpheus and integrate it deeply with these systems. We face two related research challenges: (1) How to integrate and exploit Morpheus’s rewrite rules without disrupting the existing optimizations in these systems? We plan to study both the computational costs and the communication costs of our rewrite rules, as well as how they interplay with data partitioning, shuffling, and scheduling. (2) How to compare the runtime performance of such systems systematically in order to evaluate the gains? We plan to create a new “benchmark” for such LA-based ML systems by systematically evaluating the effects of various properties of the data (dimensions and sparsity), system (memory, number of cores, and number of nodes), and computations (LA operators, LA expressions, and full ML programs).

Email for more info