Model-based Pricing of Relational Data in the Cloud / Arun Kumar
Rich, structured datasets are increasingly being bought and sold in cloud-based data marketplaces such as Microsoft DataMarket for various tasks, including machine learning (ML). But current marketplaces force users to buy such data in whole or as fixed subsets without any awareness of the ML tasks they are used for. This leads to sub-optimal choices and missed opportunities for both data sellers and buyers. In this proposed project, we plan to build a formal and practical pricing framework we call model-based pricing that aims to resolve such issues. While the raw data could be expensive, our framework enables sellers to specify much lower price points for different ML models trained on (subsets of) the data and sell the models directly. This could dramatically expand the market reach of the sellers and also make such datasets accessible to more buyers. From a buyer’s perspective, our key observation is that ML users typically need only as much data as is necessary to meet their accuracy, price, and/or runtime goals, which leads to novel trade-offs between accuracy, price, and runtimes. We plan to study such trade-offs systematically by focusing on three increasingly sophisticated settings. (1) In horizontal pricing, we will devise a practical framework that combines ML, microeconomics, and data management to let buyers optimize the accuracy-price trade-off via subsampling of data examples. (2) In vertical pricing, we plan to integrate feature selection into the accuracy-price trade-off and let buyers also drop features they may not be interested in. (3) In resource-integrated pricing, we plan to integrate cloud resources costs as well and let buyers optimize the accuracy-price-runtime trade-off by jointly considering the purchase of both data and compute resources in the cloud. We also plan to work with cloud providers such as Amazon, Google, or Microsoft to validate our framework on production workloads.