This repo is for the short paper PERSEUS: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models in proceedings of IC2E 2020.
In this paper, we looked at the problem of efficiency and cost saving deep learning inference in the cloud environment. More concretely, we tackled the problem using multi-tenant model serving -- instead of having GPU servers hosting one model dedicately, we serve multiple models on individual GPU servers, subject to the GPU memory capacity. In doing so, we improved the utilization of hardware resources, especially GPU. To achieve this task, we built a measurement framework PERSEUS to characterize and measure the performence and cost trade-offs doing multi-tenant model serving.
-
We evaluated multi-tenant model serving using PERSEUS on several metrics such as inference throughput, monetary cost, and GPU utilization. We showed that multi-tenant serving can lead to up to 12% cost reduction, while maintaining the SLA requirement of model serving.
-
We identified several potential improvements from the deep learning framework's perspective, to provide better support for serving models, especially on CPUs.
Fig 1. Throughput comparison measured of dedicated serving vs. multi-tenant serving.
Fig 2. Monetary saving with multi-tenant serving.
Please see the instructions in the individual modules in code
folder.
If you would like to cite the paper, please cite it as:
@article{lemay2019perseus,
title={Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models},
author={Matthew LeMay and Shijian Li and Tian Guo},
year={2019},
eprint={1912.02322},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
We would like to thank National Science Foundation grants #1755659 and #1815619, and Google Cloud Platform Research credits.
More project information can be found in our lab's project site.
-
Mattew LeMay [email protected]
-
Shijian Li [email protected]
-
Tian Guo [email protected]