diff --git a/README.md b/README.md index f49b51c..e179ae2 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,7 @@ by active learning (by developers of Spacy), text and image * **Object store**: Store binary data (images, sound files, compressed texts) * [Amazon S3](https://aws.amazon.com/s3/) * [Ceph](https://ceph.io/) Object Store + * [Google Cloud Storage](https://cloud.google.com/storage/) * **Database**: Store metadata (file paths, labels, user activity, etc). * [Postgres](https://www.postgresql.org/) is the right choice for most of applications, with the best-in-class SQL and great support for unstructured JSON. * **Data Lake**: to aggregate features which are not obtainable from database (e.g. logs) @@ -96,6 +97,7 @@ by active learning (by developers of Spacy), text and image * [DVC](https://dvc.org/): Open source version control system for ML projects * [Pachyderm](https://www.pachyderm.com/): version control for data * [Dolt](https://www.liquidata.co/): versioning for SQL database + * [FloydHub Datasets](https://www.floydhub.com/floydhub/datasets) ### 1.5. Data Processing * Training data for production models may come from different sources, including *Stored data in db and object stores*, *log processing*, and *outputs of other classifiers*. @@ -108,6 +110,7 @@ by active learning (by developers of Spacy), text and image * Robust conditional execution: retry in case of failure * Pusher supports docker images with tensorflow serving * Whole workflow in a single .py file + * [Dataflow](https://cloud.google.com/dataflow/) by Google Cloud Platform

@@ -135,7 +138,10 @@ by active learning (by developers of Spacy), text and image * Training/Evaluation: Use cloud instances with proper provisioning and handling of failures * Cloud Providers: * GCP: option to connect GPUs to any instance + has TPUs + * [Compute Engine](https://cloud.google.com/compute/) - allows for configuring your VM with GPUs + * [AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks/) - provides you with Jupyter Lab instances preconfigured with all the necessary libraries and CUDA drivers (has the option for customization as well) * AWS: + * [EC2](https://aws.amazon.com/ec2/) - Similar to Compute Engine ### 2.2. Resource Management * Allocating free resources to programs * Resource management options: @@ -190,10 +196,12 @@ by active learning (by developers of Spacy), text and image * Data parallelism: Use it when iteration time is too long (both tensorflow and PyTorch support) * [Ray Distributed Training](https://ray.readthedocs.io/en/latest/distributed_training.html) * Model parallelism: when model does not fit on a single GPU + * [ML Engine](https://cloud.google.com/ml-engine) * Other solutions: * Horovod ## 3. Troubleshooting [TBD] + * [This Twitter thread](https://twitter.com/chipro/status/1189564204312711170?s=20) is a little list of all the good resources for this section ## 4. Testing and Deployment ### 4.1. Testing and CI/CD @@ -234,6 +242,7 @@ Machine Learning production software requires a more diverse set of test suites * Marathon * 3. Deploy code as a "serverless function" * 4. Deploy via a **model serving** solution + * 5. [BentoML](https://github.com/bentoml/BentoML) - it can ease the process of exposing your ML as a REST API * Model serving: * Specialized web deployment for ML models * Batches request for GPU inference @@ -263,6 +272,7 @@ Machine Learning production software requires a more diverse set of test suites * Alerts for downtime, errors, and distribution shifts * Catching service and data regressions * Cloud providers solutions are decent + * [Stackdriver](https://cloud.google.com/stackdriver/) * [Kiali](https://kiali.io/):an observability console for Istio with service mesh configuration capabilities. It answers these questions: How are the microservices connected? How are they performing? #### Are we done? @@ -327,6 +337,3 @@ Machine Learning production software requires a more diverse set of test suites [2]: [Advanced KubeFlow Workshop](https://www.meetup.com/Advanced-KubeFlow/) by [Pipeline.ai](https://pipeline.ai/), 2019. [3]: [TFX: Real World Machine Learning in Production](https://cdn.oreillystatic.com/en/assets/1/event/298/TFX_%20Production%20ML%20pipelines%20with%20TensorFlow%20Presentation.pdf) - - -