-
Terraform for deployment of cluster nodes on AWS (GCP, Azure, and On-Site pending)
-
HashiCorp Nomad scheduler ensures fast and secure execution of TensorFlow code.
-
HashiCorp Consul maintains cluster state and key/value store for cluster.
-
Nomad is lighter, faster, and easier than Kubernetes for distributed model training.
- TensorFlow Chief Node ("Worker 0")
- Workers Nodes
- Parameter Server Node
- Evaluator Node
- Consul as KV store
-
Set your cloud provider authentication environment variables
-
Set variables in
variables.tf
, and executeterraform init
&terraform apply
-
Cluster will immediately start training model, as per Git-commited Tensorflow code.
-
Trained model is saved upon completion to parameterized location (default is S3).
-
Nomad cluster node instance type is set in
variables.tf
, and determines hardware available for training. -
train
directory contains the Python code each cluster node type will execute.
- Nomad workers run the Tensorflow code, either directly on machine via Exec, or with Docker Containers.
- Both Nomad Exec and Nomad Docker workers are supported
- Nomad Exec eliminates the Docker/Container layer, and thus may have performance improvements.
- Nomad Docker is available, for existing containerized developer/pipeline workflows
- Tensorflow cluster nodes (Python via Exec or Docker) run 1:1 on Nomad cluster nodes (Cloud VM Instances)
- Each Tensorflow cluster node has a noncompeting connection to each Nomad cluster node hardware.
-
https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy
-
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
-
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
-
https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy
-
https://towardsdatascience.com/distributed-training-in-tf-keras-with-w-b-ccf021f9322e
- GitAction pipelines to initiate model training upon tensorflow code commits