Skip to content

Latest commit

 

History

History
70 lines (38 loc) · 2.53 KB

README.md

File metadata and controls

70 lines (38 loc) · 2.53 KB

Nomad Cluster for Tensorflow Distributed Model Training IN-A-BOX.

A simple framework to deploy clusters for training of TensorFlow deep learning AI models.

  • Terraform for deployment of cluster nodes on AWS (GCP, Azure, and On-Site pending)

  • HashiCorp Nomad scheduler ensures fast and secure execution of TensorFlow code.

  • HashiCorp Consul maintains cluster state and key/value store for cluster.

  • Nomad is lighter, faster, and easier than Kubernetes for distributed model training.

Architecture

  • TensorFlow Chief Node ("Worker 0")
  • Workers Nodes
  • Parameter Server Node
  • Evaluator Node
  • Consul as KV store

How to Use

  • Set your cloud provider authentication environment variables

  • Set variables in variables.tf, and execute terraform init & terraform apply

  • Cluster will immediately start training model, as per Git-commited Tensorflow code.

  • Trained model is saved upon completion to parameterized location (default is S3).

  • Nomad cluster node instance type is set in variables.tf, and determines hardware available for training.

  • train directory contains the Python code each cluster node type will execute.

Tensorflow Cluster Nodes / Docker Containers / Nomad Exec

  • Nomad workers run the Tensorflow code, either directly on machine via Exec, or with Docker Containers.
    • Both Nomad Exec and Nomad Docker workers are supported
    • Nomad Exec eliminates the Docker/Container layer, and thus may have performance improvements.
    • Nomad Docker is available, for existing containerized developer/pipeline workflows
    • Tensorflow cluster nodes (Python via Exec or Docker) run 1:1 on Nomad cluster nodes (Cloud VM Instances)
      • Each Tensorflow cluster node has a noncompeting connection to each Nomad cluster node hardware.

References

TODO

  • GitAction pipelines to initiate model training upon tensorflow code commits