Nomad Cluster for Tensorflow Distributed Model Training IN-A-BOX.

A simple framework to deploy clusters for training of TensorFlow deep learning AI models.

Terraform for deployment of cluster nodes on AWS (GCP, Azure, and On-Site pending)
HashiCorp Nomad scheduler ensures fast and secure execution of TensorFlow code.
HashiCorp Consul maintains cluster state and key/value store for cluster.
Nomad is lighter, faster, and easier than Kubernetes for distributed model training.

Architecture

TensorFlow Chief Node ("Worker 0")
Workers Nodes
Parameter Server Node
Evaluator Node
Consul as KV store

How to Use

Set your cloud provider authentication environment variables
Set variables in variables.tf, and execute terraform init & terraform apply
Cluster will immediately start training model, as per Git-commited Tensorflow code.
Trained model is saved upon completion to parameterized location (default is S3).
Nomad cluster node instance type is set in variables.tf, and determines hardware available for training.
train directory contains the Python code each cluster node type will execute.

Tensorflow Cluster Nodes / Docker Containers / Nomad Exec

Nomad workers run the Tensorflow code, either directly on machine via Exec, or with Docker Containers.
- Both Nomad Exec and Nomad Docker workers are supported
- Nomad Exec eliminates the Docker/Container layer, and thus may have performance improvements.
- Nomad Docker is available, for existing containerized developer/pipeline workflows
- Tensorflow cluster nodes (Python via Exec or Docker) run 1:1 on Nomad cluster nodes (Cloud VM Instances)
  - Each Tensorflow cluster node has a noncompeting connection to each Nomad cluster node hardware.

References

TODO

GitAction pipelines to initiate model training upon tensorflow code commits

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
bootstrap		bootstrap
train		train
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
LICENSE.md		LICENSE.md
README.md		README.md
instances.tf		instances.tf
locals.tf		locals.tf
nomad_ai_iam_policy.json		nomad_ai_iam_policy.json
outputs.tf		outputs.tf
security.tf		security.tf
terraform.tf		terraform.tf
terraform.tfvars		terraform.tfvars
variables.tf		variables.tf
vpc.tf		vpc.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nomad Cluster for Tensorflow Distributed Model Training IN-A-BOX.

A simple framework to deploy clusters for training of TensorFlow deep learning AI models.

Architecture

How to Use

Tensorflow Cluster Nodes / Docker Containers / Nomad Exec

References

TODO

About

Releases

Packages

Languages

License

nand0p/nomadAI

Folders and files

Latest commit

History

Repository files navigation

Nomad Cluster for Tensorflow Distributed Model Training IN-A-BOX.

A simple framework to deploy clusters for training of TensorFlow deep learning AI models.

Architecture

How to Use

Tensorflow Cluster Nodes / Docker Containers / Nomad Exec

References

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages