Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible resnet #233

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

Ansible resnet #233

wants to merge 27 commits into from

Conversation

ShawnXuan
Copy link
Collaborator

使用 ansible 在集群中进行分布式训练。

@ShawnXuan ShawnXuan marked this pull request as ready for review August 2, 2024 04:02

## 4. 运行Playbook

使用以下命令运行Playbook,并解密变量文件:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是要求其它服务器首先要有主服务器公钥才能执行,不然报错连接失败

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# 使用 Ansible 将 SSH 公钥分发到多个目标主机就是配置公钥的。

- 默认使用:

```bash
./pull.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以提示:需要有docker权限

- 指定镜像标签:

```bash
./pull.sh alpine:latest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这指定标签的看起来会timeout

TASK [Pull Docker image if not present] ************************************************************************************************************************
fatal: [of25]: FAILED! => {"changed": false, "msg": "Error pulling alpine - code: None message: error pulling image configuration: download failed after attempts=6: dial tcp 108.160.169.181:443: i/o timeout"}
fatal: [of27]: FAILED! => {"changed": false, "msg": "Error pulling alpine - code: None message: error pulling image configuration: download failed after attempts=6: dial tcp 31.13.112.9:443: i/o timeout"}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个timeout我也是,所以就开发了 load + commit的方式,后面我们会自定义一个镜像,pull这个可能用不到。

@xiezipeng-ML
Copy link

docker image: /share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants