Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible resnet #233

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions Classification/resnet50/0_dist_ssh_key/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# 使用 Ansible 将 SSH 公钥分发到多个目标主机

ShawnXuan marked this conversation as resolved.
Show resolved Hide resolved
## 1. 创建变量文件并加密

创建一个包含密码的变量文件vars.yml:

```yaml
all:
hosts:
192.168.1.27:
ansible_user: myuser
ansible_password: mypassword
192.168.1.28:
ansible_user: myuser
ansible_password: mypassword
```

然后使用Ansible Vault加密这个文件:

```bash
ansible-vault encrypt vars.yml
```

注意:

1. 执行 `ansible-vault` 的过程中需要设定一个密码,请记住或保存好这个密码
2. `vars.yml`将被替换为加密后的文件

## 2. 创建主机清单文件

创建一个主机清单文件`inventory.ini`:

```ini
[all]
node1 ansible_host=192.168.1.27 ansible_user=myuser
node2 ansible_host=192.168.1.28 ansible_user=myuser
```

注:需要根据情况修改 `ansible_user` 的值

## 3. 创建Playbook

如果文件存在,这一步可以忽略。

创建一个Playbook distribute_ssh_key.yml:

```yaml
---
- name: Distribute SSH key
hosts: all
vars_files:
- vars.yml
tasks:
- name: Create .ssh directory if it doesn't exist
file:
path: /home/{{ ansible_user }}/.ssh
state: directory
mode: '0700'
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"

- name: Copy the SSH key to the authorized_keys file
authorized_key:
user: "{{ ansible_user }}"
state: present
key: "{{ lookup('file', '/path/to/id_rsa.pub') }}"
```

注:`vars_files` 配置为 `vars.yml`

## 4. 运行Playbook

使用以下命令运行Playbook,并解密变量文件:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是要求其它服务器首先要有主服务器公钥才能执行,不然报错连接失败

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# 使用 Ansible 将 SSH 公钥分发到多个目标主机就是配置公钥的。


```bash
ansible-playbook -i inventory.ini distribute_ssh_key.yml --ask-vault-pass
```
或者运行

```bash
./dist_ssh_key.sh
```

1 change: 1 addition & 0 deletions Classification/resnet50/0_dist_ssh_key/dist_ssh_key.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ansible-playbook -i inventory.ini distribute_ssh_key.yml --ask-vault-pass
19 changes: 19 additions & 0 deletions Classification/resnet50/0_dist_ssh_key/distribute_ssh_key.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
- name: Distribute SSH key
hosts: all
vars_files:
- vars.yml
tasks:
- name: Create .ssh directory if it doesn't exist
file:
path: /home/{{ ansible_user }}/.ssh
state: directory
mode: '0700'
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"

- name: Copy the SSH key to the authorized_keys file
authorized_key:
user: "{{ ansible_user }}"
state: present
key: "{{ lookup('file', '/home/xiexuan/.ssh/id_rsa.pub') }}"
ShawnXuan marked this conversation as resolved.
Show resolved Hide resolved
3 changes: 3 additions & 0 deletions Classification/resnet50/0_dist_ssh_key/inventory.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[all]
of27 ansible_host=192.168.1.27 ansible_user=myuser
of28 ansible_host=192.168.1.28 ansible_user=myuser
8 changes: 8 additions & 0 deletions Classification/resnet50/0_dist_ssh_key/vars.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
all:
hosts:
192.168.1.27:
ansible_user: myuser
ansible_password: mypassword
192.168.1.28:
ansible_user: myuser
ansible_password: mypassword
61 changes: 61 additions & 0 deletions Classification/resnet50/1_get_docker_image/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# 拉取或导入镜像

## 拉取镜像

适用于直接从 dockerhub 拉取镜像。

用法: `./pull.sh [镜像标签]`

参数说明:

- 镜像标签 (可选) : 要拉取的Docker镜像标签,例如 alpine:latest。如果未提供,则使用playbook中的默认值。

示例:

- 默认使用:

```bash
./pull.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以提示:需要有docker权限

```

- 指定镜像标签:

```bash
./pull.sh alpine:latest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这指定标签的看起来会timeout

TASK [Pull Docker image if not present] ************************************************************************************************************************
fatal: [of25]: FAILED! => {"changed": false, "msg": "Error pulling alpine - code: None message: error pulling image configuration: download failed after attempts=6: dial tcp 108.160.169.181:443: i/o timeout"}
fatal: [of27]: FAILED! => {"changed": false, "msg": "Error pulling alpine - code: None message: error pulling image configuration: download failed after attempts=6: dial tcp 31.13.112.9:443: i/o timeout"}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个timeout我也是,所以就开发了 load + commit的方式,后面我们会自定义一个镜像,pull这个可能用不到。

```

## 导入镜像

适用于本地共享目录有已经保存镜像的tar文件,使用 `docker load` 导入。

用法: `./load.sh [镜像文件路径] [镜像标签] [强制导入]`

参数说明:

- 镜像文件路径 (可选) : 要导入的Docker镜像tar文件路径,默认为 `/share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar`
- 镜像标签 (可选) : 导入后设置的Docker镜像标签,默认为 `oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8`
- 强制导入 (可选) : 是否强制导入镜像(true 或 false),默认为 false

示例:

- 默认使用:

```bash
./load.sh
```

- 指定镜像文件路径和标签:

```bash
./load.sh /path/to/shared/abc.tar myrepo/myimage:latest
```

- 强制导入镜像:

```bash
./load.sh /path/to/shared/abc.tar myrepo/myimage:latest true
```




26 changes: 26 additions & 0 deletions Classification/resnet50/1_get_docker_image/load.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash

if [ -n "$1" ]; then
docker_image_path=$1
else
docker_image_path="/share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar"
fi

if [ -n "$2" ]; then
docker_image_tag=$2
else
docker_image_tag="oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8"
fi

if [ -n "$3" ]; then
force_load=$3
else
force_load=false
fi

ansible-playbook \
-i ../inventory.ini \
load_and_tag_docker_image.yml \
-e "docker_image_path=$docker_image_path" \
-e "docker_image_tag=$docker_image_tag" \
-e "force_load=$force_load"
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
- name: Load and tag Docker image
hosts: all
vars:
docker_image_path: "/share_nfs/k85/oneflow.0.9.1.dev20240203-cuda11.8.tar"
docker_image_tag: "oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8"
force_load: false

tasks:
- name: Check if Docker image with the specified tag already exists
command: "docker images -q {{ docker_image_tag }}"
register: image_id
changed_when: false
when: not force_load

- name: Load Docker image from tar file
command: "docker load -i {{ docker_image_path }}"
when: force_load or image_id.stdout == ""
register: load_output

- name: Get image ID from load output
set_fact:
loaded_image_id: "{{ load_output.stdout_lines[-1] | regex_search('sha256:[0-9a-f]+') }}"
when: force_load or image_id.stdout == ""

- name: Tag the loaded Docker image
command: "docker tag {{ loaded_image_id }} {{ docker_image_tag }}"
when: force_load or image_id.stdout == ""
7 changes: 7 additions & 0 deletions Classification/resnet50/1_get_docker_image/pull.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

if [ -n "$1" ]; then
ansible-playbook -i ../inventory.ini pull_docker_image.yml -e "docker_image=$1"
else
ansible-playbook -i ../inventory.ini pull_docker_image.yml
fi
17 changes: 17 additions & 0 deletions Classification/resnet50/1_get_docker_image/pull_docker_image.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
- name: Pull specified Docker image
hosts: all
vars:
docker_image: "oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8"

tasks:
- name: Check if the Docker image is already present
command: "docker images -q {{ docker_image }}"
register: docker_image_id
changed_when: false

- name: Pull Docker image if not present
docker_image:
name: "{{ docker_image }}"
source: pull
when: docker_image_id.stdout == ""
39 changes: 39 additions & 0 deletions Classification/resnet50/2_distributed_training/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# run_dist_training.sh 使用说明

`run_dist_training.sh` 是一个 Bash 脚本,用于运行 `ansible-playbook` 命令来启动分布式训练。此脚本支持通过参数指定 Docker 镜像和源目录。

## 用法

```bash
./run_dist_training.sh [docker_image] [src]
```

## 参数

- `docker_image` (可选): 要使用的 Docker 镜像名称。默认为 `oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8`。
- `src` (可选): 要挂载到 Docker 容器的源目录。默认为 `/share_nfs/k85/models/Vision/classification/image/resnet50`。

## 示例

1. 使用默认值运行:

```bash
./run_dist_training.sh
```

2. 指定 Docker 镜像运行:

```bash
./run_dist_training.sh "my_custom_image:latest"
```

3. 指定 Docker 镜像和源目录运行:

```bash
./run_dist_training.sh "my_custom_image:latest" "/my/custom/src"
```

## 注意

如果不提供参数,脚本将使用默认的 Docker 镜像和源目录。

51 changes: 51 additions & 0 deletions Classification/resnet50/2_distributed_training/dist_training.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
- name: Distributed Training Setup
hosts: all
vars:
device_num_per_node: 8
num_nodes: "{{ groups['all'] | length }}"
master_addr: "{{ hostvars[groups['all'][0]].ansible_host }}"
docker_image: "oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8"
src: "/share_nfs/k85/models/Vision/classification/image/resnet50"

tasks:
- name: Set node rank
set_fact:
node_rank: "{{ groups['all'].index(inventory_hostname) }}"

- name: distributed training in Docker container
command: >
docker run --rm --gpus all
--runtime=nvidia --privileged
--network host --ipc=host
-v {{ src }}:/workspace
-w /workspace
{{ docker_image }} /bin/bash -c "
python3 -m oneflow.distributed.launch \
--nproc_per_node {{ device_num_per_node }} \
--nnodes {{ num_nodes }} \
--node_rank {{ node_rank }} \
--master_addr {{ master_addr }} \
/workspace/train.py \
--synthetic-data \
--batches-per-epoch 1000 \
--num-devices-per-node {{ device_num_per_node }} \
--lr 1.536 \
--num-epochs 1 \
--train-batch-size 32 \
--graph \
--use-fp16 \
--metric-local False \
--metric-train-acc True \
--fuse-bn-relu \
--fuse-bn-add-relu \
--use-gpu-decode \
--channel-last \
--skip-eval
"
register: output

- name: Display output
debug:
var: output.stdout

Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

DOCKER_IMAGE="oneflowinc/oneflow:0.9.1.dev20240203-cuda11.8"
SRC="/share_nfs/k85/models/Vision/classification/image/resnet50"

if [ -n "$1" ]; then
DOCKER_IMAGE="$1"
fi

if [ -n "$2" ]; then
SRC="$2"
fi

# 运行 ansible-playbook 命令
ansible-playbook -i ../inventory.ini dist_training.yml -e "docker_image=${DOCKER_IMAGE}" -e "src=${SRC}"
Loading