diff --git a/docs/docs/basics/projects.md b/docs/docs/basics/projects.md index 7b5e104..549c62d 100644 --- a/docs/docs/basics/projects.md +++ b/docs/docs/basics/projects.md @@ -15,8 +15,7 @@ - `train.py`: 项目训练代码的统一入口文件,将在`train.sh`被调用。在`train.py`里,我们需要实现对分布式训练的支持。 - `README.md`: 项目代码的文档,用于指导用户如何准备数据集、模型以及进行训练和推理。 -为了帮助用户更方便的创建一个符合以上规范的项目,我们提供了Colossal-AI云平台CLI,来帮助用户一键初始化项目以及必要文件,详情可以查看[CLI使用说明](../cli/cli.md)。 - +为了帮助用户更方便的创建一个符合以上规范的项目,我们提供了Colossal-AI云平台CLI,来帮助用户一键初始化项目以及必要文件,详情可以查看[CLI使用说明](../cli/installation.md)。 ## 创建项目 diff --git a/docs/docs/cli/cli.md b/docs/docs/cli/cli.md deleted file mode 100644 index 58f288f..0000000 --- a/docs/docs/cli/cli.md +++ /dev/null @@ -1,71 +0,0 @@ -# 命令行工具 - -## 简介 - -为了方便用户一键创建符合云平台要求的项目,我们提供了一个命令行工具名叫cap,它的名字来源于Colossal-AI Platform的首字母。 - -## 安装 - -1. 从源码安装 - -```bash -pip install git+https://github.com/hpcaitech/ColossalAI-Platform-CLI@main -``` - -1. 从PyPI安装 - -```bash -pip install colossalai-platform -``` - -## 使用方法 - -### 创建一个标准项目 - -我们可以用下面的命令创建一个标准的项目结构,记得替换``为你自己的项目名称。 - -```bash -cap project init -``` - -这个项目将包含以下文件 - -``` -- - - Dockerfile - - train.sh - - train.py - - HyperParameters.json - - README.md - - requirements.txt -``` - -`train.sh`,`train.py`,`HyperParameters.json`为云平台启动任务时的必要文件。 - -**1. HyperParameters.json** - -这个文件定义了用户启动训练任务时所需要输入的超参数,用户可以通过在json里添加自己的超参数定义。 - -``` -{ - "HyperParameters": [ - { - "name": "max_epoch", - "types": "int", - "defaultValue": "10", - "description" : "" - } - ] -} -``` -在启动任务时,就能看到这个超参了。 - -![Hyper Parameters](./images/hyperparams.png) - -**2. train.py** - -`train.py`里包含了主要的训练代码。 - -**3. train.sh** - -`train.sh`是整个项目的主要入口,云平台会执行这个文件来启动训练。 diff --git a/docs/docs/cli/create-project.md b/docs/docs/cli/create-project.md new file mode 100644 index 0000000..d61f799 --- /dev/null +++ b/docs/docs/cli/create-project.md @@ -0,0 +1,173 @@ +# 创建项目并上传到平台 + +通过 `cap project init`,用户可以创建一个标准的项目框架。在下面的命令中,替换 `my-project` 为你自己的项目名称。 + +```bash +cap project init my-project +# Project skeleton `my-project` has been initialized in `/tmp/cap-demonstration/my-project` +# +# - Edit `train.sh`, `train.py` and `HyperParameters.json` to create your own training project. +# - To upload it to the platform, run `cap project create` and `cap project upload-dir`. +``` + +这个命令将会在当前目录下,创建如下的项目框架: + +``` +- my-project + - Dockerfile + - train.sh + - train.py + - HyperParameters.json + - README.md + - requirements.txt +``` + +## 配置任务文件 + +`train.sh`,`train.py`,`HyperParameters.json` 为平台启动任务所必需的文件,下面依次介绍它们的功能。 + +### HyperParameters.json + +这个文件定义了用户启动训练任务时,可以从界面输入的超参数。最终的超参数将会作为环境变量传递。 + +例如,添加超参数定义 `max_epoch` 如下: + +```json +// HyperParameters.json +{ + "HyperParameters": [ + { + "name": "max_epoch", + "types": "int", + "defaultValue": "10", + "description" : "the max epoch of training" + } + ] +} +``` + +启动任务时就会有选框,可以配置 `max_epoch` 的值。 + +![Hyper Parameters](./images/hyperparams.png) + +最终对应的环境变量是 `MAX_EPOCH`。 + +### train.sh + +启动训练任务时,平台将会在每个训练容器内运行 `train.sh` 脚本。 + +训练任务使用 torchrun,挂载的数据、超参数和 RANK 等信息会通过环境变量传入。 + +项目框架中的 `train.sh` 如下: + +```bash +#!/usr/bin/env bash +SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" + +# =================================================================== +# Welcome to ColossalAI Platform! +# =================================================================== +# Those environment variables would be injected by the runner: +# +# 1. ColossalAI Platform defined ones: +# PROJECT_DIR, DATASET_DIR, MODEL_DIR, OUTPUT_DIR, SCRIPT_DIR +# +# 2. Required by torchrun: +# NNODES, NPROC_PER_NODE, NODE_RANK, MASTER_ADDR, MASTER_PORT +# +# 3. Hyperparameters from configuration UI: +# (check HyperParameters.json for more details) +# +# After that, the runner would execute `train.sh`, this script. +# =================================================================== + +torchrun --nnodes ${NNODES} \ + --nproc_per_node ${NPROC_PER_NODE} \ + --node_rank ${NODE_RANK} \ + --master_addr ${MASTER_ADDR} \ + --master_port ${MASTER_PORT} \ + ${SCRIPT_DIR}/train.py \ + --project_dir ${PROJECT_DIR} \ + --dataset_dir ${DATASET_DIR} \ + --model_dir ${MODEL_DIR} \ + --output_dir ${OUTPUT_DIR} + +# TODO: add more argument passing here +``` + +### train.py + +`train.py` 是主要的训练代码。在训练任务中它会被 `train.sh` 调用,在本地测试时,可以另外传入参数来测试。 + +## 上传项目到平台 + +### 创建空项目 + +首先在平台上创建一个空项目。可以使用浏览器界面或者命令行工具。示例命令如下: + +```bash +cap project create +# Create an empty project, user: myusername +# Project name: My Project +# Project description: This project is for demonstration purpose. +# Do you want to continue [y/N]: y +# Project created successfully, id: 65558667d419d3db7d3ddbb6 + +cap project list +# Name: My Project +# ID: 65558667d419d3db7d3ddbb6 +# Description: This project is for demonstration purpose. +# Created At: 2023-11-16 03:03:03 +# +# ... +``` + +登录平台的浏览器界面,可以在**控制台-资产-项目**下看到创建的项目。 + +![创建项目](./images/create-project-empty.zh.png) + +点击进入**详情**,可以看到项目为空。 + +![创建项目详情](./images/create-project-empty-2.zh.png) + +上传项目代码可以通过浏览器界面中的**上传文件夹**按钮,也可以使用命令行工具 `cap`。接下来演示如何使用命令行工具上传代码。 + +### 把本地的项目代码上传到平台 + +首先需要拿到项目的 ID,可以通过 `cap project list` 查看,也可以在浏览器界面上查看 URL 中的 ID。 + +在上面的示例中,项目 ID 为 `65558667d419d3db7d3ddbb6`。 + +接下来使用 `cap project upload-dir` 命令上传项目代码。示例输出如下: + +```bash +cd my-project/ + +cap project upload-dir 65558667d419d3db7d3ddbb6 . +# Upload overview: +# Local directory: /tmp/cap-demonstration/my-project +# Dataset: +# ID: 65558667d419d3db7d3ddbb6 +# Name: My Project +# Description: This project is for demonstration purpose. +# CreatedAt: 2023-11-16 03:03:03 +# +# The project content would be overwritten. +# +# Do you want to continue [y/N]: y +# Clearing project 65558667d419d3db7d3ddbb6... +# Uploading directory . as project 65558667d419d3db7d3ddbb6... +# HyperParameters.json => HyperParameters.json +# README.md => README.md +# requirements.txt => requirements.txt +# Dockerfile => Dockerfile +# patch.py => patch.py +# train.py => train.py +# train.sh => train.sh +# Done. +# Directory . uploaded as project 65558667d419d3db7d3ddbb6. +``` + +上传完成后,可以在浏览器界面上看到项目代码的内容。 + +![上传后的项目代码](./images/create-project-empty-3.zh.png) diff --git a/docs/docs/cli/images/create-project-empty-2.en.png b/docs/docs/cli/images/create-project-empty-2.en.png new file mode 100644 index 0000000..2695844 Binary files /dev/null and b/docs/docs/cli/images/create-project-empty-2.en.png differ diff --git a/docs/docs/cli/images/create-project-empty-2.zh.png b/docs/docs/cli/images/create-project-empty-2.zh.png new file mode 100644 index 0000000..9b45022 Binary files /dev/null and b/docs/docs/cli/images/create-project-empty-2.zh.png differ diff --git a/docs/docs/cli/images/create-project-empty-3.en.png b/docs/docs/cli/images/create-project-empty-3.en.png new file mode 100644 index 0000000..4ed59ab Binary files /dev/null and b/docs/docs/cli/images/create-project-empty-3.en.png differ diff --git a/docs/docs/cli/images/create-project-empty-3.zh.png b/docs/docs/cli/images/create-project-empty-3.zh.png new file mode 100644 index 0000000..4039d12 Binary files /dev/null and b/docs/docs/cli/images/create-project-empty-3.zh.png differ diff --git a/docs/docs/cli/images/create-project-empty.en.png b/docs/docs/cli/images/create-project-empty.en.png new file mode 100644 index 0000000..f5b9f4a Binary files /dev/null and b/docs/docs/cli/images/create-project-empty.en.png differ diff --git a/docs/docs/cli/images/create-project-empty.zh.png b/docs/docs/cli/images/create-project-empty.zh.png new file mode 100644 index 0000000..6f09411 Binary files /dev/null and b/docs/docs/cli/images/create-project-empty.zh.png differ diff --git a/docs/docs/cli/installation.md b/docs/docs/cli/installation.md new file mode 100644 index 0000000..080dce3 --- /dev/null +++ b/docs/docs/cli/installation.md @@ -0,0 +1,49 @@ +# 安装 cap 命令行工具 + +我们提供了一个命令行工具 `cap`,帮助用户管理平台上的数据集和代码,尤其是在没有图形界面的服务器上传和下载。它的名字来源于 Colossal-AI Platform 的首字母。 + +1. 从源码安装(推荐,因为 API 可能频繁更新) + +```bash +pip install git+https://github.com/hpcaitech/ColossalAI-Platform-CLI@main +``` + +2. 从PyPI安装 + +```bash +pip install colossalai-platform +``` + +## 配置 + +安装完成后,可以通过 `cap configure` 命令,来完成命令行工具的配置。 + +它将会询问用户名和密码,并尝试登录平台的 API Server,来验证配置的有效性。输出如下: + +```bash +cap configure +# Config doesn't exist on /home/myuser/.colossalai-platform/config.yaml, writing default to it +# Username: myusername +# Password (Hide input): +# Login successfully! +# +# Thank you for choosing the ColossalAI Platform! +# During our public beta phase, we're actively developing and improving the platform. We appreciate your patience with any user experience issues. +# +# For assistance, visit [doc link](TODO) or reach out anytime. +# Your feedback is valuable as we strive to enhance your experience. +``` + +## 配置文件 + +用户名和密码将会被保存在配置文件,路径为 `$HOME/.colossalai-platform/config.yaml`。示例内容如下: + +```bash +cat ~/.colossalai-platform/config.yaml +# api_server: https://180.184.83.159 +# username: myusername +# password: ********** +# max_upload_chunk_bytes: 104857600 +``` + +另外,在连接一个私有化部署的实例时,需要修改配置文件中的 API Server 地址。 diff --git a/docs/sidebars.js b/docs/sidebars.js index 0f4c033..5e32550 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -39,7 +39,10 @@ const sidebars = { { type: "category", label: "命令行工具", - items: ["cli/cli"], + items: [ + "cli/installation", + "cli/create-project", + ], }, ], contactSidebar: [ diff --git a/scripts/build.sh b/scripts/build.sh old mode 100644 new mode 100755 diff --git a/scripts/install_node.sh b/scripts/install_node.sh old mode 100644 new mode 100755 diff --git a/scripts/preview.sh b/scripts/preview.sh old mode 100644 new mode 100755