Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] CLI: installation.md and create-project.md #11

Merged
merged 1 commit into from
Nov 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/docs/basics/projects.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@
- `train.py`: 项目训练代码的统一入口文件,将在`train.sh`被调用。在`train.py`里,我们需要实现对分布式训练的支持。
- `README.md`: 项目代码的文档,用于指导用户如何准备数据集、模型以及进行训练和推理。

为了帮助用户更方便的创建一个符合以上规范的项目,我们提供了Colossal-AI云平台CLI,来帮助用户一键初始化项目以及必要文件,详情可以查看[CLI使用说明](../cli/cli.md)。

为了帮助用户更方便的创建一个符合以上规范的项目,我们提供了Colossal-AI云平台CLI,来帮助用户一键初始化项目以及必要文件,详情可以查看[CLI使用说明](../cli/installation.md)。

## 创建项目

Expand Down
71 changes: 0 additions & 71 deletions docs/docs/cli/cli.md

This file was deleted.

173 changes: 173 additions & 0 deletions docs/docs/cli/create-project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# 创建项目并上传到平台

通过 `cap project init`,用户可以创建一个标准的项目框架。在下面的命令中,替换 `my-project` 为你自己的项目名称。

```bash
cap project init my-project
# Project skeleton `my-project` has been initialized in `/tmp/cap-demonstration/my-project`
#
# - Edit `train.sh`, `train.py` and `HyperParameters.json` to create your own training project.
# - To upload it to the platform, run `cap project create` and `cap project upload-dir`.
```

这个命令将会在当前目录下,创建如下的项目框架:

```
- my-project
- Dockerfile
- train.sh
- train.py
- HyperParameters.json
- README.md
- requirements.txt
```

## 配置任务文件

`train.sh`,`train.py`,`HyperParameters.json` 为平台启动任务所必需的文件,下面依次介绍它们的功能。

### HyperParameters.json

这个文件定义了用户启动训练任务时,可以从界面输入的超参数。最终的超参数将会作为环境变量传递。

例如,添加超参数定义 `max_epoch` 如下:

```json
// HyperParameters.json
{
"HyperParameters": [
{
"name": "max_epoch",
"types": "int",
"defaultValue": "10",
"description" : "the max epoch of training"
}
]
}
```

启动任务时就会有选框,可以配置 `max_epoch` 的值。

![Hyper Parameters](./images/hyperparams.png)

最终对应的环境变量是 `MAX_EPOCH`。

### train.sh

启动训练任务时,平台将会在每个训练容器内运行 `train.sh` 脚本。

训练任务使用 torchrun,挂载的数据、超参数和 RANK 等信息会通过环境变量传入。

项目框架中的 `train.sh` 如下:

```bash
#!/usr/bin/env bash
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

# ===================================================================
# Welcome to ColossalAI Platform!
# ===================================================================
# Those environment variables would be injected by the runner:
#
# 1. ColossalAI Platform defined ones:
# PROJECT_DIR, DATASET_DIR, MODEL_DIR, OUTPUT_DIR, SCRIPT_DIR
#
# 2. Required by torchrun:
# NNODES, NPROC_PER_NODE, NODE_RANK, MASTER_ADDR, MASTER_PORT
#
# 3. Hyperparameters from configuration UI:
# (check HyperParameters.json for more details)
#
# After that, the runner would execute `train.sh`, this script.
# ===================================================================

torchrun --nnodes ${NNODES} \
--nproc_per_node ${NPROC_PER_NODE} \
--node_rank ${NODE_RANK} \
--master_addr ${MASTER_ADDR} \
--master_port ${MASTER_PORT} \
${SCRIPT_DIR}/train.py \
--project_dir ${PROJECT_DIR} \
--dataset_dir ${DATASET_DIR} \
--model_dir ${MODEL_DIR} \
--output_dir ${OUTPUT_DIR}

# TODO: add more argument passing here
```

### train.py

`train.py` 是主要的训练代码。在训练任务中它会被 `train.sh` 调用,在本地测试时,可以另外传入参数来测试。

## 上传项目到平台

### 创建空项目

首先在平台上创建一个空项目。可以使用浏览器界面或者命令行工具。示例命令如下:

```bash
cap project create
# Create an empty project, user: myusername
# Project name: My Project
# Project description: This project is for demonstration purpose.
# Do you want to continue [y/N]: y
# Project created successfully, id: 65558667d419d3db7d3ddbb6

cap project list
# Name: My Project
# ID: 65558667d419d3db7d3ddbb6
# Description: This project is for demonstration purpose.
# Created At: 2023-11-16 03:03:03
#
# ...
```

登录平台的浏览器界面,可以在**控制台-资产-项目**下看到创建的项目。

![创建项目](./images/create-project-empty.zh.png)

点击进入**详情**,可以看到项目为空。

![创建项目详情](./images/create-project-empty-2.zh.png)

上传项目代码可以通过浏览器界面中的**上传文件夹**按钮,也可以使用命令行工具 `cap`。接下来演示如何使用命令行工具上传代码。

### 把本地的项目代码上传到平台

首先需要拿到项目的 ID,可以通过 `cap project list` 查看,也可以在浏览器界面上查看 URL 中的 ID。

在上面的示例中,项目 ID 为 `65558667d419d3db7d3ddbb6`。

接下来使用 `cap project upload-dir` 命令上传项目代码。示例输出如下:

```bash
cd my-project/

cap project upload-dir 65558667d419d3db7d3ddbb6 .
# Upload overview:
# Local directory: /tmp/cap-demonstration/my-project
# Dataset:
# ID: 65558667d419d3db7d3ddbb6
# Name: My Project
# Description: This project is for demonstration purpose.
# CreatedAt: 2023-11-16 03:03:03
#
# The project content would be overwritten.
#
# Do you want to continue [y/N]: y
# Clearing project 65558667d419d3db7d3ddbb6...
# Uploading directory . as project 65558667d419d3db7d3ddbb6...
# HyperParameters.json => HyperParameters.json
# README.md => README.md
# requirements.txt => requirements.txt
# Dockerfile => Dockerfile
# patch.py => patch.py
# train.py => train.py
# train.sh => train.sh
# Done.
# Directory . uploaded as project 65558667d419d3db7d3ddbb6.
```

上传完成后,可以在浏览器界面上看到项目代码的内容。

![上传后的项目代码](./images/create-project-empty-3.zh.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/docs/cli/images/create-project-empty.en.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/docs/cli/images/create-project-empty.zh.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 49 additions & 0 deletions docs/docs/cli/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# 安装 cap 命令行工具

我们提供了一个命令行工具 `cap`,帮助用户管理平台上的数据集和代码,尤其是在没有图形界面的服务器上传和下载。它的名字来源于 Colossal-AI Platform 的首字母。

1. 从源码安装(推荐,因为 API 可能频繁更新)

```bash
pip install git+https://github.com/hpcaitech/ColossalAI-Platform-CLI@main
```

2. 从PyPI安装

```bash
pip install colossalai-platform
```

## 配置

安装完成后,可以通过 `cap configure` 命令,来完成命令行工具的配置。

它将会询问用户名和密码,并尝试登录平台的 API Server,来验证配置的有效性。输出如下:

```bash
cap configure
# Config doesn't exist on /home/myuser/.colossalai-platform/config.yaml, writing default to it
# Username: myusername
# Password (Hide input):
# Login successfully!
#
# Thank you for choosing the ColossalAI Platform!
# During our public beta phase, we're actively developing and improving the platform. We appreciate your patience with any user experience issues.
#
# For assistance, visit [doc link](TODO) or reach out anytime.
# Your feedback is valuable as we strive to enhance your experience.
```

## 配置文件

用户名和密码将会被保存在配置文件,路径为 `$HOME/.colossalai-platform/config.yaml`。示例内容如下:

```bash
cat ~/.colossalai-platform/config.yaml
# api_server: https://180.184.83.159
# username: myusername
# password: **********
# max_upload_chunk_bytes: 104857600
```

另外,在连接一个私有化部署的实例时,需要修改配置文件中的 API Server 地址。
5 changes: 4 additions & 1 deletion docs/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,10 @@ const sidebars = {
{
type: "category",
label: "命令行工具",
items: ["cli/cli"],
items: [
"cli/installation",
"cli/create-project",
],
},
],
contactSidebar: [
Expand Down
Empty file modified scripts/build.sh
100644 → 100755
Empty file.
Empty file modified scripts/install_node.sh
100644 → 100755
Empty file.
Empty file modified scripts/preview.sh
100644 → 100755
Empty file.
Loading