Skip to content

Commit

Permalink
add code example
Browse files Browse the repository at this point in the history
  • Loading branch information
lanran2001 committed Oct 8, 2023
1 parent 4c67b77 commit 967717f
Showing 1 changed file with 119 additions and 0 deletions.
119 changes: 119 additions & 0 deletions docs/blog/2023-10-08-CreateTrainingJob/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,125 @@ tags: [ColossalChat]
#### 手动创建
如果需要手动创建项目,您需要在文件夹中创建一些内容,可以参考:[项目规范](https://docs.platform.luchentech.com/docs/basics/projects#%E9%A1%B9%E7%9B%AE%E8%A7%84%E8%8C%83)

#### 代码示例
##### HyperParameters.json
此文件用于设置启动训练时的超参数,可以在启动任务的网页UI界面设置
```json
{
"HyperParameters": [
{
"name": "epoch",
"types": "string",
"defaultValue": "10",
"description": "Number of epochs for training"
}
// At our platform, it would be injected into `train.sh`
// as environment variable ${epoch}
]
}
```
##### train.py
此文件用于启动训练任务,由train.sh调用
```python
import os
import argparse
import types
from patch import patch_platform_specific_dependencies

# Please do not remove this call,
# the platform's runtime environment needs it.
patch_platform_specific_dependencies()

def add_platform_args(parser: argparse.ArgumentParser):
# required arguments
parser.add_argument(
"--project_dir",
type=str,
required=True,
help="The directory contains the project code.",
)
parser.add_argument(
"--dataset_dir",
type=str,
required=True,
help="The directory contains the training dataset.",
)
parser.add_argument(
"--output_dir",
type=str,
required=True,
help="The directory project would write output into.",
)

# optional arguments, add more if you need
parser.add_argument(
"--model_dir",
type=str,
default=None,
help="The directory contains the model to finetune.",
)


def main():
parser = argparse.ArgumentParser(description="training script")
add_platform_args(parser)
args = parser.parse_args()

# There are some path conventions:
#
# $OUTPUT_DIR/tensorboard:
# The platform-builtin tensorboard expects events to be here.
# $OUTPUT_DIR/checkpoint
# The platform-builtin checkpoint recovery feature
# expects the checkpoint to be here.
tensorboard_dir = os.path.join(args.output_dir, "tensorboard")
os.mkdir(tensorboard_dir, exist_ok=True)
checkpoint_dir = os.path.join(args.output_dir, "checkpoint")
os.mkdir(checkpoint_dir, exist_ok=True)

# TODO: your training code here


if __name__ == "__main__":
main()
```

##### train.sh
此文件为训练代码的入口命令,在云平台上会执行这个bash脚本
```bash
#!/usr/bin/env bash
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

# ===================================================================
# Welcome to ColossalAI Platform!
# ===================================================================
# Those environment variables would be injected by the runner:
#
# 1. ColossalAI Platform defined ones:
# PROJECT_DIR, DATASET_DIR, MODEL_DIR, OUTPUT_DIR, SCRIPT_DIR
#
# 2. Required by torchrun:
# NNODES, NPROC_PER_NODE, NODE_RANK, MASTER_ADDR, MASTER_PORT
#
# 3. Hyperparameters from configuration UI:
# (check HyperParameters.json for more details)
#
# After that, the runner would execute `train.sh`, this script.
# ===================================================================

torchrun --nnodes ${NNODES} \
--nproc_per_node ${NPROC_PER_NODE} \
--node_rank ${NODE_RANK} \
--master_addr ${MASTER_ADDR} \
--master_port ${MASTER_PORT} \
${SCRIPT_DIR}/train.py \
--project_dir ${PROJECT_DIR} \
--dataset_dir ${DATASET_DIR} \
--model_dir ${MODEL_DIR} \
--output_dir ${OUTPUT_DIR}

# TODO: add more argument passing here
```
### 2. 上传到云平台

[云平台项目页面](https://platform.luchentech.com/console/project)
Expand Down

0 comments on commit 967717f

Please sign in to comment.