add code example

hpcaitech · Oct 8, 2023 · 967717f · 967717f
1 parent 4c67b77
commit 967717f
Showing 1 changed file with 119 additions and 0 deletions.
diff --git a/docs/blog/2023-10-08-CreateTrainingJob/index.md b/docs/blog/2023-10-08-CreateTrainingJob/index.md
@@ -23,6 +23,125 @@ tags: [ColossalChat]
 #### 手动创建
 如果需要手动创建项目，您需要在文件夹中创建一些内容，可以参考：[项目规范](https://docs.platform.luchentech.com/docs/basics/projects#%E9%A1%B9%E7%9B%AE%E8%A7%84%E8%8C%83)
 
+#### 代码示例
+##### HyperParameters.json
+此文件用于设置启动训练时的超参数，可以在启动任务的网页UI界面设置
+```json
+{
+  "HyperParameters": [
+    {
+      "name": "epoch",
+      "types": "string",
+      "defaultValue": "10",
+      "description": "Number of epochs for training"
+    }
+    // At our platform, it would be injected into `train.sh`
+    // as environment variable ${epoch}
+  ]
+}
+```
+##### train.py
+此文件用于启动训练任务，由train.sh调用
+```python
+import os
+import argparse
+import types
+from patch import patch_platform_specific_dependencies
+
+# Please do not remove this call,
+# the platform's runtime environment needs it.
+patch_platform_specific_dependencies()
+
+def add_platform_args(parser: argparse.ArgumentParser):
+    # required arguments
+    parser.add_argument(
+        "--project_dir",
+        type=str,
+        required=True,
+        help="The directory contains the project code.",
+    )
+    parser.add_argument(
+        "--dataset_dir",
+        type=str,
+        required=True,
+        help="The directory contains the training dataset.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        required=True,
+        help="The directory project would write output into.",
+    )
+
+    # optional arguments, add more if you need
+    parser.add_argument(
+        "--model_dir",
+        type=str,
+        default=None,
+        help="The directory contains the model to finetune.",
+    )
+
+
+def main():
+    parser = argparse.ArgumentParser(description="training script")
+    add_platform_args(parser)
+    args = parser.parse_args()
+
+    # There are some path conventions:
+    #
+    # $OUTPUT_DIR/tensorboard:
+    #     The platform-builtin tensorboard expects events to be here.
+    # $OUTPUT_DIR/checkpoint
+    #     The platform-builtin checkpoint recovery feature
+    #     expects the checkpoint to be here.
+    tensorboard_dir = os.path.join(args.output_dir, "tensorboard")
+    os.mkdir(tensorboard_dir, exist_ok=True)
+    checkpoint_dir = os.path.join(args.output_dir, "checkpoint")
+    os.mkdir(checkpoint_dir, exist_ok=True)
+
+    # TODO: your training code here
+
+
+if __name__ == "__main__":
+    main()
+```
+
+##### train.sh
+此文件为训练代码的入口命令，在云平台上会执行这个bash脚本
+```bash
+#!/usr/bin/env bash
+SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
+
+# ===================================================================
+#                Welcome to ColossalAI Platform!
+# ===================================================================
+# Those environment variables would be injected by the runner:
+#
+# 1. ColossalAI Platform defined ones:
+#    PROJECT_DIR, DATASET_DIR, MODEL_DIR, OUTPUT_DIR, SCRIPT_DIR
+#
+# 2. Required by torchrun:
+#    NNODES, NPROC_PER_NODE, NODE_RANK, MASTER_ADDR, MASTER_PORT
+#
+# 3. Hyperparameters from configuration UI:
+#    (check HyperParameters.json for more details)
+#
+# After that, the runner would execute `train.sh`, this script.
+# ===================================================================
+
+torchrun --nnodes ${NNODES} \
+    --nproc_per_node ${NPROC_PER_NODE} \
+    --node_rank ${NODE_RANK} \
+    --master_addr ${MASTER_ADDR} \
+    --master_port ${MASTER_PORT} \
+    ${SCRIPT_DIR}/train.py \
+    --project_dir ${PROJECT_DIR} \
+    --dataset_dir ${DATASET_DIR} \
+    --model_dir ${MODEL_DIR} \
+    --output_dir ${OUTPUT_DIR}
+
+# TODO: add more argument passing here
+```
 ### 2. 上传到云平台
 
 [云平台项目页面](https://platform.luchentech.com/console/project)