-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
01ab964
commit be0d5d8
Showing
48 changed files
with
451 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+204 KB
docs/blog/2023-07-28-finetune-llama2/images/dataset-upload-success.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
--- | ||
slug: 微调llama2模型 | ||
title: 微调llama2模型 | ||
authors: | ||
name: Ziyuan Cui | ||
title: ColossalAI Platform Team | ||
tags: [llama2 ColossalChat] | ||
--- | ||
|
||
# 微调 llama2 模型 | ||
|
||
## 介绍 | ||
|
||
ColossalAI 平台是一个全托管的机器学习平台,它无缝地结合强大的计算能力和最先进的大模型加速和优化框架 ColossalAI。平台提供了多个大模型训练模版,可以让用户以无代码的方式,通过只上传数据集,来完成大模型的微调或预训练。在本文中,我们将讲解如何通过平台提供的 ColossalChat 模版,在用户提供的数据上微调一个 llama2 模型。 | ||
|
||
## 流程详解 | ||
|
||
### 1. 准备和创建对话数据集 | ||
|
||
首先,您需要在本地创建一个文件夹名,打开文件夹之后,在里面创建一个名为`data.json`的文件,其中包含您的对话数据集。 | ||
|
||
请参照以下 json 格式来构建您的对话数据集。 | ||
|
||
```json | ||
[ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule." | ||
}, | ||
{ | ||
"instruction": "What are the three primary colors?", | ||
"input": "", | ||
"output": "The three primary colors are red, blue, and yellow." | ||
} | ||
] | ||
``` | ||
|
||
### 2. 上传数据集 | ||
|
||
准备好本地的数据集`data.json`之后。点击左边的**数据**栏,选择**创建一个新数据集**。 | ||
![](./images/data-panel.png) | ||
|
||
在**上传数据集**界面,填写数据集名称、数据集描述,在**上传您的数据集**区域,选择上传您本地的包含`data.json`的文件夹,最后点击上传按钮。 | ||
![](./images/create-dataset.png) | ||
|
||
上传成功后,您将会在数据集详情界面看到您已经上传完成的数据文件。您可以对这个数据集的文件进行管理,例如添加文件,下载文件,删除文件等等。 | ||
![](./images/dataset-upload-success.png) | ||
|
||
### 3. 创建训练任务 | ||
|
||
点击左边的**任务**边栏,并在右上角点击**新建任务**。 | ||
|
||
在**任务**界面,选择**通过模版**创建任务的选项。 | ||
|
||
在**训练设置**界面,填写任务名称、任务描述,在模版选项里选择`colossalai/ColossalChat`。 | ||
|
||
> 所有前缀为`colossalai/`的模型都是 ColossalAI 官方提供的模型 | ||
![](./images/llama2-create-job-1.png) | ||
|
||
在**超参数**界面,选择您需要的超参数配置,例如模型类型选择、训练策略选择、epoch 数目等等。如无特殊需求,可以直接使用默认配置。 | ||
![](./images/hyperparameters.png) | ||
|
||
- pretrain_model_path:(选填)如果您没有在**训练设置**中选择**模型**,便可以忽略这个选项。在本教程中,您可以保持默认设置。 | ||
- 如果您想要从自定义预训练模型开始训练,需要先在**模型**界面上传或者在训练后注册您的预训练模型,并在在这里填写您的预训练模型的路径。路径格式为:`$(MODEL_DIR)/[模型相对路径]`。预训练模型的相对路径为您在**模型**界面上上传的模型的在(`/root/`之后的)相对路径。例如,如果您的模型文件在**模型详情**界面的路径为`root/pretrain/`,这里您需要填写`$(MODEL_DIR)/pretrain`。 | ||
- 我们平台提供的镜像内置了从 HuggingFace 下载的预训练模型,如果您选择用镜像内置的模型,请保持默认配置。 | ||
- dataset_path:这里填写您的数据集的路径。路径格式为:`$(DATASET_DIR)/[模型相对路径]`。数据集模型的相对路径为您在**数据集**界面上上传的数据的在(`/root/`之后的)相对路径。例如,如果您的数据集文件在**数据集详情**界面的路径为`root/data.json`,这里您需要填写`$(DATASET_DIR)/data.json`。 | ||
- model_type:大语言模型类型,这里我们在提供了 bloom-350m、opt、gpt2 和 llama2 四种模型。在本教程中,请选择 llama。 | ||
- strategy:ColossalAI 优化策略 | ||
- log_interval:日志打印的频率 | ||
- batch_size:批大小 | ||
- max_len:最大长度序列 | ||
- lora_rank | ||
- accumulation_steps:累计梯度的步数 | ||
- lr:学习率 | ||
- max_dataset_size:最大数据集大小 | ||
- max_epochs:训练轮数 | ||
|
||
在**数据集**选项,请选择您之前创建好的数据集。 | ||
**模型**为可选选项,如果您在平台内已经注册过,或者上传模型,在这里可以的选择您以拥有的模型文件。在本教程中,您可以不选择任何模型。 | ||
![](./images/llama2-create-job-2.png) | ||
在**资源设置**,选择您所需要的机器类型,和机器数量。 | ||
|
||
- 我们的机器的 GPU 卡间会利用 Nvlink 进行加速。 | ||
- 如果您选择了多台机器,机器之前将通过 RDMA 进行训练加速。 | ||
|
||
当您配置完成后,请点击**启动任务**按钮。 | ||
|
||
### 4. 监控训练任务 | ||
|
||
在训练过程中,您可以通过点击**日志**的选项,来查看当前的日志进度。 | ||
![](images/log-stream.png) | ||
|
||
在训练过程中,您可以通过点击**指标**的选项, | ||
![](.images/../images/tensorboard.png) | ||
|
||
### 5. 处理训练输出文件 | ||
|
||
训练结束后,您可以在**任务详情**界面的**输出文件**选项卡中看到训练的输出文件。您可以选择下载这些文件,或者选择**注册模型**来注册您的`checkpoint/`目录下的文件,注册后的模型,可以在**模型**界面看到注册后的模型文件目录。 | ||
|
||
输出文件将包括一下几种文件 | ||
|
||
- `checkpoint/pretrain`:训练后的产生的模型文件 | ||
- `checkpoint/optm_ckpt`: 优化器的 checkpoint 文件 | ||
- `tensorboard/`:训练产生的 tensoborad 文件 | ||
- `master-0.txt`:master 节点产生的日志文件 | ||
- `worker-1.txt`:第一个 worker 节点产生的日志文件 | ||
|
||
![](images/register-model.png) | ||
|
||
模型注册之后,可以在**模型**界面,管理当前模型目录下的文件。 | ||
![](images/model-info.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# 数据集管理 | ||
|
||
## 简介 | ||
|
||
在Colossal-AI云平台,用户可以统一管理他们训练需要用到的各类数据集,并将这些数据集加载到训练任务重用于训练模型。接下来,我们将会引导您创建您的数据集,您可以跳转到[任务管理](../training/jobs.md)文档查看如何将数据集加载到训练任务中。 | ||
|
||
## 创建数据集 | ||
|
||
您需要按照以下步骤来创建一个新的数据集。 | ||
|
||
1. 然后在界面的右侧点击”新建数据集“的按钮 | ||
|
||
![new dataset](images/datasets/new_dataset.png) | ||
|
||
2. 输入数据集的相关信息,点击创建按钮,这样就得到一个空的数据集。 | ||
|
||
![create dataset](images/datasets/create_dataset.png) | ||
|
||
3. 点击红色框内的按钮上传您的数据集文件。 | ||
|
||
![upload dataset](images/datasets/upload_dataset.png) | ||
|
||
上传数据集时会显示一个进度条来展示上传的进度。上传完之后可以点击文件浏览器查看上传好的文件。 | ||
|
||
![dataset uploaded](images/datasets/uploaded.png) | ||
|
||
|
||
## 管理文件 | ||
|
||
上传完数据集时候,您可以在文件浏览器上选中一个文件,在右侧的菜单栏中可以删除或者下载该文件。 | ||
|
||
![manage files](images/datasets/manage.png) | ||
|
||
## 公开数据集 | ||
|
||
在数据集的展示页面,我们可以看到用户可以访问公共数据集。在这个板块,用户可以看到其他用户分享的数据集,并将这些数据集用在自己的训练任务中。 | ||
|
||
![dataset uploaded](images/datasets/public_dataset.png) | ||
|
||
用户也可以将自己的数据集公开,只需要点击自己的数据集,然后在右上角选择编辑数据集。在编辑页面,用户可以把可见性设置为公开,这样其他用户也可以直接使用数据集进行训练。Colossal-AI平台官方会逐步放置一些常用数据集在平台上供用户使用。 | ||
|
||
![dataset uploaded](images/datasets/public_dataset.png) | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# 模型管理 | ||
|
||
## 简介 | ||
|
||
模型界面可以用于管理您的模型。您可以自行上传已经训练好的模型,也可以从训练成功的任务中注册到模型库,以便日后在任务启动页面使用。 | ||
|
||
![模型界面](images/models/model_list.png) | ||
|
||
## 创建模型 | ||
|
||
模型的新建、删除、编辑等和数据集相似,可以查看[数据集管理](./datasets.md)文档熟悉流程。需要注意的是,我们支持直接从一个训练完成的任务中保存模型,相关步骤我们会在[训练任务](../training/jobs.md)中介绍。 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# 项目管理 | ||
|
||
## 简介 | ||
|
||
项目(project)包含了我们训练所需要的代码,可以把一个项目理解为一个代码库。在Colossal-AI云平台上,我们将直接加载项目代码用于训练。 | ||
|
||
![project list](./images/projects/project_list.png) | ||
|
||
## 项目规范 | ||
|
||
为了对接云平台的训练任务流程,我们要求一个项目必须包含以下为了能够在平台上运行,代码项目必须包含以下文件: | ||
|
||
- `HyperParameters.json`: 定义了用户启动训练时需要设置的超参数,在启动任务时将被加载到UI上,用户可以直接在UI上设置超参数来快速启动训练。 | ||
- `train.sh`: 任务启动的入口文件,云平台将在K8S上执行这个bash文件来启动训练。 | ||
- `train.py`: 项目训练代码的统一入口文件,将在`train.sh`被调用。在`train.py`里,我们需要实现对分布式训练的支持。 | ||
- `README.md`: 项目代码的文档,用于指导用户如何准备数据集、模型以及进行训练和推理。 | ||
|
||
为了帮助用户更方便的创建一个符合以上规范的项目,我们提供了Colossal-AI云平台CLI,来帮助用户一键初始化项目以及必要文件,详情可以查看[CLI使用说明](../cli/cli.md)。 | ||
|
||
|
||
## 创建项目 | ||
|
||
在本地我们使用CLI初始化项目并添加了相关的逻辑代码之后,我们便可以将项目上传到Colossal-AI云平台了,相关的操作和数据集类似,可以查看[数据集管理](./datasets.md)文档熟悉流程 | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# 模版管理 | ||
|
||
## 简介 | ||
|
||
模板是一个不可变的代码项目,一个模板由一个项目转换而来。如果对代码项目的效果满意,用户可以将自己的项目以及镜像打包成一个模板,便于之后反复使用。同时,用户也可以选择将模板发布到公有模板中,便于其他用户使用。 | ||
|
||
![Template List](./images/templates/template_list.png) | ||
|
||
## 创建模版 | ||
|
||
我们可以在模版列表页面点击右上角的按钮创建新的模板,创建页面如图所示: | ||
|
||
![Create a new template](./images/templates/new_template.png) | ||
|
||
我们需要填写相关的描述信息,同时选择是否公开这个模板。在最后两行,我们也需要选择这个模板所使用的项目代码以及运行的镜像,这样一来我们就成功打包了一个可以反复使用的AI模板。 | ||
|
||
## 官方模版 | ||
|
||
我们将会在公共模板市场里提供开箱即用的AI应用模板,比如基于Llama, Llama2, ChatGLM等模型的Colossal Chat对话机器人,相关团队会逐步增加更多的AI应用供大家微调以及部署。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# 命令行工具 | ||
|
||
## 简介 | ||
|
||
为了方便用户一键创建符合云平台要求的项目,我们提供了一个命令行工具名叫cap,它的名字来源于Colossal-AI Platform的首字母。 | ||
|
||
## 安装 | ||
|
||
1. 从源码安装 | ||
|
||
```bash | ||
pip install git+https://github.com/hpcaitech/ColossalAI-Platform-CLI@main | ||
``` | ||
|
||
1. 从PyPI安装 | ||
|
||
```bash | ||
pip install colossalai-platform | ||
``` | ||
|
||
## 使用方法 | ||
|
||
### 创建一个标准项目 | ||
|
||
我们可以用下面的命令创建一个标准的项目结构,记得替换`<project-name>`为你自己的项目名称。 | ||
|
||
```bash | ||
cap project init <project-name> | ||
``` | ||
|
||
这个项目将包含以下文件 | ||
|
||
``` | ||
- <project name> | ||
- Dockerfile | ||
- train.sh | ||
- train.py | ||
- HyperParameters.json | ||
- README.md | ||
- requirements.txt | ||
``` | ||
|
||
`train.sh`,`train.py`,`HyperParameters.json`为云平台启动任务时的必要文件。 | ||
|
||
**1. HyperParameters.json** | ||
|
||
这个文件定义了用户启动训练任务时所需要输入的超参数,用户可以通过在json里添加自己的超参数定义。 | ||
|
||
``` | ||
{ | ||
"HyperParameters": [ | ||
{ | ||
"name": "max_epoch", | ||
"types": "int", | ||
"defaultValue": "10", | ||
"description" : "" | ||
} | ||
] | ||
} | ||
``` | ||
在启动任务时,就能看到这个超参了。 | ||
|
||
![Hyper Parameters](./images/hyperparams.png) | ||
|
||
**2. train.py** | ||
|
||
`train.py`里包含了主要的训练代码。 | ||
|
||
**3. train.sh** | ||
|
||
`train.sh`是整个项目的主要入口,云平台会执行这个文件来启动训练。 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# 推理API | ||
|
||
## 简介 | ||
|
||
在完成微调任务之后,我们可以创建一个推理API来将我们的模型部署为一个Restful API集成到上层应用中。 | ||
|
||
![api list](./images/api_list.png) | ||
|
||
## 创建一个推理API | ||
|
||
点击API列表右上角的按钮,我们可以创建新的推理API。和创建训练任务一样,我们也需要选择我们的模型和推理代码。我们使用`kserve`来部署推理模型,在我们选择的项目/模板里,需要在根目录下有一个inference目录来放置`kserve`所需的handler文件。选择完毕之后我们点击确认即可启动推理API。 | ||
|
||
```text | ||
- <project/template name> | ||
- inference | ||
- handler.py | ||
``` | ||
|
||
![api list](./images/api_creation.png) | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# 简介 | ||
|
||
Colossal-AI云平台是一个ML Ops平台,致力于为用户提供规范、高效的模型训练以及部署体验。 | ||
|
||
在接下来的文档中,我们将带您体验云平台上的各个功能。目前云平台仍处在开发阶段中,我们仍在不断打磨功能,如果您有发现任何问题或者有任何建议,欢迎在我们的[CLI仓库](https://github.com/hpcaitech/ColossalAI-Platform-CLI)下提出您的issue或者feature request,我们会尽快处理,谢谢! |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# 查看训练任务详情 | ||
|
||
## 简介 | ||
|
||
在训练任务启动之后,页面将会跳转到任务详情页面,在这里,用户可以查看当前任务的各种信息。 | ||
|
||
## 任务详情 | ||
|
||
在任务详情界面,共有私有子面板,他们分别是 | ||
|
||
**任务信息** | ||
|
||
这个面板展示了训练任务的基本信息。 | ||
|
||
![job info](./images/job_detail/job_info.png) | ||
|
||
用户可以点击“超参数”中的详情查看当前任务的超参数。 | ||
|
||
![job hyperparams](./images/job_detail/job_hyperpparams.png) | ||
|
||
**Tensorboard训练指标** | ||
|
||
如果训练代码中有使用tensorboard输出log,那么可在此页面查看训练中的指标,比如loss,accuracy等。如果第一次连接,嵌入式页面无响应,可以等待几秒钟并刷新。在任务训练中,我们可以点击Tensorboard右上角的刷新按钮,获取最新的指标数据来监控训练过程。 | ||
|
||
![job metrics](./images/job_detail/job_metrics.png) | ||
|
||
**日志** | ||
|
||
在这个面板,我们可以查看到当前任务的输出日志,这些日志信息有助于我们更好地了解训练情况。点击右上角的按钮可以下载日志到本地。 | ||
|
||
![job log](./images/job_detail/job_log.png) | ||
|
||
**任务输出** | ||
|
||
这个面板展示了当前训练任务的各类输出文件,其中包含了模型checkpoint,tensorboard日志文件以及节点的标准输出日志。 | ||
|
||
![job output](./images/job_detail/job_output.png) | ||
|
||
对于已经完成训练的任务,我们也可以点击右侧的“注册模型”按钮,将训练好的模型文件保存到我们的模型管理数据库中,便于之后使用。 | ||
|
||
![model registration](./images/job_detail/job_register_model.png) |
Oops, something went wrong.