Add chinese blog for deepspeed windows, and fix format (#7035)

Fix #7029 - Add Chinese blog for deepspeed windows - Fix format in README.md
deepspeedai · Feb 13, 2025 · 30aed3a · 30aed3a
1 parent 83f5dee
commit 30aed3a
Show file tree

Hide file tree

Showing 3 changed files with 108 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@
 
 * [2024/12] [Ulysses-Offload: Democratizing Long Context LLM Training ](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/ulysses-offload/README.md)
 * [2024/12] [DeepSpeed-Domino: Communication-Free LLM Training Engine](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepspeed-domino/README.md)
-* [2024/08] [DeepSpeed on Windows](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/README.md) [[日本語](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/japanese/README.md)]
+* [2024/08] [DeepSpeed on Windows](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/README.md) [[日本語](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/japanese/README.md)]  [[中文](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/chinese/README.md)]
 * [2024/08] [DeepNVMe: Improving DL Applications through I/O Optimizations](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-gds/README.md) [[日本語](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-gds/japanese/README.md)]
 * [2024/07] [DeepSpeed Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ucp/README.md) [[中文](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ucp/chinese/README.md)] [[日本語](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ucp/japanese/README.md)]
 * [2024/03] [DeepSpeed-FP6:The power of FP6-Centric Serving for Large Language Models](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-fp6/03-05-2024) [[English](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-fp6/03-05-2024/README.md)] [[中文](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-fp6/03-05-2024/README-Chinese.md)]

diff --git a/blogs/windows/08-2024/README.md b/blogs/windows/08-2024/README.md
@@ -69,8 +69,8 @@ The scripts and codes for the BERT pretraining example are available in the foll
 </div>
 
 # Fine Tuning Example
-We demonstrate fine tuning capability by using the supervised fine tuning (SFT) step of DeepSpeed-Chat application. We conduct SFT of the HuggingFace facebook/opt-125m model while enabling LoRA and CPU offloading memory optimizations. The command line for running this example is as follows:
-deepspeed training\step1_supervised_finetuning\main.py --model_name_or_path facebook/opt-125m --gradient_accumulation_steps 8 --lora_dim 128 --only_optimize_lora --print_loss --zero_stage 2 --deepspeed --dtype bf16 --offload --output_dir output
+We demonstrate fine tuning capability by using the supervised fine tuning (SFT) step of DeepSpeed-Chat application. We conduct SFT of the HuggingFace facebook/opt-125m model while enabling LoRA and CPU offloading memory optimizations. The command line for running this example is as follows:\
+`deepspeed training\step1_supervised_finetuning\main.py --model_name_or_path facebook/opt-125m --gradient_accumulation_steps 8 --lora_dim 128 --only_optimize_lora --print_loss --zero_stage 2 --deepspeed --dtype bf16 --offload --output_dir output`.\
 The output should look like this:
 
 <div align="center">
@@ -82,8 +82,8 @@ The output should look like this:
 </div>
 
 # Inference Example
-We demonstrate inference capability by using ZeRO-Inference for token generation. ZeRO-Inference reduces hardware cost of inferencing by offloading to CPU or NVMe memories. We use the example scripts here to run token generation using Llama-2-7B model from HuggingFace. We offload the model weights to CPU memory since the 4GB VRAM is insufficient to host both the model and the generation working set. We use the following command line to generate 32 tokens from a prompt of 8 tokens:
-deepspeed run_model.py --model meta-llama/Llama-2-7b-hf --batch-size 64 --prompt-len 8 --gen-len 32 --cpu-offload
+We demonstrate inference capability by using ZeRO-Inference for token generation. ZeRO-Inference reduces hardware cost of inferencing by offloading to CPU or NVMe memories. We use the example scripts here to run token generation using Llama-2-7B model from HuggingFace. We offload the model weights to CPU memory since the 4GB VRAM is insufficient to host both the model and the generation working set. We use the following command line to generate 32 tokens from a prompt of 8 tokens:\
+`deepspeed run_model.py --model meta-llama/Llama-2-7b-hf --batch-size 64 --prompt-len 8 --gen-len 32 --cpu-offload`.\
 The output will look something like this:
 
 <div align="center">

diff --git a/blogs/windows/08-2024/chinese/README.md b/blogs/windows/08-2024/chinese/README.md
@@ -0,0 +1,103 @@
+<div align="center">
+
+# 在Windows系统上使用DeepSpeed
+
+</div>
+
+# 简介
+
+DeepSpeed是一个广受欢迎的开源深度学习优化库，它使得分布式训练和推理变得简单、高效且有效。凭借其众多复杂的优化技术（如ZeRO、3D并行、MoE等），DeepSpeed已被成功应用于包括Phi-3、Megatron-Turing-530B、BLOOM-176B和Arctic在内的多种前沿模型的训练。然而，由于缺乏对主流操作系统微软 Windows的原生支持，许多AI开发者与用户无法充分利用DeepSpeed的创新。为此，我们致力于让DeepSpeed在Windows上实现原生全功能运行，并保持与Linux相同的易用性。
+
+在这篇博客中，我们很高兴地宣布我们开发工作中的一些早期成果：DeepSpeed 现在可以在 Windows 上安装并原生支持单 GPU 的训练、微调和推理。重要的是，安装和使用体验与 Linux 上完全相同。此外，微调和推理工作展示了 DeepSpeed 的三个关键特性：HuggingFace Transformers 的集成、LoRA 的支持和 CPU offload。DeepSpeed 在 Windows 上的支持从 DeepSpeed 0.14.5 开始。接下来，我们将通过一些例子展示这些成就。
+
+# 测试环境
+
+我们在一台运行 Windows 11 23H2 版本号 22631.3880 的 Surface Laptop Studio 2 上进行了测试。该笔记本配备了一块 4GB 显存的 NVIDIA RTX A2000 GPU。我们使用了 Pytorch 2.3.0 和 HuggingFace Transformers 4.41.2。测试所用的脚本来自 [DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples) 代码仓库，因此在运行以下任何示例前，你需要克隆该仓库。
+
+# 安装指南
+DeepSpeed可以通过两种方式在Windows系统上安装。较为简单的方式是使用pip包管理器安装，另一种方法是从源代码安装。两种安装方式的前提条件都是系统已经安装了Python 3.x 和支持CUDA的Pytorch.
+
+## 通过pip安装
+要安装 DeepSpeed，只需运行：`pip install deepspeed`。它将安装最新版本的 DeepSpeed（目前为 0.14.5）。与 Linux 版本不同的是，Windows 版本已经预先编译了内部的自定义算子，因此不需要安装 CUDA 或 C++ 编译器。
+
+<div align="center">
+    <img src="../media/win_pip_install_deepspeed.png" style="width:6.5in;height:3.42153in" />
+</div>
+
+<div align="center">
+    通过pip在Windows上安装Deepspeed.
+</div>
+
+
+##  通过源代码安装
+克隆DeepSpeed代码仓库后，运行build_win.bat脚本进行编译安装。
+
+
+## 验证安装
+无论选择哪种安装方式，你都可以通过运行 ds_report 来检查安装是否成功。输出应该如下所示：
+
+
+<div align="center">
+    <img src="../media/ds_report.png" style="width:6.5in;height:3.42153in" />
+</div>
+
+<div align="center">
+    ds_report的输出结果，用于验证安装是否成功.
+</div>
+
+# 预训练(Pretraining)
+我们使用图像分类模型 CIFAR10 和语言模型 BERT 来演示在 Windows 上使用 DeepSpeed 进行预训练。
+
+## CIFAR10模型预训练
+用于 CIFAR10 预训练的脚本和代码可以在以下路径找到：`DeepSpeedExamples\training\cifar`。你可以运行以下命令启动 CIFAR10 预训练实验：`deepspeed cifar10_deepspeed.py --deepspeed`。最终输出应类似于：
+<div align="center">
+    <img src="../media/cifar10_training.png" style="width:6.5in;height:3.42153in" />
+</div>
+
+<div align="center">
+    在 Windows 上使用 Deepspeed 进行 CIFAR10 模型预训练
+</div>
+
+## BERT模型预训练
+用于 BERT 预训练的脚本和代码可以在以下路径找到：`DeepSpeedExamples\training\HelloDeepSpeed`。你可以使用以下命令启动 BERT 预训练实验：`deepspeed train_bert_ds.py --checkpoint_dir experiment_deepspeed`。最终输出应如下所示：
+
+<div align="center">
+    <img src="../media/bert_training.png" style="width:6.5in;height:3.42153in" />
+</div>
+
+<div align="center">
+    在 Windows 上使用 Deepspeed 进行 BERT 模型预训练
+</div>
+
+# 微调(Fine Tuning)
+我们使用 DeepSpeed-Chat 应用的监督微调（SFT）步骤来演示微调能力。我们对 HuggingFace 的 facebook/opt-125m 模型进行监督微调，同时启用 LoRA 和 CPU offload进行内存优化。运行命令行如下：\
+`deepspeed training\step1_supervised_finetuning\main.py --model_name_or_path facebook/opt-125m --gradient_accumulation_steps 8 --lora_dim 128 --only_optimize_lora --print_loss --zero_stage 2 --deepspeed --dtype bf16 --offload --output_dir output`\
+输出应如下所示：
+
+<div align="center">
+    <img src="../media/opt125m_finetuning.png" style="width:6.5in;height:3.42153in" />
+</div>
+
+<div align="center">
+    在 Windows 上使用 DeepSpeed 对 facebook/opt-125m 监督微调
+</div>
+
+# 推理
+我们使用 ZeRO-Inference 的token生成来演示推理能力。ZeRO-Inference 通过转移存储到 CPU 内存或 NVMe 硬盘内存来减少推理的硬件成本。我们使用以下脚本运行 HuggingFace 的 Llama-2-7B 模型来进行 token 生成。由于 4GB 显存无法容纳模型和生成所需的内存，我们将模型权重转移到 CPU 内存。我们使用以下命令行从 8个token的提示词中生成 32 个token：\
+`deepspeed run_model.py --model meta-llama/Llama-2-7b-hf --batch-size 64 --prompt-len 8 --gen-len 32 --cpu-offload`\
+输出应类似于：
+
+<div align="center">
+    <img src="../media/llama2-7b_inference.png" style="width:6.5in;height:3.42153in" />
+</div>
+
+<div align="center">
+    在 Windows 上使用 ZeRO-Inference 进行 LLAMA2-7B 模型的token生成
+</div>
+
+# 总结
+
+使得DeepSpeed，一个流行的深度学习框架，能够原生运行在最流行的操作系统 Windows 上，是让每个人和组织从当前的人工智能革命中受益的重要一步。在这篇博客中，我们分享了我们为实现这一目标所取得的早期成果。尽管 DeepSpeed 对 Windows 的支持仍在继续开发中，我们希望上述结果已经能够对我们的用户有实用价值，并且鼓舞他们。我们接下来的工作计划涵盖多GPU支持、权重量化以及性能优化。
+
+# 致谢
+这给项目的完成得益于现任和前任 DeepSpeed 成员的大力合作，包括 Costin Eseanu、Logan Adams、Elton Zheng、Reza Yazdani Aminabadi、Martin Cai 和 Olatunji Ruwase。我们还要感谢那些及时提出此项需求、提供关键的临时解决方法、部分解决方案和建设性反馈的 DeepSpeed 用户，最重要的是，他们始终与我们同行.