[Roadmap] veRL Development Roadmap #22

PeterSH6 · 2024-11-22T09:47:53Z

Themes

We categorized our roadmap into 8 themes: Broad Model Support, Regular Update, More RL Algorithms support, Dataset Coverage, Plugin Support, Scaling Up RL, More LLM Infrastructure Support, Wide Hardware Coverage

Broad Model Support

To add a new model in veRL, the model should satisfy the following requirements:

The models are supported in vLLM and huggingface transformers. Then you can directly use dummy_hf load format to run the new model
[Optional for DTensor] For FSDP Backend, implement the dtensor_weight_loader for the model to transfer actor weights from FSDP checkpoint to vLLM model. See FSDP Document for more information
For Megatron Backend, users need to implement the ParallelModel similar to modeling_llama_megatron.py , implement some corresponding checkpoint_utils to load checkpoints from the huggingface, and implement the megatron_weight_loader to transfer actor weights from ParallelModel directly to the vLLM model. See Megatron-LM Document for more information

Regular Update

Use postition_idsto support remove padding in transformers models (transformers >= v4.45)
Upgrade the vLLM version to the latest (v0.6.3)
Ray upgrade to latest version (test multiple resource_pool colocate)
Megatron-LM/MCore Upgrade and GPTModel Support [RFC] Megatron-LM and MCore maintaining issues for veRL #15

More RL Algorithms Support

Make sure the algorithms can converge on some math datasets (e.g., GSM8k)

GRPO
Online DPO
Safe-RLHF (Multiple rm)
ReMax

Dataset Coverage

Plugin Support

Integrate SandBox and its corresponding datasets for Code Generation tasks

Scaling up RL

Integrate Ray Compiled Graphs (aDAGs) to speedup data transfer
Support FSDP HybridShard
Context Parallel
- Ring Attention
- Deepspeed Ulyssess
Aggressive offload techniques for all models
Support vLLM Rollout utilizes larger TP size than Actor model
Support Pipeline parallelism in rollout generation (in vllm or other LLM serving infra)

More LLM Infrastructure Support

LLM Training Infrastructure

Support TorchTitan for TP + PP parallelism
Support VeScale for Auto-Parallelism training

LLM Serving Infrastructure

At present, our project supports vLLM using the SPMD execution paradigm. This means we've eliminated the need for a standalone single-controller process (known as LLMEngine) by integrating its functionality directly into the multiple worker processes, making the system SPMD.

Basic Tutorial: Basic Tutorial: Adding a New LLM Inference/Serving Backend #21
Investigating how the one-controller process + SPMD architecture can be seamlessly integrated into veRL's existing WorkerGroup design.
Support TensorRT-LLM for rollout generation
Support SGLang (offline + SPMD) for rollout generation

Wide Hardware Coverage

Supporting a new hardware type in our project involves the following requirements:

Ray compatibility: The hardware type must be supported by the Ray framework, allowing it to be recognized and managed through the ray.utils.placement_group functionality.
LLM infra and transformers support: To leverage the new hardware effectively, it is crucial that both LLM infra (e.g., vLLM, torch, Megatron-LM and others) and the transformers library provide native support for the hardware type.
CUDA kernel replacement: We need to replace the CUDA kernels currently used in FSDP and Megatron-LM with the corresponding kernels specific to the new hardware.

Support Ascend NPUs
- vLLM Ascend Support [Feature]: vllm support for Ascend NPU vllm-project/vllm#6728
- Megatron-LM -> MindSpeed
Low-end NVIDIA GPUs (e.g., Volta, Tesla series)
- For Megatron-LM, implement no-rmpad and no flash-attention version of ParallelModel Is non-RmPad version model and RmPad verison mdoel interchangeable? #20

The text was updated successfully, but these errors were encountered:

PeterSH6 pinned this issue Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] veRL Development Roadmap #22

[Roadmap] veRL Development Roadmap #22

PeterSH6 commented Nov 22, 2024 •

edited

Loading

[Roadmap] veRL Development Roadmap #22

[Roadmap] veRL Development Roadmap #22

Comments

PeterSH6 commented Nov 22, 2024 • edited Loading

Themes

Broad Model Support

Regular Update

More RL Algorithms Support

Dataset Coverage

Plugin Support

Scaling up RL

More LLM Infrastructure Support

LLM Training Infrastructure

LLM Serving Infrastructure

Wide Hardware Coverage

PeterSH6 commented Nov 22, 2024 •

edited

Loading