Skip to content

Latest commit

 

History

History
282 lines (212 loc) · 10.5 KB

README.md

File metadata and controls

282 lines (212 loc) · 10.5 KB

PhoenixOS

cuda rocm ascend slack docs

PhoenixOS (PhOS) is an OS-level GPU checkpoint/restore (C/R) system. It can transparently C/R processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Most importantly, PhOS is the first OS-level C/R system that can concurrently execute C/R without stopping the execution of application.

Under CUDA platform, we compared the C/R performace of PhOS with nvidia/cuda-checkpoint:

Checkpointing Llama2-13b-chat
Restoring Llama2-13b-chat

Note that PhOS is aimming to be a generic design that towards various hardware platforms from different vendors, by providing a set of interfaces which should be implemented by specific hardware platforms. We currently provide the C/R implementation on CUDA platform, support for ROCm and Ascend are under development.

📑 Latest News

  • [Nov.6, 2024] PhOS is open sourced 🎉 [Repo] [Documentations]

    👉 PhOS is currently fully supporting single-GPU checkpoint and restore

    👉 We will soon release codes for cross-node live migration and multi-GPU support :)

  • [May 20, 2024] PhOS paper is now released on arXiv [Paper]

PhOS is currently under heavy development. If you're interested in contributing to this project, please join our slack workspace for more upcoming cool features on PhOS.

I. Build and Install PhOS

💡 Option 1: Build and Install From Source

  1. [Clone Repository] First of all, clone this repository recursively:

    git clone --recursive https://github.com/SJTU-IPADS/PhoenixOS.git
  2. [Start Container] PhOS can be built and installed on official vendor image.

    NOTE: PhOS require libc6 >= 2.29 for compiling CRIU from source.

    For example, for running PhOS for CUDA 11.3, one can build on official CUDA images (e.g., nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04):

    # enter repository
    cd PhoenixOS
    
    # start container
    sudo docker run -dit --gpus all                                         \
                -v.:/root                                                   \
                --privileged --network=host --ipc=host                      \
                --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
    
    # enter container
    sudo docker exec -it phos /bin/bash

    Note that it's important to execute docker container with root privilege, as CRIU needs the permission to C/R kernel-space memory pages.

  3. [Downloading Necesssary Assets] PhOS relies on some assets to build and test, please download these assets by simply running following commands:

    # inside container
    
    # install basic dependencies from OS pkg manager
    apt-get update
    apt-get install git wget
    
    # download assets
    cd /root/scripts/build_scripts
    bash download_assets.sh
  4. [Build] Building PhOS is simple!

    PhOS provides a convinient build system, which covers compiling, linking and installing all PhOS components:

    Component Description
    phos-autogen Autogen Engine for generating most of Parser and Worker code for specific hardware platform, based on lightwight notation.
    phosd PhOS Daemon, which continuously run at the background, taking over the control of all GPU devices on the node.
    libphos.so PhOS Hijacker, which hijacks all GPU API calls on the client-side and forward to PhOS Daemon.
    libpccl.so PhOS Checkpoint Communication Library (PCCL), which provide highly-optimized device-to-device state migration. Note that this library is not included in current release.
    unit-testing Unit Tests for PhOS, which is based on GoogleTest.
    phos-cli Command Line Interface (CLI) for interacting with PhOS.
    phos-remoting Remoting Framework, which provide highly-optimized GPU API remoting performance. See more details at SJTU-IPADS/PhoenixOS-Remoting.

    To build and install all above components and other dependencies, simply run the build script in the container would works:

    # inside container
    cd /root/scripts/build_scripts
    
    # clear old build cache
    #   -c: clear previous build
    #   -3: the clean process involves all third-parties
    bash build.sh -c -3
    
    # start building
    #   -3: the build process involves all third-parties
    #   -i: install after successful building
    bash build.sh -3 -i

    For customizing build options, please refers to and modify avaiable options under scripts/build_scripts/build_config.yaml.

    If you encounter any build issues, you're able to see building logs under build_log. Please open a new issue if things are stuck :-|

💡 Option 2: Install From Pre-built Binaries

Will soon be updated :)

II. Usage

Once successfully installed PhOS, you can now try run your program with PhOS support!

For more details, you can refer to examples for step-by-step tutorials to run PhOS.

(1) Start phosd and your program

  1. Start the PhOS daemon (phosd), which takes over all GPU reousces on the node:

    pos_cli --start --target daemon
  2. To run your program with PhOS support, one need to put a yaml configure file under the directory which your program would regard as $PWD. This file contains all necessary informations for PhOS to hijack your program. An example file looks like:

    # [Field]   name of the job
    # [Note]    job with same name would share some resources in posd, e.g., CUModule, etc.
    job_name: "llama2-13b-chat-hf"
    
    # [Field]   remote address of posd, default is local
    daemon_addr: "127.0.0.1"
  3. You are going for launch now! Try run your program with env $phos prefix, for example:

    env $phos python3 train.py

(2) Pre-dump your program

To pre-dump your program, which save the CPU & GPU state without stopping your execution, simple run:

# create directory to store checkpoing files
mkdir /root/ckpt

# pre-dump command
pos_cli --pre-dump --dir /root/ckpt --pid [your program's pid]

(3) Dump your program

To dump your program, which save the CPU & GPU state and stop your execution, simple run:

# create directory to store checkpoing files
mkdir /root/ckpt

# pre-dump command
pos_cli --dump --dir /root/ckpt --pid [your program's pid]

(4) Restore your program

To restore your program, simply run:

# restore command
pos_cli --restore --dir /root/ckpt

III. How PhOS Works?

For more details, please check our paper.


IV. Paper

If you use PhOS in your research, please cite our paper:

@article{huang2024parallelgpuos,
  title={PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation},
  author={Huang, Zhuobin and Wei, Xingda and Hao, Yingyi and Chen, Rong and Han, Mingcong and Gu, Jinyu and Chen, Haibo},
  journal={arXiv preprint arXiv:2405.12079},
  year={2024}
}

V. Contributors

Please check mailmap for all contributors.