ray-project · SongGuyang · Apr 1, 2022 · Apr 1, 2022 · Apr 8, 2022 · Apr 8, 2022
diff --git a/reps/2022-04-01-resource-control.md b/reps/2022-04-01-resource-control.md
@@ -0,0 +1,167 @@
+## Summary
+### General Motivation
+
+Currently, we don't control and limit the resource usage of worker processes, except running the worker in a container (see the `container` part in the [doc](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#api-reference)). In most of the scenarios, the container is unnecessary, but resource control is necessary for isolation.
+
+[Control groups](https://man7.org/linux/man-pages/man7/cgroups.7.html), usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.
+
+So, the goal of this proposal is to achieve resource control for worker processes by cgroup in Linux.
+
+### Should this change be within `ray` or outside?
+
+These changes would be within Ray Core.
+
+## Stewardship
+### Required Reviewers
+@ericl, @edoakes, @simon-mo, @chenk008 @raulchen 
+
+### Shepherd of the Proposal (should be a senior committer)
+@ericl
+
+## Design and Architecture
+
+### Cluster level API
+We should add some new system configs for the resource control.
+- worker_resource_control_method: Set to "cgroup" by default.
+- cgroup_manager: Set to "cgroupfs" by default. We should also support `systemd`.
+- cgroup_mount_path: Set to `/sys/fs/cgroup/` by default.
+- cgroup_unified_hierarchy: Whether to use cgroup v2 or not. Set to `False` by default.
+- cgroup_use_cpuset: Whether to use cpuset. Set to `False` by default.
+
+### User level API
+#### Simple API using a flag
+```python
+    runtime_env = {
+        "enable_resource_control": True,
+    }
+```
+
+```python
+    from ray.runtime_env import RuntimeEnv
+    runtime_env = ray.runtime_env.RuntimeEnv(
+        enable_resource_control=True
+    )
+```
+
+#### Entire APIs
+```python
+    runtime_env = {
+        "enable_resource_control": True,
+        "resource_control_config": {
+            "cpu_enabled": True,
+            "memory_enabled": True,
+            "cpu_strict_usage": False,
+            "memory_strict_usage": True,
+        }
+    }
+```
+
+```python
+    from ray.runtime_env import RuntimeEnv, ResourceControlConfig
+    runtime_env = ray.runtime_env.RuntimeEnv(
+        resource_control_config=ResourceControlConfig(
+            cpu_enabled=True, memory_enabled=True, cpu_strict_usage=False, memory_strict_usage=True)
+    )
+```
+
+### Implementation
+#### Work steps
+
+When we run `ray.init` with a `runtime_env` and `eager_install` is enabled, the main steps are:
+  - (**Step 1**) The Raylet(Worker Pool) receives the publishing message of job started.
+  - (**Step 2**) The Raylet sends the RPC `GetOrCreateRuntimeEnv` to the Agent. 
+  - (**Step 3**) The Agent setups the `runtime_env`. For `resource_control`, the agent generates `command_prefix` in `runtime_env_context` to talk how to enable the resource control, e.g. using cgroupfs or systemd. A cgroupfs sample is like: 
+    `mkdir /sys/fs/cgroup/{worker_id} && echo "200000 1000000" > /sys/fs/cgroup/foo/cpu.max && echo {pid} > /sys/fs/cgroup/foo/cgroup.procs`
+
+When we create a `Task` or `Actor` with a `runtime_env`(or a inherited `runtime_env`), the main steps are:
+  - (**Step 4**) The worker submits the task.
+  - (**Step 5**) The `task_spec` is received by the Raylet after scheduling.
+  - (**Step 6**) The Raylet sends the RPC `GetOrCreateRuntimeEnv` to the Agent.
+  - (**Step 7**) The agent generates `command_prefix` in `runtime_env_context` and replies the RPC.
+  - (**Step 8**) The Raylet starts the new worker process with the `runtime_env_context`.
+  - (**Step 9**) The `setup_worker.py` setups the `resource_control` by `command_prefix` for the new worker process.
+
+#### Cgroup Manager
+The Cgroup Manager is used to create or delete cgroups, and bind worker processes to cgroups. We plan to integrate the Cgroup Manager in the Agent.
+
+We should abstract the Cgroup Manager because we have more than one way to manager cgroups in linux. The two main ways are cgroupfs and systemd, which also used in the [container technology](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/).
+
+##### cgroupfs
+Using cgroupfs to manage cgroups is like:
+```
+mkdir /sys/fs/cgroup/{worker_id}
+echo "200000 1000000" > /sys/fs/cgroup/foo/cpu.max
+echo {pid} > /sys/fs/cgroup/foo/cgroup.procs
+```
+NOTE: This is an example based on cgroup v2. The commmand lines in cgroup v1 is defferent and incompatible.
+
+We also could use the [libcgroup](https://github.com/libcgroup/libcgroup/blob/main/README) to simplify the implementation. This library support both cgroup v1 and cgroup v2. 
+
+##### systemd
+Using systemd to manage cgroups is like:
+```
+systemctl set-property {worker_id}.service CPUQuota=20%
+systemctl start {worker_id}.service
+```
+
+NOTE: The entire config options is [here](https://man7.org/linux/man-pages/man5/systemd.resource-control.5.html). And we can also use `StartTransientUnit` to create cgroup with worker process simply. This is a [dbus](https://www.freedesktop.org/wiki/Software/systemd/dbus/) API and there is a [dbus-python](https://dbus.freedesktop.org/doc/dbus-python/) module we can used.
+
+##### More references
+- [Yarn NodeManagerCgroups](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html)
+
+#### About Cgroup v1 and v2
+[Cgroup v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt) is more reasonable, but we should also support cgroup v1 because v1 is widely used and has been hard coded to the [OCI](https://opencontainers.org/) standards. You can see this [blog](https://www.redhat.com/sysadmin/fedora-31-control-group-v2) for more.
+
+Check if cgroup v2 has been enabled in your linux system:
+```
+mount | grep '^cgroup' | awk '{print $1}' | uniq
+```
+
+And you can try to enable cgroup v2:
+```
+grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"
+reboot
+```
+
+#### How to delete cgroups
+When worker processes die, we should delete the cgroup which is created for the processes. This work is only needed for cgroupfs because systemd could handle it.
+
+So, we should add a new RPC named `CleanForDeadWorker` in `RuntimeEnvService`. The Raylet should send this PRC to the Agent and the Agent will delete the cgroup.
+
+```
+message CleanForDeadWorkerRequest {
+  string worker_id = 1;
+}
+
+message CleanForDeadWorkerReply {
+  AgentRpcStatus status = 1;
+  string error_message = 2;
+}
+
+service RuntimeEnvService {
+...
+  rpc CleanForDeadWorker(CleanForDeadWorkerRequest)
+      returns (CleanForDeadWorkerReply);
+}
+```
+
+## Compatibility, Deprecation, and Migration Plan
+
+This proposal will not change any existing APIs and any default behaviors of Ray Core.
+
+## Test Plan and Acceptance Criteria
+
+We plan to benchmark the resouces of worker processes:
+- CPU soft control: The worker could use idle CPU times which exceeds the CPU quota.
+- CPU hard control: Set the config `cpu_strict_usage=True`. The worker couldn't exceed the CPU quota.
+- Memory soft control: The worker could use idle memory which exceeds the memory quota.
+- Memory hard control: Set the config `memory_strict_usage=True`. The worker couldn't exceed the memory quota.
+
+Acceptance criteria:
+- A set of reasonable APIs.
+- A set of reasonable benchmark results.
+
+## (Optional) Follow-on Work
+
+In the first version, we can only support one cgroup manager based on cgroupfs. We can support systemd based cgroup manager in future.
+And we can achieve control for more resources, like `blkio`, `devices`, and `net_cls`.