Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource control based on cgroup #7

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

SongGuyang
Copy link

See the doc for more details.

- worker_resource_control_method: Set to "cgroup" by default.
- cgroup_manager: Set to "cgroupfs" by default. We should also support `systemd`.
- cgroup_mount_path: Set to `/sys/fs/cgroup/` by default.
- cgroup_unified_hierarchy: Whether to use cgroup v2 or not. Set to `False` by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we only support cgroup v2 to reduce the scope?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also support cgroup v1 because it has been used widely and written into standard. I have described here https://github.com/ray-project/enhancements/pull/7/files#diff-98ccfc9582e95581aae234797bc273b2fb68cb9e4dcc3030c8e94ba447daef7dR112-R113.
And in Ant, we only support v1 in production. Can you check your production environments and CI environments?

reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
We also could use the [libcgroup](https://github.com/libcgroup/libcgroup/blob/main/README) to simplify the implementation. This library support both cgroup v1 and cgroup v2.

##### systemd
Using systemd to manage cgroups is like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the advantage of using systemd here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use systemfs in a systemd based system, there will be more than one component which manage the cgroup tree. And from https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/, we could know that:

In the short-term future writing directly to the control group tree from applications should still be OK, as long as the Pax Control Groups document is followed. In the medium-term future it will still be supported to alter/read individual attributes of cgroups directly, but no longer to create/delete cgroups without using the systemd API. In the longer-term future altering/reading attributes will also be unavailable to userspace applications, unless done via systemd's APIs (either D-Bus based IPC APIs or shared library APIs for passive operations).

In addition, systemd is highly recommended by runc for cgroup v2 https://github.com/opencontainers/runc/blob/main/docs/cgroup-v2.md.

reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
@ericl
Copy link
Contributor

ericl commented Apr 9, 2022

@scv119 want to shepherd this one?

reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
@SongGuyang
Copy link
Author

@ericl @simon-mo We've refined the structure of the whole doc.

@ericl
Copy link
Contributor

ericl commented Apr 12, 2022

In the current runtime env plugin mechanism design, a plugin must implement a whole runtime env type (e.g., pip, conda, etc). Cgroup is different here. Because it is more like a config of runtime env, rather than a new runtime env type. Because it can be used together with other types. So in order to pluginize cgroup, we may need a new plugin mechanism that can extend the the current plugin mechanism. This seems weird. and I'm not sure if such a new plugin mechanism can be re-used for other features. So I think maybe we can consider pluginizing it later if we have a second similar requirement.

I'm not sure I see this as weird. This was one of the original use cases we envisioned for runtime_env plugins when we designed it (arbitrary modifications to the environment, such as attaching debuggers, resource limits, etc.).

with the current plugin mechanism, it's hard to implement cgroup deletion. As I said here #7 (comment), the cgroups are one-off. It couldn't be shared and reused. So, the logic should be that: Create cgroup -> start worker -> worker dies -> delete cgroup. For example, if the cgroup dir is /sys/fs/cgroup/{worker_id}, we can't reuse the reference counting logic to delete this dir because we don't provide the worker_id when we decrease the reference by the PRC from raylet to agent.

This doesn't sound like a fundamental issue, but something we can add right? Here's my concern: we're adding a ton of stuff into core, which will not be used by most users. If we can do this with the existing plugin infra, with some minor improvements, then longer term that leads to a better architecture than handling it directly in core. I can understand it makes this project a little more involved, but from the overall Ray project perspective this is cleaner long term as we might have other types of plugins of this sort.

@SongGuyang
Copy link
Author

@ericl I agree that the pluggable architecture could make the core cleaner in long term. And I could try to make it to be a resource_control plugin. But I think the current abstraction of runtime env plugin is not clear and couldn't meet my requirements. We should redefine it first. Not sure if we need to create a separate REP for this. And we need to do more discussion about this with @edoakes and @architkulkarni.

btw, I don't think the resource control is a minor improvement. Maybe it will be a basic ability of Ray Core and widely used by users in large scale cluster in future. We already have a lot issues in Ant and we plan to put it into production in few month.

@ericl
Copy link
Contributor

ericl commented Apr 13, 2022

I'd like to see the discussion outcome then. If @edoakes and @architkulkarni think it's not a good fit for runtime env plugin, then I'm fine with that decision.

@architkulkarni
Copy link

@ericl I agree that the pluggable architecture could make the core cleaner in long term. And I could try to make it to be a resource_control plugin. But I think the current abstraction of runtime env plugin is not clear and couldn't meet my requirements. We should redefine it first. Not sure if we need to create a separate REP for this. And we need to do more discussion about this with @edoakes and @architkulkarni.

The issue with deletion makes sense, I agree it seems tricky to support it with the current abstraction. What would the ideal abstraction look like? One naive idea is to have each runtime env plugin have a method OnWorkerDead(worker_id: int) which is called whenever the worker dies, but that might be too specific to this use case.

@SongGuyang
Copy link
Author

The issue with deletion makes sense, I agree it seems tricky to support it with the current abstraction. What would the ideal abstraction look like? One naive idea is to have each runtime env plugin have a method OnWorkerDead(worker_id: int) which is called whenever the worker dies, but that might be too specific to this use case.

Yep, we should add a method like what you said. And we should do more things to make it ideal. For example, we need to support configuration for each plugin, because I will add a lot cluster level config here https://github.com/ray-project/enhancements/pull/7/files#diff-98ccfc9582e95581aae234797bc273b2fb68cb9e4dcc3030c8e94ba447daef7dR24.

@architkulkarni btw, do we plan to make current runtime env modules, like pip and conda, to be plugins? Or we will keep it in long term?

Another thing is what @raulchen said here #7 (comment). What do you think about this? In addition, do we need to support applying multiple plugins? If we do, how can we guarantee the sort and avoid the conflicts between plugins? We should consider any more for those issues in the abstraction of plugin, right?

@ericl
Copy link
Contributor

ericl commented Apr 14, 2022

I made some edits to https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit# that I think will enable cgroups to be supported (namely, adding priority, create_for_worker, delete_for_worker). Cgroups might be implementable out of the box as a plugin after these hooks are added, wdyt?

@architkulkarni btw, do we plan to make current runtime env modules, like pip and conda, to be plugins? Or we will keep it in long term?

I believe this is the plan, yes.

@fishbone fishbone mentioned this pull request Apr 14, 2022
@SongGuyang
Copy link
Author

I made some edits to https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit# that I think will enable cgroups to be supported (namely, adding priority, create_for_worker, delete_for_worker). Cgroups might be implementable out of the box as a plugin after these hooks are added, wdyt?

I got your thought. But I think this doc is still not clear.

  • How these hooks cooperate with the Raylet and the Agent? How to refine the RPC ?
  • How the plugin affects the runtime env cache? Currently, if the runtime env json is same, we will return the cached runtime env context. But in the case of cgroup, maybe the context is different?
  • How to implement configuration of each plugin?

btw, @ericl @architkulkarni I think we should also make the google doc to be a REP, wdyt? Who can take this work? I'm also available for this if you don't have bandwidth.

@raulchen
Copy link

I think we are on the same page that resource control is better implemented as a plugin. Also, it seems that the plugin system still needs an overhaul to meet different kinds of runtime env requirements.
So my suggestion is that we can do 2 things simultaneously:

  1. summarize the minimal needed changes of the plugin system to support cgroup. And make these changes as part of the this REP.
  2. discuss what other changes are needed to support other plugins, and potentially file another REP for the plugin system refinements.

What do you think? If we agree, @SongGuyang can add the first part to this REP first.

@ericl
Copy link
Contributor

ericl commented Apr 15, 2022 via email

@edoakes
Copy link
Contributor

edoakes commented Apr 18, 2022

@raulchen +1. Please keep @architkulkarni in the loop on the plugin changes, he can also help scope out the design offline first.

reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
```python
runtime_env = {
"enable_resource_control": True,
"resource_control_config": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cgroup version automatically detected?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a util function in runc https://github.com/opencontainers/runc/blob/main/libcontainer/cgroups/utils.go#L34-L50. But cgroup v1 and v2 could both be enabled in some systems. And it also depends on which sub systems(e.g. cpu, memory) has been enable in the cgroup. So, I'd like to add a config in first step. We can also change this in future.

@SongGuyang
Copy link
Author

I will continue to work on current REP this week and back to discuss with you about the left comments.

reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
reps/2022-04-01-resource-control.md Show resolved Hide resolved
@SongGuyang
Copy link
Author

@ericl @architkulkarni @simon-mo @rkooo567 Take a look again?

Comment on lines +152 to +155
Another change is that we should support loading runtime env plugins when we start the Ray nodes. For example:
```
ray start --head --runtime-env-plugins="ray.experimental.cgroup_env_plugin:ray.experimental.test_env_plugin"
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different from eager_install, which already exists in Ray?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things absolutely. This option is used to register plugins into the dashboard agent. For example, you can register a plugin named "conda" when start the ray node. So, when a job uses "conda" field in runtime_env after the ray cluster starts, agent could call the "conda" plugin directly to set up the environment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! In the pluggability design there's a related idea which is to use environment variables, e.g. RAY_RUNTIME_ENV_PLUGIN_CGROUP=ray.experimental.test_cgroup_plugin ray start --head. These both seem straightforward to implement but I'm not sure which approach is the best, do you have any thoughts about this?

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SongGuyang we did a small group review of this and this was the conclusion/feedback:

  1. The requisite changes to runtime_env plugin architecture make sense and would be an overall improvement.
  2. We would prefer that the functionality does not live in the main Ray repo to begin with, but we can reevaluate this later as it matures and we get more user feedback.
  3. We aren't quite sure what the gap between this and the container field in runtime_env is or how we should be thinking about them together. Should we support both? Can we support the cgroup functionality by extending container support? If we are providing cgroup, should we remove container support? Additionally, the existing container support has not been publicly documented/adopted nor maintained. Given these concerns, we think it would sense to evaluate both the container and cgroup options holistically as part of this REP -- a strawman proposal would be to move the existing container support out of the codebase and into a plugin as well if we think both are needed.
  4. Could you please add more details about how this will be tested to the REP? Specifically, where will the tests live, what types of unit & integration tests we will need, etc.? Especially if this is not living in the main Ray repo we will need to make sure that the testing infrastructure is set up to avoid breakages.

Also, @ericl asked me to shepherd this change given that I have more context on runtime_env, so could you please update the text accordingly?

Please let me know if you have any questions!

reps/2022-04-01-resource-control.md Outdated Show resolved Hide resolved
@edoakes edoakes self-assigned this May 3, 2022
@ericl ericl removed their assignment May 5, 2022
@SongGuyang
Copy link
Author

  1. We would prefer that the functionality does not live in the main Ray repo to begin with, but we can reevaluate this later as it matures and we get more user feedback.

Why it couldn't to be an experimental module in the main Ray repo? I didn't get a strong reason for this in your description.

@SongGuyang
Copy link
Author

  1. We aren't quite sure what the gap between this and the container field in runtime_env is or how we should be thinking about them together. Should we support both? Can we support the cgroup functionality by extending container support? If we are providing cgroup, should we remove container support? Additionally, the existing container support has not been publicly documented/adopted nor maintained. Given these concerns, we think it would sense to evaluate both the container and cgroup options holistically as part of this REP -- a strawman proposal would be to move the existing container support out of the codebase and into a plugin as well if we think both are needed.

I think we need both.

Cgroup is only used to control the physics resources, e.g. cpus and memory, but the container is used for a lot of purposes, e.g. custom linux kernel, custom system packages(yum install ...), custom file system, and resource control certainly. We cannot use container instead of cgroup in some scenes because cgroup is more lightweight which doesn't rely on image and image center. When you create a directory in the file system, a cgroup is obtained.

Can we only add some description like what I say above? Container is another big issue. I ensure that it could also be a plugin. But, please don't let it block the pluggability enhancement and cgroup. We can discuss it in a separate issue or REP.

@raulchen
Copy link

raulchen commented May 5, 2022

@edoakes here are my thoughts to your questions.

  1. Why need both cgroup and container?

As @SongGuyang mentioned above, sometimes users only want resource control + pip requirements. In this case, containers are too heavy-weighted, as they would have to manage their docker files and images.

  1. Where should the code live?

I'd prefer to putting it in the main repo (at least to begin with). For one, I think the resource control requirement is as common as other existing ones. Putting it in the main repo would benefit more users. For another, we need to keep refining the plugin system along with this feature. Putting everything in the same repo would facilitate testing.

I think we can put it in a new "runtime_env_plugins" directory, and later migrate existing runtime env to this directory as well. The core shouldn't depend on anything in this directory. Also, runtime envs can be shipped separately as well. E.g., "pip install ray[runtime_env_pip]" "pip install ray[runtime_env_container]".

@edoakes
Copy link
Contributor

edoakes commented May 5, 2022

@SongGuyang @raulchen

Re: cgroups + containers, I understand that you cannot use cgroup as a replacement for containers, but containers are essentially a superset of cgroup, so could we leverage the existing container support for this purpose as well without requiring users to specify an image or worry about other container details? For example, make the image and cgroup options both configurable in the container command and the image is optional.

Re: putting it in the repo vs. outside, the reasoning is we do not yet have evidence of user demand for this feature and adding it to the core repo increases maintenance burden. A relevant example is the container support in runtime_env: this is not something used by the community because it's not polished or well-documented, but it has increased maintenance burden for runtime_env and caused test failures without anyone taking clear ownership (the relevant test is now disabled). We want to ensure the same thing doesn't happen w/ cgroup support. Once the feature is fully-fledged and we see user demand, we can reconsider promoting it to put it in the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants