Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support enabling PTDS via CUDA_PTDS environment variable #633

Open
wants to merge 7 commits into
base: branch-21.06
Choose a base branch
from

Conversation

pentschev
Copy link
Member

Following up on a conversation with @harrism @jakirkham @rongou yesterday, I did a small change to the Cython bindings where we can enable PTDS via a CUDA_PTDS runtime environment variable, which is different from #480 where rebuilding the Python package is necessary. This allows us to test RMM together with other Python libraries in a non-intrusive fashion, requiring users to explicitly enable PTDS.

It's important to notice that this only works for the RMM API, for example rmm.DeviceBuffer, but using Numba to do the copying will still result on using the default stream only. To add Numba support, we may do a similar change there. Tagging @gmarkall for awareness too.

I currently used stream 2 because I didn't find an existing definition in RMM, I'm happy to change that appropriately to a definition, but I'm not sure where that should live yet.

@pentschev pentschev requested a review from a team as a code owner November 20, 2020 11:58
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, are you sure this is the right place to do this? Second, is thus too heavy-handed? Third, I think the env variable should not start with CUDA. It should be very explicit, clear and dangerous sounding: RMM_FORCE_PER_THREAD_DEFAULT_STREAM.

python/rmm/_lib/cuda_stream_view.pyx Outdated Show resolved Hide resolved
@pentschev
Copy link
Member Author

First, are you sure this is the right place to do this?

When you say "this is the right place", are you referring to RMM itself, to the Cython binding or to the CudaStreamView?

Second, is thus too heavy-handed?

Why is this heavy-handed if we only allow the user to explicitly enable it?

Third, I think the env variable should not start with CUDA. It should be very explicit, clear and dangerous sounding: RMM_FORCE_PER_THREAD_DEFAULT_STREAM.

I'm fine with a different variable name, I chose a very general-purpose one only to see how people feel about it. Should we have a very specific name like what you suggested or have a variable name that all projects could share and not only RMM, but also Numba, CuPy, etc.?

@harrism
Copy link
Member

harrism commented Nov 22, 2020

When you say "this is the right place", are you referring to RMM itself, to the Cython binding or to the CudaStreamView?

I mean the constructor of CudaStreamView. I suppose as long as there is still a way to explicitly request cudaStreamLegacy in this mode, it's OK.

I'm fine with a different variable name, I chose a very general-purpose one only to see how people feel about it. Should we have a very specific name like what you suggested or have a variable name that all projects could share and not only RMM, but also Numba, CuPy, etc.?

There were two things I didn't like about the name. One, it starts with CUDA, which should probably be reserved for official CUDA variables. Two, it has an acronym that many people won't know (CUDA docs never use the term "PTDS" as far as I know).

@pentschev
Copy link
Member Author

I mean the constructor of CudaStreamView. I suppose as long as there is still a way to explicitly request cudaStreamLegacy in this mode, it's OK.

Sorry, this is still not 100% clear to me. Are you saying that you want a way to explicitly request cudaStreamLegacy even when we have the environment variable explicitly passed by the user? If so, then maybe we could consider changing the default to stream=None, and that would just choose between cudaStreamLegacy or cudaStreamPerThread depending on the environment variable, otherwise the user can pass the stream itself and that would completely bypass the environment variable mechanism.

Calling it CUDA_* implies that it's an official CUDA feature, which it is not.
...
There were two things I didn't like about the name. One, it starts with CUDA, which should probably be reserved for official CUDA variables. Two, it has an acronym that many people won't know (CUDA docs never use the term "PTDS" as far as I know).

I understand all that, my intent here is asking for additional input on whether we want an RMM-specific variable (and consequently specific variables for other libraries, e.g., CUPY_CUDA_PTDS) or something more general, perhaps PYTHON_CUDA_PTDS? We could still expand that to RMM_PER_THREAD_DEFAULT_STREAM and do the same for other libraries. IMHO, I would prefer the short version and document it well, it gets very annoying to type all that and people who will use it will do so because they know what they're doing.

@gmarkall
Copy link
Contributor

I understand all that, my intent here is asking for additional input on whether we want an RMM-specific variable (and consequently specific variables for other libraries, e.g., CUPY_CUDA_PTDS) or something more general, perhaps PYTHON_CUDA_PTDS? We could still expand that to RMM_PER_THREAD_DEFAULT_STREAM and do the same for other libraries. IMHO, I would prefer the short version and document it well, it gets very annoying to type all that and people who will use it will do so because they know what they're doing.

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

@pentschev
Copy link
Member Author

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

This is what I thought too, although CuPy doesn't have a configuration module it follows a similar pattern for environment variables. My suggestion with PYTHON_CUDA_PTDS is only to have a simple way to enable things but I'm ok if that's undesirable, the alternative to that will probably be something like RMM_PER_THREAD_DEFAULT_STREAM=1 NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM=1 CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1, which is very convoluted and discouraging.

@harrism
Copy link
Member

harrism commented Nov 23, 2020

Are you saying that you want a way to explicitly request cudaStreamLegacy even when we have the environment variable explicitly passed by the user?

I mean that it already exists: pass 1 to the constructor of CudaStreamView. Although I don't like the magic numbers. I worked very hard in the C++ code to eliminate them, and I couldn't figure out how to do it correctly in Cython, so help appreciated. In the C++ we have rmm::cuda_stream_per_thread, rmm::cuda_stream_default and rmm::cuda_stream_legacy. See

/**
* @brief Static cuda_stream_view of the default stream (stream 0), for convenience
*/
static constexpr cuda_stream_view cuda_stream_default{};
/**
* @brief Static cuda_stream_view of cudaStreamLegacy, for convenience
*/
static cuda_stream_view cuda_stream_legacy{cudaStreamLegacy};
/**
* @brief Static cuda_stream_view of cudaStreamPerThread, for convenience
*/
static cuda_stream_view cuda_stream_per_thread{cudaStreamPerThread};

@harrism
Copy link
Member

harrism commented Nov 23, 2020

it gets very annoying to type all that

I generally type env variables into scripts, not at the command line, so the length doesn't matter much. PTDS is not a well-known acronym, and it's easily confused with PTSD.

@pentschev
Copy link
Member Author

I mean that it already exists: pass 1 to the constructor of CudaStreamView. Although I don't like the magic numbers. I worked very hard in the C++ code to eliminate them, and I couldn't figure out how to do it correctly in Cython, so help appreciated. In the C++ we have rmm::cuda_stream_per_thread, rmm::cuda_stream_default and rmm::cuda_stream_legacy. See

I spent some time now trying to do that and it seems indeed there's no simple, clean way to do so. In d789117 I exposed the definitions for cudaStreamLegacy/cudaStreamPerThread from the C includes, but we can't use custom types in def __cinit__ apparently, so we still need to pass something like a uintptr_t stream=0 to the constructor, as it's currently done.

The solution I see are to define custom Python types, perhaps just based on unitptr_t and rely on that, while writing our own PyCudaStreamLegacy or a similar name, which we can use to pass to CudaStreamView. Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

@pentschev
Copy link
Member Author

I generally type env variables into scripts, not at the command line, so the length doesn't matter much.

I do type variables, explicit is much better than implicit for me, throughout my life I wasted countless hours on implicit definitions from configuration files, system defaults, etc., thus I hate those things that are easy to forget, especially as time passes.

PTDS is not a well-known acronym, and it's easily confused with PTSD.

I agree it isn't a well-known acronym, but no acronyms or names are well-known from day zero. And if it's about confusing with similar acronyms from completely different areas of knowledge, we might as well just tell everyone to never use acronyms again. :)

@gmarkall
Copy link
Contributor

Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

I only have a basic working experience of Cython so I'm not sure what a better suggestion would look like :-(.

@leofang
Copy link
Member

leofang commented Nov 23, 2020

I spent some time now trying to do that and it seems indeed there's no simple, clean way to do so. In d789117 I exposed the definitions for cudaStreamLegacy/cudaStreamPerThread from the C includes, but we can't use custom types in def __cinit__ apparently, so we still need to pass something like a uintptr_t stream=0 to the constructor, as it's currently done.

The solution I see are to define custom Python types, perhaps just based on unitptr_t and rely on that, while writing our own PyCudaStreamLegacy or a similar name, which we can use to pass to CudaStreamView. Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

Just a drive-by comment...It is not possible to set the default value of Python def functions (such as __cinit__ here) to values defined in C. A workaround could be to define a cpdef enum that contains these values. A cpdef enum has a Python representation according to PEP-435, which can then be used as the default value:

cpdef enum cudaStream:
    cudaStreamLegacy = 0
    cudaStreamPerThread

cdef f(cudaStream s=cudaStreamLegacy):
    pass

cdef class XYZ:
    def __cinit__(self, cudaStream s=cudaStreamLegacy):
        pass

@pentschev
Copy link
Member Author

Thanks @leofang , that's indeed very informative. I'm feeling less confused now, that was pretty much the only solution I could arrive at and your confirmation is quite helpful!

@pentschev
Copy link
Member Author

I changed CudaStreamView with the suggestions from #633 (comment), let me know if that's an acceptable solution.

@harrism
Copy link
Member

harrism commented Nov 23, 2020

I'm firmly against PTDS in the name of the environment variable. It's inconsistent with the rest of RMM and with CUDA.

@pentschev
Copy link
Member Author

I'm firmly against PTDS in the name of the environment variable. It's inconsistent with the rest of RMM and with CUDA.

Is this 1ee52af ok for you?

@harrism
Copy link
Member

harrism commented Dec 2, 2020

Please retarget to branch-0.18 when ready. Adding to 0.18 project for tracking.

@harrism harrism added the Python Related to RMM Python API label Dec 2, 2020
@pentschev
Copy link
Member Author

I updated this PR to work with the latest branch-0.18. Would appreciate some new feedback here.

@jrhemstad
Copy link
Contributor

Trying to control per-thread default stream with environment variables does not feel like a good idea to me. It would be one thing if there was a CUDA provided environment variable that controlled this behavior, but trying to do it ad hoc per library just feels like a bad idea.

Furthermore, the existing environment variable deviates from the standard meaning of "per-thread default stream". Normally enabling PTDS changes the meaning of the value 0 from the legacy default stream to a stream per thread. What's being done here is it's using cuda_stream_per_thread (the value 2) by default, i.e., the meaning of 0 has not changed. So if RMM_PER_THREAD_DEFAULT_STREAM were enabled, and stream 0 was used anywhere, it would still mean the legacy default stream.

@pentschev
Copy link
Member Author

Trying to control per-thread default stream with environment variables does not feel like a good idea to me. It would be one thing if there was a CUDA provided environment variable that controlled this behavior, but trying to do it ad hoc per library just feels like a bad idea.

I mostly agree with that. However, we need anyway to control the PTDS behavior per library anyway, i.e., we'll need to add support to all of them somehow, but we need to start somehow/somewhere. Furthermore, using environment variables has been the most reasonable approach I've seen to enable a certain behavior globally in Python without requiring changes to user code as well as 3rd-party dependencies to such libraries, which is good at an initial stage without breaking existing code.

Furthermore, the existing environment variable deviates from the standard meaning of "per-thread default stream". Normally enabling PTDS changes the meaning of the value 0 from the legacy default stream to a stream per thread. What's being done here is it's using cuda_stream_per_thread (the value 2) by default, i.e., the meaning of 0 has not changed. So if RMM_PER_THREAD_DEFAULT_STREAM were enabled, and stream 0 was used anywhere, it would still mean the legacy default stream.

This is also true, and by no means this is intended as a flawless or future proof approach, but it allows our team and other advanced users to begin testing and profiling with PTDS, particularly with Dask. For now, my current assumption is that the libraries we'll use with PTDS will rely on RMM's default, thus using 0 isn't expected.

With the above said, I'm open to suggestions on improving this that will allow us to at least begin some testing, without each new user being required modify and install RMM from source. For the future, I think we may prefer to address each piece of code individually to treat PTDS at a local level, but that certainly involves touching pretty much all Python libraries that is used by RAPIDS today, which is going to be inefficient and even more intrusive at this stage.

@jrhemstad
Copy link
Contributor

we need anyway to control the PTDS behavior per library anyway

I don't think this is true. So long as every library exposes a stream argument to indicate on which stream an operation should occur, then one need only specify the cuda_stream_per_thread at the top level of any operation and it should then be passed down through any other invoked library.

def cudf_thing(stream)
   // pass the provided stream to downstream libraries
   do_cupy_thing(stream)
   do_numba_thing(stream)

@pentschev
Copy link
Member Author

I don't think this is true. So long as every library exposes a stream argument to indicate on which stream an operation should occur, then one need only specify the cuda_stream_per_thread at the top level of any operation and it should then be passed down through any other invoked library.

Yes, this is what I meant, sorry for being unclear. That's true, and probably the right move long-term, but right now this is still a bit distant, as we really need to make sure we can pass the stream to all those libraries.

@quasiben
Copy link
Member

quasiben commented Mar 2, 2021

I think this should be re-targeted to 0.19

@jakirkham jakirkham changed the base branch from branch-0.18 to branch-0.19 March 2, 2021 01:37
@jakirkham
Copy link
Member

Have updated the upstream branch targeted by the PR. Looks like that updated GitHub Projects automatically. Not sure if there's anything else needed in terms of retargeting

Comment on lines +122 to +125
if int(os.environ.get("RMM_PER_THREAD_DEFAULT_STREAM", "0")) != 0:
DEFAULT_STREAM = Stream._from_cudaStream_t(cuda_stream_per_thread.value())
else:
DEFAULT_STREAM = Stream._from_cudaStream_t(cuda_stream_default.value())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow this to be changed during runtime of an application or only at import time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume at import time as this is what I think cupy does now

Copy link
Member

@leofang leofang Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well CuPy reads the env var (same is done here) so technically there it could be changed at runtime by overwriting os.environ['CUPY_CUDA_PER_THREAD_DEFAULT_STREAM']. However such operation is really a bad practice that is hard to be fool-proofed.

@github-actions
Copy link

github-actions bot commented Apr 3, 2021

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@pentschev pentschev changed the base branch from branch-0.19 to branch-0.20 April 6, 2021 08:51
@pentschev
Copy link
Member Author

I haven't been able to continue working on PTDS, but hopefully can come back to it in the next month or so.

@gmarkall
Copy link
Contributor

For reference, I implemented something similar in Numba ( see numba/numba#6936) and went with NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM, for the reasons outlined previously:

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

@pentschev
Copy link
Member Author

Thanks @gmarkall , I still think having environment variables is a great step forward at least for the short-/mid-term, as this allows us to test things before committing to big changes in the ecosystem.

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d inactive-90d Python Related to RMM Python API
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

10 participants