[WIP] Support enabling PTDS via CUDA_PTDS environment variable #633

pentschev · 2020-11-20T11:58:13Z

Following up on a conversation with @harrism @jakirkham @rongou yesterday, I did a small change to the Cython bindings where we can enable PTDS via a CUDA_PTDS runtime environment variable, which is different from #480 where rebuilding the Python package is necessary. This allows us to test RMM together with other Python libraries in a non-intrusive fashion, requiring users to explicitly enable PTDS.

It's important to notice that this only works for the RMM API, for example rmm.DeviceBuffer, but using Numba to do the copying will still result on using the default stream only. To add Numba support, we may do a similar change there. Tagging @gmarkall for awareness too.

I currently used stream 2 because I didn't find an existing definition in RMM, I'm happy to change that appropriately to a definition, but I'm not sure where that should live yet.

GPUtester · 2020-11-20T11:58:36Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

harrism

First, are you sure this is the right place to do this? Second, is thus too heavy-handed? Third, I think the env variable should not start with CUDA. It should be very explicit, clear and dangerous sounding: RMM_FORCE_PER_THREAD_DEFAULT_STREAM.

python/rmm/_lib/cuda_stream_view.pyx

pentschev · 2020-11-20T21:06:15Z

First, are you sure this is the right place to do this?

When you say "this is the right place", are you referring to RMM itself, to the Cython binding or to the CudaStreamView?

Second, is thus too heavy-handed?

Why is this heavy-handed if we only allow the user to explicitly enable it?

Third, I think the env variable should not start with CUDA. It should be very explicit, clear and dangerous sounding: RMM_FORCE_PER_THREAD_DEFAULT_STREAM.

I'm fine with a different variable name, I chose a very general-purpose one only to see how people feel about it. Should we have a very specific name like what you suggested or have a variable name that all projects could share and not only RMM, but also Numba, CuPy, etc.?

harrism · 2020-11-22T21:01:58Z

When you say "this is the right place", are you referring to RMM itself, to the Cython binding or to the CudaStreamView?

I mean the constructor of CudaStreamView. I suppose as long as there is still a way to explicitly request cudaStreamLegacy in this mode, it's OK.

I'm fine with a different variable name, I chose a very general-purpose one only to see how people feel about it. Should we have a very specific name like what you suggested or have a variable name that all projects could share and not only RMM, but also Numba, CuPy, etc.?

There were two things I didn't like about the name. One, it starts with CUDA, which should probably be reserved for official CUDA variables. Two, it has an acronym that many people won't know (CUDA docs never use the term "PTDS" as far as I know).

pentschev · 2020-11-23T10:19:08Z

I mean the constructor of CudaStreamView. I suppose as long as there is still a way to explicitly request cudaStreamLegacy in this mode, it's OK.

Sorry, this is still not 100% clear to me. Are you saying that you want a way to explicitly request cudaStreamLegacy even when we have the environment variable explicitly passed by the user? If so, then maybe we could consider changing the default to stream=None, and that would just choose between cudaStreamLegacy or cudaStreamPerThread depending on the environment variable, otherwise the user can pass the stream itself and that would completely bypass the environment variable mechanism.

Calling it CUDA_* implies that it's an official CUDA feature, which it is not.
...
There were two things I didn't like about the name. One, it starts with CUDA, which should probably be reserved for official CUDA variables. Two, it has an acronym that many people won't know (CUDA docs never use the term "PTDS" as far as I know).

I understand all that, my intent here is asking for additional input on whether we want an RMM-specific variable (and consequently specific variables for other libraries, e.g., CUPY_CUDA_PTDS) or something more general, perhaps PYTHON_CUDA_PTDS? We could still expand that to RMM_PER_THREAD_DEFAULT_STREAM and do the same for other libraries. IMHO, I would prefer the short version and document it well, it gets very annoying to type all that and people who will use it will do so because they know what they're doing.

gmarkall · 2020-11-23T10:39:48Z

I understand all that, my intent here is asking for additional input on whether we want an RMM-specific variable (and consequently specific variables for other libraries, e.g., CUPY_CUDA_PTDS) or something more general, perhaps PYTHON_CUDA_PTDS? We could still expand that to RMM_PER_THREAD_DEFAULT_STREAM and do the same for other libraries. IMHO, I would prefer the short version and document it well, it gets very annoying to type all that and people who will use it will do so because they know what they're doing.

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

pentschev · 2020-11-23T11:00:23Z

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

This is what I thought too, although CuPy doesn't have a configuration module it follows a similar pattern for environment variables. My suggestion with PYTHON_CUDA_PTDS is only to have a simple way to enable things but I'm ok if that's undesirable, the alternative to that will probably be something like RMM_PER_THREAD_DEFAULT_STREAM=1 NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM=1 CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1, which is very convoluted and discouraging.

harrism · 2020-11-23T11:03:33Z

Are you saying that you want a way to explicitly request cudaStreamLegacy even when we have the environment variable explicitly passed by the user?

I mean that it already exists: pass 1 to the constructor of CudaStreamView. Although I don't like the magic numbers. I worked very hard in the C++ code to eliminate them, and I couldn't figure out how to do it correctly in Cython, so help appreciated. In the C++ we have rmm::cuda_stream_per_thread, rmm::cuda_stream_default and rmm::cuda_stream_legacy. See

rmm/include/rmm/cuda_stream_view.hpp

Lines 112 to 125 in e7f0726

    
           /** 
        
            * @brief Static cuda_stream_view of the default stream (stream 0), for convenience 
        
            */ 
        
           static constexpr cuda_stream_view cuda_stream_default{}; 
        
           /** 
        
            * @brief Static cuda_stream_view of cudaStreamLegacy, for convenience 
        
            */ 
        
           static cuda_stream_view cuda_stream_legacy{cudaStreamLegacy}; 
        
           /** 
        
            * @brief Static cuda_stream_view of cudaStreamPerThread, for convenience 
        
            */ 
        
           static cuda_stream_view cuda_stream_per_thread{cudaStreamPerThread};

harrism · 2020-11-23T11:04:44Z

it gets very annoying to type all that

I generally type env variables into scripts, not at the command line, so the length doesn't matter much. PTDS is not a well-known acronym, and it's easily confused with PTSD.

pentschev · 2020-11-23T15:23:25Z

I mean that it already exists: pass 1 to the constructor of CudaStreamView. Although I don't like the magic numbers. I worked very hard in the C++ code to eliminate them, and I couldn't figure out how to do it correctly in Cython, so help appreciated. In the C++ we have rmm::cuda_stream_per_thread, rmm::cuda_stream_default and rmm::cuda_stream_legacy. See

I spent some time now trying to do that and it seems indeed there's no simple, clean way to do so. In d789117 I exposed the definitions for cudaStreamLegacy/cudaStreamPerThread from the C includes, but we can't use custom types in def __cinit__ apparently, so we still need to pass something like a uintptr_t stream=0 to the constructor, as it's currently done.

The solution I see are to define custom Python types, perhaps just based on unitptr_t and rely on that, while writing our own PyCudaStreamLegacy or a similar name, which we can use to pass to CudaStreamView. Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

pentschev · 2020-11-23T15:26:36Z

I generally type env variables into scripts, not at the command line, so the length doesn't matter much.

I do type variables, explicit is much better than implicit for me, throughout my life I wasted countless hours on implicit definitions from configuration files, system defaults, etc., thus I hate those things that are easy to forget, especially as time passes.

PTDS is not a well-known acronym, and it's easily confused with PTSD.

I agree it isn't a well-known acronym, but no acronyms or names are well-known from day zero. And if it's about confusing with similar acronyms from completely different areas of knowledge, we might as well just tell everyone to never use acronyms again. :)

gmarkall · 2020-11-23T16:15:35Z

Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

I only have a basic working experience of Cython so I'm not sure what a better suggestion would look like :-(.

leofang · 2020-11-23T17:52:36Z

I spent some time now trying to do that and it seems indeed there's no simple, clean way to do so. In d789117 I exposed the definitions for cudaStreamLegacy/cudaStreamPerThread from the C includes, but we can't use custom types in def __cinit__ apparently, so we still need to pass something like a uintptr_t stream=0 to the constructor, as it's currently done.

The solution I see are to define custom Python types, perhaps just based on unitptr_t and rely on that, while writing our own PyCudaStreamLegacy or a similar name, which we can use to pass to CudaStreamView. Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

Just a drive-by comment...It is not possible to set the default value of Python def functions (such as __cinit__ here) to values defined in C. A workaround could be to define a cpdef enum that contains these values. A cpdef enum has a Python representation according to PEP-435, which can then be used as the default value:

cpdef enum cudaStream:
    cudaStreamLegacy = 0
    cudaStreamPerThread

cdef f(cudaStream s=cudaStreamLegacy):
    pass

cdef class XYZ:
    def __cinit__(self, cudaStream s=cudaStreamLegacy):
        pass

pentschev · 2020-11-23T18:44:26Z

Thanks @leofang , that's indeed very informative. I'm feeling less confused now, that was pretty much the only solution I could arrive at and your confirmation is quite helpful!

pentschev · 2020-11-23T19:35:03Z

I changed CudaStreamView with the suggestions from #633 (comment), let me know if that's an acceptable solution.

python/rmm/_lib/cuda_stream_view.pyx

harrism · 2020-11-23T22:04:24Z

I'm firmly against PTDS in the name of the environment variable. It's inconsistent with the rest of RMM and with CUDA.

pentschev · 2020-11-23T22:08:13Z

I'm firmly against PTDS in the name of the environment variable. It's inconsistent with the rest of RMM and with CUDA.

Is this 1ee52af ok for you?

python/rmm/_lib/cuda_stream_view.pyx

harrism · 2020-12-02T20:53:23Z

Please retarget to branch-0.18 when ready. Adding to 0.18 project for tracking.

pentschev · 2021-01-06T17:35:37Z

I updated this PR to work with the latest branch-0.18. Would appreciate some new feedback here.

jrhemstad · 2021-01-08T21:18:45Z

Trying to control per-thread default stream with environment variables does not feel like a good idea to me. It would be one thing if there was a CUDA provided environment variable that controlled this behavior, but trying to do it ad hoc per library just feels like a bad idea.

Furthermore, the existing environment variable deviates from the standard meaning of "per-thread default stream". Normally enabling PTDS changes the meaning of the value 0 from the legacy default stream to a stream per thread. What's being done here is it's using cuda_stream_per_thread (the value 2) by default, i.e., the meaning of 0 has not changed. So if RMM_PER_THREAD_DEFAULT_STREAM were enabled, and stream 0 was used anywhere, it would still mean the legacy default stream.

pentschev · 2021-01-08T22:22:06Z

Trying to control per-thread default stream with environment variables does not feel like a good idea to me. It would be one thing if there was a CUDA provided environment variable that controlled this behavior, but trying to do it ad hoc per library just feels like a bad idea.

I mostly agree with that. However, we need anyway to control the PTDS behavior per library anyway, i.e., we'll need to add support to all of them somehow, but we need to start somehow/somewhere. Furthermore, using environment variables has been the most reasonable approach I've seen to enable a certain behavior globally in Python without requiring changes to user code as well as 3rd-party dependencies to such libraries, which is good at an initial stage without breaking existing code.

Furthermore, the existing environment variable deviates from the standard meaning of "per-thread default stream". Normally enabling PTDS changes the meaning of the value 0 from the legacy default stream to a stream per thread. What's being done here is it's using cuda_stream_per_thread (the value 2) by default, i.e., the meaning of 0 has not changed. So if RMM_PER_THREAD_DEFAULT_STREAM were enabled, and stream 0 was used anywhere, it would still mean the legacy default stream.

This is also true, and by no means this is intended as a flawless or future proof approach, but it allows our team and other advanced users to begin testing and profiling with PTDS, particularly with Dask. For now, my current assumption is that the libraries we'll use with PTDS will rely on RMM's default, thus using 0 isn't expected.

With the above said, I'm open to suggestions on improving this that will allow us to at least begin some testing, without each new user being required modify and install RMM from source. For the future, I think we may prefer to address each piece of code individually to treat PTDS at a local level, but that certainly involves touching pretty much all Python libraries that is used by RAPIDS today, which is going to be inefficient and even more intrusive at this stage.

jrhemstad · 2021-01-08T22:29:07Z

we need anyway to control the PTDS behavior per library anyway

I don't think this is true. So long as every library exposes a stream argument to indicate on which stream an operation should occur, then one need only specify the cuda_stream_per_thread at the top level of any operation and it should then be passed down through any other invoked library.

def cudf_thing(stream)
   // pass the provided stream to downstream libraries
   do_cupy_thing(stream)
   do_numba_thing(stream)

pentschev · 2021-01-08T22:44:44Z

I don't think this is true. So long as every library exposes a stream argument to indicate on which stream an operation should occur, then one need only specify the cuda_stream_per_thread at the top level of any operation and it should then be passed down through any other invoked library.

Yes, this is what I meant, sorry for being unclear. That's true, and probably the right move long-term, but right now this is still a bit distant, as we really need to make sure we can pass the stream to all those libraries.

quasiben · 2021-03-02T01:15:13Z

I think this should be re-targeted to 0.19

jakirkham · 2021-03-02T01:38:53Z

Have updated the upstream branch targeted by the PR. Looks like that updated GitHub Projects automatically. Not sure if there's anything else needed in terms of retargeting

kkraus14 · 2021-03-02T01:49:01Z

python/rmm/_cuda/stream.pyx

+if int(os.environ.get("RMM_PER_THREAD_DEFAULT_STREAM", "0")) != 0:
+    DEFAULT_STREAM = Stream._from_cudaStream_t(cuda_stream_per_thread.value())
+else:
+    DEFAULT_STREAM = Stream._from_cudaStream_t(cuda_stream_default.value())


Do we want to allow this to be changed during runtime of an application or only at import time?

I would assume at import time as this is what I think cupy does now

Well CuPy reads the env var (same is done here) so technically there it could be changed at runtime by overwriting os.environ['CUPY_CUDA_PER_THREAD_DEFAULT_STREAM']. However such operation is really a bad practice that is hard to be fool-proofed.

github-actions · 2021-04-03T05:10:18Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

pentschev · 2021-04-06T08:52:22Z

I haven't been able to continue working on PTDS, but hopefully can come back to it in the next month or so.

gmarkall · 2021-04-15T22:07:55Z

For reference, I implemented something similar in Numba ( see numba/numba#6936) and went with NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM, for the reasons outlined previously:

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

pentschev · 2021-04-15T22:11:28Z

Thanks @gmarkall , I still think having environment variables is a great step forward at least for the short-/mid-term, as this allows us to test things before committing to big changes in the ecosystem.

github-actions · 2021-05-15T23:05:55Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2021-11-18T18:01:18Z

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

Support enabling PTDS via CUDA_PTDS environment variable

fdbd8fc

pentschev requested a review from a team as a code owner November 20, 2020 11:58

harrism reviewed Nov 20, 2020

View reviewed changes

python/rmm/_lib/cuda_stream_view.pyx Outdated Show resolved Hide resolved

pentschev added 2 commits November 23, 2020 04:31

Expose cudaStreamLegacy/cudaStreamPerThread in Python

d789117

Rename PTDS environment variable to RMM_PER_THREAD_DEFAULT_STREAM

1ee52af

Switch Python stream types to cudaStream enum

d6c6353

shwina reviewed Nov 23, 2020

View reviewed changes

python/rmm/_lib/cuda_stream_view.pyx Outdated Show resolved Hide resolved

This was referenced Nov 24, 2020

Support for Per Thread Default Stream (PTDS) cupy/cupy#4322

Merged

Support for CUDA streams rapidsai/dask-cuda#96

Open

jakirkham reviewed Nov 30, 2020

View reviewed changes

python/rmm/_lib/cuda_stream_view.pyx Outdated Show resolved Hide resolved

harrism added the Python Related to RMM Python API label Dec 2, 2020

harrism assigned pentschev Dec 2, 2020

pentschev added 2 commits January 6, 2021 09:31

Update RMM_PER_THREAD_DEFAULT_STREAM usage

b1d6aad

Revert unnecessary change to setup.py

7ede244

pentschev mentioned this pull request Feb 12, 2021

PTDS Benchmarks rapidsai/dask-cuda#517

Open

jakirkham mentioned this pull request Feb 17, 2021

[WIP] Build RMM's Python bindings with PTDS #480

Closed

jakirkham changed the base branch from branch-0.18 to branch-0.19 March 2, 2021 01:37

kkraus14 reviewed Mar 2, 2021

View reviewed changes

github-actions bot added the inactive-30d label Apr 3, 2021

pentschev changed the base branch from branch-0.19 to branch-0.20 April 6, 2021 08:51

github-actions bot removed the inactive-30d label Apr 6, 2021

github-actions bot added the inactive-30d label May 15, 2021

github-actions bot added the inactive-90d label Nov 18, 2021

vyasr mentioned this pull request Oct 21, 2022

Always enable per-thread default stream rapidsai/cudf#11281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support enabling PTDS via CUDA_PTDS environment variable #633

[WIP] Support enabling PTDS via CUDA_PTDS environment variable #633

pentschev commented Nov 20, 2020

GPUtester commented Nov 20, 2020

harrism left a comment

pentschev commented Nov 20, 2020

harrism commented Nov 22, 2020

pentschev commented Nov 23, 2020

gmarkall commented Nov 23, 2020

pentschev commented Nov 23, 2020

harrism commented Nov 23, 2020

harrism commented Nov 23, 2020 •

edited

Loading

pentschev commented Nov 23, 2020

pentschev commented Nov 23, 2020

gmarkall commented Nov 23, 2020

leofang commented Nov 23, 2020

pentschev commented Nov 23, 2020

pentschev commented Nov 23, 2020

harrism commented Nov 23, 2020

pentschev commented Nov 23, 2020

harrism commented Dec 2, 2020

pentschev commented Jan 6, 2021

jrhemstad commented Jan 8, 2021

pentschev commented Jan 8, 2021

jrhemstad commented Jan 8, 2021

pentschev commented Jan 8, 2021

quasiben commented Mar 2, 2021

jakirkham commented Mar 2, 2021

kkraus14 Mar 2, 2021

quasiben Mar 2, 2021

leofang Mar 4, 2021 •

edited

Loading

github-actions bot commented Apr 3, 2021

pentschev commented Apr 6, 2021

gmarkall commented Apr 15, 2021

pentschev commented Apr 15, 2021

github-actions bot commented May 15, 2021

github-actions bot commented Nov 18, 2021

[WIP] Support enabling PTDS via CUDA_PTDS environment variable #633

Are you sure you want to change the base?

[WIP] Support enabling PTDS via CUDA_PTDS environment variable #633

Conversation

pentschev commented Nov 20, 2020

GPUtester commented Nov 20, 2020

harrism left a comment

Choose a reason for hiding this comment

pentschev commented Nov 20, 2020

harrism commented Nov 22, 2020

pentschev commented Nov 23, 2020

gmarkall commented Nov 23, 2020

pentschev commented Nov 23, 2020

harrism commented Nov 23, 2020

harrism commented Nov 23, 2020 • edited Loading

pentschev commented Nov 23, 2020

pentschev commented Nov 23, 2020

gmarkall commented Nov 23, 2020

leofang commented Nov 23, 2020

pentschev commented Nov 23, 2020

pentschev commented Nov 23, 2020

harrism commented Nov 23, 2020

pentschev commented Nov 23, 2020

harrism commented Dec 2, 2020

pentschev commented Jan 6, 2021

jrhemstad commented Jan 8, 2021

pentschev commented Jan 8, 2021

jrhemstad commented Jan 8, 2021

pentschev commented Jan 8, 2021

quasiben commented Mar 2, 2021

jakirkham commented Mar 2, 2021

kkraus14 Mar 2, 2021

Choose a reason for hiding this comment

quasiben Mar 2, 2021

Choose a reason for hiding this comment

leofang Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Apr 3, 2021

pentschev commented Apr 6, 2021

gmarkall commented Apr 15, 2021

pentschev commented Apr 15, 2021

github-actions bot commented May 15, 2021

github-actions bot commented Nov 18, 2021

harrism commented Nov 23, 2020 •

edited

Loading

leofang Mar 4, 2021 •

edited

Loading