Add API flag for whether an AQL queue should be allocated in device memory. #269

benvanik · 2024-12-16T16:11:49Z

In 3baaa6e @saleelk added the ability to allocate an AQL queue in device memory via the HSA_ALLOCATE_QUEUE_DEV_MEM environment variable. I'd like to be able to use this programmatically with a queue creation flag as some of my queues are better in system memory and others in device memory (particularly those where I'm producing for the queue on the same device consuming it). In addition, as a middleware layer I don't control the environment and options that are set via them are generally not accessible.

hsa_queue_create doesn't seem to have a flags arg and I'm not sure of the best route for adding one such that creation could be controlled. Maybe an hsa_amd_queue_create?

The text was updated successfully, but these errors were encountered:

saleelk · 2024-12-16T19:08:37Z

yeah, I got dragged elsewhere and sort of paused on this end. Yes, we would need a new API that takes flags. I had a conversation with ROCr owner about this, but need to start that over again

benvanik · 2024-12-16T23:51:48Z

That'd be wonderful! I haven't yet gotten my code working with it as I'm getting GPU faults when HSA_ALLOCATE_QUEUE_DEV_MEM=1, but that's likely my issue (or an issue with some of my queues needing to be in host memory and some in device memory). Thanks for pushing on this: I'm round-tripping through system memory for queue operations and it's taking an eternity :)

saleelk · 2024-12-17T20:03:16Z

i've looped you in on an internal chat

atgutier · 2025-01-06T17:12:48Z

@saleelk or @dayatsin-amd are either of you taking a look at this? If not, I can create a PR for this.

benvanik · 2025-01-06T18:05:44Z

Related to this, currently the amd_queue_t ends up in host memory and only the ringbuffer is going into device memory with the flag enabled. The queue struct is hot (write_dispatch_id/read_dispatch_id, etc) and we want that colocated with the ringbuffer in device memory as well.

atgutier · 2025-01-15T20:06:51Z

@saleelk I'm taking a look at this and it seems fairly straightforward to add an API-based queue allocation mechanism for getting the queue buf in dev mem, additionally, using the same finegrain allocation method seems to work for the queue struct itself.

If I create a PR to support this that removes the ENV variable would that be ok? Or are you already relying on the ENV variable? If so I can just mark it deprecated so future users rely on the API and not the ENV var.

misos1 · 2025-02-17T20:37:15Z

@atgutier @saleelk Does this require some specific hardware? I noticed in both 3baaa6e and #284 that after is aql or amd_queue allocated on gpu it is then accessed by the host but that does not generally work on my system and when I use HSA_ALLOCATE_QUEUE_DEV_MEM I get SIGSEGV at

ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_aql_queue.cpp

Line 134 in 26f001d

    
           (((core::AqlPacket*)ring_buf_)[pkt_id]).dispatch.header = HSA_PACKET_TYPE_INVALID;

atgutier · 2025-02-17T22:47:18Z

@atgutier @saleelk Does this require some specific hardware? I noticed in both 3baaa6e and #284 that after is aql or amd_queue allocated on gpu it is then accessed by the host but that does not generally work on my system and when I use HSA_ALLOCATE_QUEUE_DEV_MEM I get SIGSEGV at

ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_aql_queue.cpp

Line 134 in 26f001d
(((core::AqlPacket*)ring_buf_)[pkt_id]).dispatch.header = HSA_PACKET_TYPE_INVALID;

Can you check whether or not you have Resizeable BAR support enabled for your system?

misos1 · 2025-02-17T23:16:41Z

I enabled "Above 4G Decoding" and it seems that now it is possible to access gpu memory from the host, thanks and looking forward to your PR.

dayatsin-amd self-assigned this Dec 16, 2024

benvanik mentioned this issue Jan 8, 2025

Burndown list of HSA/ROCR issues for initial milestone iree-org/iree#19636

Open

11 tasks

atgutier linked a pull request Jan 22, 2025 that will close this issue

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

Open

atgutier added the Feature Request label Jan 22, 2025

atgutier removed a link to a pull request Feb 18, 2025

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API flag for whether an AQL queue should be allocated in device memory. #269

Add API flag for whether an AQL queue should be allocated in device memory. #269

benvanik commented Dec 16, 2024

saleelk commented Dec 16, 2024

benvanik commented Dec 16, 2024 •

edited

Loading

saleelk commented Dec 17, 2024

atgutier commented Jan 6, 2025

benvanik commented Jan 6, 2025

atgutier commented Jan 15, 2025

misos1 commented Feb 17, 2025 •

edited

Loading

atgutier commented Feb 17, 2025

misos1 commented Feb 17, 2025 •

edited

Loading

Add API flag for whether an AQL queue should be allocated in device memory. #269

Add API flag for whether an AQL queue should be allocated in device memory. #269

Comments

benvanik commented Dec 16, 2024

saleelk commented Dec 16, 2024

benvanik commented Dec 16, 2024 • edited Loading

saleelk commented Dec 17, 2024

atgutier commented Jan 6, 2025

benvanik commented Jan 6, 2025

atgutier commented Jan 15, 2025

misos1 commented Feb 17, 2025 • edited Loading

atgutier commented Feb 17, 2025

misos1 commented Feb 17, 2025 • edited Loading

benvanik commented Dec 16, 2024 •

edited

Loading

misos1 commented Feb 17, 2025 •

edited

Loading

misos1 commented Feb 17, 2025 •

edited

Loading