Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API flag for whether an AQL queue should be allocated in device memory. #269

Open
benvanik opened this issue Dec 16, 2024 · 9 comments
Assignees

Comments

@benvanik
Copy link
Contributor

In 3baaa6e @saleelk added the ability to allocate an AQL queue in device memory via the HSA_ALLOCATE_QUEUE_DEV_MEM environment variable. I'd like to be able to use this programmatically with a queue creation flag as some of my queues are better in system memory and others in device memory (particularly those where I'm producing for the queue on the same device consuming it). In addition, as a middleware layer I don't control the environment and options that are set via them are generally not accessible.

hsa_queue_create doesn't seem to have a flags arg and I'm not sure of the best route for adding one such that creation could be controlled. Maybe an hsa_amd_queue_create?

@saleelk
Copy link
Contributor

saleelk commented Dec 16, 2024

yeah, I got dragged elsewhere and sort of paused on this end. Yes, we would need a new API that takes flags. I had a conversation with ROCr owner about this, but need to start that over again

@dayatsin-amd dayatsin-amd self-assigned this Dec 16, 2024
@benvanik
Copy link
Contributor Author

benvanik commented Dec 16, 2024

That'd be wonderful! I haven't yet gotten my code working with it as I'm getting GPU faults when HSA_ALLOCATE_QUEUE_DEV_MEM=1, but that's likely my issue (or an issue with some of my queues needing to be in host memory and some in device memory). Thanks for pushing on this: I'm round-tripping through system memory for queue operations and it's taking an eternity :)

@saleelk
Copy link
Contributor

saleelk commented Dec 17, 2024

i've looped you in on an internal chat

@atgutier
Copy link
Contributor

atgutier commented Jan 6, 2025

@saleelk or @dayatsin-amd are either of you taking a look at this? If not, I can create a PR for this.

@benvanik
Copy link
Contributor Author

benvanik commented Jan 6, 2025

Related to this, currently the amd_queue_t ends up in host memory and only the ringbuffer is going into device memory with the flag enabled. The queue struct is hot (write_dispatch_id/read_dispatch_id, etc) and we want that colocated with the ringbuffer in device memory as well.

@atgutier
Copy link
Contributor

@saleelk I'm taking a look at this and it seems fairly straightforward to add an API-based queue allocation mechanism for getting the queue buf in dev mem, additionally, using the same finegrain allocation method seems to work for the queue struct itself.

If I create a PR to support this that removes the ENV variable would that be ok? Or are you already relying on the ENV variable? If so I can just mark it deprecated so future users rely on the API and not the ENV var.

@misos1
Copy link

misos1 commented Feb 17, 2025

@atgutier @saleelk Does this require some specific hardware? I noticed in both 3baaa6e and #284 that after is aql or amd_queue allocated on gpu it is then accessed by the host but that does not generally work on my system and when I use HSA_ALLOCATE_QUEUE_DEV_MEM I get SIGSEGV at

(((core::AqlPacket*)ring_buf_)[pkt_id]).dispatch.header = HSA_PACKET_TYPE_INVALID;

@atgutier
Copy link
Contributor

@atgutier @saleelk Does this require some specific hardware? I noticed in both 3baaa6e and #284 that after is aql or amd_queue allocated on gpu it is then accessed by the host but that does not generally work on my system and when I use HSA_ALLOCATE_QUEUE_DEV_MEM I get SIGSEGV at

ROCR-Runtime/runtime/hsa-runtime/core/runtime/amd_aql_queue.cpp

Line 134 in 26f001d
(((core::AqlPacket*)ring_buf_)[pkt_id]).dispatch.header = HSA_PACKET_TYPE_INVALID;

Can you check whether or not you have Resizeable BAR support enabled for your system?

@misos1
Copy link

misos1 commented Feb 17, 2025

I enabled "Above 4G Decoding" and it seems that now it is possible to access gpu memory from the host, thanks and looking forward to your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants