Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds distributed row gatherer #1589

Open
wants to merge 18 commits into
base: index-map-pgm
Choose a base branch
from

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Apr 4, 2024

This PR adds a distributed row gatherer. This operator essentially provides the communication required in our matrix apply.

Besides the normal apply (which is blocking), it also provides two asynchronous calls. One version has an additional workspace parameter which is used as send buffer. This version can be called multiple times without restrictions, if different workspaces are used for each call. The other version doesn't have a workspace parameter, and instead uses an internal buffer. As a consequence, this function can only be called a second time, if the request of the previous call has been waited on. Otherwise, this function will throw.

This is the second part of splitting up #1546.

It also introduces some intermediate changes, which could be extracted out beforehand:

PR Stack:

@MarcelKoch MarcelKoch self-assigned this Apr 4, 2024
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. type:matrix-format This is related to the Matrix formats labels Apr 4, 2024
@MarcelKoch MarcelKoch requested a review from pratikvn April 4, 2024 10:49
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 6b4521b to ae60198 Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 6acf7c4 to 8aa6ab9 Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch 2 times, most recently from 49557f1 to 4a79442 Compare April 5, 2024 08:18
@MarcelKoch MarcelKoch modified the milestone: Ginkgo 1.8.0 Apr 5, 2024
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 8aa6ab9 to 77398bd Compare April 17, 2024 16:28
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 4a79442 to 172eb7d Compare April 17, 2024 16:28
@MarcelKoch MarcelKoch requested a review from upsj April 19, 2024 09:20
@MarcelKoch MarcelKoch mentioned this pull request Apr 19, 2024
7 tasks
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 77398bd to d278cad Compare April 19, 2024 14:39
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch 2 times, most recently from 98fa10a to 79de4c3 Compare April 19, 2024 16:19
@MarcelKoch
Copy link
Member Author

One issue that I have is the constructor. It takes a collective_communicator and an index_map. The index_map already defines the communication pattern, so the collective_communicator has to match that.
One option might be to have a virtual function like

std::unique_ptr<collective_communicator> create_with_same_type(communicator, index_map);

If I can't come up with anything better, I guess I will use that.

@pratikvn
Copy link
Member

Do we need to have the std::future setup for the release ? Can we remove that for now and just use a normal synchronous approach ? I think that is a significant change that maybe needs more thought and probably a separate PR.

@MarcelKoch MarcelKoch force-pushed the index-map-pgm branch 2 times, most recently from c432ffc to f8cb0e8 Compare February 18, 2025 15:46
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch 2 times, most recently from 3ca34b0 to 9ac78ed Compare February 20, 2025 09:46
@MarcelKoch
Copy link
Member Author

@pratikvn, @yhmtsai I've removed the row gatherer from the LinOp hierarchy for now. It instead directly derives from PolymorphicObject. You might want to give it a second look because of that.
The reasoning for me was that the apply will always create temporary clones for the vectors, using the executor from the operator. However, in the case of non-GPU aware MPI, this will always lead to throwing an exception, because then the output vector will always be on the device.
I think it should be possible to handle this correctly with generalized MPI request, but this might take quite a while for me to realize, so I would rather proceed as it is.

@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 977738d to 5580ebf Compare March 3, 2025 10:30
MarcelKoch and others added 18 commits March 3, 2025 12:22
- only allocate if necessary
- synchronize correct executor

Co-authored-by: Pratik Nayak <[email protected]>
- split tests into core and backend part
- fix formatting
- fix openmpi pre 4.1.x macro

Co-authored-by: Pratik Nayak <[email protected]>
Co-authored-by: Yu-Hsiang M. Tsai <[email protected]>
Signed-off-by: Marcel Koch <[email protected]>
- add copy/move tests
- undo using MPI_Init_thread
- add extra host_recv_buffer_
- create row-gatherer as unique_ptr

Co-authored-by: Yu-Hsiang M. Tsai <[email protected]>
The `LinOp::apply` function creates temporary clones to match the operators executor, but this will lead to wrong behavior, if MPI doesn't support GPU buffers.
right now the RG doesn't support (blocking) apply, so it doesn't make much sense to keep it as a LinOp
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 5580ebf to 5d201aa Compare March 3, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review mod:core This is related to the core module. reg:build This is related to the build system. reg:testing This is related to testing. type:matrix-format This is related to the Matrix formats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants