Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reservoir coupling: Implement time stepping #5620

Merged
merged 33 commits into from
Jan 21, 2025

Conversation

hakonhagland
Copy link
Contributor

Builds on #5521 which should be merged first.

Also refactors the original adaptive time stepping method.

@bska
Copy link
Member

bska commented Sep 19, 2024

Builds on #5521 which should be merged first.

Okay, in that case I'm going to mark this as "work in progress/draft" to avoid inadvertent merging.

@bska bska marked this pull request as draft September 19, 2024 17:52
@hakonhagland
Copy link
Contributor Author

I'm going to mark this as "work in progress/draft" to avoid inadvertent merging.

@bska Thanks. I was actually planning to get this merged, since it is becoming difficult to handle all the dependent PRs and keep them up to date. Is this a good idea, or should I wait until the complete reservoir coupling feature is ready?

@atgeirr
Copy link
Member

atgeirr commented Sep 23, 2024

Is this a good idea, or should I wait until the complete reservoir coupling feature is ready?

It is a very good idea. Do not wait for the complete feature, unless you expect significant changes to the parts you already did. This patch has grown to become quite large, so finishing and polishing the first PRs in the series so they can be merged should be a high priority.

@hakonhagland
Copy link
Contributor Author

hakonhagland commented Sep 24, 2024

It is a very good idea.

@atgeirr Ok I agree. I have removed the draft state on #5436 as it is the first in the dependency chain and we should try to merge that one first.

@hakonhagland hakonhagland force-pushed the timestepping2 branch 3 times, most recently from ac5ff0f to 27a09b4 Compare October 8, 2024 19:05
@hakonhagland hakonhagland force-pushed the timestepping2 branch 2 times, most recently from 892574a to 21ded73 Compare November 24, 2024 14:02
@hakonhagland hakonhagland marked this pull request as ready for review November 25, 2024 12:19
@blattms
Copy link
Member

blattms commented Nov 27, 2024

jenkins build this serial please

@blattms
Copy link
Member

blattms commented Nov 27, 2024

I think the jenkins failure is due to unconditional usage of MPI and unconditionally including mpi.h. Unfortunately, we require compiling without MPI to work. There is a subtle reason for this, but I fail to remember it.

Copy link
Member

@blattms blattms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of commit 3564e1d, files .../flow/ReservoirCoupling*

Do we really need the TimePoint class?


// Equality operator
bool operator==(const TimePoint& other) const {
return std::abs(time - other.time) < tol;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very surprising to everybody else. I would really prefer to make this fuzzy comparison explicit where it
used. E.g. using a special free function or creating a wrapper for the comparison where it is used.

I also think that this might break quite some assumptions for the other comparison operators, e.g. if a==b is true then a<b can be true too, or even if a==b is true a<=b might be false.

Comment on lines 115 to 126
int result = MPI_Recv(
&slave_next_report_time_offset,
/*count=*/1,
/*datatype=*/MPI_DOUBLE,
/*source_rank=*/0,
/*tag=*/static_cast<int>(MessageTag::SlaveNextReportDate),
*this->master_slave_comm_[i].get(),
MPI_STATUS_IGNORE
);
if (result != MPI_SUCCESS) {
OPM_THROW(std::runtime_error, "Failed to receive next report date from slave process");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can wrap the communictar int a Communicator<MPI_Comm> then you can simplify this to

this->master_slave_comm_[i].recv(slave_next_report_time_offset, 0, static_cast<int>(MessageTag::SlaveNextReportDate));

Please don't put extra throw clauses here. The default MPI error handler is MPI_ERRORS_ARE_FATAL which will call MPI_Abort. To get return codes one would have to reset it to something else. Hence the if branch will never be false of we do not reset it.

It is also not clear to me how our simulator will handle exceptions here. It might just cut the time step...

If we ever want exceptions for failing MPI calls we can use a custom throwing MPI_Errhandler. I do that in some of my tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't put extra throw clauses here.

I have added a custom error handler for each slave-master communicator that will eventually call MPI_Abort(). This way we do not need to check the return value for each MPI_Recv() and MPI_Send() call. See the latest commit.

@@ -53,10 +56,12 @@ class ReservoirCouplingMaster {

const Parallel::Communication &comm_;
const Schedule& schedule_;
std::time_t start_date_; // Master process' simulation start date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for the type or should we maybe use the same type for start date everywhere? Seems like it is boost::gregorian::date is SimulationTimer.

@hakonhagland
Copy link
Contributor Author

hakonhagland commented Nov 28, 2024

Unfortunately, we require compiling without MPI to work.

@blattms How do I build locally without MPI? I tried cmake option -DUSE_MPI=NO but opm-grid fails to build with error /usr/bin/ld: cannot find -lMPI::MPI_C: No such file or directory

@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

1 similar comment
@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

Don't remove these entries from the map before reservoir coupling is
completely supported.
Determine size of std::size_t correctly for all platforms using
Dune::MPITraits<std::size_t>::getType()
It is safe to free the error handler after MPI_Comm_set_errhandler()
has been called
Convert comment blocks into doxygen type comments
After rebasing on master some changes to AdaptiveTimeStepping.hpp and
AdaptiveTimeStepping_impl.hpp were missed
@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

hakonhagland commented Jan 19, 2025

What happens if one of slaves would crash or abort?

@blattms I think that would be detected by the master. Tested it now by purposely creating a segmentation fault in a slave. It was immediately noticed by the master, which called MPI_Abort() with: mpirun noticed that process rank 1 with PID 0 on node hakon-Precision-3570 exited on signal 11 (Segmentation fault). Similarly, I think the master will abort the slaves and itself if any slave exits without calling MPI_Finalize() first (for example, due to an uncatched exception..). However, if a slave exits normally using e.g. MPI_Finalize(); exit(0); the master will not detect it unless it later tries to communicate with the terminated slave after that, in which case the error handler in the master should be called. I guess it can try to check the error code there to determine if the slave has terminated normally?

@hakonhagland
Copy link
Contributor Author

What happens of someone starts a simulation using the --slave parameter

@blattms First thing to notice is that the parameter is hidden so it will not be known to the user that it exists, second if the user by accident gives the --slave parameter it will be detected here that the simulation is not part of a reservoir coupling run:

OPM_THROW(std::runtime_error, "Slave process is not spawned by a master process");

and the simulation will be aborted.

@blattms
Copy link
Member

blattms commented Jan 20, 2025

jenkins build this serial please

@blattms
Copy link
Member

blattms commented Jan 20, 2025

Sorry, but building without MPI does not work. This prevents merging.

@hakonhagland
Copy link
Contributor Author

jenkins build this serial please

@hakonhagland
Copy link
Contributor Author

Is there some documentation about what we are achieving in the time stepper? Where are the synchronization points exactly and how so we accomlish it?

@blattms Good idea. I have added a description in the latest commit.

Adds developer documentation about the timestepping procedure.
@hakonhagland
Copy link
Contributor Author

jenkins build this please

@hakonhagland
Copy link
Contributor Author

Is there some documentation about what we are achieving in the time stepper?

@blattms See also #5893.

@blattms
Copy link
Member

blattms commented Jan 21, 2025

jenkins build this serial please

@blattms blattms merged commit 990c3f0 into OPM:master Jan 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants