Skip to content

Commit

Permalink
104 mpi crash on debianubuntu during classifier training (#155)
Browse files Browse the repository at this point in the history
* Use the right communicator
"max error flow" messages were not received and remained in the communicator

* code simplification
We use the working slave list instead of the content of the task communicator.
It is the right way because we send messages in the world comm and not in the task comm.

* Add BugMPIWithErrors in LearningTest/Standard
  • Loading branch information
bruno-at-orange authored Feb 19, 2024
1 parent 66df136 commit 89ac61d
Show file tree
Hide file tree
Showing 8 changed files with 1,223 additions and 16 deletions.
30 changes: 14 additions & 16 deletions src/Parallel/PLMPI/PLMPIMaster.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,9 @@ void PLMPIMaster::UpdateMaxErrorFlow()
Global::IsMaxErrorFlowReachedPerGravity(Error::GravityMessage));

// Envoi du tableau a tous les esclaves
MPI_Comm_size(*PLMPITaskDriver::GetTaskComm(), &nTaskCommSize);
for (i = 1; i < nTaskCommSize; i++)
for (i = 0; i < GetTask()->ivGrantedSlaveIds.GetSize(); i++)
{
context.Send(*PLMPITaskDriver::GetTaskComm(), i, MAX_ERROR_FLOW);
context.Send(MPI_COMM_WORLD, GetTask()->ivGrantedSlaveIds.GetAt(i), MAX_ERROR_FLOW);
serializer.OpenForWrite(&context);
serializer.PutIntVector(&ivGravityReached);
serializer.Close();
Expand Down Expand Up @@ -958,8 +957,8 @@ int PLMPIMaster::ComputeGlobalProgression(boolean bSlaveProcess)

void PLMPIMaster::NotifyInterruptionRequested()
{
int nTaskCommSize;
int nSlaveRank;
int i;
PLMPIMsgContext context;
PLSerializer serializer;

Expand All @@ -968,20 +967,17 @@ void PLMPIMaster::NotifyInterruptionRequested()
if (GetTracerProtocol()->GetActiveMode())
GetTracerProtocol()->AddTrace("Send Interruption requested");

MPI_Comm_size(*PLMPITaskDriver::GetTaskComm(), &nTaskCommSize);
for (nSlaveRank = 1; nSlaveRank < nTaskCommSize; nSlaveRank++)
for (i = 0; i < GetTask()->ivGrantedSlaveIds.GetSize(); i++)
{
if (not PLMPITaskDriver::GetDriver()->IsFileServer(nSlaveRank))
{
if (GetTracerMPI()->GetActiveMode())
GetTracerMPI()->AddSend(nSlaveRank, INTERRUPTION_REQUESTED);
nSlaveRank = GetTask()->ivGrantedSlaveIds.GetAt(i);
if (GetTracerMPI()->GetActiveMode())
GetTracerMPI()->AddSend(nSlaveRank, INTERRUPTION_REQUESTED);

// Envoi en utilisant un serializer, car dans le messag epeut etre recu par
// PLMPISlave et celui-ci attend un serializer
context.Isend(MPI_COMM_WORLD, nSlaveRank, INTERRUPTION_REQUESTED);
serializer.OpenForWrite(&context);
serializer.Close();
}
// On envoie en utilisant un serializer, car le message peut etre recu par
// PLMPISlave et celui-ci attend un serializer
context.Isend(MPI_COMM_WORLD, nSlaveRank, INTERRUPTION_REQUESTED);
serializer.OpenForWrite(&context);
serializer.Close();
}
bInterruptionRequested = true;
bStopOrderDone = true;
Expand Down Expand Up @@ -1036,6 +1032,8 @@ void PLMPIMaster::DischargePendingCommunication(int nRank, int nTag)
if (PLParallelTask::GetVerbose())
TraceWithRank(sTmp + "discharge pending comm from " + IntToString(status.MPI_SOURCE) +
" with tag " + GetTagAsString(status.MPI_TAG));
cout << GetProcessId() << " "
<< "Discharge pending com" << endl;
// Reception du message
ReceivePendingMessage(status);
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@


Dictionary Adult
{
Numerical Label ;
Numerical age ;
Categorical workclass ;
Numerical fnlwgt ;
Categorical education ;
Numerical education_num ;
Categorical marital_status ;
Categorical occupation ;
Categorical relationship ;
Categorical race ;
Numerical sex ;
Numerical capital_gain ;
Numerical capital_loss ;
Numerical hours_per_week ;
Categorical native_country ;
Categorical class ;
};
70 changes: 70 additions & 0 deletions test/LearningTest/TestKhiops/Standard/BugMPIWithErrors/readme.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Reference:
- issue sur le githuv khiops: MPI crash on Debian/Ubuntu during classifier training #104
- bug detect� par Nicolas dans un contexte multi-table, avec de nombreuse logne orphelines dans les tables secondaires

Bug se produisant dans les conditions suibantes:
- en parallele (trois coeurs par exemple)
- sur Debian 10 ou 11 (et Ubunto 22?)
- avec une nouvelle version de MPI (4.1)
- quand il y a beaucup d'erreur transmise entre les esclaves et le mettre

Des tests ont ete effectues sur la branche du repo clone par Stephane: Bug-MPI
- cf. sur repo clone de Stephane: Bug MPI traces 1
- Ajout de traces dans les methodes StartFileServers et StopFileServers de PLMPITaskDriver pour preciser le probleme
- plante dans StopFileServers, au moment de l'appel a MPI_Barrier

Test minimaliste de Nicolas
- base Adult, en ayant change le type d'une variable Categorical a Numerical

Plante avec trace obtenue:
warning : Data table Adult.txt : Record 3 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 4 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 5 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 7 : Numerical variable sex: value <Female> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 8 : Numerical variable sex: value <Female> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 9 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 10 : Numerical variable sex: value <Female> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 11 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 14 : Numerical variable sex: value <Female> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 15 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 16 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 17 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 18 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 19 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 20 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 22 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 24 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 25 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 26 : Numerical variable sex: value <Female> converted to <> (Unconverted string)
warning : Data table Adult.txt : Record 27 : Numerical variable sex: value <Male> converted to <> (Unconverted string)
warning : Data table : ...
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x55fe5eead028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
Abort(274287887) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 274287887) - process 1
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x56352c201028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
Abort(408505615) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 408505615) - process 2
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x55e844d37028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
Abort(542723343) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 542723343) - process 3
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x55935d30f028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
Abort(207179023) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 207179023) - process 4
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x5585f7750028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
Abort(744049935) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 744049935) - process 5
Loading

0 comments on commit 89ac61d

Please sign in to comment.