Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

malloc(): unsorted double linked list corrupted errors from DelphesPythia8_EDM4HEP #136

Open
bistapf opened this issue Nov 12, 2024 · 34 comments

Comments

@bistapf
Copy link
Contributor

bistapf commented Nov 12, 2024

Since about ~ 2 weeks ago, a large fraction (more than 2/3) of DelphesPythia8_EDM4HEP batch jobs submitted with EventProducer fail with the following error:

malloc(): unsorted double linked list corrupted

This happens during the initialization of the Delphes modules. Some of the jobs however still run fine with exactly the same configuration.

The jobs are submitted from lxplus, on AlmaLinux 9. The error happens with both latest key4hep releases, so -r 2024-10-03 and -r 2024-10-28.

I've attached a zip file with all the log files for a job that failed (condor_job.000000125.7228699.61.x), and for comparison a .log for one that worked (log_successful_job.log - perhaps it depends on the condor node whether the error occurs?). The job config and the script that failed are also included (job_desc_lhep8.cfg and job000000083.sh).

malloc_error_logs.zip

I will test still whether the error also occurs when running locally, or only on condor and report back. - Edit: Confirming that locally this script fails.

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

Tagging @juliagonski who reported the same issue.

@andresailer
Copy link

Hi @bistapf Could you attach, link, paste:

/afs/cern.ch/user/b/bistapf/Dev_EvtProducer/EventProducer/edm4hep_output_config.tcl

Thanks!

@andresailer
Copy link

I can reproduce locally with https://raw.githubusercontent.com/key4hep/k4SimDelphes/refs/heads/main/examples/edm4hep_output_config.tcl, but not with the debug build.

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

Hi @andresailer , here's the output config: edm4hep_output_config.tcl.

So you're saying with setting up, e.g., source /cvmfs/sw.hsf.org/key4hep/setup.sh -d -r 2024-10-28 you didn't see the error?

@andresailer
Copy link

Thanks!

And yes when adding -d I don't see the error.

@andresailer
Copy link

andresailer commented Nov 12, 2024

Valgrind, plus some hot patching via LD_PRELOAD

void ExRootConfReader::ReadFile(const char*, bool)
==2484287== Invalid read of size 1
==2484287==    at 0x5915323: TclCompileString (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/delphes/master-ic3lyz/lib/libDelphes.so)
==2484287==    by 0x591A271: SetByteCodeFromAny (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/delphes/master-ic3lyz/lib/libDelphes.so)
==2484287==    by 0x5909369: Tcl_EvalObj (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/delphes/master-ic3lyz/lib/libDelphes.so)
==2484287==    by 0x486B9AA: ExRootConfReader::ReadFile(char const*, bool) (in /home/sailer/temp/debugDelphes/job000000083_mgp8_pp_h012j_5f_hmumu/preload/build/libpreload.so)
==2484287==    by 0x440C0A: int doit<podio::ROOTWriter>(int, char**, DelphesInputReader&) (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/k4simdelphes/00-07-qqaftw/bin/Delphes>
==2484287==    by 0x414343: main (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/k4simdelphes/00-07-qqaftw/bin/DelphesPythia8_EDM4HEP)
==2484287==  Address 0x22cc7347 is 0 bytes after a block of size 40,679 alloc'd
==2484287==    at 0x485465E: operator new[](unsigned long) (vg_replace_malloc.c:729)
==2484287==    by 0x486B968: ExRootConfReader::ReadFile(char const*, bool) (in /home/sailer/temp/debugDelphes/job000000083_mgp8_pp_h012j_5f_hmumu/preload/build/libpreload.so)
==2484287==    by 0x440C0A: int doit<podio::ROOTWriter>(int, char**, DelphesInputReader&) (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/k4simdelphes/00-07-qqaftw/bin/Delphes>
==2484287==    by 0x414343: main (in /cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/k4simdelphes/00-07-qqaftw/bin/DelphesPythia8_EDM4HEP)

This buffer
https://github.com/delphes/delphes/blob/0a4a3993d08465d7599174baeb364703b8a73fa4/external/ExRootAnalysis/ExRootConfReader.cc#L64
Should be one bigger? At least that makes valgrind happy, and also avoids the malloc crash

PS: Does not seem to be the thing causing this, at least not by itself...

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

Running with a local standalone installation of Delphes and the same inputs/config worked fine. So I don't think it's a Delphes issue?

Tagging @selvaggi and @pavel-demin anyway for the buffer comment above.

@andresailer
Copy link

@bistapf How did you compile delphes?

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

I followed the instructions from the workbook here - but I used Pythia pythia8310 just because I already had that installed, not sure it's needed. Then you should have DelphesPythia8 which I ran with the same Delphes and Pythia card from the script and that worked fine. I could indeed not get Delphes compiled without a local installation of Pythia though, just setting up from LCG.

Btw, I cannot confirm that adding the debug flag fixes it, for me it throws the malloc error then still (output attached).
log_local_test_w_debug.txt

@andresailer
Copy link

Maybe the whole issue is so flaky that one really has to run many times to draw conclusions.

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

Yes, I'm afraid that might be the case. Judging by the fraction of condor jobs that have the error, the chance for it to fail is quite high though (unacceptably high unfortunately). With the last set of 100 jobs I tested, only 17 worked and all the others had the malloc.

Also do you have any idea why this suddenly popped out of (seemingly) nowhere?

I ran ~500 jobs without any issues on 28.10 still, same setup, then from 29.10 this started happening. So I thought at first it could be a problem with the new release @jmcarcell kindly made available for me, but reverting back to the one from 3.10 also did not fix it. Maybe it could still be related somehow though?

@tmadlener
Copy link
Contributor

What is the difference in stacks between 03.10. and 28.10.? Are they completely separate? Or are they sharing some packages?

Have you tried using the debug stack for running on the batch system? Does that also have the same failure rate?

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

I think the difference should only be the k4SimDelphes version, to use the changes we made for the PID collection in #131. @jmcarcell could you confirm? Thanks!
The jobs without issues from 28.10 did run with this release though, IIRC. Definitely with the split PID collections.

The only thing I remember changing after that is fixing the edm4hep output config to have the correct new name for the MCReco collection, i.e. replacing MCRecoAssociationCollectionName with RecoMCParticleLinkCollectionName. But I wouldn't think that to be related?

Debug stack on batch I haven't tried. Locally it failed 100% of the time for me, but I can submit 100 jobs with that and see what we get.

@pavel-demin
Copy link

Between October 19 and 23, I made some changes around the memory allocation code in some parts of the Tcl code in Delphes:

https://github.com/delphes/delphes/commits/master/external/tcl

If I am not mistaken, the 2024-10-03 key4hep release contains an older version of the Delphes code without these changes, and the 2024-10-28 key4hep release contains a newer version of the Delphes code with these changes. So I would say that these changes neither solve this problem nor cause it.

I will try to reproduce this problem and see what I can do about it.

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

Thanks @pavel-demin ! The standalone Delphes test I ran was using the latest master, so after this commit.

Out of 100 test jobs with debug stack and 3.10 release 76 crashed immediately, so similar rate.

I noticed that most of them (63) have the malloc error but a few also have corrupted double-linked list (not small) or just a good old Segmentation fault, but always it's in the initialization of the MuonMomentumSmearing module.

Indeed @juliagonski had reported to me by email that removing this module solved the issue for them, but I think this is strange because some of the jobs do still work even with the module.

@tmadlener
Copy link
Contributor

Just quickly checking the Delphes that is used in both cases it looks like 2024-10-03 and 2024-10-28 use the same version:

$ source /cvmfs/sw.hsf.org/key4hep/setup.sh
AlmaLinux/RockyLinux/RHEL 9 detected
Setting up the Key4hep software stack release latest-opt from CVMFS
Use the following command to reproduce the current environment: 

        source /cvmfs/sw.hsf.org/key4hep/setup.sh -r 2024-10-28

If you have any issues, comments or requests, open an issue at https://github.com/key4hep/key4hep-spack/issues
Tip: A new -d flag can be used to access debug builds, otherwise the default is the optimized build
$ which DelphesPythia8
/cvmfs/sw.hsf.org/key4hep/releases/2024-10-03/x86_64-almalinux9-gcc14.2.0-opt/delphes/master-ic3lyz/bin/DelphesPythia8

It also looks like we are not using a tagged release for Delphes in our current releases. Is this on purpose?

@pavel-demin
Copy link

I did some testing on my side and managed to reproduce the crashes with the following commands:

source /cvmfs/sw.hsf.org/key4hep/setup.sh -r 2024-10-03
DelphesPythia8_EDM4HEP card.tcl edm4hep_output_config.tcl card.cmd events_000000083.root

and

source /cvmfs/sw.hsf.org/key4hep/setup.sh -r 2024-10-28
DelphesPythia8_EDM4HEP card.tcl edm4hep_output_config.tcl card.cmd events_000000083.root

These two releases use the the same Delphes version from 2024-10-03.

I also tried the latest nightly build that uses a newer Delphes version:

source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh -r 2024-11-12
which DelphesPythia8

/cvmfs/sw-nightlies.hsf.org/key4hep/releases/2024-11-12/x86_64-almalinux9-gcc14.2.0-opt/delphes/0a4a3993d08465d7599174baeb364703b8a73fa4_develop-76ldcg/bin/DelphesPythia8

With this nightly release, DelphesPythia8_EDM4HEP does not crash. So I hope that the changes that I made between October 19 and October 23 actually fixed this problem.

@bistapf
Copy link
Contributor Author

bistapf commented Nov 12, 2024

Thanks a lot, @pavel-demin ! Indeed I can confirm that with the nightly build I was able to run 100 jobs succesfully.

So do we need a new release with the latest Delphes version, @jmcarcell ? Thanks!

@tmadlener
Copy link
Contributor

Could we have a proper delphes tag for that, please (@pavel-demin @selvaggi)? Or is there some developments that need to be done first?

@pavel-demin
Copy link

I have just tagged the current version of the code as a pre-release 3.5.1pre11. Is it OK for you?

@tmadlener
Copy link
Contributor

I think that should work. Thanks a lot.

@bistapf
Copy link
Contributor Author

bistapf commented Nov 13, 2024

Would the release then also include the latest k4SimDelphes build, i.e. the fix from #137?

@tmadlener
Copy link
Contributor

I created a new tag (v00-07-03) and this should be picked up for the next release, via key4hep/key4hep-spack#669

@Kenny-Jia
Copy link

I have just tagged the current version of the code as a pre-release 3.5.1pre11. Is it OK for you?

Hi I also got the same error with even 3.5.1pre11, but I am using just DelphesHepMC2 rather than DelphesPythia8. I am wondering if you could share more details on how the problem could be fixed? Thank you!

@tmadlener
Copy link
Contributor

Hi I also got the same error with even 3.5.1pre11, but I am using just DelphesHepMC2

If you get the problem with the DelphesHepMC2 reader from Delphes (i.e. no EDM4hep output involved), I think it's easiest to open a new issue directly in the delphes repository, referencing this issue for some more background information.

If instead you still get the error for a reader that this repository provides, can you share a few more details on how to reproduce the issue? (software environment, inputs, commands, ...)

@bistapf
Copy link
Contributor Author

bistapf commented Nov 15, 2024

Adding to @Kenny-Jia 's last comment:

I tried using locally built versions of Delphes tag 3.5.1pre11 and k4SimDelphes tag v00-07-03 on top of the latest key4hep release, still got the malloc error. Could someone else (with more experience) try this too please?

With the nightlies I haven't seen the issue again, so maybe the problem is elsewhere after all?

Edit: Using the local tagged versions on top of the nightly stack seems to work. But I'm not sure what this tells us - perhaps the way I'm trying to use the local versions is not working correctly. I have checked that which DelphesPythia8_EDM4HEP points to my locally build key4hep installation (which uses the v00-07-03 git tag) and that the build/CMakeCache.txt there picks up the locally built Delphes (which uses the 3.5.1pre11 git tag). Any other way I can see which versions I'm using?

Here's the commands I followed in any case:
test_local_builds.txt

@bistapf
Copy link
Contributor Author

bistapf commented Nov 27, 2024

Just updating on this that I have run by now some thousands of jobs with the nightlies and the malloc issue only occured once (and I couldn't figure out if there was something specific about this job that made it crash specifically e.g. the batch machine it ran on or so). So somehow it is (essentially) fixed in those, but I really can't tell if it's the Delphes update or something else (see my comment about tests with local builds above). Any suggestions how to proceed for getting a stable key4hep release that doesn't have this issue as soon as possible, @andresailer @jmcarcell @tmadlener ?

@tmadlener
Copy link
Contributor

Hi @bistapf, I think it should be possible to make a release based on the current "series" just picking up a newer version of Delphes and the latest tag of k4simdelphes just for checking that the necessary changes have landed where they need to land.

@jmcarcell
Copy link
Member

jmcarcell commented Nov 28, 2024

@bistapf I'll make a new build soon. Having a look at this the version that is used in the releases is more than 3 years old is the master branch on 2024-10-03 (with no changes for 2024-10-28).

@pavel-demin Could we get a released version of delphes? The last one that is not a pre (3.5.0) is more than 3 years old and this is what is being used by spack: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/delphes/package.py#L25

@pavel-demin
Copy link

I am planning to prepare a new Delphes release soon but I am afraid that I will not have time to do it in the coming weeks.

I would also like to understand this problem better and make sure that it is fixed before making the new release. At the moment I am a bit confused with all the different comments and I do not understand if the problem is completely fixed or if it still reappears from time to time.

@jmcarcell
Copy link
Member

No problem, this was more a general comment, for the release builds I can use the latest pre version.

@jmcarcell
Copy link
Member

@bistapf try out the new release by sourcing the setup script
It has a new version of delphes and k4SimDelphes:

{[email protected]} ~ % which DelphesPythia8_EDM4HEP
/cvmfs/sw.hsf.org/key4hep/releases/2024-11-28/x86_64-almalinux9-gcc14.2.0-opt/k4simdelphes/00-07-03-u22la6/bin/DelphesPythia8_EDM4HEP
{[email protected]} ~ % which DelphesHepMC3
/cvmfs/sw.hsf.org/key4hep/releases/2024-11-28/x86_64-almalinux9-gcc14.2.0-opt/delphes/3.5.1pre11-2zkuun/bin/DelphesHepMC3

@jmcarcell
Copy link
Member

Any updates on this @bistapf?

@bistapf
Copy link
Contributor Author

bistapf commented Dec 11, 2024

Hi @jmcarcell , all, sorry for the delay. Unfortunately I have to report that it seems the issue is not fully solved yet.

Using source /cvmfs/fcc.cern.ch/sw/latest/setup.sh to set-up the new release -r 2024-11-28 I got varying failure rates with the malloc error of my batch jobs. Sometimes almost 2/3 of the jobs ran fine, while in one production submission of 500 jobs I had only 18 jobs succeed without the error, so only 3%. I cannot figure out what, if anything, affects this, in principle all the jobs are the same, just different LHE input files (but always 10k events in those, so I think it shouldn't really matter?).

When running the tester locally, the chance that it runs through appears to be 50/50. This is when I run the script multiple times in a row on the same interactive node. Could you also give it a try to make sure the problem isn't on my end somehow? Thanks!

I attach the collection of logs for a job that failed, as well as one that worked again. The only thing I have noticed is that the .log for jobs that failed have the following line in the host information: GPUs = 0 (and the value after Disk= seems a lot lower than when it doesn't say this, but I'm not sure if that doesn't just mean how much disk space the job ended up using). But I'm not sure how to test if this is a consistent issue, or perhaps unrelated (and it is unclear to me why I did not have the issue with the nightlies, if it is somehow related to the batch host, and also why the local tester on the same machine sometimes fails). Do you think it's worth following up? I would try to analyse my logs in a more systematic manner then.

job_w_malloc_logs.zip
job_success_logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants