Failures in Run 3 data reprocessing #40437

kskovpen · 2023-01-06T13:49:23Z

We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693

The crash message is:

Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0

PdmV

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-01-06T13:49:47Z

A new Issue was created by @kskovpen .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2023-01-06T14:04:55Z

Assign reconstruction

cmsbuild · 2023-01-06T14:05:17Z

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-01-06T14:53:37Z

A bit more context from the link

Fatal Exception (Exit code: 8001)
An exception of category 'DeepTauId' occurred while
[0] Processing Event run: 357696 lumi: 141 event: 162424424 stream: 0
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[4] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
invalid prediction = nan for tau_index = 0, pred_index = 0

The links to logs do not seem to work.

danielwinterbottom · 2023-01-09T11:12:38Z

Just to add that this might be related to this issue: #28358
In this case, the problems were found to be non-reproducible crashes due to problems on the site

srimanob · 2023-01-18T10:22:50Z

Can this issue be closed? Or issue related to something else?

VinInn · 2023-02-09T18:13:34Z

It would be possible to obtain details on the machine were the crash occurred?

kskovpen · 2023-04-14T15:09:00Z

It looks like this issue is a bottleneck in finishing the 2023 data reprocessing. The fraction of failures is not negligible, as can be seen here (look for 8001 error codes). Is there still a way to implement a protection for these failures?

@VinInn we are trying to get this info; will let you know, if we manage to dig it out.

kskovpen · 2023-04-14T15:15:25Z

Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?

makortel · 2023-04-14T16:17:29Z

Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?

No, it is still an exception

cmssw/RecoTauTag/RecoTau/plugins/DeepTauId.cc

Lines 1296 to 1297 in 8c3dad4

    
           throw cms::Exception("DeepTauId") 
        
               << "invalid prediction = " << pred << " for tau_index = " << tau_index << ", pred_index = " << k;

makortel · 2023-04-14T16:18:17Z

@kskovpen Do you have any pointers to the logs of the 8001 failures?

kskovpen · 2023-04-14T18:08:27Z

@kskovpen Do you have any pointers to the logs of the 8001 failures?

Here it is.

makortel · 2023-04-14T18:20:17Z

@kskovpen Do you have any pointers to the logs of the 8001 failures?

Here it is.

Thanks. This failure occurred on Intel(R) Xeon(R) CPU E5645 @ 2.40GHz, which is of Westmere microarchitecture, i.e. SSE-only.

kskovpen · 2023-07-02T10:29:22Z

This issue appears to be the main bottleneck in the current run3 data reprocessing. The initial cause of the issue could be the excessive memory usage of the deepTau related modules. The issue happens all over the place and there are many examples, e.g. https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660. Was the memory profiling done for the latest deepTau implementation in cmssw?

makortel · 2023-07-03T13:39:26Z

https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660

Trying to look at the logs for 50660, I see

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/50660/DataProcessing/0b0b9315-4516-4a95-b2a1-63dcdaa879c9-24-0-logArchive/
- I see only the invalid prediction = -nan for tau_index = 0, pred_index = 0 exception, on Intel(R) Xeon(R) CPU X5650, which is Westmere microarchitecture, i.e. SSE-only
- The wmagentJob.log has
```
 2023-06-28 05:13:16,497:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
 2023-06-28 05:14:12,850:INFO:PerformanceMonitor:PSS: 1015060; RSS: 759448; PCPU: 50.5; PMEM: 1.1
 2023-06-28 05:19:13,174:INFO:PerformanceMonitor:PSS: 12010830; RSS: 6378532; PCPU: 123; PMEM: 9.7
 2023-06-28 05:19:13,175:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
 Number of Cores: 4
 Job has exceeded maxPSS: 10000 MB
 Job has PSS: 12010 MB
 
 2023-06-28 05:19:13,176:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
```
  which is interesting because it claims PSS larger than RSS. I don't understand how that could be. The RSS is reasonable for 4-core job, and quite consistent with the RSS reported in the CMSSW log. If the timestamps of the wmagentJob.log and cmsRun1-stdout.log can be correlated (i.e. their clocks are close enough), the error by WM above is noticed at the time the CMSSW has already terminated the data processing loop and is shutting down.
- Later on the wmagentJob.log shows the CMSSW terminated with exit code 8001.

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/50660/DataProcessing/03b0b56e-f7b7-44ff-a29e-e6033be3b4d6-9-0-logArchive/

The wmagentJob.log has

2023-06-28 06:02:34,738:INFO:PerformanceMonitor:PSS: 549635; RSS: 668964; PCPU: 38.0; PMEM: 0.2
2023-06-28 06:07:34,962:INFO:PerformanceMonitor:PSS: 9424486; RSS: 7618620; PCPU: 233; PMEM: 2.8
2023-06-28 06:12:35,200:INFO:PerformanceMonitor:PSS: 10024095; RSS: 7686548; PCPU: 308; PMEM: 2.9
2023-06-28 06:12:35,200:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 10024 MB

2023-06-28 06:12:35,201:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2

again showing PSS larger than RSS, and RSS being somewhat compatible with the RSS reported in the CMSSW log.

Correlating the timestamps between wmagentJob.log and cmsRun1-stdout.log it seems that CMSSW shut itself down after the SIGUSR2 signal. CMSSW log itself shows no issues. RSS fluctuates between 7.0 and 7.5 GiB.

With these two logs alone I don't see any evidence that deepTau would cause memory issues. The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).

makortel · 2023-07-03T13:41:11Z

The 8001 log (https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/8001/DataProcessing/04ae17ee-b07e-4ca6-b32e-4e51ee944afe-36-0-logArchive/job/) shows the job throwing the exception was also run on Intel(R) Xeon(R) CPU X5650, i.e. Westmere and SSE-only.

davidlange6 · 2023-07-03T13:52:30Z

do we have any such reported failures at CERN? (eg, why does the T0 not see this same issue)

kskovpen · 2023-07-03T15:44:24Z

The failures are strongly correlated with the sites. The highest failure rates are observed at T2_BE_IIHE, T2_US_Nebraska, and T2_US_Caltech.

kskovpen · 2023-07-03T15:46:19Z

Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305

amaltaro · 2023-07-03T16:12:38Z

@makortel Hi Matti, for the test workflow that I was running and reported issues like:

invalid prediction = -nan for tau_index = 0, pred_index = 0

it happened at 2 sites only (13 failures at MIT and 3 at Nebraska).

A couple of logs for MIT area available in CERN EOS:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz

while for Nebraska they are:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz

makortel · 2023-07-03T16:54:03Z

Thanks @amaltaro for the logs.

A couple of logs for MIT area available in CERN EOS:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz

This job failed because the input data from file:/mnt/hadoop/cms/store/data/Run2022D/MuonEG/RAW/v1/000/357/688/00000/819bbdb2-43c0-4393-aa84-4f0a81ad5f9e.root was corrupted (CMSSW log has

R__unzipLZMA: error 9 in lzma_code
----- Begin Fatal Exception 01-Jul-2023 05:39:45 EDT-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 357688 lumi: 48 event: 85850129 stream: 2
   [1] Running path 'AODoutput_step'
   [2] Prefetching for module PoolOutputModule/'AODoutput'
   [3] While reading from source GlobalObjectMapRecord hltGtStage2ObjectMap '' HLT
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 357688 lumi: 48 event: 86242374 stream: 0
   [6] Running path 'dqmoffline_17_step'
   [7] Prefetching for module CaloTowersAnalyzer/'AllCaloTowersDQMOffline'
   [8] Prefetching for module CaloTowersCreator/'towerMaker'
   [9] Prefetching for module HBHEPhase1Reconstructor/'hbhereco@cpu'
   [10] Prefetching for module HcalRawToDigi/'hcalDigis'
   [11] While reading from source FEDRawDataCollection rawDataCollector '' LHC
   [12] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 2802592, fKeylen = 115, fObjlen = 4817986, noutot = 0, nout=0, nin=2802477, nbuf=4817986

----- End Fatal Exception -------------------------------------------------

I'm puzzled why the FileReadError resulted in 8001 exit code instead of 8021, but I'll open a separate issue for that (#42179).

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz

This file doesn't seem to exist.

while for Nebraska they are:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz

These jobs failed with the invalid prediction = nan for tau_index = 0, pred_index = 0 exception. The node had Intel(R) Xeon(R) CPU X5650 CPU, i.e. SSE-only (and thus consistent with the discussion above).

makortel · 2023-07-03T17:07:15Z

Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305

Here

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-0-0-logArchive/
shows the combination of invalid prediction = nan for tau_index = 0, pred_index = 0 exception (CPU is Intel(R) Xeon(R) CPU X5650 i.e. SSE-only), and WM seeing PSS going over the limit, while RSS is much smaller and reasonable for 4-core job.

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-2-0-logArchive/
WM sees PSS going over the limit, while RSS is smaller and reasonable for 4-core job.

makortel · 2023-07-03T17:13:33Z

The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).

Poking into the WM code, I see the PSS is read from /proc/<PID>/smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command).

But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps)

davidlange6 · 2023-07-03T18:25:16Z

In case its useful to correlate sites and cpu types - this is what's been running recently... https://gist.github.com/davidlange6/74232d064422e036c176fb992d90357e

…

On Jul 3, 2023, at 7:13 PM, Matti Kortelainen ***@***.***> wrote: The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau). Poking into the WM code, I see the PSS is read from /proc/<PID>/smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command). But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Dr15Jones · 2023-07-26T13:42:54Z

(there are long stretches when MemoryCheck is silent as VSS does not change while RSS is actullay changing sue to swapping)

Newer releases also monitor changes in RSS in MemoryCheck.

VinInn · 2023-07-26T14:09:31Z

it happens (second time) even if the job is pinned (I killed the previous job and bang)

[innocent@lxplus801 rereco]$ cat /proc/3951792/smaps_rollup
00400000-ffffc2780000 ---p 00000000 00:00 0                              [rollup]
Rss:             7653376 kB
Anonymous:       7463360 kB
AnonHugePages:   1048576 kB
Swap:             507712 kB
[innocent@lxplus801 rereco]$ cat /proc/3951792/smaps_rollup
Rss:             9195776 kB
Anonymous:       9005760 kB
AnonHugePages:   2621440 kB
Swap:             507712 kB
[innocent@lxplus801 rereco]$ cat /proc/3951792/smaps_rollup
Rss:            10354880 kB
Anonymous:      10164480 kB
AnonHugePages:   3670016 kB
Swap:             410112 kB

VinInn · 2023-07-26T14:11:36Z

The memory explosion can be reproduced!

VinInn · 2023-07-28T10:15:35Z

let's cross post
in the rereco in question these are the models we run in each stream

[innocent@lxplus801 rereco]$ grep onnx config.py
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoParticleFlow/PFProducer/data/mlpf/mlpf_2021_11_16__no_einsum__all_data_cms-best-of-asha-scikit_20211026_042043_178263.workergpu010.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertexcombined.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertex.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/decorrelated/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/decorrelated/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/particle-net.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),

so multiple modules use the very same model.
It is absolutely not trivial to try to collapse them as they MAY run in parallel.
Still they could share the same model as it is in a edm::GlobalCache<ONNXRuntime>

VinInn · 2023-08-01T08:15:08Z

My summary:
There were four, most probably independent, issues:

TauId throws in a loop which is not supposed to be executed as the size of the collection is supposed to be zero.
Not solved, not reproduced. Most probably memory corruption.
PSS > RSS
Identified, solution proposed at WMCore level.
Suddenly RSS grows "out of control"
Reproduced. Seems related to a non-collaborative effort between JeMalloc and THP in presence of scarse resources and memory fragmentation.
Solution proposed: use TCMalloc that is explicitly designed to collaborate with THP.
RelVal needs more than 2GB per stream (see the other issue)
It seems related to SimHit replay
Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

drkovalskyi · 2023-08-01T08:25:28Z

Thanks Vincenzo for the investigation and the summary.

srimanob · 2023-08-01T11:15:53Z

4. RelVal needs more than 2GB per stream (see the other issue)
It seems related to SimHit replay
Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

Why does data reprocessing (on the topic of issue) has issue with simhit replay? Or you mean for general MC relvals. For MC relvals, currently it runs with less stream (4 threads, 2 streams) in default setup.

VinInn · 2023-08-01T15:22:07Z

@srimanob ,
let me clarify: data reprocessing has no issue with simhit replay. RelVal does.

jamesletts · 2023-08-02T18:59:33Z

My summary: There were four, most probably independent, issues:

TauId throws in a loop which is not supposed to be executed as the size of the collection is supposed to be zero.
Not solved, not reproduced. Most probably memory corruption.

PSS > RSS
Identified, solution proposed at WMCore level.

Suddenly RSS grows "out of control"
Reproduced. Seems related to a non-collaborative effort between JeMalloc and THP in presence of scarse resources and memory fragmentation.
Solution proposed: use TCMalloc that is explicitly designed to collaborate with THP.

RelVal needs more than 2GB per stream (see the other issue)
It seems related to SimHit replay
Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

Regarding this excellent summary of Vincenzo's (thank you), the XEB is asking for periodic updates and time estimates for the solutions. I have a link to the WMCore fix (2). Are there GitHub issues opened for 1 (TauID) and 3 (TCMalloc)?

makortel · 2023-08-02T19:11:13Z

Are there GitHub issues opened for 1 (TauID)

#42444

and 3 (TCMalloc)?

#42387

makortel · 2023-08-07T21:09:34Z

Profiling the job mentioned in #40437 (comment) again with the usual MEM_LIVE (not peak) showed that in this job the top-level ParameterSet took about 40 MB
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue40437/reco_07.5_live/496

The top-level ParameterSet is currently kept alive throughout the job, but that is not strictly necessary. I made a PR to master to release the ParameterSet soon after the modules have been constructed (before beginJob()) #42503 .

makortel · 2023-08-07T21:22:01Z

I spun off the DQM memory usage discussion into #42504

makortel · 2023-08-07T22:00:33Z

This ~70-90 MB / job
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue40437/reco_07.5_live/255
has already been fixed in 13_1_X #41465.

VinInn · 2023-08-08T06:35:41Z

Is #41465 worth a backport?

makortel · 2023-08-08T14:02:53Z

Is #41465 worth a backport?

I'd say it depends on the @cms-sw/pdmv-l2's plan for this re-reco in 12_4_X (well, also for plans for a re-reco in 13_0_X). If there is a chance that the backport would really be used, then the backport could be worth it.

makortel · 2023-08-09T16:20:05Z

Here is another almost 100 MB / stream (in L1TMuonEndCapTrackProducer) #42526

makortel · 2023-08-29T22:04:40Z

Another spin-off, this time on the number of allocations (i.e. memory churn) #42672

cmsbuild added the pending-assignment label Jan 6, 2023

cmsbuild added pending-signatures reconstruction-pending and removed pending-assignment labels Jan 6, 2023

makortel mentioned this issue Feb 9, 2023

Exception instead of LogError in DeepTauId #40733

Open

makortel mentioned this issue Jul 3, 2023

FileReadError exception leading to 8001 exit code instead of 8021 #42179

Closed

VinInn mentioned this issue Jul 27, 2023

shall we make TCmalloc default (in place of JeMalloc)? #42387

Closed

caruta mentioned this issue Aug 1, 2023

Removal of StringCutObjectSelector from Muon trigger DQM #42437

Merged

smuzaffar mentioned this issue Aug 1, 2023

Use TCMalloc instead of JeMalloc for cmsRun #42440

Merged

VinInn mentioned this issue Aug 2, 2023

DeepTauId throws (in event with zero taus?) #42444

Open

mbluj mentioned this issue Aug 2, 2023

Use make_unique in tau modules #42447

Merged

caruta mentioned this issue Aug 3, 2023

Removal of StringCutObjectSelector from Muon trigger DQM #42456

Merged

This was referenced Aug 4, 2023

Make cut parser const-thread safe cms-sw/framework-team#606

Closed

Release ProcessDesc in main() to release some memory #42503

Merged

makortel mentioned this issue Aug 7, 2023

DQM memory usage in 2022 re-reco #42504

Open

makortel mentioned this issue Aug 9, 2023

L1TMuonEndCapTrackProducer::produce() takes 96 MB memory per stream #42526

Open

z4027163 mentioned this issue Aug 10, 2023

Job memory usage calculation -- (PSS calculation issue for new kernel) dmwm/WMCore#11687

Closed

makortel mentioned this issue Aug 28, 2023

Many memory allocations in AlCaHcalIsotrkProducer::produce() #42672

Open

srimanob mentioned this issue Sep 25, 2023

Exception on HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 Phase-2 workflow #42862

Closed

Failures in Run 3 data reprocessing #40437

Failures in Run 3 data reprocessing #40437

Comments

kskovpen commented Jan 6, 2023

cmsbuild commented Jan 6, 2023

Dr15Jones commented Jan 6, 2023

cmsbuild commented Jan 6, 2023

makortel commented Jan 6, 2023

danielwinterbottom commented Jan 9, 2023

srimanob commented Jan 18, 2023

VinInn commented Feb 9, 2023

kskovpen commented Apr 14, 2023

kskovpen commented Apr 14, 2023

makortel commented Apr 14, 2023

makortel commented Apr 14, 2023

kskovpen commented Apr 14, 2023

makortel commented Apr 14, 2023

kskovpen commented Jul 2, 2023

makortel commented Jul 3, 2023

makortel commented Jul 3, 2023

davidlange6 commented Jul 3, 2023

kskovpen commented Jul 3, 2023 • edited Loading

kskovpen commented Jul 3, 2023

amaltaro commented Jul 3, 2023

makortel commented Jul 3, 2023 • edited Loading

makortel commented Jul 3, 2023

makortel commented Jul 3, 2023

davidlange6 commented Jul 3, 2023 via email

Dr15Jones commented Jul 26, 2023

VinInn commented Jul 26, 2023 • edited Loading

VinInn commented Jul 26, 2023

VinInn commented Jul 28, 2023 • edited Loading

VinInn commented Aug 1, 2023 • edited Loading

drkovalskyi commented Aug 1, 2023

srimanob commented Aug 1, 2023 • edited Loading

VinInn commented Aug 1, 2023

jamesletts commented Aug 2, 2023 • edited Loading

makortel commented Aug 2, 2023

makortel commented Aug 7, 2023

makortel commented Aug 7, 2023

makortel commented Aug 7, 2023

VinInn commented Aug 8, 2023

makortel commented Aug 8, 2023

makortel commented Aug 9, 2023

makortel commented Aug 29, 2023

kskovpen commented Jul 3, 2023 •

edited

Loading

makortel commented Jul 3, 2023 •

edited

Loading

VinInn commented Jul 26, 2023 •

edited

Loading

VinInn commented Jul 28, 2023 •

edited

Loading

VinInn commented Aug 1, 2023 •

edited

Loading

srimanob commented Aug 1, 2023 •

edited

Loading

jamesletts commented Aug 2, 2023 •

edited

Loading