-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures in Run 3 data reprocessing #40437
Comments
A new Issue was created by @kskovpen . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Assign reconstruction |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A bit more context from the link
The links to logs do not seem to work. |
Just to add that this might be related to this issue: #28358 |
Can this issue be closed? Or issue related to something else? |
It would be possible to obtain details on the machine were the crash occurred? |
It looks like this issue is a bottleneck in finishing the 2023 data reprocessing. The fraction of failures is not negligible, as can be seen here (look for 8001 error codes). Is there still a way to implement a protection for these failures? @VinInn we are trying to get this info; will let you know, if we manage to dig it out. |
Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5? |
No, it is still an exception cmssw/RecoTauTag/RecoTau/plugins/DeepTauId.cc Lines 1296 to 1297 in 8c3dad4
|
@kskovpen Do you have any pointers to the logs of the 8001 failures? |
This issue appears to be the main bottleneck in the current run3 data reprocessing. The initial cause of the issue could be the excessive memory usage of the deepTau related modules. The issue happens all over the place and there are many examples, e.g. https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660. Was the memory profiling done for the latest deepTau implementation in cmssw? |
Trying to look at the logs for 50660, I see
With these two logs alone I don't see any evidence that deepTau would cause memory issues. The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau). |
The 8001 log (https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/8001/DataProcessing/04ae17ee-b07e-4ca6-b32e-4e51ee944afe-36-0-logArchive/job/) shows the job throwing the exception was also run on |
do we have any such reported failures at CERN? (eg, why does the T0 not see this same issue) |
The failures are strongly correlated with the sites. The highest failure rates are observed at T2_BE_IIHE, T2_US_Nebraska, and T2_US_Caltech. |
Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305 |
@makortel Hi Matti, for the test workflow that I was running and reported issues like:
it happened at 2 sites only (13 failures at MIT and 3 at Nebraska). A couple of logs for MIT area available in CERN EOS:
while for Nebraska they are:
|
Thanks @amaltaro for the logs.
This job failed because the input data from
I'm puzzled why the
This file doesn't seem to exist.
These jobs failed with the |
Here https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-0-0-logArchive/ https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-2-0-logArchive/ |
Poking into the WM code, I see the PSS is read from But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by |
In case its useful to correlate sites and cpu types - this is what's been running recently...
https://gist.github.com/davidlange6/74232d064422e036c176fb992d90357e
… On Jul 3, 2023, at 7:13 PM, Matti Kortelainen ***@***.***> wrote:
The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).
Poking into the WM code, I see the PSS is read from /proc/<PID>/smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command).
But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Newer releases also monitor changes in RSS in MemoryCheck. |
it happens (second time) even if the job is pinned (I killed the previous job and bang)
|
The memory explosion can be reproduced! |
let's cross post
so multiple modules use the very same model. |
My summary:
|
Thanks Vincenzo for the investigation and the summary. |
Why does data reprocessing (on the topic of issue) has issue with simhit replay? Or you mean for general MC relvals. For MC relvals, currently it runs with less stream (4 threads, 2 streams) in default setup. |
@srimanob , |
Regarding this excellent summary of Vincenzo's (thank you), the XEB is asking for periodic updates and time estimates for the solutions. I have a link to the WMCore fix (2). Are there GitHub issues opened for 1 (TauID) and 3 (TCMalloc)? |
Profiling the job mentioned in #40437 (comment) again with the usual MEM_LIVE (not peak) showed that in this job the top-level The top-level |
I spun off the DQM memory usage discussion into #42504 |
This ~70-90 MB / job |
Is #41465 worth a backport? |
I'd say it depends on the @cms-sw/pdmv-l2's plan for this re-reco in 12_4_X (well, also for plans for a re-reco in 13_0_X). If there is a chance that the backport would really be used, then the backport could be worth it. |
Here is another almost 100 MB / stream (in |
Another spin-off, this time on the number of allocations (i.e. memory churn) #42672 |
We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693
The crash message is:
PdmV
The text was updated successfully, but these errors were encountered: