You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are large failures regarding Run3 Rereco WFs running on some of the sites with newer kernel versions. Details: Failures in Run 3 data reprocessing
The detailed reason is in this ticket as well. In short, Linux v6.0+ includes an additional field in smaps, Pss_Dirty. It is added to the PSS calculation and therefore jobs get killed earlier due to PSS exceeding the threshold.
We would like to seek a solution, such as using RSS as the metric to kill the jobs in terms of memory usage. Or else the sites with new kernels will constantly overestimate the memory usage and end jobs earlier.
The text was updated successfully, but these errors were encountered:
Thanks for creating this issue @z4027163
There is already another WMCore issue related to the problem: #11667 and we already have provided two fixes for that:
The long term solution would require change of the mechanism on how we distribute runtime code, so even though this fix which is using the more robust psutil module is ready, it would not go in production until we converge on the best way to distribute the library at runtime.
I am closing this issue now and we should follow on the original one.
There are large failures regarding Run3 Rereco WFs running on some of the sites with newer kernel versions. Details: Failures in Run 3 data reprocessing
The detailed reason is in this ticket as well. In short, Linux v6.0+ includes an additional field in smaps, Pss_Dirty. It is added to the PSS calculation and therefore jobs get killed earlier due to PSS exceeding the threshold.
We would like to seek a solution, such as using RSS as the metric to kill the jobs in terms of memory usage. Or else the sites with new kernels will constantly overestimate the memory usage and end jobs earlier.
The text was updated successfully, but these errors were encountered: