NanoAOD file access over http and performance #128

alexander-held · 2023-05-01T09:23:36Z

ServiceX transforms of NanoAOD files and direct uproot-based access via http seem to be slower than for ntuples: https://gist.github.com/alexander-held/4e58811522ed9990afb2d4b73ef9471e.

@masonproffitt pointed out an XRootD issue related to this: xrootd/xrootd#1976. Reading too much data causes a 500 error and uproot subsequently falls back to individual requests, making everything slower. A similar issue is xrootd/xrootd#2003: this is about requesting too many ranges at once, while the former is about requesting too many bytes in a range.

Related uproot issue during these investigations: scikit-hep/uproot5#881.

Impact on ServiceX

More details about the behavior of ServiceX from @masonproffitt:

the uproot backend does not set anything related to chunking; it just uses the default settings for uproot. the problems are a bit different between uproot4 (used in the current version of the servicex uproot transformer) and uproot5. in uproot4, the main problem is that uproot.lazy has an explicit iterator over branches, so the execution time scales linearly with both the number of branches accessed and the round trip latency. in uproot5, this problem should disappear thanks to uproot.dask, but there the issue is that it hits these xrootd limits and falls back to individual requests (at least for each branch, maybe even for each basket)

for uproot5, we can set the step_size in the servicex transformer, but i don't think there's a consistent way to guarantee that we don't hit these limits because there are separate limits for (1) number of byte ranges, (2) total ascii length of the Range field, and (3) total number of actual bytes requested by Range. the problem is that there's no way to know the number and size of the baskets before the code executes. handling this would require either going deep into uproot itself or inspecting a lot of metadata at runtime and modifying the generated code in very non-trivial ways

Impact on coffea

It is currently unclear if this would affect coffea directly ingesting the input dataset differently. Are there any tricks that may matter here @nsmith- @lgray? Currently we are still using "old" coffea, though preparing to switch to coffea 2023.

The text was updated successfully, but these errors were encountered:

masonproffitt · 2023-05-01T10:03:13Z

I just tested: the difference is simply the number of baskets in those files. The NanoAOD has 251 baskets per branch, and the ntuple has 10. Therefore you very quickly hit the 1024-byte-range XRootD limit for the NanoAOD but not for the ntuple.

alexander-held added implementation concerns analysis implementation analysis task concerns analysis task labels May 1, 2023

This was referenced Aug 4, 2023

feat: support input files on public EOS #185

Merged

feat: add support input for input files on EOS #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NanoAOD file access over http and performance #128

NanoAOD file access over http and performance #128

alexander-held commented May 1, 2023 •

edited

Loading

masonproffitt commented May 1, 2023

NanoAOD file access over http and performance #128

NanoAOD file access over http and performance #128

Comments

alexander-held commented May 1, 2023 • edited Loading

Impact on ServiceX

Impact on coffea

masonproffitt commented May 1, 2023

alexander-held commented May 1, 2023 •

edited

Loading