Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NanoAOD file access over http and performance #128

Open
alexander-held opened this issue May 1, 2023 · 1 comment
Open

NanoAOD file access over http and performance #128

alexander-held opened this issue May 1, 2023 · 1 comment
Labels
analysis task concerns analysis task implementation concerns analysis implementation

Comments

@alexander-held
Copy link
Member

alexander-held commented May 1, 2023

ServiceX transforms of NanoAOD files and direct uproot-based access via http seem to be slower than for ntuples: https://gist.github.com/alexander-held/4e58811522ed9990afb2d4b73ef9471e.

@masonproffitt pointed out an XRootD issue related to this: xrootd/xrootd#1976. Reading too much data causes a 500 error and uproot subsequently falls back to individual requests, making everything slower. A similar issue is xrootd/xrootd#2003: this is about requesting too many ranges at once, while the former is about requesting too many bytes in a range.

Related uproot issue during these investigations: scikit-hep/uproot5#881.

Impact on ServiceX

More details about the behavior of ServiceX from @masonproffitt:

the uproot backend does not set anything related to chunking; it just uses the default settings for uproot. the problems are a bit different between uproot4 (used in the current version of the servicex uproot transformer) and uproot5. in uproot4, the main problem is that uproot.lazy has an explicit iterator over branches, so the execution time scales linearly with both the number of branches accessed and the round trip latency. in uproot5, this problem should disappear thanks to uproot.dask, but there the issue is that it hits these xrootd limits and falls back to individual requests (at least for each branch, maybe even for each basket)

for uproot5, we can set the step_size in the servicex transformer, but i don't think there's a consistent way to guarantee that we don't hit these limits because there are separate limits for (1) number of byte ranges, (2) total ascii length of the Range field, and (3) total number of actual bytes requested by Range. the problem is that there's no way to know the number and size of the baskets before the code executes. handling this would require either going deep into uproot itself or inspecting a lot of metadata at runtime and modifying the generated code in very non-trivial ways

Impact on coffea

It is currently unclear if this would affect coffea directly ingesting the input dataset differently. Are there any tricks that may matter here @nsmith- @lgray? Currently we are still using "old" coffea, though preparing to switch to coffea 2023.

@masonproffitt
Copy link
Member

I just tested: the difference is simply the number of baskets in those files. The NanoAOD has 251 baskets per branch, and the ntuple has 10. Therefore you very quickly hit the 1024-byte-range XRootD limit for the NanoAOD but not for the ntuple.

@alexander-held alexander-held added implementation concerns analysis implementation analysis task concerns analysis task labels May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis task concerns analysis task implementation concerns analysis implementation
Projects
None yet
Development

No branches or pull requests

2 participants