Targeting analysis at 200 Gbps with ATLAS PHYSLITE. This repository is very much a work in progress.
ATLAS does not have released OpenData, so there isn't an AGC we can copy and try to run. As a result, this repository's main purpose is as a facilities test:
- Run from PHYSLITE
- Load 200 Gbps off of the PHYSLITE samples
- Push all that data downstream to DASK (or similar) workers.
We have a losely tracked set of lessons learned.
materialize_branches.ipynb
: read list of branches, distributable with Dask (use for benchmarking)
When run on the UChicago AF Jupyter Notebook no package installs are required.
There is a requirements.txt
which should allow this to be run on a bare-bones machine (modulo location of files, etc.).
If you are going to use the servicex
version, you have to pin dask_awkward==2024.2.0
. The future versions have a bug which hasn't been fixed yet.
The folder input_files
contains the list of input containers / files and related metadata plus scripts to produce these.
In total:
- number of files: 219,029
- size: 191.073 TB
- number of events: 23,347,787,104
with additional files:
-
input_files/find_containers.py
: query rucio for a list of containers, given a list of (hardcoded) DSIDs -
input_files/container_list.txt
: list of containers to run over -
input_files/produce_container_metadata.py
: query metadata for containers: number of files / events, size -
input_files/container_metadata.json
: output ofinput_files/produce_container_metadata.py
with container metadata -
input_files/get_file_list.py
: for a given dataset creates a txt file listing file access paths that include appropriate xcache. The same kind of output can be obtained by doing:export SITE_NAME=AF_200 rucio list-file-replicas mc20_13TeV:mc20_13TeV.364126.Sherpa_221_NNPDF30NNLO_Zee_MAXHTPTV500_1000.deriv.DAOD_PHYSLITE.e5299_s3681_r13145_p6026 --protocol root --pfns --rses MWT2_UC_LOCALGROUPDISK
-
input_files/containers_to_files.py
: process the list of containers into a list of files per container with hardcoded xcache instances, writes toinput_files/file_lists/*
.
Branches to be read are determined with a 2018 data file.
input_files/size_per_branch.ipynb
: produce breakdown of branch sizes for given fileinput_files/branch_sizes.json
: output of notebook above
This work was supported by the U.S. National Science Foundation (NSF) cooperative agreements OAC-1836650 and PHY-2323298 (IRIS-HEP).