Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to awkward arrays and boost histograms #5

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 35 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# ntuple-tools

The python scripts in this repository should help you get started analysing the [HGCAL L1 TP ntuples](https://github.com/PFCal-dev/cmssw/tree/hgc-tpg-devel-CMSSW_10_3_0_pre4/L1Trigger/L1THGCal/plugins/ntuples)
PYTHON framework for the analysis of [ROOT](https://root.cern/) `TTree` data using [uproot](https://uproot.readthedocs.io/en/latest/) for the IO and [awkward-array](https://awkward-array.org/doc/main/) for the columnar data analysis.

The tool is originally developed for the analysis of [L1T ntuples for Phase-2 e/g](https://github.com/cerminar/Phase2EGTriggerAnalysis) but should work with any kind of flat ntuples.

## Pre-requisites: first time setup

The tool can be run on any private machines using just `python`, `pip` and `virtualenvwrapper`.
If you plan to run it on lxplus you might want to look at the point `1` below.

### 1. lxplus setup

This step is `lxplus` specific, givin access to a more recent `python` and `root` version.
Expand Down Expand Up @@ -39,7 +44,7 @@ The **first time** you will have to create the actual instance of the `virtualen

and

[requirements_py3.8.txt](requirements_py3.10.txt)
[requirements_py3.10.txt](requirements_py3.10.txt)

for python 3.8 and 3.10 respectively.

Expand All @@ -57,7 +62,6 @@ Edit/skip it accordingly for your specific system.

`source setup_lxplus.sh`


### 2. setup `virtualenvwrapper`

For starting using virtualenvwrapper
Expand All @@ -75,16 +79,22 @@ After this initial (once in a time) setup is done you can just activate the virt

## Running the analysis

The main script is `analyzeHgcalL1Tntuple.py`:
The main script is `analyzeNtuples.py`:

`python analyzeHgcalL1Tntuple.py --help`
`python analyzeNtuples.py --help`

An example of how to run it:

`python analyzeHgcalL1Tntuple.py -f cfg/hgctps.yaml -i cfg/datasets/ntp_v81.yaml -c tps -s doubleele_flat1to100_PU200 -n 1000 -d 0`
`python analyzeNtuples.py -f cfg/hgctps.yaml -i cfg/datasets/ntp_v81.yaml -c tps -s doubleele_flat1to100_PU200 -n 1000 -d 0`

## General idea

Data are read in `collections` of objects corresponding to an `array` and are processed by `plotters` which creates set of histograms for different `selections` of the data `collections`.


### Configuration file
The configuration is handled by 2 yaml files.

One specifying
- output directories
- versioning of the plots
Expand All @@ -94,12 +104,13 @@ The other prividing
- details of the input samples (location of the ntuple files)

Example of configuration file can be found in:
- [cfg/default.yaml](cfg/default.yaml)
- [cfg/datasets/ntp_v66.yaml](cfg/datasets/ntp_v66.yaml)
- [cfg/egplots.yaml](cfg/egplots.yaml)
- [cfg/datasets/ntp_v92.yaml](cfg/datasets/ntp_v92.yaml)


### Reading ntuple branches or creating derived ones
The list of branches to be read and converted in pandas `DataFrame` format is specified in the module

The list of branches to be read and converted to `Awkward Arrays` format is specified in the module

[collections](python/collections.py)

Expand All @@ -111,7 +122,7 @@ Selections are defined as strings in the module:
[selections](python/selections.py)

Different collections are defined for different objects and/or different purposes. The selections have a `name` whcih is used for the histogram naming (see below). Selections are used by the plotters.

Selections can be combined and retrieved via regular expressions in the configuration of the plotters.

### Adding a new plotter
The actual functionality of accessing the objects, filtering them according to the `selections` and filling `histograms` is provided by the plotter classes defined in the module:
Expand All @@ -137,9 +148,22 @@ The histogram naming follows the convention:
This is assumed in all the `plotters` and in the code to actually draw the histograms.


## Histogram drawing

Of course you can use your favorite set of tools. I use mine [plot-drawing-tools](https://github.com/cerminar/plot-drawing-tools), which is based on `jupyter notebooks`.

`cd ntuple-tools`
`git clone [email protected]:cerminar/plot-drawing-tools.git`
`jupyter-notebook`

## HELP

I can't figure out how to do some manipulation using the `awkward array` or `uproot`....you can take a look at examples and play witht the arrays in:
[plot-drawing-tools/blob/master/eventloop-uproot-ak.ipynb](https://github.com/cerminar/plot-drawing-tools/blob/master/eventloop-uproot-ak.ipynb)

## Submitting to the batch system

Note that the script `analyzeHgcalL1Tntuple.py` can be used to submit the jobs to the HTCondor batch system invoking the `-b` option. A dag configuration is created and you can actually submit it following the script output.
Note that the script `analyzeNtuples.py` can be used to submit the jobs to the HTCondor batch system invoking the `-b` option. A dag configuration is created and you can actually submit it following the script output.

### Note about hadd job.
For each sample injected in the batch system a DAG is created. The DAG will submitt an `hadd` command once all the jobs will succeed.
Expand Down
17 changes: 7 additions & 10 deletions analyzeHgcalL1Tntuple.py → analyzeNtuples.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@

# import root_numpy as rnp
import pandas as pd
import uproot4 as up
import uproot as up

from python.main import main
import python.l1THistos as histos
Expand Down Expand Up @@ -122,10 +122,9 @@ def analyze(params, batch_idx=-1):
if params.rate_pt_wps:
calib_manager.set_pt_wps_version(params.rate_pt_wps)


output = ROOT.TFile(params.output_filename, "RECREATE")
output.cd()
output = up.recreate(params.output_filename)
hm = histos.HistoManager()
hm.file = output

# instantiate all the plotters
plotter_collection = []
Expand Down Expand Up @@ -156,7 +155,8 @@ def analyze(params, batch_idx=-1):
for tree_file_name in files_with_protocol:
if break_file_loop:
break
tree_file = up.open(tree_file_name, num_workers=2)
# tree_file = up.open(tree_file_name, num_workers=2)
tree_file = up.open(tree_file_name, num_workers=1)
print(f'opening file: {tree_file_name}')
ttree = tree_file[params.tree_name.split('/')[0]][params.tree_name.split('/')[1]]

Expand All @@ -180,7 +180,7 @@ def analyze(params, batch_idx=-1):
# if tree_reader.global_entry % 100 == 0:
# tr.collect_stats()

if tree_reader.global_entry != 0 and tree_reader.global_entry % 1000 == 0:
if tree_reader.global_entry != 0 and tree_reader.global_entry % 10000 == 0:
print("Writing histos to file")
hm.writeHistos()

Expand All @@ -205,11 +205,8 @@ def analyze(params, batch_idx=-1):
# print("Processed {} events/{} TOT events".format(nev, ntuple.nevents()))

print("Writing histos to file {}".format(params.output_filename))

output.cd()
hm.writeHistos()

output.Close()
output.close()
# ROOT.ROOT.DisableImplicitMT()

return tree_reader.n_tot_entries
Expand Down
18 changes: 18 additions & 0 deletions cfg/compIDtuples.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from __future__ import absolute_import
import python.plotters as plotters
import python.collections as collections
import python.selections as selections


# simple_selections = (selections.Selector('^EGq[4-5]$')*('^Pt[1-3][0]$|all'))()

comp_selections = (selections.Selector('^Pt15|all')&('^EtaABC$|^EtaBC$|all'))()
sim_selections = (selections.Selector('^GEN$')&('^Ee$|all')&('^Pt15|all')&('^EtaABC$|^EtaBC$|all'))()

compid_plotters = [
plotters.CompTuplesPlotter(collections.TkEleEE, comp_selections),
plotters.CompCatTuplePlotter(collections.TkEleEE, collections.sim_parts, comp_selections, sim_selections)
]

for sel in sim_selections:
print(sel)
44 changes: 44 additions & 0 deletions cfg/compIDtuples.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@

common:
output_dir:
default: /eos/user/c/cerminar/hgcal/CMSSW1015/plots/
matterhorn: /Users/cerminar/cernbox/hgcal/CMSSW1015/plots/
Matterhorn: /Users/cerminar/cernbox/hgcal/CMSSW1015/plots/
triolet: /Users/cerminar/cernbox/hgcal/CMSSW1015/plots/
output_dir_local: /Users/cerminar/cernbox/hgcal/CMSSW1015/plots/
output_dir_lx: /eos/user/c/cerminar/hgcal/CMSSW1015/plots/
plot_version: v160A
run_clustering: False
run_density_computation: False
# +AccountingGroup = "group_u_CMS.u_zh.users"
# +AccountingGroup = "group_u_CMST3.all"

collections:

compid:
file_label:
compid
samples:
# - ele_flat2to100_PU0
# - ele_flat2to100_PU200
# - doubleele_flat1to100_PU0
- doublephoton_flat1to100_PU200
- doubleele_flat1to100_PU200
- nugun_alleta_pu200
# - photon_flat8to150_PU0
# - photon_flat8to150_PU200
# - dyll_PU200
plotters:
- !!python/name:cfg.compIDtuples.compid_plotters
htc_jobflavor:
microcentury
priorities:
doubleele_flat1to100_PU0: 2
doubleele_flat1to100_PU200: 7
doublephoton_flat1to100_PU200: 6
nugun_alleta_pu200: 6
events_per_job:
doubleele_flat1to100_PU0: 10000
doubleele_flat1to100_PU200: 10000
doublephoton_flat1to100_PU200: 10000
nugun_alleta_pu200: 10000
10 changes: 5 additions & 5 deletions cfg/datasets/ntp_v91.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ samples:

# tree_name: hgcalTriggerNtuplizer/HGCalTriggerNtuple
tree_name: l1EGTriggerNtuplizer_l1tCorr/L1TEGTriggerNtuple
rate_pt_wps: data/rate_pt_wps_v152B.90A.json
rate_pt_wps: data/rate_pt_wps_v160A.91G.json
# tree_name: l1CaloTriggerNtuplizer/HGCalTriggerNtuple

# doubleele_flat1to100_PU0:
Expand Down Expand Up @@ -58,10 +58,10 @@ samples:
# input_sample_dir: NuGunAllEta_PU200/NTP/v80A/
# input_sample_dir: NeutrinoGun_E_10GeV/NuGunAllEta_PU200_v47/191105_135050/0000/
events_per_job: 300
#
# ttbar_PU200:
# input_sample_dir: TT_TuneCP5_14TeV-powheg-pythia8/TT_PU200_v82B/
# events_per_job: 200

ttbar_PU200:
input_sample_dir: TT_TuneCP5_14TeV-powheg-pythia8/TT_PU200_FWTest10k
events_per_job: 200

# zprime_ee_PU200:
# input_sample_dir: ZprimeToEE_M-6000_TuneCP5_14TeV-pythia8/ZPrimeEE_PU200_v82
Expand Down
73 changes: 73 additions & 0 deletions cfg/datasets/ntp_v92.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# NOTE: fix of track extrapolation (digitized tracks with bitwise extrapolation)
# branch:

samples:
input_dir: /eos/cms/store/cmst3/group/l1tr/cerminar/l1teg/ntuples/
calib_version: calib-v134C
version: 92G

# tree_name: hgcalTriggerNtuplizer/HGCalTriggerNtuple
tree_name: l1EGTriggerNtuplizer_l1tCorr/L1TEGTriggerNtuple
rate_pt_wps: data/rate_pt_wps_v152B.90A.json
# tree_name: l1CaloTriggerNtuplizer/HGCalTriggerNtuple

# doubleele_flat1to100_PU0:
# input_sample_dir: DoubleElectron_FlatPt-1To100/DoubleElectron_FlatPt-1To100_PU0_v64E/
# events_per_job : 500
# # gen_selections: !!python/name:python.selections.genpart_photon_selections

doubleele_flat1to100_PU200:
input_sample_dir: DoubleElectron_FlatPt-1To100-gun/DoubleElectron_FlatPt-1To100_PU200_v92G/
events_per_job : 200

doublephoton_flat1to100_PU200:
input_sample_dir: DoublePhoton_FlatPt-1To100-gun/DoublePhoton_FlatPt-1To100_PU200_v92G/
events_per_job : 200

# ele_flat2to100_PU0:
# input_sample_dir: SingleElectron_PT2to200/SingleE_FlatPt-2to200_PU0_v60G2/
# events_per_job : 500
# # gen_selections: !!python/name:python.selections.genpart_photon_selections
#
# ele_flat2to100_PU200:
# input_sample_dir: SingleElectron_PT2to200/SingleE_FlatPt-2to200_PU200_v60G2/
# events_per_job : 200
#
# photon_flat8to150_PU0:
# input_sample_dir: SinglePhoton_PT2to200/SinglePhoton_FlatPt-2to200_PU0_v60D/
# events_per_job : 500
#
# photon_flat8to150_PU200:
# input_sample_dir: SinglePhoton_PT2to200/SinglePhoton_FlatPt-2to200_PU200_v60D/
# events_per_job : 200
#
# pion_flat2to100_PU0:
# input_sample_dir: SinglePion_FlatPt-2to100/SinglePion_FlatPt-2to100_PU0_v33/190911_081445/0000/
# events_per_job : 500
#
# pion_flat2to100_PU200:
# input_sample_dir: SinglePion_FlatPt-2to100/SinglePion_FlatPt-2to100_PU200_v33/190911_081546/0000/
# events_per_job : 200
# #
# nugun_alleta_pu0:
# input_sample_dir: SingleNeutrino/NuGunAllEta_PU0_v14/190123_172948/0000/
# events_per_job: 500

nugun_alleta_pu200:
input_sample_dir: MinBias_TuneCP5_14TeV-pythia8/NuGunAllEta_PU200_v92G/
# input_sample_dir: NuGunAllEta_PU200/NTP/v80A/
# input_sample_dir: NeutrinoGun_E_10GeV/NuGunAllEta_PU200_v47/191105_135050/0000/
events_per_job: 300
#
# ttbar_PU200:
# input_sample_dir: TT_TuneCP5_14TeV-powheg-pythia8/TT_PU200_v82B/
# events_per_job: 200


dyll_PU200:
input_sample_dir: DYToLL_M-50_TuneCP5_14TeV-pythia8/DYToLL_PU200_v92G
events_per_job: 200

dyll_M10to50_PU200:
input_sample_dir: DYToLL_M-10To50_TuneCP5_14TeV-pythia8/DYToLL_M10To50_PU200_v92G
events_per_job: 200
Loading