From de3173805cf3510d2708efe89d9493426ab5e8bd Mon Sep 17 00:00:00 2001 From: GintasS Date: Tue, 9 Apr 2024 17:23:49 +0200 Subject: [PATCH] Improved readme with instructions. --- README.md | 319 +++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 229 insertions(+), 90 deletions(-) diff --git a/README.md b/README.md index 5e3f45b..b875917 100755 --- a/README.md +++ b/README.md @@ -1,180 +1,319 @@ -# ntuple-analysis +# ntuple-analysis PYTHON framework for the analysis of [ROOT](https://root.cern/) `TTree` data using [uproot](https://uproot.readthedocs.io/en/latest/) for the IO and [awkward-array](https://awkward-array.org/doc/main/) for the columnar data analysis. -The tool is developed for the analysis of [FastPUPPI](https://github.com/p2l1pfp/FastPUPPI) but should work with any kind of flat ntuples. +The tool was developed for the analysis of [FastPUPPI](https://github.com/p2l1pfp/FastPUPPI) but should work with any kind of flat ntuple. -## Pre-requisites: first time setup +## In this README -The tool can be run on any private machines using just `python`, `pip` and `virtualenvwrapper`. -If you plan to run it on lxplus you might want to look at the point `1` below. +- [Features](#features) +- [Requirements](#requirements) +- [Usage](#usage) + - [First time setup](#first-time-setup) + - [After the first setup](#after-the-first-setup) + - [Running the project on Windows](#running-the-project-on-windows) +- [Main script](#main-script) + - [How does the analyzeNtuples work?](#how-does-analyzentuplespy-work) + - [Configuration files for the analyzeNtuples script](#configuration-files-for-the-analyzentuplespy-script) +- [Submitting to the batch system](#submitting-to-the-batch-system) +- [FAQ](#faq) +- [Contributing](#contributing) -### 1. lxplus setup +## Features -This step is `lxplus` specific, givin access to a more recent `python` and `root` version. -Edit/skip it accordingly for your specific system. +The features -`source setup_lxplus.sh` + - To be added -### 2. install `virtualenvwrapper` +## Requirements -This stetp needs to be done **only once** for your account and can be done with whatever `python` version is in use in the system. +- A computing account on the LXPLUS CERN service. -For some reason the current `CMSSW` scrips seems to deliver an inconsistent setup of `virtualenv` and `virtualenvwrapper`, for this reason we force a new installation in `~/.local` using: +## Usage -`pip install --ignore-installed --user virtualenv==15.1.0 virtualenvwrapper` +### First time setup -For a more complete overview of the procedure you can refer to -`virtualenvwrapper` [installation instructions](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) +1. Log in to a LXPLUS machine. -### 3. setup `virtualenvwrapper` + *A computing account at CERN is required.* + + *This step is LXPLUS specific, giving access to a more recent Python and Root version. Edit/skip it accordingly for your specific system.* -For starting using virtualenvwrapper + ``` + ssh lxplus.cern.ch + ``` + + *After that, create a new directory to store the ntuple-analysis project:* -`source setVirtualEnvWrapper.sh` + ``` + mkdir "your directory" + ``` +2. Clone this repository. -### 4. create a virtualenv for the project + ``` + git clone https://github.com/cerminar/ntuple-analysis.git + ``` + + *Or you can create a fork of this repository if you plan to contribute.* -The **first time** you will have to create the actual instance of the `virtualenv`: +3. CD into your newly created directory: + ``` + cd "your directory" + ``` -``mkvirtualenv --system-site-packages - -p `which python3.9` -r requirements.txt `` + *Again, perform CD into the ntuple-analysis:* + ``` + cd ntuple-analysis + ``` + +4. Clone [Plot Drawing Tools repository](https://github.com/cerminar/plot-drawing-tools) for the Jupyter Notebook support. -[requirements.txt](requirements.txt) + ``` + git clone https://github.com/cerminar/plot-drawing-tools.git + ``` -You can use the file directly using for example: +5. Edit the ```setVirtualEnvWrapper.sh``` script to add the HOME directory of your user. -`pip install -r requirements.txt` + *You can use Nano, Vim or your other favorite editor.* + ``` + vim setVirtualEnvWrapper.sh + ``` -*NOTE*: `python > 3.9` is a requirement. + *Edit the first line of the ```setVirtualEnvWrapper.sh``` script:* + ``` + export WORKON_HOME=/data/YOUR_USERNAME/.virtualenvs + ``` + *where YOUR_USERNAME is your username.* -## Setup after first installation +6. Run the shell script ```setup_lxplus.sh``` to set-up the LXPLUS service. -### 1. lxplus setup + ``` + source setup_lxplus.sh + ``` -This step is `lxplus` specific, givin access to a more recent `python` and `root` version. -Edit/skip it accordingly for your specific system. +7. Run another shell script, ```setVirtualEnvWrapper.sh```, to initialize the virtual environment wrapper. -`source setup_lxplus.sh` + ``` + source setVirtualEnvWrapper.sh + ``` -### 2. setup `virtualenvwrapper` + *To learn more about the Virtual Environment Wrapper, you can take a look at the docs [link](https://virtualenvwrapper.readthedocs.io/en/latest/install.html).* + +8. Create a virtual environment for the project. -For starting using virtualenvwrapper + ``` + mkvirtualenv --system-site-packages -p `which python3.9` -r requirements.txt + ``` + *where venvname is the name of your new virtual environment* -`source setVirtualEnvWrapper.sh` + *If you created a virtual environment in a different way, you can use:* + ``` + pip install -r requirements.txt + ``` + *NOTE: python > 3.9 is a requirement.** + +9. Activate the virtual environment (if it's not active already). -### 3. activate the `virtualenv` + ``` + workon + ``` + *where venvname is the name of your new virtual environment** -After this initial (once in a time) setup is done you can just activate the virtualenv calling: +10. In order to use Jupyter Notebooks, we need to reinstall the ```traitlets``` package. + + ``` + pip uninstall traitlets + ``` -`workon ` + *and then* + ``` + pip install traitlets==5.9.0 + ``` -(`lsvirtualenv` is your friend in case you forgot the name). +11. Install a custom kernel with all of the packages from your virtual environment. + ``` + python3 -m ipykernel install --name --user + ``` + *where venvname is the name of your new virtual environment** -### Conda environment -You can use also conda to install all the dependencies and root + *Source: [here](https://stackoverflow.com/questions/28831854/how-do-i-add-python3-kernel-to-jupyter-ipython ))* -```bash -conda create env_name python=3.11 -conda activate env_name -conda install root #In the conda-forge channel -pip install -r requirements.txt -``` +12. Launch the Jupyter Notebook. + + *You can launch it in the LXPLUS service:* + ``` + jupyter notebook + ``` + *or, if you are using Windows, to access it from Windows:* + ``` + jupyter notebook --no-browser --port=8095 + ``` + +### After the first setup +1. Log in to a LXPLUS machine. -## Running the analysis + *A computing account at CERN is required.* -The main script is `analyzeNtuples.py`: + ``` + ssh lxplus.cern.ch + ``` + + *CD into the root directory of the ntuple-analysis* + ``` + cd "your directory" + ``` +2. Run the shell script ```setup_lxplus.sh``` to set up the LXPLUS service. -`python analyzeNtuples.py --help` + ``` + source setup_lxplus.sh + ``` -An example of how to run it: +3. Run another shell script ```setVirtualEnvWrapper.sh``` to initialize virtual environment wrapper. + + ``` + source setVirtualEnvWrapper.sh + ``` + + *To learn more about the Virtual Environment Wrapper, you can take a look at the docs [link](https://virtualenvwrapper.readthedocs.io/en/latest/install.html).* + +4. Activate the virtual environment (if it's not active already). + + ``` + workon + ``` + *where venvname is the name of your existing virtual environment from the first set-up.* + + *also, ```lsvirtualenv``` is your friend if you forget the name of the virtualenv.* + +### Running the Jupyter Notebook on Windows + +You need to do the following: + +1. Download [Ubuntu](https://apps.microsoft.com/detail/9pdxgncfsczv?hl=en-us&gl=US) for Windows here and install it. +2. Launch the first instance of Ubuntu (referred to here as #1) and complete all the steps on that machine from [here](#first-time-setup). +3. Launch another instance of Ubuntu (referred to here as #2) and create a tunnel between instance #1 and your Windows machine by: + ``` + ssh -L 8099:localhost:8095 YOUR_MACHINE_URL + ``` + *YOUR_MACHINE_URL is the URL of the address to connect to your machine.* + + `NOTE:` If you are going to use the LXPLUS service, it might happen that this tunnel will not be created with the machine that launched Jupyter Notebook with the following: + + ``` + jupyter notebook --no-browser --port=8095 + ``` + + It means that you need a dedicated CentOS machine. +4. Open http://localhost:8099/ in the browser on Windows. + + `NOTE:` When in Jupyter Notebook, it is important to select a kernel that you have created with the python3 -m ipykernel install command on step 10. + + +## Main script + +The main script is ```analyzeNtuples.py```: + +``` +python analyzeNtuples.py --help +``` -`python analyzeNtuples.py -f cfg/hgctps.yaml -i cfg/datasets/ntp_v81.yaml -c tps -s doubleele_flat1to100_PU200 -n 1000 -d 0` +An example of how to run it: +``` +python analyzeNtuples.py -f cfg/hgctps.yaml -i cfg/datasets/ntp_v81.yaml -c tps -s doubleele_flat1to100_PU200 -n 1000 -d 0 +``` -## General idea +### How does analyzeNtuples.py work? -Data are read in `collections` of objects corresponding to an `array` and are processed by `plotters` which creates set of histograms for different `selections` of the data `collections`. +Data are read in `collections` of objects corresponding to an `array` and are +processed by `plotters`, which create sets of histograms for different `selections` of the data `collections`. +### Configuration files for the analyzeNtuples.py script -### Configuration file -The configuration is handled by 2 yaml files. +The configuration is handled by two YAML files. -One specifying +The first YAML (e.g: ```hgctps.yaml```) file specifies: - output directories - versioning of the plots - - collections of samples, i.e. group of samples to be processed homogeneously: for each collection the list of plotters (see below) to be run is provided. + - collections of samples, i.e., groups of samples to be processed homogeneously: + - for each collection, the list of plotters (see below) to be run is provided. -The other prividing +The second YAML file (e.g., ```ntp_v81.yaml```) provides: - details of the input samples (location of the ntuple files) -Example of configuration file can be found in: +An example of the YAML configuration files can be found here: - [cfg/egplots.yaml](cfg/egplots.yaml) - [cfg/datasets/ntp_v92.yaml](cfg/datasets/ntp_v92.yaml) +## Submitting to the batch system + +Note that the script ```analyzeNtuples.py``` can be used to submit the jobs to the HTCondor batch system ,invoking the `-b` option. +A dag configuration is created, and you can actually submit it following the script output. + +### Note about the HADD job. + +For each sample injected into the batch system, a DAG is created. The DAG will submit a hadd command once all the jobs succeed. However, if you don't want to wait (or don't care), you can also submit a condor job that will run periodically, thus dramatically reducing the latency. For example: + +```condor_submit batch_single_empart_guns_tracks_v77/ele_flat2to100_PU0/batch_harvest.sub``` + + +## FAQ -### Reading ntuple branches or creating derived ones +#### - How can you read ntuple branches or create derived branches? -The list of branches to be read and converted to `Awkward Arrays` format is specified in the module +The list of branches to be read and converted to `Awkward Arrays` format is specified in the module. [collections](python/collections.py) Instantiating an object of class `DFCollection`. What is actually read event by event depends anyhow on which plotters are actually instantiated (collections are read on-demand). -### Selecting subsets of object collections +#### - How can you select a subset of an object collection? Selections are defined as strings in the module: [selections](python/selections.py) -Different collections are defined for different objects and/or different purposes. The selections have a `name` whcih is used for the histogram naming (see below). Selections are used by the plotters. -Selections can be combined and retrieved via regular expressions in the configuration of the plotters. +Different collections are defined for different objects and/or different purposes. The selections have a name,whichh is used for the histogram naming (see below). Selections are used by the plotters. Selections can be combined and retrieved via regular expressions in the configuration of the plotters. -### Adding a new plotter -The actual functionality of accessing the objects, filtering them according to the `selections` and filling `histograms` is provided by the plotter classes defined in the module: +### - How can you add a new plotter? + +The actual functionality of accessing the objects, filtering them according to the `selections`, and filling histograms is provided by the plotter classes defined in the module: [plotters](python/plotters.py) -Basic plotters are already available, most likely you just need to instantiate one of them (or a collection of them) using the `DFCollection` instance you are interested in. -Which collection is run for which sample is steered by the configuration file. +Basic plotters are already available; most likely, you just need to instantiate one of them (or a collection of them) using the DFCollection instance you are interested in. Which collection is run for which sample is steered by the configuration file. -The plotters access one or more collections, select them in several different ways, book and fill the histograms (see below). +The plotters access one or more collections, select them in several different ways, book them, and fill in the histograms (see below). -### Adding a new histogram +### - How can you add a new histogram? Histograms are handled in the module: [l1THistos](python/l1THistos.py) -There are different classes of histograms depending on the input object and on the purpose. -To add a new histogram to an existing class it is enough to add it in the corresponding constructor and in the `fill` module. The writing of the histos to files is handled transparently. +There are different classes of histograms depending on the input object and the purpose. To add a new histogram to an existing class, it is enough to add it in the corresponding constructor and in the `fill` module. The writing of the histos to files is handled transparently. The histogram naming follows the convention: `___` This is assumed in all the `plotters` and in the code to actually draw the histograms. +#### Histogram drawing -## Histogram drawing - -Of course you can use your favorite set of tools. I use mine [plot-drawing-tools](https://github.com/cerminar/plot-drawing-tools), which is based on `jupyter notebooks`. - -`cd ntuple-analysis` -`git clone git@github.com:cerminar/plot-drawing-tools.git` -`jupyter-notebook` +Of course, you can use your favorite set of tools. +I use my [plot-drawing-tools](https://github.com/cerminar/plot-drawing-tools), +which is based on `Jupyter notebooks`. -## HELP - -I can't figure out how to do some manipulation using the `awkward array` or `uproot`....you can take a look at examples and play witht the arrays in: -[plot-drawing-tools/blob/master/eventloop-uproot-ak.ipynb](https://github.com/cerminar/plot-drawing-tools/blob/master/eventloop-uproot-ak.ipynb) -## Submitting to the batch system +``` +cd ntuple-analysis +git clone git@github.com:cerminar/plot-drawing-tools.git +jupyter-notebook +``` -Note that the script `analyzeNtuples.py` can be used to submit the jobs to the HTCondor batch system invoking the `-b` option. A dag configuration is created and you can actually submit it following the script output. +## Contributing -### Note about hadd job. -For each sample injected in the batch system a DAG is created. The DAG will submitt an `hadd` command once all the jobs will succeed. -However, if you don't want to wait (or you don't care) you can submit also a condor job that will run hadd periodically thus reducing dramatically the latency. -For example: +If you want to contribute to this project, you are very welcome. Just fork the project, set it up on your own machine, and play with it. If you have any questions, post them on the issues/discussions tab. -`condor_submit batch_single_empart_guns_tracks_v77/ele_flat2to100_PU0/batch_harvest.sub` +Currently, I can't figure out how to do some manipulation using the `awkward array` or `uproot`.You can take a look at examples and play with the arrays in: +[plot-drawing-tools/blob/master/eventloop-uproot-ak.ipynb](https://github.com/cerminar/plot-drawing-tools/blob/master/eventloop-uproot-ak.ipynb)