Skip to content
Konstantin Androsov edited this page Apr 12, 2020 · 4 revisions

Production v6 instructions

Framework installation

  1. Install framework on lxplus6 in a prod workspace directory without creating CMSSW release area

    curl -s https://raw.githubusercontent.com/hh-italian-group/h-tautau/master/install_framework.sh | bash -s prod 8
  2. Check framework production functionality interactively for a few samples

    cd CMSSW_10_2_20/src
    cmsenv
    # Run interactively few events
    cmsRun h-tautau/Production/python/Production.py inputFiles=/store/mc/RunIIAutumn18MiniAOD/ttHToNonbb_M125_TuneCP5_13TeV-powheg-pythia8/MINIAODSIM/102X_upgrade2018_realistic_v15-v2/20000/092E70B7-08D1-8B44-B300-A86F4A0856B9.root sampleType=MC_18 applyTriggerMatch=True saveGenTopInfo=True saveGenBosonInfo=True saveGenJetInfo=True tupleOutput=prod_try.root maxEvents=100
    # check the output
    root -l prod_try.root
  3. Install the framework in Pisa (using a fai machine with the slc6 image)

    curl -s https://raw.githubusercontent.com/hh-italian-group/h-tautau/master/install_framework.sh | bash -s prod 4
    cd CMSSW_10_2_20/src
    cmsenv
    ./run.sh --make TupleMerger

Setup CRAB working environment on lxplus6

Each time after login:

source /cvmfs/cms.cern.ch/crab3/crab.sh #or .csh 
voms-proxy-init --voms cms --valid 168:00
cd CMSSW_DIR/src/h-tautau/Production/crab
cmsenv

Production spreadsheet legend

  • done: all crab jobs are successfully finished.
  • tuple: tuples for a task that have all crab jobs successfully finished are merged and copied to the central storage in Pisa.
  • 99p and 99t: same as done and tuple defined above, but for the tasks for which at least 99% of jobs are successfully finished, while a few remaining jobs have failed. These tasks should be considered as "finished", so you should follow the full procedure explained below. The only exceptions are DATA and embedded samples for which all jobs should be 100% finished.

Production workflow

Steps 0, 2-6 should be repeated periodically, at least once per day, until the end of the production.

  1. Define YEAR variable

     export YEAR=VALUE  # where VALUE is 2016, 2017 or 2018
  2. Submit jobs

    ./submit.py --work-area work-area_$YEAR --cfg ../python/Production.py --site T2_IT_Pisa --output hh_bbtautau_prod_v6_$YEAR config/$YEAR/config1 [config/$YEAR/config2] ...
    • run ./submit.py --help to get more details about available parameters.
    • Submit each year and be careful to use different output folders
    • In order to avoid saturation of the CRAB scheduler, it is better to submit one year at the time and wait that about 50% of jobs are finished before submitting the other years
    • The embedded samples are published in phys03 DBS, therefore --inputDBS phys03 should be specified during the submission.
  3. Check jobs status

    ./multicrab.py --workArea work-area_$YEAR --crabCmd status

    Analyze the output of the status command for each task:

    1. If few jobs were failed without any persistent pattern, resubmit them:
      crab resubmit -d work-area_$YEAR/TASK_AREA
    2. If a significant amount of jobs are failing, one should investigate the reason and take actions accordingly. For more details see the CRAB troubleshoot section below.
    3. If all jobs are successfully finished (or >=99%), move task area from "work-area_$YEAR" into "finished_$YEAR" directory (create it if needed).
      # mkdir -p finished_$YEAR
      mv work-area_$YEAR/TASK_AREA finished_$YEAR

    Before moving the directory, make sure that all jobs have status FAILED or FINISHED, otherwise wait (use kill command, if necessary).

    1. Create task lists for tasks in "finished_$YEAR" directory and transfer them into Pisa server.

      if [ -f current-check_$YEAR.txt ] ; then rm -f prev-check_$YEAR.txt ; mv current-check_$YEAR.txt prev-check_$YEAR.txt ; fi
      ./create_job_list.sh finished_$YEAR | sort > current-check_$YEAR.txt
    2. For all tasks from "finished_$YEAR" (especially if they are 99%) create crab reports

      # mkdir -p crab_results_$YEAR
      for JOB in $(ls finished_$YEAR) ; do if [ ! -f "crab_results_$YEAR/$JOB.tar.gz" ] ; then echo "finished_$YEAR/$JOB" ; crab report  -d "finished_$YEAR/$JOB" ; tar -czvf crab_results_$YEAR/$JOB.tar.gz finished_$YEAR/$JOB/results/ ; fi ; done

      This config will allow rerunning the task partially in case if it would be needed in the future.

    3. Update prod_v6/$YEAR spreadsheet accordingly following the notation defined in the section above production-spreadsheet-legend. You can use the following command line to get a list of the newly finished tasks:

      if [ -f prev-check_$YEAR.txt ] ; then diff current-check_$YEAR.txt prev-check_$YEAR.txt ; else cat current-check_$YEAR.txt ; fi
  4. Submit merge jobs output files on the Pisa server.

    1. The file current-check_$YEAR.txt has to be transferred from lxplus to the Pisa server in the src directory.
    2. Submit merge jobs in the interactive queue in Pisa server from a fai machine (bsub -Is -n 1 -q fai -a "docker-sl6" /bin/bash):
      python -u h-tautau/Instruments/python/submit_tuple_merger.py --queue interactive --crab-outputs /gpfs/ddn/srm/cms/store/user/USER/hh_bbtautau_prod_v6_$YEAR --finished-tasks current-check_$YEAR.txt --output output_$YEAR/merge --central-output /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v6/Full 2>&1 | tee status_${YEAR}_$(date +"%Y-%m-%dT%H%M").log
      • If some jobs are failed or not present in the batch system queue - remove the output folders of these jobs
        LOG_FILE=$(ls -t status_${YEAR}_*.log | head -1); cat $LOG_FILE | grep -E ': failed' | sed -E 's/(.*):.*/\1/' | xargs -n 1 -I JOB rm -r output_$YEAR/merge/JOB
      • This script can be run as many times as you want
      • If you use interactive queue, to transfer finished jobs in the central area, you need to run submit_tuple_merger.py the second time.
      • You can also use cms queue, but larger merge jobs usually require a substantial amount of RAM, and will fail in the cms queue with a high probability.
  5. Transfer tuple files into the central tuple storage in Pisa: /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v6/Full.

    find output_$YEAR/tuples -name '*.root' ! -name '*sub[0-9].root' ! -name '*recovery[0-9].root' -exec echo {} \; -exec chmod g+rw {} \; -exec mv {} /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v6/Full/ \;
    • Update prod_v6_$YEAR spreadsheet accordingly.
  6. Transfer crab results in `/gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v6/crab_results

     rsync -auv --chmod=g+rw "[email protected]:/PATH_FROM_LXPLUS/crab_results_$YEAR/*.tar.gz" /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v6/crab_results
  7. When all production is over, after few weeks of safety delay, delete remaining crab output directories and root files in your area to reduce unnecessary storage usage.

CRAB troubleshoot

  1. Common failure reasons

    • Jobs are failing due to memory excess.

      Solution Resubmit jobs requiring more memory per job, e.g.:

      crab resubmit --maxmemory 4000 -d work-area_$YEAR/TASK_AREA
    • Jobs are failing on some servers

      Solution Resubmit jobs using black or white list, e.g.:

      crab resubmit --siteblacklist=T2_IT_Pisa -d work-area_$YEAR/TASK_AREA
      # OR
      crab resubmit --sitewhitelist=T2_IT_Pisa -d work-area_$YEAR/TASK_AREA
  2. How to create a recovery task. Do this, only if the problem can't be solved using crab resubmit and recipes suggested in the previous points. Possible reasons to create a recovery task are:

    • Some jobs persistently exceed the execution time limit, so smaller jobs should be created.
    • Some bugs in the code which are relevant only in a rare conditions which were met for some jobs in the task.
      • if bug is can affect also the successfully finished jobs, the entire task should be re-run from scratch.
    • Other non-reproducible crab issues

    Here are the steps to create a recovery task:

    1. Fix all bugs in the code, if there are any.
    2. Wait until all job has or 'finished' or 'failed' status.
    3. Retrieve crab report:
      crab report -d finished-partial/TASK_AREA
    4. Use file 'results/notFinishedLumis.json' in the task area as the lumi mask for the recovery task. Create recovery task using submit.py:
      ./submit.py --work-area work-area_$YEAR --cfg ../python/Production.py --site T2_IT_Pisa --output hh_bbtautau_prod_v6_$YEAR --jobNames FAILED_TASK_NAME --lumiMask finished-partial/TASK_AREA/results/notFinishedLumis.json --jobNameSuffix _recovery1 FAILED_TASK_CFG
    5. Follow the production workflow procedure.
  3. Create prepare local jobs (when there are few jobs failed):

    1. Once you have set up the environment, you need to create the CRAB project directory for your task. If you have already submitted the task you can simply cd to the project directory created at submission time or create it with the crab remake command;
    2. have not yet submitted the task, do it with the --dryrun option Once the CRAB project directory is created, execute
    crab preparelocal --dir = <PROJECTDIR>

    and then execute locally the script that is created.