Skip to content
Konstantin Androsov edited this page Aug 1, 2017 · 11 revisions

Production v3 instructions

Framework installation

  1. Install framework on lxplus in a prod workspace directory without creating CMSSW release area

    curl -s https://raw.githubusercontent.com/hh-italian-group/hh-bbtautau/master/Run/install_framework.sh | bash -s prod
  2. Check framework production functionality interactively for a few samples

    cd CMSSW_8_0_28/src
    cmsenv
    # Radion 250 sample
    echo /store/mc/RunIISummer16MiniAODv2/GluGluToRadionToHHTo2B2Tau_M-250_narrow_13TeV-madgraph/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/110000/A06FE4CA-27DB-E611-9246-549F3525C380.root > Radion_250.txt
    cmsRun h-tautau/Production/python/Production.py fileList=Radion_250.txt applyTriggerMatch=True sampleType=Summer16MC ReRunJEC=True globalTag=80X_mcRun2_asymptotic_2016_TrancheIV_v6 tupleOutput=eventTuple_Radion_250.root maxEvents=1000
    # Single electron Run2016B
    echo /store/data/Run2016B/SingleElectron/MINIAOD/03Feb2017_ver2-v2/110000/003B2C1F-50EB-E611-A8F1-002590E2D9FE.root > SingleElectron_B.txt
    cmsRun h-tautau/Production/python/Production.py fileList=SingleElectron_B.txt anaChannels=eTau applyTriggerMatch=True sampleType=Run2016 ReRunJEC=True globalTag=80X_dataRun2_2016SeptRepro_v7 tupleOutput=eventTuple_SingleElectronB.root saveGenTopInfo=False saveGenBosonInfo=False saveGenJetInfo=False energyScales=Central lumiFile=h-tautau/Production/json/Cert_271036-284044_13TeV_PromptReco_Collisions16_JSON.txt maxEvents=1000
  3. Install framework on the stage out site (e.g. Pisa)

    curl -s https://raw.githubusercontent.com/hh-italian-group/hh-bbtautau/master/Run/install_framework.sh | bash -s prod
    cd CMSSW_8_0_28/src
    cmsenv
    ./run.sh MergeRootFiles --help

Setup CRAB working environment

Each time after login:

source /cvmfs/cms.cern.ch/crab3/crab.sh
voms-proxy-init --voms cms --valid 168:00
cd CMSSW_DIR/src
cmsenv
cd h-tautau/Production/crab

Production workflow

Steps 2-6 should be repeated periodically, 1-2 times per day, until the end of the production.

  1. Submit jobs

    ./submit.py --work-area work-area --cfg ../python/Production.py --site T2_IT_Pisa --output hh_bbtautau_prod_v3 config1 [config2] ...
  2. Check jobs status

    ./multicrab.py --workArea work-area --crabCmd status

    Analyze output of the status command for each task:

    1. If few jobs were failed without any persistent pattern, resubmit them:

      crab resubmit -d work-area/TASK_AREA
    2. If significant amount of jobs are failing, one should investigate the reason and take actions accordingly. For more details see the CRAB troubleshoot section below.

    3. If all jobs are successfully finished, move task area from 'work-area' into 'finished' directory (create it if needed).

      # mkdir -p finished
      mv work-area/TASK_AREA finished/
    4. If there is no reasonable hope that some jobs will be successfully finished, move task area from 'work-area' into 'finished-partial' directory (create it if needed). Before moving the directory, make sure that all jobs has or 'failed' or 'finished', otherwise wait (use kill command, if necessary). Create recovery task for the failed jobs (see CRAB troubleshoot section).

      # mkdir -p finished-partial
      mv work-area/TASK_AREA finished-partial/
    5. Create job lists for jobs in 'finished' and 'finished-partial' directories and transfer them into stage out server

      if [ -d current-check ] ; then rm -rf prev-check ; mv current-check prev-check ; fi
      mkdir current-check
      for NAME in finished* ; do ./create_job_list.sh $NAME | sort > current-check/$NAME.txt ; done
    6. Update prod_v3 spreadsheet accordingly. 'finished-partial' task should be considered as not complete, so no updates in the spreadsheet are needed.

      for NAME in finished* ; do echo "$NAME:" ; if [ -f prev-check/$NAME.txt ] ; then diff current-check/$NAME.txt prev-check/$NAME.txt ; else cat current-check/$NAME.txt ; fi ; done
  3. Submit merge jobs output files on the stage out server (before running this procedure the software has to be installed).

    • For partially finished tasks wait for the recovery task to be finished before start merging.
    • Support of partially finished jobs is not implemented yet.
    1. If some merge jobs were already created during the previous iteration, use find_new_jobs.sh to create list of new jobs to submit. N.B. The file current-check/finished.txt has to be transferred from lxplus to stage out server in the src directory in order to run find_new_jobs.sh.

      ./h-tautau/Instruments/find_new_jobs.sh current-check/finished.txt output/merge > job_list.txt
    2. Submit merge jobs in the local queue, where CRAB_OUTPUT_PATH in the output crab path specified in the submit.py command. For example for Pisa stage out server is /gpfs/ddn/srm/cms/store/user/#YOUR_USERNAME/hh_bbtautau_prod_v3/.

      ./h-tautau/Instruments/submit_tuple_hadd.sh cms job_list.txt output/merge CRAB_OUTPUT_PATH
    3. Collect finished jobs (this script can be run as many times you want).

      ./h-tautau/Instruments/collect_tuple_hadd.sh output/merge output/tuples
  4. TODO Split large merged files into several parts in order to satisfy cernbox requirement that file size should be less than 8GB. Use submit_job.sh split_tree.py

  5. Transfer tuple files into the local tuple storage. Pisa: /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full.

    rsync -auv --chmod=g+w --dry-run output/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full
    # if everything ok
    rsync -auv --chmod=g+w output/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full
    • Update prod_v3 spreadsheet accordingly.
    • Tuples the will be transfered by the production coordinator into the central prod_v3 cernbox directory: /eos/user/k/kandroso/cms-it-hh-bbtautau/Tuples2016_v3.
  6. When all production is over, delete remaining crab output directories and root files in your area to reduce unnecessary storage usage.

CRAB troubleshoot

  1. Common failure reasons

    • Jobs are failing due to memory excess. Solution Resubmit jobs requiring more memore per job, e.g.:
      crab resubmit --maxmemory 4000 -d work-area/TASK_AREA
    • Jobs are failing on some servers Solution Resubmit jobs using black or white list, e.g.:
      crab resubmit --siteblacklist=T2_IT_Pisa -d work-area/TASK_AREA
      # OR
      crab resubmit --sitewhitelist=T2_IT_Pisa -d work-area/TASK_AREA
  2. How to create recovery task. Do this, only if problem can't be solved using crab resubmit and recipes suggested in the previous points. Possible reasons to create a recovery task are:

    • Some jobs persistently exceed execution time limit, so smaller jobs should be created.
    • Some bugs in the code which are relevant only in a rare conditions which were met for some jobs in the task.
      • if bug is can affect also the successfully finished jobs, the entire task should be re-run from scratch.

    Here are the steps to create recovery task:

    1. Fix all bugs in the code, if there are any.
    2. Wait until all job has or 'finished' or 'failed' status.
    3. Retrieve crab report:
      crab report -d finished-partial/TASK_AREA
    4. Use file 'results/notFinishedLumis.json' in the task area as the lumi mask for the recovery task. Create recovery task using submit.py:
      ./submit.py --work-area work-area --cfg ../python/Production.py --site T2_IT_Pisa --output hh_bbtautau_prod_v3 --jobNames FAILED_TASK_NAME --lumiMask finished-partial/TASK_AREA/results/notFinishedLumis.json --jobNameSuffix recovery1 FAILED_TASK_CFG
    5. Follow production workflow procedure.
Clone this wiki locally