Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add token authentication support #12196

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

khurtado
Copy link
Contributor

@khurtado khurtado commented Dec 5, 2024

Fixes #12199

Needed for #12144

Status

tested
But external dependency is not completed (condor)

Description

This does not completely fixes #12144
This is needed to enable token authentication in WMAgent.
Stage-in should be functional with this fix
Stage-out will require changes in the stageout commands

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

HTCondor token setup
htgettoken (optional) in the CMS runtime image.
Note: The HTCondor token setup is not deployed in all schedds yet.

Additional notes

There is a variable called $BEARER_TOKEN_FILE that if set, HTCondor will write the token there.
This would need to be setup in the host, not the WMAgent container.
This step is however, not critical because this is the reference token.
The actual token that HTCondor transfers comes from

condor_config_val SEC_CREDENTIAL_DIRECTORY_OAUTH

which is in a private system area.
This can be changed in the condor configuration to directly write to /data/certs in the future.

  1. Python bindings
    When jobs are submitted via the condor python bindings, inside the WMAgent container, the job is submitted, but /usr/bin/condor_vault_storer does not seem to be triggered.
    I am currently working that around by:

Executing a condor_submit with a test job once from the host
Token seems to stay and refresh afterwards
If cms_readonly scope is used, this does not happen and we need a manual refresh, but for production, we don't need any scope.

@dmwm-bot
Copy link

dmwm-bot commented Dec 5, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 51 new failures
    • 26 tests no longer failing
    • 6 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/150/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor Author

khurtado commented Dec 5, 2024

@anpicci @amaltaro FYI

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/156/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/158/artifact/artifacts/PullRequestReport.html

@d-ylee
Copy link
Contributor

d-ylee commented Dec 6, 2024

retest this please

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/159/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor Author

khurtado commented Dec 9, 2024

I tested the stage-in with the following workflow:

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_Run2022C_LumiMask_khurtado_rereco_lumi_v1_241206_231046_4695

The X509_USER_PROXY environment variable was unset in the job environment for this test.
The condor.out logfile in the jobs show there is a BEARER_TOKEN_FILE but not a X509_USER_PROXY in the environment.

The wmAgent logfiles show the CMSSW run executed successfully and stageOut/dqmUpload failed afterwards.

E.g.:
job1:

2024-12-06 23:40:43,339:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
2024-12-06 23:42:32,781:INFO:PerformanceMonitor:PSS: 2104608; RSS: 2262660; PCPU: 49.5; PMEM: 0.5
2024-12-06 23:47:32,892:INFO:PerformanceMonitor:PSS: 6654177; RSS: 6823764; PCPU: 148; PMEM: 1.7
2024-12-06 23:52:33,083:INFO:PerformanceMonitor:PSS: 6585639; RSS: 6931464; PCPU: 251; PMEM: 1.7
2024-12-06 23:57:33,226:INFO:PerformanceMonitor:PSS: 6569418; RSS: 6919364; PCPU: 292; PMEM: 1.7
2024-12-07 00:02:33,477:INFO:PerformanceMonitor:PSS: 6318371; RSS: 6372732; PCPU: 314; PMEM: 1.6
2024-12-07 00:07:33,577:INFO:PerformanceMonitor:PSS: 6613831; RSS: 6666636; PCPU: 327; PMEM: 1.6
2024-12-07 00:12:33,732:INFO:PerformanceMonitor:PSS: 6645300; RSS: 6699936; PCPU: 337; PMEM: 1.6
2024-12-07 00:17:33,999:INFO:PerformanceMonitor:PSS: 6627041; RSS: 6684564; PCPU: 343; PMEM: 1.6
2024-12-07 00:22:34,221:INFO:PerformanceMonitor:PSS: 6656654; RSS: 6713488; PCPU: 345; PMEM: 1.6
2024-12-07 00:27:34,336:INFO:PerformanceMonitor:PSS: 6628590; RSS: 6707472; PCPU: 349; PMEM: 1.6
2024-12-07 00:27:41,673:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun_ExitCode 0
2024-12-07 00:27:41,679:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun1_ExitCode 0
2024-12-07 00:27:41,696:INFO:Report:addOutputFile method called with outputModule: AODoutput, aFile: None
2024-12-07 00:27:41,696:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2024-12-07 00:27:41,696:INFO:Report:addOutputFile method called with outputModule: DQMoutput, aFile: None
2024-12-07 00:27:41,697:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2024-12-07 00:27:41,697:INFO:Report:addOutputFile method called with outputModule: MINIAODoutput, aFile: None
2024-12-07 00:27:41,697:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2024-12-07 00:27:44,307:INFO:CMSSW:Steps.Executors.CMSSW.post called
2024-12-07 00:27:44,309:INFO:ExecuteMaster:StepName: cmsRun1, StepType: CMSSW, with result: 0
2024-12-07 00:27:44,366:INFO:Executor:Steps.Executor logging started
2024-12-07 00:27:44,368:INFO:StageOut:Steps.Executors.StageOut.pre called
2024-12-07 00:27:44,368:INFO:StageOut:Steps.Executors.StageOut.execute called
2024-12-07 00:27:44,368:INFO:StageOut:StageOut override is: stageOut1.userDN = None
<snip>
2024-12-07 00:27:44,372:INFO:StageOutMgr:==== Stageout configuration finish ====
2024-12-07 00:27:44,372:INFO:StageOut:Beginning report processing for step cmsRun1
2024-12-07 00:27:44,374:INFO:StageOutMgr:==>Working on file: /store/unmerged/DMWM/MuonEG/AOD/ReReco_Run2022C_LumiMask_khurtado_rereco_lumiv1-v11/00000/1193f448-1f63-4dc6-8e03-f6296c80d06a.root
2024-12-07 00:27:44,374:INFO:StageOutMgr:===> Attempting 2 Stage Outs
2024-12-07 00:27:44,374:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/DMWM/MuonEG/AOD/ReReco_Run2022C_LumiMask_khurtado_rereco_lumiv1-v11/00000/1193f448-1f63-4dc6-8e03-f6296c80d06a.root
PFN: root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/unmerged/DMWM/MuonEG/AOD/ReReco_Run2022C_LumiMask_khurtado_rereco_lumiv1-v11/00000/1193f448-1f63-4dc6-8e03-f6296c80d06a.root

2024-12-07 00:27:44,374:INFO:StageOutImpl:Creating output directory...
2024-12-07 00:27:44,376:INFO:StageOutImpl:Running the stage out...
2024-12-07 00:27:44,656:INFO:StageOutImpl:Command exited with status: 151
Output message: stdout: Local File Size is: 550353683
Remote File Size is:
Local File Checksum is: 0e560234
Remote File Checksum is:
ERROR: Size or Checksum Mismatch between local and SE
rm /dcache/uscmsdisk/store/unmerged/DMWM/MuonEG/AOD/ReReco_Run2022C_LumiMask_khurtado_rereco_lumiv1-v11/00000/1193f448-1f63-4dc6-8e03-f6296c80d06a.root : [ERROR] Error response: permission denied

job2:

2024-12-07 03:20:21,870:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2024-12-07 03:20:21,870:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
2024-12-07 03:24:36,370:INFO:PerformanceMonitor:PSS: 936741; RSS: 1405664; PCPU: 4.3; PMEM: 0.3
2024-12-07 03:25:20,316:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun_ExitCode 0
2024-12-07 03:25:20,321:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun1_ExitCode 0
2024-12-07 03:25:20,335:INFO:CMSSW:Steps.Executors.CMSSW.post called
2024-12-07 03:25:20,336:INFO:ExecuteMaster:StepName: cmsRun1, StepType: CMSSW, with result: 0
2024-12-07 03:25:20,345:INFO:Executor:Steps.Executor logging started
2024-12-07 03:25:20,347:INFO:DQMUpload:Steps.Executors.DQMUpload.pre called
2024-12-07 03:25:20,348:INFO:DQMUpload:Steps.Executors.DQMUpload.execute called
2024-12-07 03:25:20,348:INFO:DQMUpload:Beginning report processing for step cmsRun1
2024-12-07 03:25:20,469:INFO:DQMUpload:HTTP Upload is about to start:
 => URL: https://cmsweb-testbed.cern.ch/dqm/dev
 => Filename: /srv/job/WMTaskSpace/cmsRun1/DQM_V0001_R000355870__MuonEG__DMWM-ReReco_Run2022C_LumiMask_khurtado_rereco_lumiv1-v11__DQMIO.root

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kenyi, Andrea, it is not clear to me if this is a test and/or development in progress or not. I suspect you are still working on this, so I am labeling this according to make it clear and avoid mistakes (please remove the label once it's ready for a final review/merge).

In addition, please let me know if you wanted me to look into anything specific. I do not have anything else to add here and changes are looking alright.

etc/submit_py3.sh Outdated Show resolved Hide resolved
@khurtado
Copy link
Contributor Author

khurtado commented Dec 16, 2024

@amaltaro This has been tested for stage-in and it is working. However, it depends on condor to be properly setup, otherwise condor will still submit but they will fail with:

 ID      OWNER          HELD_SINCE  HOLD_REASON
1751.0   cmst1          12/16 14:30 Job credentials are not available

Therefore, code review is fine, but we cannot merge until the condor setup is fully defined (as far as I know, there is some automation related development going on regarding this) and deployed in all condor schedds.

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/215/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/216/artifact/artifacts/PullRequestReport.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adopt token for WMAgent, test stage-in Adopt token for WMAgent stage-in/stage-out
4 participants