Skip to content

Implementation WMAgent Refactoring

Eric Vaandering edited this page Aug 26, 2016 · 8 revisions

Initial plan for implementation

schema changes

  1. need a new work unit table to keep track the which lumis/events are successfully processed. (not considering optimization but minimal changes)
   wmbs_work_unit
   CREATE TABLE wmbs_workunit (
             id           INTEGER          PRIMARY KEY AUTO_INCREMENT,
             taskid       INTEGER          NOT NULL,
             fileid       INTEGER          NOT NULL, #(fake file for mc)
             run          INTEGER          NOT NULL,
             lumi         INTEGER          NOT NULL,
             firstevent   INTEGER          NOT NULL,
             lastevent    INTEGER          NOT NULL,
             status       INT(1)           DEFAULT 0,
             FOREIGN KEY (taskid)
             REFERENCES wmbs_workflow(id) ON DELETE CASCADE)  

fileid, run, lumi can be replaced by one id, if we add unique id in wmbs_file_runlumi_map table.

EWV: I think we would need 2-3 other fields here too. Retry count for the work unit, how many work units ended up in the last job to try this work unit (remember we want to try a work unit by itself before giving up completely), and perhaps a timestamp. But that might not be necessary if we have a timeout on the jobs themselves.

If one lumi can be spread out in multiple files, we need association table for work unit and wmbs_file_runlumi_map table.

  1. above table need to be populated when fileset and subscription is created (before job splitting happens)

  2. wmbs_job_mask table should be modified (or replaced) so it contains relationship between work unit and job id.

CREATE TABLE wmbs_job_workunit_assoc (
             jobid            INTEGER          NOT NULL,
             workunitid       INTEGER          NOT NULL,
             FOREIGN KEY (jobid)
             REFERENCES wmbs_job(id) ON DELETE CASCADE,
             FOREIGN KEY (workunitid)
             REFERENCES wmbs_workunit(id) ON DELETE CASCADE)
  1. wmbs_job table contains 4 states (success, failure, partial_success, not_attempt) Not sure this is needed but maybe need for retrying logic. (in case total failure don't reshuffle the work unit, etc)

  2. We might need the association between output file and wmbs_work_unit

job splitting changes.

  1. (Job splitting need to happen multiple times not just for initially over the input. To make this simpler, splitting happens over wmbs_work_unit not over files.

job accounting changes

  1. (JobAccounter needs to update wmbs_work_unit status, also wmbs_work_unit and output file accociation)

Retry logic changes (requirement #9)

  1. How to define retry rules (by work_unit or jobs)

monitoring changes??

  1. we might still able to track jobs which would contain work unit information.
Clone this wiki locally