Concept

Synergy Scheduler supervises execution of multiple processes and their jobs.

Process

Process or Worker is any system process (for example: Python process, Hadoop map-reduce job, etc) meant to convert or aggregate raw data into formatted.

There are two types of processes:

cron-like jobs govern by timer. They are known to the system as free-run
managed jobs that are govern by state machine. Such jobs could have multiple dependencies on other jobs

Managed processes have following features:

They chose state machine to govern their execution
They could have dependencies or serve as a dependency-provider for other processes
i.e. you can instruct the Scheduler: wait for Process A to succeed on timeperiod T before running Process B on T
Garbage Collector provides fail-over mechanism for failed and abandoned unit_of_work
Each process has a comprehensive history of its runs and state transfers

Free-run process have following features:

Lightweight - no states, shallow history of runs; designed to function as a trigger only
Protection from overloading the target worker by tracking the status of the task execution: should the worker be busy with the ongoing task when the next trigger occurs - Scheduler will send a remainder, rather than newly created task
No dependencies on other processes, and could not serve as a dependency themselves
No re-triggering of invalid/abandoned unit_of_work
i.e. Garbage Collector will skip all unit_of_work that belongs to free-run process

Timeperiod

Timeperiod represents a time window (or slice). It is encoded in format: YYYYMMDDHH.
For instance a timeperiod 2014011501 stands for 1 hour slice from 01:00 (inclusive) to 02:00 (exclusive) of 15 of January 2014.

Job

Job is a link between process and the timeperiod and adds tracking.
For instance, for some given process site_statistics and a timeperiod 2014011501, the job is responsible for tracking state of data conversion from 1-hour slice into site statistics.

Task

Task or unit_of_work or UOW is an attempt to perform a Job.

Time qualifier

Time qualifier is a class of a timeperiod. Its possible values are:

hourly
daily
monthly
yearly

For illustration purposes, let's assume that the Synergy Scheduler supervises a system that gathers and processes user's behaviour on a web site. In this context:

An hourly timeperiod represents data gathered within one hour, such that period from 10:00:00 of 1 of Jan 2011 till 10:59:59 of 1 of Jan 2011 represents one hourly period.
Notation of this timeperiod is: 2011010110
Data gathered from 00:00:00 of 1 of Jan 2011 till 23:59:59 of 1 of Jan 2011 represents daily period.
Notation of this timeperiod is: 2011010100
Data gathered from 00:00:00 of 1 of Jan 2011 till 23:59:59 of 31 of Jan 2011 represents monthly period.
Notation of this timeperiod is: 2011010000
All-year statistics result in a yearly period
Notation of this timeperiod is: 2011000000

Timetable and trees

Synergy Scheduler organizes timeperiods in tree-like structures.
root <- yearly periods <- monthly periods <- daily periods <- hourly periods

Each level of the tree can be considered as complete only if all nested timeperiods are in STATE_PROCESSED or STATE_SKIPPED states

For example: since daily period nests 24 hourly periods we need all of them to complete before daily period could be declared complete.

Trees can have following number of levels:

4-level tree, hosts yearly, monthly, daily and hourly timeperiods
3-level tree, hosts yearly, monthly and daily timeperiods
2-level tree, hosts timeperiods (either hourly, daily or monthly) and virtual "root" level to maintain tree-like structure

Trees above underline downwards dependency: yearly periods depend on monthly; monthly depends on daily; daily depends on hourly.

Trees & Processes

Each level in the tree is managed by a designated process. For example: <site> hourly period statistics by "site_hourly_aggregator", <site> daily period statistics - by "site_daily_aggregator", etc.

Dependencies between trees

It is common for trees to have dependencies.
For example: to calculate Revenue Per Click, we need two numbers: number_of_clicks from <site tree> and revenue from <financial tree>.
Both numbers are required to compute Revenue Per Click = number_of_clicks / revenue. Thus, tree <financial post-processing> will depend on both <site tree> and <financial tree>.

Dependencies are registered in the context.py block defining the tree. They are time qualifier-dependent. Such that daily timeperiods from <site tree> can be dependent on daily timeperiods from <financial tree>. Consequentially, hourly timeperiods of tree A can not block daily timeperiods from dependent tree B, as they belong to different time-aggregation classes.

Dependencies can be of following types:

blocking_dependencies any processing of dependent timeperiods is blocked until blocking timeperiods are processed.
Interesting use-case is when one of blocking timeperiods is in STATE_SKIPPED. In this case, dependent timeperiod is also moved to STATE_SKIPPED
blocking_children any processing of higher time granularity is blocked until all nested children timeperiods are processed.
blocking_normal dependency allows processing of the dependent timeperiod, however finalization of the dependent timeperiod is not allowed unless blocking timeperiods are processed

Content Index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly