diff --git a/.gitignore b/.gitignore index 380be18..9f65799 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,4 @@ *.pyc *~ -.vagrant qless-lib.lua qless.lua diff --git a/.npmignore b/.npmignore deleted file mode 100644 index 6966ee1..0000000 --- a/.npmignore +++ /dev/null @@ -1,3 +0,0 @@ -*.pyc -*~ -.vagrant diff --git a/README.md b/README.md index 98a51dc..f3d6bd3 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,4 @@ -Qless Core -========== -[![Build Status](https://travis-ci.org/seomoz/qless-core.png)](https://travis-ci.org/seomoz/qless-core) +# Qless Core This is the set of all the lua scripts that comprise the qless library. We've begun migrating away from the system of having one lua script per command to @@ -10,11 +8,11 @@ lua script. There are a few reasons for making this choice, but essentially it was getting too difficult to maintain and there was a lot of duplicated code in different sections. This also happens to have the added benefit of allowing you to build -on top of the qless core library __within your own lua scripts__ through +on top of the qless core library **within your own lua scripts** through composition. -Building --------- +## Building + For ease of development, we've broken this unified file into several smaller submodules that are concatenated together into a `qless.lua` script. These are: @@ -25,7 +23,7 @@ submodules that are concatenated together into a `qless.lua` script. These are: - `recurring.lua` -- the recurring job class - `worker.lua` -- manage available workers - `api.lua` -- exposing the interfaces that the clients invoke, it's a very - thin wrapper around these classes + thin wrapper around these classes In order to build up the `qless.lua` script, we've included a simple `Makefile` though all it does is cat these files out in a particular order: @@ -42,17 +40,17 @@ that the qless clients use: make qless-lib.lua ``` -Testing -------- +## Testing + Historically, tests have appeared only in the language-specific bindings of qless, but that has become a tedious process. Not to mention the fact that it's a steep barrier to entry for writing new clients. In light of that, we now include tests directly in `qless-core`, written in python. To run these, -you will need python and the `nose` and `redis` libraries. If you have `pip` +you will need python and the `py.test` and `redis` libraries. If you have `pip` installed: ```python -pip install redis nose +pip -r requirements.txt ``` To run the tests, there is a directive included in the makefile: @@ -68,11 +66,24 @@ the `REDIS_URL` environment variable: REDIS_URL='redis://host:port' make test ``` -Conventions -=========== +There is docker compose file to run redis and test the library. + +```bash +docker compose run --rm shell +``` + +To get a shell with python installed and redis running. + +```bash +pip install -r requirements.txt && make test +``` + +To run the tests. + +# Conventions + +## No more `KEYS` -No more `KEYS` --------------- When originally developing this, I wrote some functions using the `KEYS` portion of the lua scripts, but eventually realized that to do so didn't make any sense. For just about all operations there's no way to determine a priori @@ -80,24 +91,23 @@ which Redis keys would be touched, and so I abandoned that idea. However, in many cases there were vestigial `KEYS` in use, but that has now changed. No more `KEYS`! -Time, Time Everywhere ---------------------- +## Time, Time Everywhere + To ease the client logic, every command now takes a timestamp with it. In many cases this argument is ignored, but it is still required in order to make a valid call. This requirement only comes through in the exposed script API, but not in the class interface. At the class function level, only the functions which require the `now` argument list it. -Documentation -------------- +## Documentation + The documentation of the code is present in each of the modules, but it is excluded from the production code to reduce the weight of it. -Features and Philosophy -======================= +# Features and Philosophy + +## Locking -Locking -------- A worker is given an exclusive lock on a piece of work when it is given that piece of work. That lock may be renewed periodically so long as it's before the provided 'heartbeat' timestamp. Likewise, it may be completed. @@ -111,8 +121,8 @@ node attempting to renew the lock is the same node that currently owns the lock. If so, then the lock's expiration should be pushed back accordingly, and the updated expiration returned. If not, an exception is raised. -Stats ------ +## Stats + Qless also collects statistics for job wait time (time popped - time put), and job completion time (time completed - time popped). By 'statistics', I mean average, variange, count and a histogram. Stats for the number of @@ -122,15 +132,15 @@ Stats are grouped by day. In the case of job wait time, its stats are aggregated on the day when the job was popped. In the case of completion time, they are grouped by the day it was completed. -Tracking --------- +## Tracking + Jobs can be tracked, which just means that they are accessible and displayable. This can be useful if you just want to keep tabs on the progress of jobs through the pipeline. All the currently-tracked jobs are stored in a sorted set, `ql:tracked`. -Failures --------- +## Failures + Failures are stored in such a way that we can quickly summarize the number of failures of a given type, but also which items have succumb to that type of failure. With that in mind, there is a Redis set, `ql:failures` whose members @@ -150,8 +160,8 @@ deadbeef ... ``` -Worker Data ------------ +## Worker Data + We'll keep a sorted set of workers sorted by the last time they had any activity. We'll store this set at `ql:workers`. @@ -159,51 +169,49 @@ In addition to this list, we'll keep a set of the jids that a worker currently has locks for at `ql:w::jobs`. This should be sorted by the time when we last saw a heartbeat (or pop) for that worker from that job. -__TBD__ We will likely store data about each worker. Perhaps this, too, can +**TBD** We will likely store data about each worker. Perhaps this, too, can be kept by day. -Job Data Deletion ------------------ +## Job Data Deletion + We should delete data about completed jobs periodically. We should prune both by the policies for the maximum number of retained completed jobs, and by the maximum age for retained jobs. To accomplish this, we'll use a sorted list to keep track of which items should be expired. This list should be stored in the key `ql:completed` +# Configuration Options -Configuration Options -===================== The configuration should go in the key `ql:config`, and here are some of the configuration options that `qless` is meant to support: 1. `heartbeat` (60) -- - The default heartbeat in seconds for queues + The default heartbeat in seconds for queues 1. `stats-history` (30) -- - The number of days to store summary stats + The number of days to store summary stats 1. `histogram-history` (7) -- - The number of days to store histogram data + The number of days to store histogram data 1. `jobs-history-count` (50k) -- - How many jobs to keep data for after they're completed -1. `jobs-history` (7 * 24 * 60 * 60) -- - How many seconds to keep jobs after they're completed + How many jobs to keep data for after they're completed +1. `jobs-history` (7 _ 24 _ 60 \* 60) -- + How many seconds to keep jobs after they're completed 1. `heartbeat-` -- - The heartbeat interval (in seconds) for a particular queue + The heartbeat interval (in seconds) for a particular queue 1. `max-worker-age` -- - How long before workers are considered disappeared + How long before workers are considered disappeared 1. `-max-concurrency` -- - The maximum number of jobs that can be running in a queue. If this number - is reduced, it does not impact any currently-running jobs + The maximum number of jobs that can be running in a queue. If this number + is reduced, it does not impact any currently-running jobs 1. `max-job-history` -- - The maximum number of items in a job's history. This can be used to help - control the size of long-running jobs' history + The maximum number of items in a job's history. This can be used to help + control the size of long-running jobs' history +# Internal Redis Structure -Internal Redis Structure -======================== This section stands to speak to the internal structure and naming conventions. -Jobs ----- +## Jobs + Each job is stored primarily in a key `ql:j:`, a Redis hash, which contains most of the keys that describe the job. A set (possibly empty) of jids on which this job depends is stored in `ql:j:-dependencies`. @@ -269,15 +277,15 @@ job is stored in `ql:j:-dependents`. For example, `ql:j:`: } ``` -Queues ------- +## Queues + A queue is a priority queue and consists of three parts: 1. `ql:q:-scheduled` -- sorted set of all scheduled job ids 1. `ql:q:-work` -- sorted set (by priority) of all jobs waiting 1. `ql:q:-locks` -- sorted set of job locks and expirations 1. `ql:q:-depends` -- sorted set of jobs in a queue, but waiting on - other jobs + other jobs When looking for a unit of work, the client should first choose from the next expired lock. If none are expired, then we should next make sure that @@ -291,8 +299,8 @@ When a job is completed, it removes itself as a dependency of all the jobs that depend on it. If it was the last job that a job depended on, it is then inserted into the queue's work. -Stats ------ +## Stats + Stats are grouped by day and queue. The day portion of the stats key is an integer timestamp of midnight for that day: @@ -306,8 +314,8 @@ Stats are stored under two hashes: `ql:s:wait::` and - `total` -- The total number of data points contained - `mean` -- The current mean value - `vk` -- Not the actual variance, but a number that can be used to both numerically - stable-ly find the variance, and compute it in a - [streaming fashion](http://www.johndcook.com/standard_deviation.html) + stable-ly find the variance, and compute it in a + [streaming fashion](http://www.johndcook.com/standard_deviation.html) - `s1`, `s2`, ..., -- second-resolution histogram counts for the first minute - `m1`, `m2`, ..., -- minute-resolution for the first hour - `h1`, `h2`, ..., -- hour-resolution for the first day @@ -316,40 +324,39 @@ Stats are stored under two hashes: `ql:s:wait::` and This is also another hash, `ql:s:stats::` with keys: - `failures` -- This is how many failures there have been. If a job is run - twice and fails repeatedly, this is incremented twice. -- `failed` -- This is how many are currently failed -- `retries` -- This is how many jobs we've had to retry + twice and fails repeatedly, this is incremented twice. +- `failed` -- This is how many are currently failed +- `retries` -- This is how many jobs we've had to retry + +## Tags -Tags ----- All jobs store a JSON array of the tags that are associated with it. In addition, the keys `ql:t:` store a sorted set of all the jobs associated with that particular tag. The score of each jid in that tag is the time when that tag was added to that job. When jobs are tagged a second time with an existing tag, then it's a no-op. +# Implementing Clients -Implementing Clients -==================== There are a few nuanced aspects of implementing bindings for your particular language that are worth bringing up. The canonical example for bindings should be the [python](https://github.com/seomoz/qless-py) and [ruby](https://github.com/seomoz/qless) bindings. -Structure ---------- +## Structure + We recommend using [git submodules](http://git-scm.com/book/en/Git-Tools-Submodules) to keep [qless-core](https://github.com/seomoz/qless-core) in your bindings. -Testing -------- +## Testing + The majority of tests are implemented in `qless-core`, and so your bindings should merely test that they provide sensible access to the functionality. This should include a notion of `queues`, `workers`, `jobs`, etc. -Running the Worker ------------------- +## Running the Worker + If your language supports dynamic importing of code, and in particular if a class can be imported deterministically from a string identifier, then you should include a worker executable with your release. For example, in Python, @@ -358,8 +365,8 @@ As such, a worker binary can just be given a list of queues, a number (and perhaps type) of workers, wait intervals, etc., and then can import all the code required to perform work. -Timestamps ----------- +## Timestamps + Jobs with identical priorities are popped in the order they were inserted. The caveat is that it's only true to the precision of the timestamps your bindings provide. For example, if you provide timestamps to the second granularity, then @@ -368,29 +375,29 @@ order. Timestamps at the thousandths of a second granularity will maintain this property better. While for most applications it's likely not important, it is something to be aware of when writing language bindings. -Filesystem Access ------------------ +## Filesystem Access + It's intended to be a common usecase that bindings provide a worker script or binary that runs several worker subprocesses. These should run with their working directory as a sandbox. -Forking Model -------------- +## Forking Model + There are a couple of philosophies regarding how to best fork processes to do work. Certainly, there should be a parent process that manages child processes, if for no other reason than to ensure that child workers are well-behaved. But how exactly the child processes work is less clear. We encourage you to make all models available in your client: -- __Fork once for each job__ -- This has the added benefit of containing - potential issues like resource leaks, but it comes at the potentially high - cost of forking once for each job. -- __Fork long-running processes__ -- Forking long-running processes means that - you will likely to be able to saturate the CPUs on a machine more easily, - and reduces the cost per job of forking. -- __Coroutines in long-running processes__ -- Especially for I/O-bound processes - this is handy, since you can keep the number of processes relatively small - and still get good I/O parallelism. +- **Fork once for each job** -- This has the added benefit of containing + potential issues like resource leaks, but it comes at the potentially high + cost of forking once for each job. +- **Fork long-running processes** -- Forking long-running processes means that + you will likely to be able to saturate the CPUs on a machine more easily, + and reduces the cost per job of forking. +- **Coroutines in long-running processes** -- Especially for I/O-bound processes + this is handy, since you can keep the number of processes relatively small + and still get good I/O parallelism. Each style of worker should be able to listen for worker-specific `lock_lost`, `canceled` and `put` events, each of which can signal that a worker has lost @@ -399,41 +406,40 @@ the opportunity to stop the child worker that's currently running that job (if it exists). While `qless` ensures correct behavior when taking action on jobs where a lock has been lost, this is an opportunity to gain efficiency. -Queue Popping Order -------------------- +## Queue Popping Order + Workers are allowed (and encouraged) to pop off of more than one queue. But then we get into the problem of what order they should be polled. Workers should support two modes of popping: ordered and round-robin. Consider queues `A`, `B`, and `C` with job counts: - A: 5 - B: 2 - C: 3 + A: 5 + B: 2 + C: 3 In an ordered verion, the order in which the queues are specified has significance in the order in which jobs are popped. For example, if our queued were ordered `C, B, A` in the worker, we'd pop jobs off: - C, C, C, B, B, A, A, A, A, A + C, C, C, B, B, A, A, A, A, A In the round-robin implementation, a worker pops off a job from each queue as it progress through all queues: - C, B, A, C, B, A, C, A, A, A + C, B, A, C, B, A, C, A, A, A +# Internal Style Guide -Internal Style Guide -==================== These aren't meant to be stringent, but just to keep myself sane so that when moving between different chunks of code that it's all formatted similarly, and the same variable names have the same meaning. 1. Parameter sanitization should be performed as early as possible. This - includes making use of `assert` and `error` based on the number and type - of arguments. + includes making use of `assert` and `error` based on the number and type + of arguments. 1. Job ids should be referred to as `jid`, both internally and in the clients. 1. Failure types should be described with `group`. I'm not terribly thrilled - with the term, but I thought it was better than 'kind.' After spending - some time with a Thesaurus, I didn't find anything that appealed to me more + with the term, but I thought it was better than 'kind.' After spending + some time with a Thesaurus, I didn't find anything that appealed to me more 1. Job types should be described as `klass` (nod to Resque), because both - 'type' and 'class' are commonly used in languages. + 'type' and 'class' are commonly used in languages.