Skip to content

Commit

Permalink
updating readme
Browse files Browse the repository at this point in the history
  • Loading branch information
dfm committed May 8, 2023
1 parent 9cf963a commit d630e1d
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 9 deletions.
122 changes: 115 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,131 @@
Very much still a work in progress. Something like the following is the target:
# "Staging" for Snakemake

This package provides a mechanism for
[Snakemake](https://snakemake.readthedocs.io) workflows to explicitly "stage
out" the output files from certain rules to a public repository like
[Zenodo](https://zenodo.org) to allow faster re-execution of the workflow, using
these previously generated artifacts. This can be especially useful for
workflows with computationally expensive rules that don't need to be frequently
re-run.

`snakemake-staging` is a spin-off of the
[`showyourwork`](https://github.com/showyourwork/showyourwork) project, which
provides a "caching" framework for Snakemake workflows, to transparently avoid
re-execution of rules that have been cached to [Zenodo](https://zenodo.org). The
implementation of this logic in `showyourwork` is, however, somewhat fragile and
unpredictable. In `snakemake-staging`, we take a more explicit approach, where
"staged" rules are always either explicitly executed or restored.

## Installation

To use `snakemake-staging` in your workflow, you can install it using `pip`
(it's probably best to set up your Snakemake installation [following the
Snakemake
docs](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
first):

```bash
python -m pip install snakemake-staging
```

## Quickstart

### The `Snakefile`

While testing, it's probably best to use the [Zenodo
Sandbox](https://sandbox.zenodo.org), rather than the main site, since any
archive published to Zenodo is permanent. To use the sandbox, you'll need a
personal access token stored in the `SANDBOX_TOKEN` environment variable. You
can generate a new token
[here](https://sandbox.zenodo.org/account/settings/applications/).

Once you've added this token to your environment, you can edit the Snakefile for
your workflow to use `snakemake-staging` as follows. First, towards the top of
your Snakefile, add:

```python
# File: Snakefile
import snakemake_staging as staging

stage = staging.ZenodoStage(
"stage",
"zenodo-stage",
config.get("restore", False)
)
```

rule expensive_computation:
to create a new stage called `zenodo-stage`. Note that here we're extracting a
`restore` flag from the Snakemake config, which will be used to determine
whether to restore files for the stage. This means that you can control the
behavior of this stage from the command line. By passing `--config restore=True`
to the `snakemake` command line interface, all files staged out by the
`zenodo-stage` stage will be restored from the archive rather than generated.

Then, to stage out a rule, you can apply the stage as follows:

```python
rule expensive:
input:
...
output:
stage("path/to/output/file.txt")
stage(
"path/to/output1.txt",
"path/to/output2.txt",
)
shell:
...
```

Finally, _after defining all the rules that you want to stage out_, you must
add the following `include` which defines all the staging rules:

# Finally (this should be after all calls to `stage` have been made),
# include the provided Snakefile for handling staging and restoring files.
```python
include: staging.snakefile()
```

At this point, here's the full `Snakefile`:

<details>
<summary>Full Snakefile example</summary>

```python
import snakemake_staging as staging

stage = staging.ZenodoStage(
"zenodo-stage",
config.get("restore", False)
)

rule expensive:
input:
...
output:
stage(
"path/to/output1.txt",
"path/to/output2.txt",
)
shell:
...

include: staging.snakefile()
```

</details>

### Usage

With the `Snakefile` defined in the previous section, you can now run your
workflow in 3 ways:

1. **Normal execution**: If you run something like
`snakemake path/to/output1.txt` (where I have omitted the usual `--cores`
and `--conda` arguments) will execute the workflow as normal, without
staging out any files.

2. **Stage upload**: If you instead have Snakemake target the
`zenodo-stage.zenodo.json` file (this filename can be changed by passing the
`info_file` argument to the `ZenodoStage` constructor), the `expensive` rule
will be executed, and the outputs will be uploaded to Zenodo, saving the
record information to `zenodo-stage.zenodo.json`.

3. **Stage restore**: Finally, after these outputs have been uploaded to Zenodo,
you can call Snakemake `--config restore=True` to disable the `expensive`
rule, and force the outputs to be restored from Zenodo.
6 changes: 4 additions & 2 deletions src/snakemake_staging/stages.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from pathlib import Path
from typing import List, Optional

from snakemake.io import OutputFiles

from snakemake_staging.config import _CONFIG
from snakemake_staging.utils import PathLike, package_data, path_to_identifier

Expand Down Expand Up @@ -34,9 +36,9 @@ def upload_flag_file(self) -> Path:
def __call__(self, *files: PathLike) -> List[PathLike]:
return self.staged(*files)

def staged(self, *files: PathLike, **named: PathLike) -> List[PathLike]:
def staged(self, *files: PathLike) -> List[PathLike]:
# Add this list of files to the database of files for this stage
for file in files + tuple(named.values()):
for file in files:
identifier = path_to_identifier(file)
if identifier in self.files:
raise RuntimeError(
Expand Down

0 comments on commit d630e1d

Please sign in to comment.