Skip to content

Commit

Permalink
Tutorial polishing
Browse files Browse the repository at this point in the history
 - only attempt to execute tutorials with jupytext
 - introduction
 - cluster section for SARS-CoV-2
  • Loading branch information
DrYak committed Nov 2, 2022
1 parent e5cfdf7 commit 650edce
Show file tree
Hide file tree
Showing 4 changed files with 115 additions and 42 deletions.
26 changes: 26 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Tutorials

You can find two tutorials in this directory:

- [tutorial_hiv.md](tutorial_hiv.md): uses HIV test data
- [tutorial_sarscov2.md](tutorial_sarscov2.md): uses SARS-CoV-2 data from a publication

## Note about the tutorials

Due to automated texting, each copy-pastable block begins with a command entering the directory and ends with on leaving the directory:

```bash
cd tutorial/work/
# do something
cd ../..
```
Of course you don't necessarily need to do that. You can simply remainin the working directory.

When editing files like `config.yaml`, you can use your favorite editor (`vim`, `emacs`, `nano`, [butterflies](https://xkcd.com/378/), etc.). By default our tutorial use heredocs to make it easier to copy-paste the blocks into bash:

```bash
cat > config.yaml <<EOF
general:
virus_base_config: 'sars-cov-2'
EOF
```
2 changes: 1 addition & 1 deletion docs/convert.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ elif [[ "${branch}" != "${default}" ]]; then
fi

# create Jupyter Notebooks for all Markdown files
exec jupytext --set-formats ipynb,md --execute ./*.md
exec jupytext --set-formats ipynb,md --execute ./tutorial*.md
23 changes: 14 additions & 9 deletions docs/tutorial.md → docs/tutorial_hiv.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ V-pipe is a workflow designed for the analysis of next generation sequencing (NG

## Requirements

V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to install WSL2 - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.
V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to [install WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.


## Organizing Data
Expand All @@ -37,7 +37,7 @@ Paired-ended reads need to be in split files with suffixes `_R1` and `_R2`.

```text
📁samples
|───📁patient1
───📁patient1
│ └───📁date1
│ └───📁raw_data
│ ├───🧬reads_R1.fastq
Expand All @@ -48,7 +48,7 @@ Paired-ended reads need to be in split files with suffixes `_R1` and `_R2`.
│ ├───🧬reads_R1.fastq
│ └───🧬reads_R2.fastq
└───📁date2
└───raw_data
└───📁raw_data
├───🧬reads_R1.fastq
└───🧬reads_R2.fastq
```
Expand All @@ -60,11 +60,11 @@ The files will have the following structure:

```text
📁samples
|└───📁CAP217
└───📁4390
└───📁raw_data
├───🧬reads_R1.fastq
└───🧬reads_R2.fastq
───📁CAP217
│ └───📁4390
│ └───📁raw_data
│ ├───🧬reads_R1.fastq
│ └───🧬reads_R2.fastq
└───📁CAP188
│───📁4
│ └───📁raw_data
Expand Down Expand Up @@ -115,13 +115,18 @@ cd working_2

## Preparation

Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. (You can display the directory structure with `tree samples` or `find samples`.)
Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. (You can display the directory structure with `tree testing/work/resources/samples` or `find testing/work/resources/samples`.)

```bash
mkdir -p testing/work/resources
mv testing/V-pipe/docs/example_HIV_data/samples testing/work/resources/samples
```

Note that:
- by default V-pipe expects its samples in a directory `samples` contained directly in the working directory - i.e. `testing/work/sample``
- in this tutorial we put them inside the `resources` subdirectory, and will set the config file accordingly.


### Reference
If you have a reference sequences that you would like to use for read mapping and alignment, then add it to the `resources/reference/ref.fasta` directory. In our case, however, we will use the reference sequence HXB2 already provided by V-Pipe `V-pipe/resources/hiv/HXB2.fasta`.

Expand Down
106 changes: 74 additions & 32 deletions docs/tutorial_sarscov2.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,19 @@ jupyter:

# SARS-CoV-2 Tutorial

This tutorial shows the basics of how to interact with V-pipe. A recording of our webinar covering the subject is available at the bottom of the current page.
This tutorial shows the basics of how to interact with V-pipe.

For the purpose of this Tutorial, we will work with the sars-cov2 branch which is adapted for the SARS-CoV-2 virus.
For the purpose of this Tutorial, we will work with the master branch of V-pipe and use the _sars-cov-2_ virus base config which is adapted for the SARS-CoV-2 virus.


## Organizing Data

V-pipe expects the input samples to be organized in a two-level hierarchy:

At the first level, input files grouped by samples (e.g.: patients or biological replicates of an experiment).
A second level for distinction of datasets belonging to the same sample (e.g.: sample dates).
Inside that directory, the sub-directory raw_data holds the sequencing data in FASTQ format (optionally compressed with GZip).
Paired-ended reads need to be in split files with _R1 and _R2 suffixes.
- At the first level, input files grouped by samples (e.g.: patients or biological replicates of an experiment).
- A second level for distinction of datasets belonging to the same sample (e.g.: sample dates).
- Inside that directory, the sub-directory raw_data holds the sequencing data in FASTQ format (optionally compressed with GZip).
- Paired-ended reads need to be in split files with _R1 and _R2 suffixes.


## Preparing a small dataset
Expand All @@ -38,9 +38,7 @@ You can run the first test on your workstation or a good laptop.

First, you need to prepare the data:

* For that test, you need to download the following runs from SRA: SRR10903401 and SRR10903402

If you have difficulties, check this shared directory. You can obtain there a copy of the .fastq files. More information on the steps necessary to generate the .fastq files from SRA can be found in the README.md file.
* For that test, you need to download the following runs from SRA: [SRR10903401](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10903401) and [SRR10903402](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10903402)

```bash
mkdir -p samples/SRR10903401/20200102/raw_data
Expand All @@ -66,7 +64,23 @@ mv samples/SRR10903402/20200102/raw_data/SRR10903402_1.fastq samples/SRR10903402
mv samples/SRR10903402/20200102/raw_data/SRR10903402_2.fastq samples/SRR10903402/20200102/raw_data/SRR10903402_R2.fastq
```

The downloaded files will have the following structure:
The downloaded files should have the following structure:

```text
📁samples
├───📁SRR10903401
│ └───📁20200102
│ └───📁raw_data
│ ├───🧬SRR10903401_R1.fastq
│ └───🧬SRR10903401_R2.fastq
└───📁SRR10903402
└───📁20200102
└───📁raw_data
├───🧬SRR10903402_R1.fastq
└───🧬SRR10903402_R2.fastq
```

You can display the directory structure with the following command on Linux (on Mac OS, use `find samples`)

```bash
tree samples
Expand All @@ -77,14 +91,14 @@ tree samples

V-pipe uses the [Bioconda](https://bioconda.github.io/) bioinformatics software repository for all its pipeline components. The pipeline itself is written using [Snakemake](https://snakemake.readthedocs.io/en/stable/).

For advanced users: If your are fluent with these tools, you can:

* directly download and install [bioconda](https://bioconda.github.io/user/install.html) and [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda),
* make sure to configure V-pipe to use the `sars-cov-2` virus-config
* and start using V-pipe with them, using the --use-conda to [automatically download and install](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) any further pipeline dependencies.
* please refer to the documentation for additional instructions.

The present tutorial will show simplified commands that automate much of this process.
> **For advanced users:** If your are fluent with these tools, you can:
>
> * directly download and install [bioconda](https://bioconda.github.io/user/install.html) and [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda),
> * make sure to configure V-pipe to use the `sars-cov-2` virus-config
> * and start using V-pipe with them, using the --use-conda to [automatically download and install](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) any further pipeline dependencies.
> * please refer to the [documentation](https://github.com/cbg-ethz/V-pipe/blob/master/README.md) for additional instructions.
>
> The present tutorial will show simplified commands that automate much of this process.
To deploy V-pipe, you can use the installation script with the following parameters:

Expand All @@ -96,24 +110,24 @@ bash quick_install.sh -p tutorial -w work
* using `-p` specifies the subdirectory where to download and install snakemake and V-pipe
* using `-w` will create a working directory and populate it. It will copy over the references and the default `config/config.yaml`, and create a handy `vpipe` short-cut script to invoke `snakemake`.

Tip: To create and populate other new working directories, you can call init_project.sh from within the new directory:

```console
mkdir -p working_2
cd working_2
../V-pipe/init_project.sh
```
> **Tip:** To create and populate other new working directories, you can call init_project.sh from within the new directory:
>
> ```console
> mkdir -p working_2
> cd working_2
> ../V-pipe/init_project.sh
> ```
## Running V-pipe
Copy the samples directory you created in the step Preparing a small dataset to this working directory. (You can display the directory structure with `tree samples` or `find samples`.)
Copy the samples directory you created in the step [Preparing a small](#preparing-a-small-dataset) dataset to this working directory. (You can display the directory structure with `tree samples` or `find samples`.)
```bash
mv samples tutorial/work/
```
Prepare V-pipe's configuration:
Prepare V-pipe's configuration. You can find more information in [the documentation](https://github.com/cbg-ethz/V-pipe/blob/master/config/README.md). In your local V-pipe installation, you will also find an exhaustive manual about all the configuration options inside `config/config.html`.

```bash
cat <<EOT > tutorial/work/config.yaml
Expand Down Expand Up @@ -155,8 +169,7 @@ SRR10903402 20200102 150
EOT
```

Tip: Always check the content of the `samples.tsv` file.

**Tip:** Always check the content of the `samples.tsv` file.
If you didn’t use the correct structure, this file might end up empty or some entries might be missing.
You can safely delete it and re-run the `--dryrun` to regenerate it.

Expand All @@ -170,7 +183,9 @@ cd tutorial/work/

## Output

The Wiki contains an overview of the output files. The output of the SNV calling is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `samples/​{hierarchy}​/variants/SNVs/snvs.vcf`, you can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in `samples/​{hierarchy}​/variants/SNVs/snvs.csv`.
The section _output_ of the exhaustive configuration manual contains an overview of the output files.
The output of the SNV calling is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `results/`_​{hierarchy}​_`/variants/SNVs/snvs.vcf`, you can open it with your favorite VCF tools for visualisation or downstream processing.
It is also available in a tabular format in `results/​`_{hierarchy}​_`/variants/SNVs/snvs.csv`.

### Expected output

Expand Down Expand Up @@ -198,7 +213,34 @@ general:
It is possible to ask snakemake to submit jobs on a cluster using the batch submission command-line interface of your cluster.
The platform LSF by IBM is one of the popular systems you might find (Others include SLURM, Grid Engine).
The opensource platform SLURM by SchedMD is one of the popular systems you might find on clusters (Others include LSF, Grid Engine).
The most user friendly way to submit jobs to the cluster is using a special _snakemake profile_.
[smk-simple-slurm](https://github.com/jdblischak/smk-simple-slurm) is a profile that works well in our experience with SLURM (for other platforms see suggestions in [the snakemake-profil documentation](https://github.com/snakemake-profiles/doc)).
```console
cd tutorial/
git clone https://github.com/jdblischak/smk-simple-slurm.git
cd work/
./vpipe --dry-run --profile ../smk-simple-slurm --jobs 100
cd ../..
```

Snakemakes documentation [introduces the key concepts used in profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles).
Check also [the other options for running snakemake on clusters](https://snakemake.readthedocs.io/en/stable/executing/cli.html#CLUSTER) if you need more advanced uses.

### Dependencies downloading on the cluster

In addition, Snakemake has [parameters for conda](https://snakemake.readthedocs.io/en/stable/executing/cli.html#CONDA) that can help management of dependencies:

- using `-conda-create-envs-only` enables to download the dependencies only without running the pipeline itself. This is very useful if the compute nodes of your cluster are not allowed internet access.
- using `--conda-prefix=`_{DIR}_ stores the conda environments of dependencies in a common directory (thus possible to share and re-use between multiple instances of V-pipe).

```console
cd tutorial/work/
./vpipe --conda-prefix ../snake-envs --cores 1 --conda-create-envs-only
cd ../..
```

...TODO...
When using V-pipe in production environments, plan the installer's `-p` prefix and `-w` working and snakemake's `--conda-prefix` environments directories according to the cluster quotas and time limits.
For example, consider using `${SCRATCH}` and only move the content of the `results/` directory to long-term storage.

0 comments on commit 650edce

Please sign in to comment.