Tutorial polishing

- only attempt to execute tutorials with jupytext - introduction - cluster section for SARS-CoV-2
cbg-ethz · Nov 2, 2022 · 650edce · 650edce
1 parent e5cfdf7
commit 650edce
Show file tree

Hide file tree

Showing 4 changed files with 115 additions and 42 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,26 @@
+# Tutorials
+
+You can find two tutorials in this directory:
+
+- [tutorial_hiv.md](tutorial_hiv.md): uses HIV test data
+- [tutorial_sarscov2.md](tutorial_sarscov2.md): uses SARS-CoV-2 data from a publication
+
+## Note about the tutorials
+
+Due to automated texting, each copy-pastable block begins with a command entering the directory and ends with on leaving the directory:
+
+```bash
+cd tutorial/work/
+# do something
+cd ../..
+```
+Of course you don't necessarily need to do that.  You can simply remainin the working directory.
+
+When editing files like `config.yaml`, you can use your favorite editor (`vim`, `emacs`, `nano`, [butterflies](https://xkcd.com/378/), etc.). By default our tutorial use heredocs to make it easier to copy-paste the blocks into bash:
+
+```bash
+cat > config.yaml <<EOF
+general:
+    virus_base_config: 'sars-cov-2'
+EOF
+```
diff --git a/docs/convert.sh b/docs/convert.sh
@@ -18,4 +18,4 @@ elif [[ "${branch}" != "${default}" ]]; then
 fi
 
 # create Jupyter Notebooks for all Markdown files
-exec jupytext --set-formats ipynb,md --execute ./*.md
+exec jupytext --set-formats ipynb,md --execute ./tutorial*.md
diff --git a/docs/tutorial.md → docs/tutorial_hiv.md b/docs/tutorial.md → docs/tutorial_hiv.md
@@ -21,7 +21,7 @@ V-pipe is a workflow designed for the analysis of next generation sequencing (NG
 
 ## Requirements
 
-V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to install WSL2 - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.  
+V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to [install WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.
 
 
 ## Organizing Data
@@ -37,7 +37,7 @@ Paired-ended reads need to be in split files with suffixes `_R1` and `_R2`.
 
 ```text
 📁samples
-|───📁patient1
+├───📁patient1
 │   └───📁date1
 │       └───📁raw_data
 │           ├───🧬reads_R1.fastq
@@ -48,7 +48,7 @@ Paired-ended reads need to be in split files with suffixes `_R1` and `_R2`.
     │       ├───🧬reads_R1.fastq
     │       └───🧬reads_R2.fastq
     └───📁date2
-        └───raw_data
+        └───📁raw_data
             ├───🧬reads_R1.fastq
             └───🧬reads_R2.fastq
 ```
@@ -60,11 +60,11 @@ The files will have the following structure:
 
 ```text
 📁samples
-|└───📁CAP217
-│    └───📁4390
-│        └───📁raw_data
-│            ├───🧬reads_R1.fastq
-│            └───🧬reads_R2.fastq
+├───📁CAP217
+│   └───📁4390
+│       └───📁raw_data
+│           ├───🧬reads_R1.fastq
+│           └───🧬reads_R2.fastq
 └───📁CAP188
     │───📁4
     │   └───📁raw_data
@@ -115,13 +115,18 @@ cd working_2
 
 ## Preparation
 
-Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. (You can display the directory structure with `tree samples` or `find samples`.)
+Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. (You can display the directory structure with `tree testing/work/resources/samples` or `find testing/work/resources/samples`.)
 
 ```bash
 mkdir -p testing/work/resources
 mv testing/V-pipe/docs/example_HIV_data/samples testing/work/resources/samples
 ```
 
+Note that:
+ - by default V-pipe expects its samples in a directory `samples` contained directly in the working directory - i.e. `testing/work/sample``
+ - in this tutorial we put them inside the `resources` subdirectory, and will set the config file accordingly.
+
+
 ### Reference
 If you have a reference sequences that you would like to use for read mapping and alignment, then add it to the `resources/reference/ref.fasta` directory. In our case, however, we will use the reference sequence HXB2 already provided by V-Pipe `V-pipe/resources/hiv/HXB2.fasta`.
 

diff --git a/docs/tutorial_sarscov2.md b/docs/tutorial_sarscov2.md
@@ -17,19 +17,19 @@ jupyter:
 
 # SARS-CoV-2 Tutorial
 
-This tutorial shows the basics of how to interact with V-pipe. A recording of our webinar covering the subject is available at the bottom of the current page.
+This tutorial shows the basics of how to interact with V-pipe. 
 
-For the purpose of this Tutorial, we will work with the sars-cov2 branch which is adapted for the SARS-CoV-2 virus.
+For the purpose of this Tutorial, we will work with the master branch of V-pipe and use the _sars-cov-2_ virus base config which is adapted for the SARS-CoV-2 virus.
 
 
 ## Organizing Data
 
 V-pipe expects the input samples to be organized in a two-level hierarchy:
 
-At the first level, input files grouped by samples (e.g.: patients or biological replicates of an experiment).
-A second level for distinction of datasets belonging to the same sample (e.g.: sample dates).
-Inside that directory, the sub-directory raw_data holds the sequencing data in FASTQ format (optionally compressed with GZip).
-Paired-ended reads need to be in split files with _R1 and _R2 suffixes.
+- At the first level, input files grouped by samples (e.g.: patients or biological replicates of an experiment).
+- A second level for distinction of datasets belonging to the same sample (e.g.: sample dates).
+- Inside that directory, the sub-directory raw_data holds the sequencing data in FASTQ format (optionally compressed with GZip).
+- Paired-ended reads need to be in split files with _R1 and _R2 suffixes.
 
 
 ## Preparing a small dataset
@@ -38,9 +38,7 @@ You can run the first test on your workstation or a good laptop.
 
 First, you need to prepare the data:
 
-* For that test, you need to download the following runs from SRA: SRR10903401 and SRR10903402
-
-If you have difficulties, check this shared directory. You can obtain there a copy of the .fastq files. More information on the steps necessary to generate the .fastq files from SRA can be found in the README.md file.
+* For that test, you need to download the following runs from SRA: [SRR10903401](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10903401) and [SRR10903402](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10903402)
 
 ```bash
 mkdir -p samples/SRR10903401/20200102/raw_data
@@ -66,7 +64,23 @@ mv samples/SRR10903402/20200102/raw_data/SRR10903402_1.fastq samples/SRR10903402
 mv samples/SRR10903402/20200102/raw_data/SRR10903402_2.fastq samples/SRR10903402/20200102/raw_data/SRR10903402_R2.fastq
 ```
 
-The downloaded files will have the following structure:
+The downloaded files should have the following structure:
+
+```text
+📁samples
+├───📁SRR10903401
+│   └───📁20200102
+│       └───📁raw_data
+│           ├───🧬SRR10903401_R1.fastq
+│           └───🧬SRR10903401_R2.fastq
+└───📁SRR10903402
+    └───📁20200102
+        └───📁raw_data
+            ├───🧬SRR10903402_R1.fastq
+            └───🧬SRR10903402_R2.fastq
+```
+
+You can display the directory structure with the following command on Linux (on Mac OS, use `find samples`)
 
 ```bash
 tree samples
@@ -77,14 +91,14 @@ tree samples
 
 V-pipe uses the [Bioconda](https://bioconda.github.io/) bioinformatics software repository for all its pipeline components. The pipeline itself is written using [Snakemake](https://snakemake.readthedocs.io/en/stable/).
 
-For advanced users: If your are fluent with these tools, you can:
-
-* directly download and install [bioconda](https://bioconda.github.io/user/install.html) and [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda),
-* make sure to configure V-pipe to use the `sars-cov-2` virus-config
-* and start using V-pipe with them, using the --use-conda to [automatically download and install](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) any further pipeline dependencies.
-* please refer to the documentation for additional instructions.
-
-The present tutorial will show simplified commands that automate much of this process.
+> **For advanced users:** If your are fluent with these tools, you can:
+>
+> * directly download and install [bioconda](https://bioconda.github.io/user/install.html) and [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda),
+> * make sure to configure V-pipe to use the `sars-cov-2` virus-config
+> * and start using V-pipe with them, using the --use-conda to [automatically download and install](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) any further pipeline dependencies.
+> * please refer to the [documentation](https://github.com/cbg-ethz/V-pipe/blob/master/README.md) for additional instructions.
+>
+> The present tutorial will show simplified commands that automate much of this process.
 
 To deploy V-pipe, you can use the installation script with the following parameters:
 
@@ -96,24 +110,24 @@ bash quick_install.sh -p tutorial -w work
 * using `-p` specifies the subdirectory where to download and install snakemake and V-pipe
 * using `-w` will create a working directory and populate it. It will copy over the references and the default `config/config.yaml`, and create a handy `vpipe` short-cut script to invoke `snakemake`.
 
-Tip: To create and populate other new working directories, you can call init_project.sh from within the new directory:
-
-```console
-mkdir -p working_2
-cd working_2
-../V-pipe/init_project.sh
-```
+> **Tip:** To create and populate other new working directories, you can call init_project.sh from within the new directory:
+> 
+> ```console
+> mkdir -p working_2
+> cd working_2
+> ../V-pipe/init_project.sh
+> ```
 
 
 ## Running V-pipe
 
-Copy the samples directory you created in the step Preparing a small dataset to this working directory. (You can display the directory structure with `tree samples` or `find samples`.)
+Copy the samples directory you created in the step [Preparing a small](#preparing-a-small-dataset) dataset to this working directory. (You can display the directory structure with `tree samples` or `find samples`.)
 
 ```bash
 mv samples tutorial/work/
 ```
 
-Prepare V-pipe's configuration:
+Prepare V-pipe's configuration. You can find more information in [the documentation](https://github.com/cbg-ethz/V-pipe/blob/master/config/README.md). In your local V-pipe installation, you will also find an exhaustive manual about all the configuration options inside `config/config.html`.
 
 ```bash
 cat <<EOT > tutorial/work/config.yaml
@@ -155,8 +169,7 @@ SRR10903402	20200102	150
 EOT
 ```
 
-Tip: Always check the content of the `samples.tsv` file.
-
+**Tip:** Always check the content of the `samples.tsv` file.
 If you didn’t use the correct structure, this file might end up empty or some entries might be missing.
 You can safely delete it and re-run the `--dryrun` to regenerate it.
 
@@ -170,7 +183,9 @@ cd tutorial/work/
 
 ## Output
 
-The Wiki contains an overview of the output files. The output of the SNV calling is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `samples/{hierarchy}/variants/SNVs/snvs.vcf`, you can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in `samples/{hierarchy}/variants/SNVs/snvs.csv`.
+The section _output_ of the exhaustive configuration manual contains an overview of the output files.
+The output of the SNV calling is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `results/`_{hierarchy}_`/variants/SNVs/snvs.vcf`, you can open it with your favorite VCF tools for visualisation or downstream processing.
+It is also available in a tabular format in `results/`_{hierarchy}_`/variants/SNVs/snvs.csv`.
 
 ### Expected output
 
@@ -198,7 +213,34 @@ general:
 
 It is possible to ask snakemake to submit jobs on a cluster using the batch submission command-line interface of your cluster.
 
-The platform LSF by IBM is one of the popular systems you might find (Others include SLURM, Grid Engine).
+The opensource platform SLURM by SchedMD is one of the popular systems you might find on clusters (Others include LSF, Grid Engine).
+
+The most user friendly way to submit jobs to the cluster is using a special _snakemake profile_.
+[smk-simple-slurm](https://github.com/jdblischak/smk-simple-slurm) is a profile that works well in our experience with SLURM (for other platforms see suggestions in [the snakemake-profil documentation](https://github.com/snakemake-profiles/doc)).
+
+```console
+cd tutorial/
+git clone https://github.com/jdblischak/smk-simple-slurm.git
+cd work/
+./vpipe --dry-run --profile ../smk-simple-slurm --jobs 100
+cd ../..
+```
+
+Snakemakes documentation [introduces the key concepts used in profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles). 
+Check also [the other options for running snakemake on clusters](https://snakemake.readthedocs.io/en/stable/executing/cli.html#CLUSTER) if you need more advanced uses.
+
+### Dependencies downloading on the cluster
+
+In addition, Snakemake has [parameters for conda](https://snakemake.readthedocs.io/en/stable/executing/cli.html#CONDA) that can help management of dependencies:
 
+- using `-conda-create-envs-only` enables to download the dependencies only without running the pipeline itself. This is very useful if the compute nodes of your cluster are not allowed internet access.
+- using `--conda-prefix=`_{DIR}_ stores the conda environments of dependencies in a common directory (thus possible to share and re-use between multiple instances of V-pipe).
+
+```console
+cd tutorial/work/
+./vpipe --conda-prefix ../snake-envs --cores 1 --conda-create-envs-only
+cd ../..
+```
 
-...TODO...
+When using V-pipe in production environments, plan the installer's `-p` prefix and `-w` working and snakemake's `--conda-prefix` environments directories according to the cluster quotas and time limits.
+For example, consider using `${SCRATCH}` and only move the content of the `results/` directory to long-term storage.