diff --git a/docs/part2/00_intro.md b/docs/part2/00_intro.md index 8d1af35..af3bde2 100644 --- a/docs/part2/00_intro.md +++ b/docs/part2/00_intro.md @@ -63,17 +63,11 @@ You decide to use Nextflow. 2. Inspect the scripts (open in a VSCode tab, or text editor in the terminal). Each script runs a single data processing step and are run in order of the prefixed number. + + !!! quote "Poll" - What are some limitations of these scripts in terms of running them in a - pipeline and monitoring it? - - ??? note "Solution" - - * **No parallelism**: processes run iteratively, increasing overall runtime and limiting scalability. - * **No error handling**: if a step fails, may propagate errors or incomplete results into subsequent steps. - * **No caching**: if a step fails, you either need to re-run from the beginning or edit the script to exclude the files that have already run. - * **No resource management**: inefficient resource usage, no guarantee processes are able to access the CPU, RAM, disk space they need. - * **No software management**: assumes same environment is available every time it is run. + What are some limitations of these scripts in terms of running them in a + pipeline and monitoring it? ## 2.0.3 Our workflow: RNAseq data processing diff --git a/docs/part2/01_salmon_idx.md b/docs/part2/01_salmon_idx.md index a8de86a..cb81432 100644 --- a/docs/part2/01_salmon_idx.md +++ b/docs/part2/01_salmon_idx.md @@ -148,11 +148,11 @@ Nextflow also handles whether the directory already exists or if it should be created. In the `00_index.sh` script you had to manually make a results directory with `mkdir -p "results`. - More information and other modes can be found on [publishDir](https://www.nextflow.io/docs/latest/process.html#publishdir). -> +You now have a complete process! + ## 2.1.3 Using containers Nextflow recommends using containers to ensure reproducibility and portability @@ -233,14 +233,27 @@ pre-installed on your Virtual Machine. We can tell Nextflow configuration to run containers with Docker by using the `nextflow.config` file. -Create a `nextflow.config` file in the same directory as `main.nf` and add the -following line: +Create a `nextflow.config` file in the same directory as `main.nf`. + +!!! note + + You can create the file via the VSCode Explorer (left sidebar) or in the + terminal with a text editor. + + If you are using the Explorer, right click on `part2` in the sidebar and + select **"New file"**. + +Add the following line to your config file: ```groovy linenums="1" title="nextflow.config" docker.enabled = true ``` -You now have a complete process! +You have now configured Nextflow to use Docker. + +!!! tip + + Remember to save your files after editing them! ## 2.1.4 Adding `params` and the workflow scope @@ -282,7 +295,7 @@ If we need to run our pipeline with a different transcriptome file, we can overwrite this default in our execution command with `--transcriptome` double hyphen flag. -Next, add the workflow scope at the bottom of you `main.nf` after the process: +Next, add the workflow scope at the bottom of your `main.nf` after the process: ```groovy title="main.nf" // Define the workflow @@ -296,8 +309,11 @@ workflow { This will tell Nextflow to run the `INDEX` process with `params.transcriptome_file` as input. -> Note about adding comments, Part 1 suggests developers choice -rather than fixed comments +!!! tip "Tip: Your own comments" + + As a developer you can to choose how and where to comment your code! + Feel free to modify or add to the provided comments to record useful + information about the code you are writing. We are now ready to run our workflow! @@ -354,15 +370,16 @@ arguments inside a process. Instead of trying to infer how the variable is being defined and applied to the process, let’s use the hidden command files saved for this task in the work/ directory. -Open the `work/` directory: +!!! question "Exercise" -```bash -Image/instrctions for how to find command -``` + 1. Navigate to the `work/` directory and open the `.command.sh` file. + 2. Compare the `.command.sh` file with `00_index.sh`. -!!! question "Exercise" - Inspect the `.command.sh` file and compare it with `00_index.sh`. A question for attendees to answer. + !!! quote "Poll" + + Why do we no longer see or hardcoded file paths like `results/salmon_index` and `data/ggal/transcriptome.fa` in `.command.sh`? + !!! abstract "Summary" diff --git a/docs/part2/02_fastqc.md b/docs/part2/02_fastqc.md index fd8a8cb..e092f83 100644 --- a/docs/part2/02_fastqc.md +++ b/docs/part2/02_fastqc.md @@ -89,9 +89,10 @@ It contains: * The empty `output:` block for us to define the output data for the process. * The `script:` block prefilled with the command that will be executed. -> Note about ${} -> Consider making consistent use of capital letters vs lowercase +!!! info "Dynamic naming" + Recall that curly brackets are used to pass variables as part of a file name. + ### 2. Define the process `output` Unlike `salmon` from the previous process, `fastqc` requires that the output @@ -190,13 +191,9 @@ or metadata that needs to be processed together. Working with samplesheets is particularly useful when you have a combination of files and metadata that need to be assigned to a sample in a flexible manner. Typically, samplesheets are written in comma-separated (`.csv`) or tab-separated (`.tsv`) formats. - We reccommend using comma-separated files as they are less error prone and easier to read and write. + We recommend using comma-separated files as they are less error prone and easier to read and write. -Inspect our samplesheet: - -```bash -cat data/samplesheet.csv -``` +Let's inspect `data/samplesheet.csv` with VSCode. ```console title="Output" sample,fastq_1,fastq_2 @@ -237,24 +234,7 @@ for the process we just added. !!! info "Using samplesheets with Nextflow can be tricky business" There are currently no Nextflow operators specifically designed to handle samplesheets. As such, we Nextflow workflow developers have to write custom parsing logic to read and split the data. This adds complexity to our workflow development, especially when trying to handle tasks like parallel processing of samples or filtering data by sample type. -We won't explore the logic of constructing our samplesheet input channel in depth in this lesson. The key takeaway here is to understand that using samplesheets is best practice for reading grouped files and metadata into Nextflow, and that operators and groovy needs to be chained together to get these in the correct format. The best way to do this is using a combination of Groovy and Nextflow operators. - -Our samplesheet input channel has used common [Nextflow operators](https://www.nextflow.io/docs/latest/operator.html): - -```bash -// Define the fastqc input channel -Channel.fromPath(params.reads) - .splitCsv(header: true) - .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } - .view() -``` - -* `.fromPath` creates a channel from one or more files matching a given path or pattern (to our `.csv` file, provided with the `--reads` parameter). -* `splitCsv` splits the input file into rows, treating it as a CSV (Comma-Separated Values) file. The `header: true` option means that the first row of the CSV contains column headers, which will be used to access the values by name. -* `map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }` uses some Groovy syntax to transform each row of the CSV file into a list, extracting the sample value, `fastq_1` and `fastq_2` file paths from the row. -* `.view()` is a debugging step that outputs the transformed data to the console so we can see how the channel is structured. Its a great tool to use when building your channels. - -Add the following to your workflow scope above where `INDEX` is called: +Add the following to your workflow scope below where `INDEX` is called: ```groovy title="main.nf" hl_lines="7-12" // Define the workflow @@ -272,6 +252,19 @@ workflow { } ``` +We won't explore the logic of constructing our samplesheet input channel in depth in this lesson. The key takeaway here is to understand that using samplesheets is best practice for reading grouped files and metadata into Nextflow, and that operators and groovy needs to be chained together to get these in the correct format. The best way to do this is using a combination of Groovy and Nextflow operators. + +Our samplesheet input channel has used common [Nextflow operators](https://www.nextflow.io/docs/latest/operator.html): + +* `.fromPath` creates a channel from one or more files matching a given path or pattern (to our `.csv` file, provided with the `--reads` parameter). +* `splitCsv` splits the input file into rows, treating it as a CSV (Comma-Separated Values) file. The `header: true` option means that the first row of the CSV contains column headers, which will be used to access the values by name. +* `map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }` uses some Groovy syntax to transform each row of the CSV file into a list, extracting the sample value, `fastq_1` and `fastq_2` file paths from the row. +* `.view()` is a debugging step that outputs the transformed data to the console so we can see how the channel is structured. Its a great tool to use when building your channels. + +??? Tip "Tip: using the `view()` operator for testing" + + The [`view()`](https://www.nextflow.io/docs/latest/operator.html#view) operator is a useful tool for debugging Nextflow workflows. It allows you to inspect the data structure of a channel at any point in the workflow, helping you to understand how the data is being processed and transformed. + Run the workflow with the `-resume` flag: ```bash @@ -300,14 +293,14 @@ definition of `tuple val(sample_id), path(reads_1), path(reads_2)`: [gut, /home/setup2/hello-nextflow/part2/data/ggal/gut_1.fq, /home/setup2/hello-nextflow/part2/data/ggal/gut_2.fq] ``` -Next, we need to assign this output to a variable so it can be passed to the `FASTQC` +!!! quote "Checkpoint" + + Zoom react Y/N + +Next, we need to assign the channel we create to a variable so it can be passed to the `FASTQC` process. Assign to a variable called `reads_in`, and remove the `.view()` operator as we now know what the output looks like. -??? Tip "Tip: using the `view()` operator for testing" - - The [`view()`](https://www.nextflow.io/docs/latest/operator.html#view) operator is a useful tool for debugging Nextflow workflows. It allows you to inspect the data structure of a channel at any point in the workflow, helping you to understand how the data is being processed and transformed. - ```groovy title="main.nf" hl_lines="8-11" // Define the workflow workflow { @@ -412,9 +405,10 @@ for each of the `.fastq` files. !!! abstract "Summary" - In this step you have learned: + In this lesson you have learned: 1. How to implement a process with a tuple input 2. How to construct an input channel using operators and Groovy - 3. How to use the `view()` operator to inspect the structure of a channel - 4. How to use a samplesheet to scale your workflow across multiple samples \ No newline at end of file + 3. How to use the `.view()` operator to inspect the structure of a channel + 3. How to use the `-resume` flag to skip sucessful tasks + 4. How to use a samplesheet to read in grouped samples and metadata diff --git a/docs/part2/03_quant.md b/docs/part2/03_quant.md index 295ee26..579d329 100644 --- a/docs/part2/03_quant.md +++ b/docs/part2/03_quant.md @@ -3,9 +3,8 @@ !!! note "Learning objectives" 1. Implement a process with multiple input channels. + 2. Understand the importance of creating channels from process outputs. 3. Implement chained Nextflow processes with channels. - 4. Understand the components of a process such as `input`, `output`, - `script`, `directive`, and the `workflow` scope. In this lesson we will transform the bash script `02_quant.sh` into a process called `QUANTIFICATION`. This step focuses on the next phase of RNAseq data processing: quantifying the expression of transcripts relative to the reference transcriptome. @@ -70,7 +69,7 @@ It contains: * Prefilled process directives `container` and `publishDir`. * The empty `input:` block for us to define the input data for the process. * The empty `output:` block for us to define the output data for the process. -* The `script:` block prefilled with the command that will be executed. +* The empty `script:` block for us to define the script for the process. ### 2. Define the process `script` @@ -106,10 +105,6 @@ The `--libType=U` is a required argument and can be left as is for the script de The `output` is a directory of `$sample_id`. In this case, it will be a directory called `gut/`. Replace `< process outputs >` with the following: -``` -path "$sample_id" -``` - ```groovy title="main.nf" hl_lines="9" process QUANTIFICATION { container "quay.io/biocontainers/salmon:1.10.1--h7e5ed60_0" @@ -186,6 +181,12 @@ process QUANTIFICATION { } ``` +!!! info "Matching process inputs" + + Recall that the number of inputs in the process input block and the workflow must match! + + If you have multiple inputs they need to be listed across multiple lines in the input block and listed inside the brackets in the workflow block. + You have just defined a process with multiple inputs! ### 5. Call the process in the `workflow` scope @@ -195,12 +196,9 @@ Recall that the inputs for the `QUANTIFICATION` process are emitted by the is ready to be called by the `QUANTIFICATION` process. Similarly, we need to prepare a channel for the index files output by the `INDEX` process. -Add the following channel to your `main.nf` file, after the `reads_in` channel: -``` -transcriptome_index_in = INDEX.out[0] -``` +Add the following channel to your `main.nf` file, after the `FASTQC` process: -```groovy title="main.nf" hl_lines="12-14" +```groovy title="main.nf" hl_lines="15-16" // Define the workflow workflow { @@ -212,6 +210,9 @@ workflow { .splitCsv(header: true) .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } + // Run the fastqc step with the reads_in channel + FASTQC(reads_in) + // Define the quantification channel for the index files transcriptome_index_in = INDEX.out[0] @@ -226,10 +227,7 @@ workflow { Call the `QUANTIFICATION` process in the workflow scope and add the inputs by adding the following line to your `main.nf` file after your `transcriptome_index_in` channel definition: -```bash -QUANTIFICATION(transcriptome_index_in, reads_in) -``` -```groovy title="main.nf" hl_lines="15-17" +```groovy title="main.nf" hl_lines="18-19" // Define the workflow workflow { @@ -241,8 +239,11 @@ workflow { .splitCsv(header: true) .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } + // Run the fastqc step with the reads_in channel + FASTQC(reads_in) + // Define the quantification channel for the index files - transcriptome_index_in = INDEX.out + transcriptome_index_in = INDEX.out[0] // Run the quantification step with the index and reads_in channels QUANTIFICATION(transcriptome_index_in, reads_in) @@ -252,14 +253,14 @@ workflow { By doing this, we have passed two arguments to the `QUANTIFICATION` process as there are two inputs in the `process` definition. -> Add note about tuples being a single input? - Run the workflow: ```bash nextflow run main.nf -resume ``` +Your output should look like: + ```console title="Output" Launching `main.nf` [shrivelled_cuvier] DSL2 - revision: 4781bf6c41 @@ -270,13 +271,16 @@ executor > local (1) ``` -You now have a `results/gut` folder, with an assortment of files and -directories. +A new `QUANTIFICATION` task has been successfully run and have a `results/gut` +folder, with an assortment of files and directories. !!! abstract "Summary" - In this step you have learned: + In this lesson you have learned: + + 1. How to define a process with multiple input channels + 2. How to access a process output with `.out` + 3. How to create a channel from a process output + 4. How to chain Nextflow processes with channels + - 1. How to - 1. How to - 1. How to diff --git a/docs/part2/04_multiqc.md b/docs/part2/04_multiqc.md index 64a738d..3d94b52 100644 --- a/docs/part2/04_multiqc.md +++ b/docs/part2/04_multiqc.md @@ -4,7 +4,6 @@ 1. Implement a channel that combines the contents of two channels. 2. Implement a process with multiple output files. - 3. Improve execution logging with process directives and groovy. In this step we will transform the `03_multiqc.sh` into a process called `MULTIQC`. This step focuses on the final step of our RNAseq data processing workflow: generating @@ -30,7 +29,7 @@ Open the bash script `03_multiqc.sh`. multiqc --outdir results/ results/ ``` -This script is a lot simpler than previous scripts, we've worked with. It searches searches for the output files generated by the `FASTQC` and `QUANTIFICATION` processes saved to the `results/` directory. As specified by `--outdir results/`, it will output two MultiQC files: +This script is a lot simpler than previous scripts we've worked with. It searches searches for the output files generated by the `FASTQC` and `QUANTIFICATION` processes saved to the `results/` directory. As specified by `--outdir results/`, it will output two MultiQC files: 1. A directory called `multiqc_data/` 2. A report file called `multiqc_report.html` @@ -39,14 +38,13 @@ This script is a lot simpler than previous scripts, we've worked with. It search ### 1. Process directives, `script`, and `input` -Here is the empty `process` template with the `container` and `publishDir` -directives we'll be using to get you started. Add this to your `main.nf` after the `QUANTIFICATION` process. +Here is the `process` template with the `container` and `publishDir` +directives provided. Add this to your `main.nf` after the `QUANTIFICATION` process: ```groovy title="main.nf" process MULTIQC { - container "quay.io/biocontainers/multiqc:1.19--pyhdfd78af_0" - publishDir params.outdir, mode: 'copy' + publishDir "results", mode: 'copy' input: path "*" @@ -62,13 +60,9 @@ process MULTIQC { ``` The `script` and `input` follow the MultiQC Nextflow -[integration recommendations](https://multiqc.info/docs/usage/pipelines/#nextflow). - -> Probably need another diagram and explanation on why the script is thw way -it is - -> Refer back to staging from Part 1, and that the channel/`.collect` deals -with this. More stable using channels to deal with paths vs. directory input +[integration recommendations](https://multiqc.info/docs/usage/pipelines/#nextflow). +The key thing to note here is that MultiQC needs to be run once for all +upstream outputs. From the information above we know that the input for `multiqc` is the `results/` directory, specifically, the files and directories within @@ -76,7 +70,7 @@ From the information above we know that the input for `multiqc` is the (`fastqc_gut_logs/`) and `QUANTIFICATION` (`gut/`) processes into a single channel as input to `MULTIQC`. -??? warning "Why you should NOT use the `publishDir` folder as a process input" +!!! warning "Why you should NOT use the `publishDir` folder as a process input" It might make sense to have the `results/` folder (set by `publishDir`) as the input to the process here, but it may not exist until the workflow @@ -92,22 +86,15 @@ channel as input to `MULTIQC`. More on this in the next section. -!!! exercise "Exercise" - Think of something - ### 2. Define the process `output` -Next, add the `output` definition to the `MULTIQC` process. - -```bash -path "multiqc_report.html" -path "multiqc_data" -``` -MultiQC output consists of the following: +The MultiQC output consists of the following: * An HTML report file called `multiqc_report.html` * A directory called `multiqc_data/` containing the data used to generate the report. +Add the following `output` definition to the `MULTIQC` process: + ```groovy title="main.nf" hl_lines="10-11" process MULTIQC { @@ -130,34 +117,33 @@ process MULTIQC { ## 2.4.2 Combining channels with operators -The goal of this step is to combine outputs from `FASTQC` and `QUANTIFICATION` processes into a single input channel for the `MULTIQC` process. These tools are both supported by MultiQC and their outputs can be detected automatically by MultiQC. - -!!! question "Exercise" +!!! tip - Which channels output: + When running MultiQC, it needs to be run once on all the upstream input files. + This is so a single report is generated with all the results. - 1. `fastqc_gut_logs/` - 2. `gut/` +In this case, the input files for the `MULTIQC` process are outputs from +`FASTQC` and `QUANTIFICATION` processes. Both FastQC and Salmon are supported +by MultiQC and the required files are detected automatically by the program +(when using it a Nextflow pipeline, there is some pre-processing that needs to +be done). - ??? note "Solution" - - `fastqc_ch` and `quant_ch`. +The goal of this step is to bring the outputs from `MULTIQC` and +`QUANTIFICATION` processes into a single input channel for the `MULTIQC` +process. This ensures that MultiQC is run once. -The next few steps will involve chaining together Nextflow operators to correctly format inputs for the `MULTIQC` process. +The next few additions will involve chaining together Nextflow operators to +correctly format inputs for the `MULTIQC` process. -In the workflow scope, use the -[`mix`](https://www.nextflow.io/docs/latest/operator.html#mix) operator to -emit the contents of `fastqc_ch` and `quant_ch` in a single channel. +!!! quote "Poll" -Add the following to the workflow block in your `main.nf` file, under the quantification process. View it using the `view()` operator: + What Nextflow input type (qualifier) ensures that inputs are grouped and + processed together? -```bash -multiqc_in = FASTQC.out[0] - .mix(QUANTIFICATION[0]) - .view() -``` +Add the following to the workflow block in your `main.nf` file, under the +`QUANTIFICATION` process. -```groovy title="main.nf" hl_lines="18-21" +```groovy title="main.nf" hl_lines="21-24" // Define the workflow workflow { @@ -169,6 +155,9 @@ workflow { .splitCsv(header: true) .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } + // Run the fastqc step with the reads_in channel + FASTQC(reads_in) + // Define the quantification channel for the index files transcriptome_index_in = INDEX.out[0] @@ -176,8 +165,8 @@ workflow { QUANTIFICATION(transcriptome_index_in, reads_in) // Define the multiqc input channel - multiqc_in = FASTQC.out[0] - .mix(QUANTIFICATION[0]) + FASTQC.out[0] + .mix(QUANTIFICATION.out[0]) .view() } @@ -186,10 +175,13 @@ workflow { This channel creates a tuple with the two inputs as elements: - Takes the output of `FASTQC`, using element `[0]` to refer to the first element of the output. -- Uses `mix(QUANTIFICATION)[0]` to combine `FASTQC` output with the first element of the `QUANTIFICATION` output. +- Uses `mix(QUANTIFICATION.out[0])` to combine `FASTQC.out[0]` output with the first element of the `QUANTIFICATION` output. - Uses `view()` allows us to see the values emitted into the channel. -Run the workflow: +For more information, see the documentation on +[`mix`](https://www.nextflow.io/docs/latest/operator.html#mix). + +Run the workflow to see what it produces: ```bash nextflow run main.nf -resume @@ -211,11 +203,18 @@ The outputs have been emitted one after the other, meaning that it will be processed separately. We need them to be processed together (generated in the same MultiQC report), so we need to add one more step. +!!! note + + Note that the outputs point to the files in the work directories, rather than + the `publishDir`. This is one of the ways that Nextflow ensures all input files + are ready and ensures proper workflow control. + + Add the [`collect`](https://www.nextflow.io/docs/latest/operator.html#collect) operator to ensure all samples are processed together in the same process and view the output: -```groovy title="main.nf" hl_lines="20" +```groovy title="main.nf" hl_lines="24" // Define the workflow workflow { @@ -227,6 +226,9 @@ workflow { .splitCsv(header: true) .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } + // Run the fastqc step with the reads_in channel + FASTQC(reads_in) + // Define the quantification channel for the index files transcriptome_index_in = INDEX.out[0] @@ -234,7 +236,7 @@ workflow { QUANTIFICATION(transcriptome_index_in, reads_in) // Define the multiqc input channel - multiqc_in = FASTQC.out[0] + FASTQC.out[0] .mix(QUANTIFICATION[0]) .collect() .view() @@ -248,7 +250,7 @@ Run the workflow: nextflow run main.nf -resume ``` -The channel now outputs a single tuple with the two directories. +The channel now outputs a single tuple with the two directories: ```console title="Output" Launching `main.nf` [small_austin] DSL2 - revision: 6ab927f137 @@ -260,14 +262,17 @@ Launching `main.nf` [small_austin] DSL2 - revision: 6ab927f137 ``` -!!! question "Exercise" +Now that we have a channel that emits the correct data, add the finishing +touches to the workflow scope. + +!!! question "Exercise: Assign the input channel" - Now that we have a channel that emits the correct data, remove `.view()` - and assign the channel to a variable called `multiqc_in`. + 1. Assign the chain of operations to a channel called `multiqc_in` + 2. Remove the `.view()` operator ??? note "Solution" - ```groovy title="main.nf" hl_lines="7 8 11" + ```groovy title="main.nf" hl_lines="8 11" // Define the quantification channel for the index files transcriptome_index_in = INDEX.out[0] @@ -279,13 +284,13 @@ Launching `main.nf` [small_austin] DSL2 - revision: 6ab927f137 .mix(QUANTIFICATION[0]) .collect() + } ``` -We are now ready to call the `MULTIQC` process in the `workflow`. +!!! question "Exercise: call the `MULTIQC` process" -!!! question "Exercise" - - Add the `MULTIQC` process in the workflow scope with `multiqc_in` as input. + 1. Add the `MULTIQC` process in the workflow scope + 2. Pass the `multiqc_in` channel as input. ??? note "Solution" @@ -303,7 +308,8 @@ We are now ready to call the `MULTIQC` process in the `workflow`. // Run the multiqc step with the multiqc_in channel MULTIQC(multiqc_in) - + + } ``` Run the workflow: @@ -312,7 +318,8 @@ Run the workflow: nextflow run main.nf -resume ``` -The output should look something like: +Your output should look something like: + ```console title="Output" Launching `main.nf` [hopeful_swanson] DSL2 - revision: a4304bbe73 @@ -323,16 +330,39 @@ Launching `main.nf` [hopeful_swanson] DSL2 - revision: a4304bbe73 ``` -> Inspect `results/multiqc_report.html`, maybe Poll something in the file +## 2.4.3 Inspecting the MultiQC report + +Let's inspect the generated MultiQC report. You will need to download the file +to your local machine and open it in a web browser. + +!!! question "Exercise" + + 1. In the VSCode Explorer sidebar, locate the report `results/multiqc_report.html` + 2. Right click on the file, and select **"Download"** + 3. Open the file in a web browser + + !!! quote "Poll" + + Under the **"General Statistics"** section, how many samples (i.e. rows) have been + included in the table? + +!!! tip + + If you have to view many (i.e. `.html`) files on a remote server, we recommend using the + [Live Server](https://marketplace.visualstudio.com/items?itemName=ritwickdey.LiveServer) + VSCode extension. + + The extension allows you to view `.html` files within a VSCode tab instead + of manually downloading files locally. You have a working pipeline for a single paired-end sample! + !!! abstract "Summary" - In this step you have learned: + In this lesson you have learned: - 1. How to - 1. How to - 1. How to - 1. How to - 1. How to + 1. How to implement a process following integration recommendations + 2. How to define an output with multiple outputs + 3. How to use the `mix` and `collect` operators to combine outputs into a single tuple + 4. How to access and view `.html` files from a remote server diff --git a/docs/part2/05_scale.md b/docs/part2/05_scale.md index c4bea10..6fdc8f8 100644 --- a/docs/part2/05_scale.md +++ b/docs/part2/05_scale.md @@ -32,10 +32,6 @@ Add the following `tag` directives to your existing `FASTQC` and For `FASTQC`: -```bash -tag "fastqc on ${sample_id}" -``` - ```groovy title="main.nf" hl_lines="2" process FASTQC { tag "fastqc on ${sample_id}" @@ -46,9 +42,6 @@ process FASTQC { And for `QUANTIFICATION`: -```bash -tag "salmon on ${sample_id}" -``` ```groovy title="main.nf" hl_lines="2" process QUANTIFICATION { tag "salmon on ${sample_id}" @@ -78,6 +71,9 @@ executor > local (5) [a3/1f885c] MULTIQC | 1 of 1, cached: 1 ✔ ``` +No new tasks were run, but `FASTQC` and `QUANTIFICATION` processes now have +labels appended in the execution output. + ## 2.5.2 Using a samplesheet with multiple samples Recall that the samplesheet is used to control which files/data are analysed by @@ -119,37 +115,33 @@ executor > local (5) There are two new tasks run for `FASTQC` and `QUANTIFICATION`. Our newly added tags indicate which samples they were run on - either `lung` or `liver` reads! -> Note: Decide exercises +!!! note -!!! question "Optional Exercise" + Updating the `params.reads` definition in your `main.nf` script can save + having to add the `--reads` flag every time you want to run it with a + different samplesheet. - Update your `params.reads` definition in `main.nf` so it takes - `samplesheet_full.csv` instead of `samplesheet.csv`. +!!! example "Advanced Exercise" - ??? note "Solution" + 1. Update the workflow scope to inspect the output of the `reads_in` channel (i.e. with `.view()`) + 2. Run the workflow with `samplesheet_full.csv` - ```groovy title="main.nf" hl_lines="3" - //pipeline input parameters - params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa" - params.reads = "$projectDir/data/samplesheet_full.csv" + What has changed with what the `reads_in` channel is emitting? - ``` - -!!! question "Optional Exercise" + ??? note "Solution" - In the workflow scope, add `.view()` to `reads_in` + Viewing `reads_in`: - > Update with new workflow definition if keeping this exercise + ```groovy title="main.nf" hl_lines="6" + // Define the fastqc input channel + reads_in = Channel.fromPath(params.reads) + .splitCsv(header: true) + .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } - ??? note "Solution" + reads_in.view() - ```groovy title="main.nf" - Channel - .fromPath(params.reads) - .splitCsv(header: true) - .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] } - .set { read_pairs_ch } - read_pairs_ch.view() + // Run the fastqc step with the reads_in channel + FASTQC(reads_in) ``` Run the workflow: @@ -172,14 +164,11 @@ tags indicate which samples they were run on - either `lung` or `liver` reads! [lung, .../data/ggal/lung_1.fq, .../data/ggal/lung_2.fq] ``` - Key differences to note: - - - Total of three tuples, for each sample - - `QUANTIFICATION` and `FASTQC` have 3 processes and 1 cached - - Added `results/` outputs for each paired sample - - `multiqc_report.html` now has 9 samples + There are now a total of three tuples emitted separately for each sample. + When passed into `FASTQC` and `QUANTIFICATION`, each tuple is processed + separately in independent tasks. - Remove `read_pairs_ch.view()` before proceeding. + Remove `reads_in.view()` before proceeding. ## 2.5.3 An introduction to configuration @@ -203,7 +192,7 @@ we control this through the Recall that our `FASTQC` takes as input the `reads_in` channel which emits two `.fastq` files. We will configure the process to use 2 CPUs so each file gets -run on 1 CPU each, simulataneously. +run on 1 CPU each (the maximum CPUs fastqc will use per file), simulataneously. In your `main.nf` script, update the `script` definition in the `FASTQC` process to add the multithreading option: @@ -229,8 +218,8 @@ process.cpus = 2 docker.enabled = true ``` -The `-t $task.cpus` argument will populate as `-t 2` when the process is run -next. +The `-t $task.cpus` argument will populate as `-t 2` when we run the workflow next. +Before we do, we will explore Nextflow's built-in reporting system to assess resource usage. ## 2.5.4 Inspecting workflow performance @@ -256,7 +245,8 @@ trace.enabled = true ``` Run the workflow. To assess the resource usage all processes need to be run -again so `-resume` should not be used. +again so `-resume` should not be used. (If we resume now, it will still +appear as a cached run, with limited information). ```bash nextflow run main.nf --reads "data/samplesheet_full.csv" @@ -284,21 +274,24 @@ Complete the following steps in the exercise to view the report file `report-*.h 4. Navigate to **"Resource Usage" -> "CPU"**. 5. Hover over the `FASTQC` bar chart and note the `mean` CPU usage. + !!! quote "Poll" + + What was the `mean` CPU usage for your `FASTQC` process? + ??? note "Solution" In this report, a mean of 2.53 CPUs were utilised by the `FASTQC` process across the 3 samples. This value will slightly differ across runs. - > Note: explain why this is > 2 CPUs? - ![](img/report_cpu.png) -> Note: Any additional exercises, outro text, learning summary +You have successfully run, configured, and profiled a multi-sample workflow! !!! abstract "Summary" - In this step you have learned: + In this lesson you have learned: - 1. How to - 1. How to - 1. How to + 1. How to add custom labels with process tags + 1. How to use `task.cpus` to enable multithreading within processes + 3. How to configure process resources with `nextflow.config` + 1. How to enable and view Nextflow workflow reports