Skip to content

Latest commit

 

History

History
346 lines (248 loc) · 12.5 KB

README.md

File metadata and controls

346 lines (248 loc) · 12.5 KB

nf-mixcr: TCR repertoire building with MiXCR

DOI

nf-mixcr is nextflow pipeline running MiXCR to build T-cell repertoire from illumina sequencing. Nextflow makes your life easier by managing for you the input files, output files and jobs without having to install any program apart Nextflow itself and a container runner (singularity or docker).

The pipeline runs the mixcr analyze program on each read pair placed listed in a samplesheet file, generates the QC and clones tables automatically.

flowchart TD
    A(Samplesheet) --> B[mixcr analyze]
    B[Samplesheet Check] -->|on each sample| C[mixcr analyze]
    C -->|on each sample| D[mixcr exportclones]
    C -->|on all sample| E[mixcr exportQC align]
    C -->|on all sample| F[mixcr exportQC chainusage]
    C -->|on each sample| G[mixcr exportQC coverage]
    C -->|on each sample| H[mixcr export report]
Loading

Full list of run programs:

Requirements

NB: I assume you have a minimal knowledge of terminal and bash and you'll be able to run the following lines.

nf-mixcr does not require lots of dependencies to run. If you plan to run it on a cluster (like Eddie), there are big chances you do not need to install anything. The only dependencies are:

My advice for installation is to use conda (Miniforge) package manager.

conda create -n nf-mixcr_env
conda activate nf-mixcr_env
conda install -c milaboratories nextflow singularity mixcr

MiXCR (once for licence activation)

Before going further, you will need a licence for using MiXCR. If you don't have one, please visit this page and fill in the form. If you are an academic, lucky you, it's free! If you're not, please check the commercial licensing page. Once you received your licence, please run the command mixcr activate-license and copy paste your license key.

Pipeline Installation

NOPE! 🎉

But first, let's check if the pipeline is running correctly. The test profile can be use to run to the pipeline with toy datasets automatically downloaded from the repository.

You can start the test by running:

nextflow run sguizard/nf-mixcr -profile singularity,test,<Institution>

or if you use docker in place of singularity:

nextflow run sguizard/nf-mixcr -profile docker,test,<Institution>

The place holder must be replaced by your cluster profile. The list of available configs can be found on nf-core website.

NB: singularity or docker profile might be skipped if they are already defined in your institution profile.

Preparing files and data for analysis

To keep files sorted between inputs, outputs and working directories, I start by creating a directory for the analysis (TCR_project) and create a data directory where I store the reads and other inputs files:

TCR_project/
└── data
    ├── imgt.202312-3.sv8.json.gz
    ├── mixcr_analyze.config
    ├── read_1.fastq.gz
    ├── read_2.fastq.gz
    └── samplesheet.csv

Samplesheet

A sampleesheet must be provided. This file is a three columns comma-separated value table. The columns are id, read1, read2 and each value must be separated by a comma. Each line gives the location of the fastq file associated with a unique ID.

id,read1,read2
SAMP1,./data/read_1.fastq.gz,./data/read_2.fastq.gz

Library (Optional)

If the specie studied is different from Human (hsa) or Mouse (mmu), you'll need to provide a library of reference V, D, J, C genes. The IMGT provides libraries for a large panel of specie which can be used with mixcr. The data can be downloaded here. Please, don't decompress the file and keep the '.json.gz' extension.

mixcr analyze configuration file

MiXCR gather multiple tools and each of them are highly configurable. Implementing all MiXCR options in the pipeline would be highly time consuming. As a tradeoff, I decided to make use of a configuration file to set up mixcr analyze parameters. You can find a template configuration file here, modify it with your needs. You can also run the pipeline with the option --get_ma_conf to get a copy.

Each line between the central square brackets is a mixcr analyze option. If needed, you can add options by inserting a new line at the end of the option, write your option between simple quotes and ending the line with a comma.

process {
    withName: MIXCR_ANALYZE {
        cpus = 8
        ext.args = {
            [
                '--species cat',
                '--rna',
                '--tag-pattern "^N{4:6}GCTCACCTTTTTCAGGTCCTC(R1:*)\\^N{4:6}GCAGTGGTATCAACGCAGAGT(UMI:TN{4}TN{4}TN{4}TCTTGGGG)(R2:*)"',
                '--rigid-left-alignment-boundary',
                '--floating-right-alignment-boundary J',
                '--ADDITIONAL-OPTION and_its_value',
            ].join(' ').trim()
        }
    }
}

Running the pipeline

The classical command line to run the pipeline looks like this:

nextflow run sguizard/nf-mixcr \
    -profile <Institution> \
    -c data/mixcr_analyze.config \
    --samplesheet data/samplesheet.csv \
    --preset generic-amplicon-with-umi \
    --study My_project

Options description

You will set two kind of options:

  • Nextflow options, with simple dash (eg. -profile)
  • Pipeline options, with double dash (eg. --samplesheet)

The nextflow options that need to be used are:

  • -profile: select the adhoc virtualization technology (docker or singularity) and the profile of your cluster (eg. eddie). Profiles are separated by commas (eg. docker,eddie).
  • -c: define additional configuration. Please add the mandatory mixcr_analyze.config file here.

The pipeline options are:

  • --samplesheet: The path to the samplesheet listing samples as describe above
  • --preset: mixcr analyze preset to use. (eg. generic-amplicon-with-umi)
  • --library: V, D, J, C reference genes library
  • --study: An ID that will be used as prefix for global report files (Default: TCR)
  • --outdir: the name of the directory where the results will be publish (Default: results)
  • --get_ma_conf: Download a copy of template mixcr_analysis.config and stop

Some option must be defined for each run and can't be omitted. The compulsory options are:

  • -profile
  • -c (mixcr_analysis.config)
  • --samplesheet
  • --preset

Output files

The results of the pipeline will be stored in the directory defined by the --outdir option. For each process/program, one directory will be created to store the results. An additional directory, pipeline_info, gather reports about pipeline execution.

<outdir name>/
|-- 01_mixcr_analysis
|-- 02_mixcr_exportClones
|-- 03_mixcr_exportQc_align
|-- 03_mixcr_exportQc_chainusage
|-- 03_mixcr_exportQc_coverage
|-- 04_mixcr_exportReports
`-- pipeline_info

01_mixcr_analysis

01_mixcr_analysis
|-- SAMP1.align.report.json
|-- SAMP1.align.report.txt
|-- SAMP1.assemble.report.json
|-- SAMP1.assemble.report.txt
|-- SAMP1.clns
|-- SAMP1.clones_TRB.tsv
|-- SAMP1.log
|-- SAMP1_non_refined.vdjca
|-- SAMP1.qc.json
|-- SAMP1.qc.txt
|-- SAMP1.refined.vdjca
|-- SAMP1.refine.report.json
`-- SAMP1.refine.report.txt

This directory gather the results of the programs launched by MiXCR. With the preset generic-amplicon-with-umi, mixcr analyze align, mixcr analyze refineTagsAndSort, mixcr analyze assemble and mixcr analyze export are run.

02_mixcr_exportClones

02_mixcr_exportClones
`-- SAMP1_exportClones_<TRB/IGL>.tsv

mixcr exportClones generates a tabulation separated value file listing detected clones.

03_mixcr_exportQc_align

03_mixcr_exportQc_align
|-- TCR_exportQC_align.pdf
`-- TCR_exportQC_align.png

mixcr exportQc align use the results of each analyzed samples to generate align report. It describes the reads status (correctly/incorrectly align).

03_mixcr_exportQc_chainusage

03_mixcr_exportQc_chainusage
|-- TCR_exportQC_chainUsage.pdf
`-- TCR_exportQC_chainUsage.png

Exports chain usage summary of each sample.

03_mixcr_exportQc_coverage

03_mixcr_exportQc_coverage
|-- SAMP1_exportQC_coverage.pdf
|-- SAMP1_exportQC_coverage_R0.png
|-- SAMP1_exportQC_coverage_R1.png
`-- SAMP1_exportQC_coverage_R2.png

Exports anchor points coverage by the library. It separately plots coverage for R1, R2 and overlapping reads.

04_mixcr_exportReports

04_mixcr_exportReports
|-- SAMP1.report.json
`-- SAMP1.report.txt

These files contains the report of each tool launched by mixcr analyze.

pipeline_info

pipeline_info
|-- <timestamp>_execution_report.html
|-- <timestamp>_execution_timeline.html
`-- <timestamp>_execution_trace.txt

These are the reports generated by Nextflow about the pipeline run.

The execution report contains information about jobs, their running time, the resources used and the command used alongside the pipeline version used.

The execution timeline display the running time and order in which jobs have been launched.

The execution trace report gather the raw data about job execution (included job running directory in work directory).

Tips and Troubleshooting known problems

Roslin Institute users: Fix eddie config

Dear Roslin eddies users,

If you have already run a nextflow pipeline on eddie, there are big chances you face an error message about singularity images caching directory. This error is caused by the permission of the /exports/igmm/eddie/BioinformaticsResources/nfcore/singularity-images directory which is not accessible to all users.

In order to fix this, you can create an eddie_fix.confg file and add the following lines to it:

singularity {
  envWhitelist = "SINGULARITY_TMPDIR,TMPDIR"
  runOptions = '-p -B "$TMPDIR"'
  enabled = true
  autoMounts = true
  cacheDir = "/exports/eddie/scratch/<username>/singularity-images"
}

Do not forget to replace the placeholder.

This will store the singularity image in a directory in your scratch directory. Do not forget to delete it once the pipeline finished running! This is obviously a temporary fix. Discussions are running at the Roslin Institute to find a solution to this problem. Pushing a roslin specific configuration is considered.

NB: You will need to apply the next fix too.

Correct MiXCR license detection

To being sure that MiXCR can correctly access to your license, you should update the singularity -B option by adding this following lines into a custom configuration file (eddie_fix.config for example 😜).

singularity {
  runOptions = '-p -B "$TMPDIR",/home/<username>'
}

Do not forget to replace the placeholder with yours.

Example command line

nextflow run sguizard/nf-mixcr \
    -profile eddie \
    -c data/mixcr_analyze.config \
    -c data/eddie_fix.config \
    --samplesheet data/samplesheet.csv \
    --preset generic-amplicon-with-umi \
    --library data/imgt.202312-3.sv8.json.gz \
    --study TCR_cat_project

Contributing

Contributions are welcome! Just try to following the code formatting the best as you can.

Citation

Please cite my work if you use it in own research, thanks! 🙏

Sébastien Guizard. (2024). sguizard/nf-mixcr: nf-mixcr v1.0.1 (v1.0.1). Zenodo. https://doi.org/10.5281/zenodo.10678867

Footnotes

This pipeline is very inspired by nf-core templates and even borrow few parts of it, notably the institution configs.

Please also check the nf-core website! It gathers great, easy to use pipelines and it is maintained by wonderful peoples!