The R package IDseq contains functions that are used to process sequencing reads from the IDseq technology. The immuno-Detection by sequencing can measure proteins via sequencing DNA-tags from antibodies. The package allows to:
- Import reads from FASTQ files and extracts the following DNA sequences from the reads:
- Unique Molecular Identifier (UMI)
- Antibody barcode (BARCODE1)
- Sample barcode (BARCODE2)
- Create a data table with three columns of DNA sequences (UMI, BARCODE1, BARCODE2) and 1 column with sample-index information.
- Report the frequency of duplicate (or more) UMI DNA sequences.
- Create count table (Unique UMI) per antibody per sample.
- Log the run and experiment information.
The 'devtools' package allows installing the IDseq package from GitHub directly.
# First start R and install and load the devtools package
install.packages('devtools')
library(devtools)
# Then install the IDseq package from the repository jessievb/IDseq
devtools::install_github("jessievb/IDseq")
# Finally load the package
library(IDseq)
-
Windows: Direct download from github
-
Linux: Download via command line and git. If git is not installed, download it here. Then run the following command:
git clone [email protected]:jessievb/IDseq.git
# move into the folder with R packages (any folder you like)
cd ~/my-R-packages/
# download the IDseq package using git
git clone [email protected]:jessievb/IDseq.git
'Devtools' package also allows you to install a package from a local folder. Extra information on devtools package you can find here.
- Start R
- Make sure working directory = the package directory
- Run documentation and installation of the IDseq package (only once needed)
setwd(~/my-R-packages/IDseq/) # Make sure working directory = folder with R-package
devtools::document() # to create documentation
devtools::install() # install package
# Finally load the package
library(IDseq)
IDseq_split_reads function uses ShortRead package to extract all reads from (zipped) FASTQ files.
- Load reads from .FASTQ.gz files
- (Approximate) matches reads to anchor sequence
- Then, the UMI sequence, Barcode 1 and Barcode 2 were extracted from the reads via a regular expression
- Export: "splitreads.tsv": table with - UMI - Barcode 1 - Barcode 2 - Sequencing_ID.
- Run information is logged to output/exp_log/Run_info.log
This functions combines the following functions in order:
- IDseq_import_split_data()
- IDseq_umi_count()
- IDseq_umi_count_frequency()
- IDseq_barcode_count()
Also, it adds run and experiment information to the .log files in the output/exp_log/ folder. Finally it creates a bar and dotplot with the count distribution and UMI-duplicate rates in the output/figures/ folder
mkdir -p /home/Experiment_ID/{data,config,output/{data,exp_log,figures}}
library(IDseq)
setwd(~/Experiment_ID/)
IDseq_split_reads(fastq_file="data/sample_1/sample_name.fastq.gz")
Start any number of processes, depending on the number of fastq.gz files to process:
# check if the command works. Should print all FASTQ filenames found in the indicated. Indicate behind -P how many cores should be used.
find Path_to_folders_with_fastqfiles/*/sample_name*__R1_0*.fastq.gz -name "*.fastq.gz" | xargs -P 4 -i -- echo "'{}'"
# copy command until -i , and then start R by using -- R -e
find Path_to_folders_with_fastqfiles/*/sample_name*__R1_0*.fastq.gz -name "*.fastq.gz" | xargs -P 4 -i -- R -e 'library(IDseq); setwd("~/Experiment_ID"); system.time(IDseq_split_reads(fastq_file="'{}'")); quit(save="no")'
Save these two tables in the config folder
Briefly, the following workflow can be followed:
setwd(Experiment_folder)
## Combine different split read files into one: "output/data/IDseq_split_reads.tsv"
IDseq_analysis_splitreads(experiment_dir=getwd())
## Import barcode count table into environment
barcode_count <- fread(input="output/data/barcode_count.tsv")
## match barcodes to antibody and samples. If no 100% match is found, the well_name and Ab_name (and other columns) receive value NA
barcode_count_matched <- IDseq_barcode_match()
## Filter all matched reads:
barcode_count_matched_filtered <- IDseq_barcode_match_filtered()
## Extra:
not_matched_counts <- IDseq_barcode_match_na()