-
Notifications
You must be signed in to change notification settings - Fork 2
05. Radseq
Goal of the practical: Processes sequence data from simulated RAD loci using a specialized pipeline and analyze the identified variants.
Highly recommended reading:
To simulate a set of RAD loci we will use the package RADinitio. It takes a reference sequence and simulates different stages of the RADseq library preparation and sequencing process in a given set of populational settings.
- 1.1. Install packages
We will run the practice within a Conda virtual environment. Check how to manage Conda environments and a conda-cheatsheet.
You can create the required Conda environment using the file RADenv.yml already saved in the folder RAD
your home directory.
cd RAD
source /opt/conda/etc/profile.d/conda.sh
conda create -n RADenv -c conda-forge -c bioconda ipyrad msprime
The installation processes take a while. When finished, you can activate the environment.
conda activate RADenv
Now install the radinitio
package with pip
pip install radinitio
Check if the packages are ready
radinitio -h
ipyrad -h
You should get the help message of each package (i.e. list of available settings). If there are no message errors, you a ready to start the practice.
You can also see a list of all packages in an environment
conda list
You should get a quite long output including all packages now installed in the RADenv
conda environment. Most of them are default (base) packages, but many are requirements for radinitio
and ipyrad
. Without conda you would need to install most of those requirements by yourself (one by one).
- 1.2. Select the reference genome of a desired species
You can choose the reference you want. As an example, I will use a short fragment of the genome of the stress-tolerant wild tomato species Solanum pennellii. You can find the sequence in the file /home/curso/data/RAD/Spen_chr2_part.fasta
-
1.3. Simulate the data
RADinitio requires 1) the reference genome, 2) a file with chromosome names and 3) the output directory. The remaining settings have default values. You can check it by displaying the help message (
radinitio -h
).
We will simulate a double digest restriction-site associated DNA (ddRADseq) library with the following features: * 5 populations (10 individuals each) with an effective population size of 5 000 * The restriction enzymes will be ecoRI and mseI * Fragment size 350 (+/- 35) bp * 9 PCR cycles * 30X coverage * Read length 150 * We will simulate the data using only the chromosome 2.
Get chromosomes name and save to a file using AWK
.
awk 'sub(/^>/, "")' /home/curso/data/RAD/Spen_chr2_part.fasta > chrom_list.txt
Create the output directory
mkdir sim_reads
Run the simulation
radinitio --simulate-all --genome /home/curso/data/RAD/Spen_chr2_part.fasta \
--chromosomes chrom_list.txt --out-dir sim_reads --n-pops 5 --pop-eff-size 5000 \
--n-seq-indv 10 --library-type ddRAD --enz ecoRI --enz2 mseI --insert-mean 350 \
--insert-stdev 35 --pcr-cycles 9 --coverage 30 --read-length 150
The sequence files in fasta
format will be saved in the folder sim_reads/rad_reads/
- Then, we need to convert to
fastq
format to be able to process them.- Create a directory to save the
fastq
files - Go to the directory with the simulated sequences
- Use
reformat.sh
from BBtools in afor
loop to get thefastq
files
- Create a directory to save the
mkdir ddrad_fq
cd sim_reads/rad_reads/
for i in msp_*.1.fa.gz; do
reformat.sh in=${i//.1./.#.} out=../../ddrad_fq/${i//.1.fa.gz/_R#_.fastq.gz}
done
To process the reads we will use the package ipyrad
. Check the documentation.
The first step is to create the ipyrad parameters file
ipyrad -n sim_data_proc
After that, you should get a new text file with the name params-sim_data_proc.txt
.
Check the content of the file.
------- ipyrad params file (v.0.9.56)-------------------------------------------
sim_data_proc ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/USERNAME/RAD ## [1] [project_dir]: Project dir (made in curdir if not present)
## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
## [3] [barcodes_path]: Location of barcodes file
## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo ## [5] [assembly_method]: Assembly method (denovo, reference)
## [6] [reference_sequence]: Location of reference sequence file
rad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG, ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.85 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4 ## [21] [min_samples_locus]: Min # samples per locus for output
0.2 ## [22] [max_SNPs_locus]: Max # SNPs per locus
8 ## [23] [max_Indels_locus]: Max # of indels per locus
0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l ## [27] [output_formats]: Output formats (see docs)
## [28] [pop_assign_file]: Path to population assignment file
## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3
Then, you need to include in the file the settings for our data set. You should edit the parameters 1, 4, 7, 8, 27, and 28. You can find the explanation of each parameter in the file. Open the file with a text editor and include the information as the example below:
WATCH UP THE PATH! From now on, I am running all processes on a folder called "RAD" in the home directory of the prk40.
Adjust all the paths according to your prk and working directory.
------- ipyrad params file (v.0.9.56)-------------------------------------------
sim_data_proc ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/USERNAME/RAD/proc_reads ## [1] [project_dir]: Project dir (made in curdir if not present)
## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
## [3] [barcodes_path]: Location of barcodes file
/home/USERNAME/RAD/ddrad_fq/*.fastq.gz ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo ## [5] [assembly_method]: Assembly method (denovo, reference)
## [6] [reference_sequence]: Location of reference sequence file
pairddrad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
AATTC,TAA ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.85 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4 ## [21] [min_samples_locus]: Min # samples per locus for output
0.2 ## [22] [max_SNPs_locus]: Max # SNPs per locus
8 ## [23] [max_Indels_locus]: Max # of indels per locus
0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
v ## [27] [output_formats]: Output formats (see docs)
/home/USERNAME/RAD/pop_assignment.txt ## [28] [pop_assign_file]: Path to population assignment file
## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3
Then, you need to prepare the population assignment file. RADinitio output a file with almost all the needed information (sim_reads/popmap.tsv
). You just need to add a line at the end indicating the minimum number of samples per population.
cat sim_reads/popmap.tsv > pop_assignment.txt
echo "# pop0:10 pop1:10 pop2:10 pop3:10 pop4:10" >> pop_assignment.txt
Now, we can run the processing pipeline.
ipyrad
workflow consists of seven steps that you can run using the -s
option. For a full description of each step check the ipyrad manual.
You can run one by one to check the effect of different settings. Then, when you decided your final set of parameters you can apply all the steps in a single run.
Let us run the complete workflow with the settings you have in the params-sim_data_proc.txt
file.
ipyrad -p params-sim_data_proc.txt -s 1234567
If all data and parameters are well prepared the processes will run and after a few minutes you should get the following messages on the screen:
ipyrad -p params-sim_data_proc.txt -s 1234567 -c 4
-------------------------------------------------------------
ipyrad [v.0.9.56]
Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
Parallel connection | node07.pgen.wzw.tum.de: 4 cores
Step 1: Loading sorted fastq data to Samples
[####################] 100% 0:00:02 | loading reads
100 fastq files loaded to 50 Samples.
Step 2: Filtering and trimming reads
[####################] 100% 0:00:22 | processing reads
Step 3: Clustering/Mapping reads within samples
[####################] 100% 0:00:08 | join merged pairs
[####################] 100% 0:00:07 | join unmerged pairs
[####################] 100% 0:00:04 | dereplicating
[####################] 100% 0:01:19 | clustering/mapping
[####################] 100% 0:00:00 | building clusters
[####################] 100% 0:00:00 | chunking clusters
[####################] 100% 0:06:49 | aligning clusters
[####################] 100% 0:00:01 | concat clusters
[####################] 100% 0:00:02 | calc cluster stats
Step 4: Joint estimation of error rate and heterozygosity
[####################] 100% 0:00:11 | inferring [H, E]
Step 5: Consensus base/allele calling
Mean error [0.00574 sd=0.00008]
Mean hetero [0.00781 sd=0.00051]
[####################] 100% 0:00:02 | calculating depths
[####################] 100% 0:00:03 | chunking clusters
[####################] 100% 0:02:21 | consens calling
[####################] 100% 0:00:08 | indexing alleles
Step 6: Clustering/Mapping across samples
[####################] 100% 0:00:03 | concatenating inputs
[####################] 100% 0:00:05 | clustering across
[####################] 100% 0:00:01 | building clusters
[####################] 100% 0:00:24 | aligning clusters
Step 7: Filtering and formatting output files
[####################] 100% 0:00:08 | applying filters
[####################] 100% 0:00:05 | building arrays
[####################] 100% 0:00:00 | writing conversions
[####################] 100% 0:00:02 | indexing vcf depths
[####################] 100% 0:00:06 | writing vcf output
Parallel connection closed.
In addition, you will find a new folder in the working directory with the name we set in the parameter file (proc_reads
in our case).
Inside, you can find the output files for each step of the workflow and text files with summary stats.
Check the summary file in the outfiles folder.
head -12 proc_reads/sim_data_proc_outfiles/sim_data_proc_stats.txt
The first section summarizes the number of loci retained in each step.
## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters
total_filters applied_order retained_loci
total_prefiltered_loci 0 0 1209
filtered_by_rm_duplicates 95 95 1114
filtered_by_max_indels 67 67 1047
filtered_by_max_SNPs 0 0 1047
filtered_by_max_shared_het 40 36 1011
filtered_by_min_sample 556 556 455
total_filtered_loci 758 754 455
We can compare these numbers with the number of simulated loci. We can find this information in the RADinitio
output files.
cat sim_reads/radinitio.log
Towards the end of the file, you can find the number of RAD loci identified in the reference fasta sequence:
Extracting RAD loci...
Extracted 930 loci.
You can see there is a big discrepancy between the number of simulated loci with RADinitio
(930) and the final set of identified loci with ipyrad
(455).
- What could be the reason for this discrepancy?
We can see that the number of loci drops from 997 to 455 in the minimum number of samples per locus filtering step. That means that not all loci are present in all simulated individuals.
- How can you explain that our simulated loci are not present in all individuals?
Now, let us check the result relaxing the filter of minimum number of samples per locus.
For that, we just need to create a branch of the workflow:
ipyrad -p params-sim_data_proc.txt -b sim_data_proc2
This command will create a new parameters file.
Then, create a new population assignment file allowing the minimum number of one sample per population (instead of 10).
cat sim_reads/popmap.tsv > pop_assignment2.txt
echo "# pop0:1 pop1:1 pop2:1 pop3:1 pop4:1" >> pop_assignment2.txt
Then, edit parameter 28 of the new parameter file (params-sim_data_proc2.txt) to indicate a new population assignment file.
sed -i 's/pop_assignment/pop_assignment2/' params-sim_data_proc2.txt
Now, run again the ipyrad with the new parameter file,
ipyrad -p params-sim_data_proc2.txt -s 7
Checking on the new summary output file
head -12 proc_reads/sim_data_proc2_outfiles/sim_data_proc2_stats.txt
You can see if you allow more missing data in the dataset is possible to recover more loci:
## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters
total_filters applied_order retained_loci
total_prefiltered_loci 0 0 1209
filtered_by_rm_duplicates 95 95 1114
filtered_by_max_indels 268 268 846
filtered_by_max_SNPs 0 0 846
filtered_by_max_shared_het 49 42 804
filtered_by_min_sample 27 27 777
total_filtered_loci 439 432 777
Following the same procedure, you can explore the impact of other parameters on the results. Explore for example the impact of the clustering threshold in step 3.
Once you finished the sequence data processing with optimized parameters, you would like to analyze the identified variants.
ipyrad
output the variants in different formats.
In our parameter file we set to output the variants in vcf format (v ## [27] [output_formats]
). You can find the vcf file in the folder proc_reads/sim_data_proc2_outfiles
There are many software and packages available to analyze population genetic data. For this practice, we will perform two simple exploratory analyses to analyze population structure using the package adegenet in R.
First, transfer the vcf file (sim_data_proc2.vcf) to your local computer using the WinSCP
software (in Windows) or the commands scp
of rsync
(in UNIX system). Then, download a simple R markdown template that we prepared for this practice.
Once you have both the VCF file and the R markdown template in the same folder, open the R markdown file using RStudio
Explore the file. The basic structure of an R markdown file consists of two basic sections. A header with general information of the file (Title, Authors, ...) and instructions for output generation. The second section is the body of the document that alternates text (titles and paragraphs) with R code chunks.
Once you have completed the markdown file you just need to click on the knit button. If everything is properly set, after a few seconds you should get an HTML file that automatically opens in your browser and also is saved in the working directory. It is easy to generate PDF files, edit the R markdown template to generate a PDF file. You can find the basic instructions in this rmarkdown-cheatsheet. If you are not familiar with R markdown check some tutorials on the internet.