Skip to content

05. Radseq

Gustavo A. Silva-Arias edited this page Aug 26, 2021 · 4 revisions

Goal of the practical: Processes sequence data from simulated RAD loci using a specialized pipeline and analyze the identified variants.

Restriction site-associated DNA (RAD) markers

Highly recommended reading:

Common processing tools

1. Simulate RAD loci from a (FASTA) reference sequence

To simulate a set of RAD loci we will use the package RADinitio. It takes a reference sequence and simulates different stages of the RADseq library preparation and sequencing process in a given set of populational settings.

  • 1.1. Install packages

We will run the practice within a Conda virtual environment. Check how to manage Conda environments and a conda-cheatsheet.

You can create the required Conda environment using the file RADenv.yml already saved in the folder RAD your home directory.

cd RAD
source /opt/conda/etc/profile.d/conda.sh
conda create -n RADenv -c conda-forge -c bioconda ipyrad msprime

The installation processes take a while. When finished, you can activate the environment.

conda activate RADenv

Now install the radinitio package with pip

pip install radinitio

Check if the packages are ready

radinitio -h
ipyrad -h

You should get the help message of each package (i.e. list of available settings). If there are no message errors, you a ready to start the practice.

You can also see a list of all packages in an environment

conda list

You should get a quite long output including all packages now installed in the RADenv conda environment. Most of them are default (base) packages, but many are requirements for radinitio and ipyrad. Without conda you would need to install most of those requirements by yourself (one by one).

  • 1.2. Select the reference genome of a desired species

You can choose the reference you want. As an example, I will use a short fragment of the genome of the stress-tolerant wild tomato species Solanum pennellii. You can find the sequence in the file /home/curso/data/RAD/Spen_chr2_part.fasta

  • 1.3. Simulate the data RADinitio requires 1) the reference genome, 2) a file with chromosome names and 3) the output directory. The remaining settings have default values. You can check it by displaying the help message (radinitio -h).

We will simulate a double digest restriction-site associated DNA (ddRADseq) library with the following features: * 5 populations (10 individuals each) with an effective population size of 5 000 * The restriction enzymes will be ecoRI and mseI * Fragment size 350 (+/- 35) bp * 9 PCR cycles * 30X coverage * Read length 150 * We will simulate the data using only the chromosome 2.

Get chromosomes name and save to a file using AWK.

awk 'sub(/^>/, "")' /home/curso/data/RAD/Spen_chr2_part.fasta > chrom_list.txt

Create the output directory

mkdir sim_reads

Run the simulation

radinitio --simulate-all --genome /home/curso/data/RAD/Spen_chr2_part.fasta \
--chromosomes chrom_list.txt --out-dir sim_reads --n-pops 5 --pop-eff-size 5000 \
--n-seq-indv 10 --library-type ddRAD --enz ecoRI --enz2 mseI --insert-mean 350 \
--insert-stdev 35 --pcr-cycles 9 --coverage 30 --read-length 150

The sequence files in fasta format will be saved in the folder sim_reads/rad_reads/

  • Then, we need to convert to fastq format to be able to process them.
    • Create a directory to save the fastq files
    • Go to the directory with the simulated sequences
    • Use reformat.sh from BBtools in a for loop to get the fastq files
mkdir ddrad_fq
cd sim_reads/rad_reads/
for i in msp_*.1.fa.gz; do 
    reformat.sh in=${i//.1./.#.} out=../../ddrad_fq/${i//.1.fa.gz/_R#_.fastq.gz}
done

2. Process the simulated reads

To process the reads we will use the package ipyrad. Check the documentation.

The first step is to create the ipyrad parameters file

ipyrad -n sim_data_proc

After that, you should get a new text file with the name params-sim_data_proc.txt.

Check the content of the file.

------- ipyrad params file (v.0.9.56)-------------------------------------------
sim_data_proc                  ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/USERNAME/RAD             ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus
8                              ## [23] [max_Indels_locus]: Max # of indels per locus
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l                        ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

Then, you need to include in the file the settings for our data set. You should edit the parameters 1, 4, 7, 8, 27, and 28. You can find the explanation of each parameter in the file. Open the file with a text editor and include the information as the example below:
WATCH UP THE PATH! From now on, I am running all processes on a folder called "RAD" in the home directory of the prk40.
Adjust all the paths according to your prk and working directory.

------- ipyrad params file (v.0.9.56)-------------------------------------------
sim_data_proc                  ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/USERNAME/RAD/proc_reads  ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
/home/USERNAME/RAD/ddrad_fq/*.fastq.gz   ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                                           ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
AATTC,TAA                                           ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus
8                              ## [23] [max_Indels_locus]: Max # of indels per locus
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
v                                                   ## [27] [output_formats]: Output formats (see docs)
/home/USERNAME/RAD/pop_assignment.txt    ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

Then, you need to prepare the population assignment file. RADinitio output a file with almost all the needed information (sim_reads/popmap.tsv). You just need to add a line at the end indicating the minimum number of samples per population.

cat sim_reads/popmap.tsv > pop_assignment.txt
echo "# pop0:10 pop1:10 pop2:10 pop3:10 pop4:10" >> pop_assignment.txt

Now, we can run the processing pipeline. ipyrad workflow consists of seven steps that you can run using the -s option. For a full description of each step check the ipyrad manual. You can run one by one to check the effect of different settings. Then, when you decided your final set of parameters you can apply all the steps in a single run.

Let us run the complete workflow with the settings you have in the params-sim_data_proc.txt file.

ipyrad -p params-sim_data_proc.txt -s 1234567

If all data and parameters are well prepared the processes will run and after a few minutes you should get the following messages on the screen:

ipyrad -p params-sim_data_proc.txt -s 1234567 -c 4

 -------------------------------------------------------------
  ipyrad [v.0.9.56]
  Interactive assembly and analysis of RAD-seq data
 ------------------------------------------------------------- 
  Parallel connection | node07.pgen.wzw.tum.de: 4 cores
  
  Step 1: Loading sorted fastq data to Samples
  [####################] 100% 0:00:02 | loading reads          
  100 fastq files loaded to 50 Samples.
  
  Step 2: Filtering and trimming reads
  [####################] 100% 0:00:22 | processing reads     
  
  Step 3: Clustering/Mapping reads within samples
  [####################] 100% 0:00:08 | join merged pairs      
  [####################] 100% 0:00:07 | join unmerged pairs    
  [####################] 100% 0:00:04 | dereplicating          
  [####################] 100% 0:01:19 | clustering/mapping     
  [####################] 100% 0:00:00 | building clusters      
  [####################] 100% 0:00:00 | chunking clusters      
  [####################] 100% 0:06:49 | aligning clusters      
  [####################] 100% 0:00:01 | concat clusters        
  [####################] 100% 0:00:02 | calc cluster stats     
  
  Step 4: Joint estimation of error rate and heterozygosity
  [####################] 100% 0:00:11 | inferring [H, E]       
  
  Step 5: Consensus base/allele calling 
  Mean error  [0.00574 sd=0.00008]
  Mean hetero [0.00781 sd=0.00051]
  [####################] 100% 0:00:02 | calculating depths     
  [####################] 100% 0:00:03 | chunking clusters      
  [####################] 100% 0:02:21 | consens calling        
  [####################] 100% 0:00:08 | indexing alleles       
  
  Step 6: Clustering/Mapping across samples 
  [####################] 100% 0:00:03 | concatenating inputs   
  [####################] 100% 0:00:05 | clustering across    
  [####################] 100% 0:00:01 | building clusters      
  [####################] 100% 0:00:24 | aligning clusters      
  
  Step 7: Filtering and formatting output files 
  [####################] 100% 0:00:08 | applying filters       
  [####################] 100% 0:00:05 | building arrays        
  [####################] 100% 0:00:00 | writing conversions    
  [####################] 100% 0:00:02 | indexing vcf depths    
  [####################] 100% 0:00:06 | writing vcf output     
  Parallel connection closed.

In addition, you will find a new folder in the working directory with the name we set in the parameter file (proc_reads in our case). Inside, you can find the output files for each step of the workflow and text files with summary stats.

Check the summary file in the outfiles folder.
head -12 proc_reads/sim_data_proc_outfiles/sim_data_proc_stats.txt
The first section summarizes the number of loci retained in each step.

## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                           total_filters applied_order retained_loci
total_prefiltered_loci                 0             0          1209
filtered_by_rm_duplicates             95            95          1114
filtered_by_max_indels                67            67          1047
filtered_by_max_SNPs                   0             0          1047
filtered_by_max_shared_het            40            36          1011
filtered_by_min_sample               556           556           455
total_filtered_loci                  758           754           455

We can compare these numbers with the number of simulated loci. We can find this information in the RADinitio output files.

cat sim_reads/radinitio.log

Towards the end of the file, you can find the number of RAD loci identified in the reference fasta sequence:

Extracting RAD loci...
    Extracted 930 loci.

You can see there is a big discrepancy between the number of simulated loci with RADinitio (930) and the final set of identified loci with ipyrad (455).

  • What could be the reason for this discrepancy?

We can see that the number of loci drops from 997 to 455 in the minimum number of samples per locus filtering step. That means that not all loci are present in all simulated individuals.

  • How can you explain that our simulated loci are not present in all individuals?

Now, let us check the result relaxing the filter of minimum number of samples per locus.
For that, we just need to create a branch of the workflow:

ipyrad -p params-sim_data_proc.txt -b sim_data_proc2

This command will create a new parameters file.
Then, create a new population assignment file allowing the minimum number of one sample per population (instead of 10).

cat sim_reads/popmap.tsv > pop_assignment2.txt
echo "# pop0:1 pop1:1 pop2:1 pop3:1 pop4:1" >> pop_assignment2.txt

Then, edit parameter 28 of the new parameter file (params-sim_data_proc2.txt) to indicate a new population assignment file.

sed -i 's/pop_assignment/pop_assignment2/' params-sim_data_proc2.txt

Now, run again the ipyrad with the new parameter file,

ipyrad -p params-sim_data_proc2.txt -s 7

Checking on the new summary output file

head -12 proc_reads/sim_data_proc2_outfiles/sim_data_proc2_stats.txt

You can see if you allow more missing data in the dataset is possible to recover more loci:

## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                           total_filters applied_order retained_loci
total_prefiltered_loci                 0             0          1209
filtered_by_rm_duplicates             95            95          1114
filtered_by_max_indels               268           268           846
filtered_by_max_SNPs                   0             0           846
filtered_by_max_shared_het            49            42           804
filtered_by_min_sample                27            27           777
total_filtered_loci                  439           432           777

Following the same procedure, you can explore the impact of other parameters on the results. Explore for example the impact of the clustering threshold in step 3.


3. Process the identified variants

Once you finished the sequence data processing with optimized parameters, you would like to analyze the identified variants.

ipyrad output the variants in different formats. In our parameter file we set to output the variants in vcf format (v ## [27] [output_formats]). You can find the vcf file in the folder proc_reads/sim_data_proc2_outfiles

There are many software and packages available to analyze population genetic data. For this practice, we will perform two simple exploratory analyses to analyze population structure using the package adegenet in R.

First, transfer the vcf file (sim_data_proc2.vcf) to your local computer using the WinSCP software (in Windows) or the commands scp of rsync (in UNIX system). Then, download a simple R markdown template that we prepared for this practice.

Once you have both the VCF file and the R markdown template in the same folder, open the R markdown file using RStudio

Explore the file. The basic structure of an R markdown file consists of two basic sections. A header with general information of the file (Title, Authors, ...) and instructions for output generation. The second section is the body of the document that alternates text (titles and paragraphs) with R code chunks.

Once you have completed the markdown file you just need to click on the knit button. If everything is properly set, after a few seconds you should get an HTML file that automatically opens in your browser and also is saved in the working directory. It is easy to generate PDF files, edit the R markdown template to generate a PDF file. You can find the basic instructions in this rmarkdown-cheatsheet. If you are not familiar with R markdown check some tutorials on the internet.