___ __ _ _ _
/ _ \___ _ __ ___ _ __ ___ ___ / _(_)_ __ ___ _ _| | __ _| |_ ___ _ __
/ /_\/ _ | '_ \ / _ \| '_ ` _ \ / _ \ \ \| | '_ ` _ \| | | | |/ _` | __/ _ \| '__|
/ /_\| __| | | | (_) | | | | | | __/ _\ | | | | | | | |_| | | (_| | || (_) | |
\____/\___|_| |_|\___/|_| |_| |_|\___| \__|_|_| |_| |_|\__,_|_|\__,_|\__\___/|_|
This repository contains scripts that run msprime genome simulations based on a input text pedigree. tskit
and msprime
provide extensive documentation about these simulations. New to this? A good starting point is the msprime introduction.
For this example we use a sample pedigree released with GenLib
ind,father,mother,sex
10086,0,0,1
10087,0,0,2
10102,0,0,1
10103,0,0,2
text pedigree (file)
In order to correctly build the msprime.Pedigree object, we must include generation numbers to ensure that parents are added before their children. We use the generate_ascending_sample_pedigree.R script to identifies probands as individuals who do not have any children, and then ascend the pedigree starting from the probads while keeping track of the maximum genealogical depth for internal nodes.
ind,mother,father,generation
409153,295170,295169,0
443151,442562,442561,0
408477,861890,861889,0
408790,863184,863183,0
text pedigree with generation time (file)
These parameters can be modified in run_genome_sim.py
, but in our case we used the following set of parameters:
demography beyond the input pedigree: Out of Africa model from Tennessen et al., 2012 (defined here and loaded from here)
model to recapitulate the tree: hudson
mutation rate: 3.62e-8 from Gravel et al., 2011 (stdpopsim) (Note: This unusually high mutation rate is chosen to match the mutation rate from Tennessen et al., 2012 while accounting for the difference in diversity between coding and non-coding sequences. See paper for details)
recombination map: HapmapII_GRCh37 (download)
This software can run on a local machine for smaller sized pedigrees, but for our paper, we used the Compute Canada cluster given the large computational resource requirements of population scale genome simulations.
To run the genomes simulation on this pedigree, we use the sample_pedigree_simulation_job.sh script which runs on compute clusters that use a slurm job scheduler. Note these commands could be run direclty with a command line interface with some user specified inputs.
python code/run_genome_sim.py \
-d /path/to/pedigree/ \
-o /path/to/tree_sequences/ \
-p ascending_sample_pedigree \
-chr 22
This will run the genome simulation pipeline using the input pedigree and following the recombination map of Chromosome 22.
The output of the simulation yields a tszipped formatted tree sequence as shown here :
import msprime
import tszip
ts = tszip.decompress("ascending_sample_pedigree_22_sim.tsz")
print(ts)
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 111516║
╟───────────────┼───────────╢
║Sequence Length│ 51304566║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 280║
╟───────────────┼───────────╢
║Total Size │ 33.5 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪══════╪═════════╪════════════╣
║Edges │409211│ 12.5 MiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Individuals│ 369│ 23.0 KiB│ Yes║
╟───────────┼──────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼──────┼─────────┼────────────╢
║Mutations │221856│ 7.8 MiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Nodes │ 65701│ 1.8 MiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Populations│ 5│507 Bytes│ Yes║
╟───────────┼──────┼─────────┼────────────╢
║Provenances│ 4│ 3.0 MiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Sites │221107│ 5.3 MiB│ No║
╚═══════════╧══════╧═════════╧════════════╝
tree sequence (file)
- Simulate multiple chromosomes? Easy! Just specify
#SBATCH --array=1-22
to run all autosomes as separate jobs. - installed the proper requirements.
- mirror the folder structure in the quick_start directory. In particular, make sure the 'maps' folder is in the same folder as the pedigree folder (e.g
path/to/pedigree/../maps/Tennessen_ooa_2T12.yaml
)
These python files are what interact directly with the msprime
and tskit
. They can be run on a local machine so long as all requirements are met.
run_genome_sim.py
is the driver script that defines input parameters and runs the genome simulation.
pedigree_tools.py
contains functions used to load_and_verify_pedigree
, add_individuals_to_pedigree
, and simulate_genomes_with_known_pedigree
.
convert_to_bcf.py
is a python wrapper for (write_vcf
to convert tree sequences to bcf files.
Note this pipeline has additional steps that may not be relevant for you! In particular, this pipeline converts tree sequences to inefficient file formats and prunes variants in order to run a PCA.
0_run_all_jobs.sh
job farming shell script that schedules all jobs
1_simulate_genomes_as_ts.sh
runs an array job of 22 autosomes
2_ts_to_bcf.sh
converts simulated tree sequence to a bcf file
3_bcf_to_plink.sh
converts bcf file to a plink file
4_ld_prune.sh
prunes with linkage disequilibrium and strict mask
5_concatenate_chromosomes.sh
concatenates all chromosomes
note: for principal component analysis we use flashpca2.
The genome simuations are based on the work described in the following papers, please cite them: