Skip to content

Code to simulate phylogenetic trees that can be used to train neural networks

License

Notifications You must be signed in to change notification settings

aezarebski/derp-simulation

Repository files navigation

DERP Simulation

This repository contains the code to simulate a database of phylogenetic trees that will be used in a machine learning project in which a neural network will be trained to solve phylodynamics problems.

Usage

Timelines

Figure fig:timelines shows a schematic of the timelines involved in a simulation. The origin is the start time of the simulation and the present is the time of the last sequenced sample. Note that the measurement times are randomly sampled between the origin and the present and are measured in forwards time.

./timelines.png

Getting started generating a database

To generate the database, run the main script:

python main.py <path/to/config.json>

If you want a small example to test this out, try using the config/debugging.json file. The whole simulation is configured by the JSON file provided at the command line.

Optional monitoring progress

Making a large database takes a while. There is a tool in src/monitor.py which you can run and it will report on the progress of a current database construction and give a (rough) estimate of the remaining time for each stage of the simulation.

python src/monitor.py <path/to/config.json>

Configuring a simulation

The way in which a dataset is simulated is configured with a JSON file. There is a schema for valid configurations described here. There are some example configurations provided:

Additional information about these datasets is given here.

Visualising the data

Two scripts, visualisation.py and visualisation_temporal.py can be used to visualise the output of a simulation.

python visualisation.py <path/to/config.json>
python visualisation_temporal.py <path/to/config.json>

Note that the latter only applies for simulations which are configured to report temporal data (that is, report_temporal_data is set to true in the config).

Database structure

The database is an HDF5 file. Each simulation is represented with a group with a name of the form record_xxxxxx, e.g. record_000123. The data from each simulation is split into two groups: input and output.

Input

The input group has the following datasets:

present
the time since the origin of the last sequenced sample
tree_height
the time between the $T\text{MRCA}$ and the present
tree
a binary blob which is the pickled reconstructed tree of the sequenced samples in the simulation.

Output

The output group contains a lot of measurements, but the most important is the temporal_measurements dataset. The temporal_measurements dataset has the following columns:

measurement_times (float)
the (forward) time since the origin of the measurements
prevalence (int)
the number of infected individuals
cumulative (int)
the cumulative number of infections
reproductive_number (float)
the reproduction number

Using the database

The following demonstrates how to use the database in Python. Don’t forget to close the database connection after using it! The following script reads in the tree and measurements from a simulation and produces this CSV file and the figure below.

from Bio import Phylo
import h5py
import pickle
import matplotlib.pyplot as plt
import numpy as np

hdf5_file = "../out/sim-charmander/dataset-charmander.hdf5"

db_conn = h5py.File(hdf5_file)

demo_tree = pickle.loads(db_conn['record_000001/input/tree'][...].tobytes())
fig, ax = plt.subplots()
Phylo.draw(demo_tree, do_show=False, axes=ax)
fig.savefig('../out/sim-charmander/plots/demo-tree.png')

measurements = db_conn['record_000001/output/parameters/temporal_measurements'][...]
column_names = measurements.dtype.names
np.savetxt('../out/sim-charmander/demo-measurements.csv',
           measurements, delimiter=',',
           header=','.join(column_names))

db_conn.close()

./out/sim-charmander/plots/demo-tree.png

If you want a GUI to inspect the output HDF5 file, the HDFCompass tool provides a simple way to inspect the data that has been generated. There is some basic information about the simulation stored as attributes in the HDF5 file. This includes the date of creation and the size of the dataset.

Conda environment

A conda environment to run this simulation can be created from the environment.yaml file by running the following command:

conda env create -f environment.yaml

This environment will have all the correct packages for running the simulations.

Installing BEAST2

BEAST2 is used to simulate the data. If you don’t have BEAST2 installed, there is a script scr/setuplib.sh which will download and install this for you. Once you have BEAST2 installed, you will need to install remaster through BEAUti.

About

Code to simulate phylogenetic trees that can be used to train neural networks

Resources

License

Stars

Watchers

Forks