-
Notifications
You must be signed in to change notification settings - Fork 13
Haplotype Generation
Header variants.h contains support structures for generating haplotypes from a reference sequence and phased variants. They assume a diploid genome that may contain some haploid regions. While the structures expect a graph built from a VCF file, they do not include tools for VCF parsing or graph construction.
The reference is a node sequence (a path). Each variant site corresponds to a semiopen interval [start, end) in the reference. Each site has a reference allele and a number of alternate alleles. Each allele is a path that replaces the interval [start, end) in the reference. The paths for alternate alleles may (but are not required to) start and end with a reference node.
The intervals for adjacent sites may overlap. If a haplotype has the reference allele at one of the overlapping sites, the alternate allele takes priority. If both sites have alternate alleles, it may still be possible to resolve the overlap by removing reference nodes from the end of the generated path and by skipping reference nodes at the start of the second alternate allele. If an overlap cannot be resolved, the second site can be ignored or it can cause a phase break. In the latter case, the current haplotype ends after the first site and a new haplotype starts from the second site.
Haplotype
represents a haplotype path.
struct Haplotype
{
vector_type path;
size_type offset;
bool active, diploid;
size_type sample, phase, count;
};
-
path
is the node sequence for the haplotype. -
offset
is the reference offset after the end ofpath
. -
active
anddiploid
are internal variables. -
sample
is the sample identifier. -
phase
is the phase identifier (currently either 0 or 1). -
count
is a running count of haplotypes for the combination ofsample
andphase
.
All member functions are for internal use. Because the haplotype is generated from a reference path, haplotypes for different chromosomes / VCF contigs may have the same combination of sample
, phase
, and count
.
Phasing
represents the genotype of a sample at a variant site.
struct Phasing
{
size_type first, second;
bool diploid, phased;
};
-
first
: The allele identifier for the first phase (the only phase in a haploid region). 0 represents the reference allele and other values represent alternate alleles. -
second
: The allele identifier for the second phase. -
diploid
: Is this a genotype for a diploid site? -
phased
: Is the genotype phased?
Phasing::Phasing(size_type allele)
Phasing::Phasing(size_type first_allele, size_type second_allele, bool is_phased = true)
Phasing(const std::string& genotype, bool was_diploid = true)
- The first constructor creates a phasing with allele identifier
allele
for a haploid site. - The second constructor creates a phasing for a diploid site with allele
first_allele
for the first phase andsecond_allele
for the second phase. The site is either phased or unphased, depending onis_phased
. - The third constructor parses the VCF genotype string
genotype
. Missing alleles are interpreted as reference alleles. If the genotype string is empty, the constructor will create a phased genotype diploid or haploid reference allels, depending onwas_diploid
.
void Phasing::forcePhased(std::function<bool()> rng)
This function replaces an unphased genotype with a randomly phased genotype. If the Phasing
object does not represent an unphased diploid genotype, nothing happens. Otherwise forcePhased()
makes the genotype phased and calls rng()
. If rng()
returns true
, the allele identifiers for the first and the second phase are swapped.
Phasing
objects can be compared with operators ==
and !=
and the genotype string can be written to any std::ostream
object with operator <<
.
struct VariantPaths
is a structure that contains the reference path, the variant sites, and the paths for alternate alleles. While the user is expected to build the structure, they should only use it through generateHaplotypes()
(see below).
There are three steps in building a VariantPaths
object:
- Create an empty object.
- Append the reference nodes.
- Add the variant sites and alternate alleles in sorted order.
explicit VariantPaths(size_type reference_size = 0)
The only constructor builds an empty object. Specifying the reference length (in nodes) via reference_size
is optional. Because the reference is stored as vector_type
(as std::vector
), knowing the length in advance will save memory.
void appendToReference(node_type node)
Append node identifier node
to the reference.
void addSite(size_type ref_start, size_type ref_end)
Adds a new variant site corresponding to the reference interval from ref_start
(inclusive) to ref_end
(exclusive). Added sites receive consecutive identifiers starting from 0. The site is initially empty. Sites should be added in sorted order to avoid unresolvable overlaps.
void addAllele(const vector_type& path)
Add a new alternate allele with path path
to the latest variant site. Added alleles receive consecutive identifiers starting from 1.
Example:
VariantPaths buildVariantPaths(const vector_type& reference,
const std::vector<range_type>& sites,
const std::vector<std::vector<vector_type>>& alleles)
{
VariantPaths variants(reference.size());
for(auto node : reference) { variants.appendToReference(node); }
for(size_type site = 0; site < sites.size(); site++)
{
variants.addSite(sites[site].first, sites[site].second);
for(const vector_type& allele : alleles[site]) { variants.addAllele(allele); }
}
return variants;
}
This function builds the VariantPaths
object from reference path reference
, variant sites sites
, and alternate alleles alleles
.
Most member functions should only be used through generateHaplotypes()
. The following functions return some statistics:
-
size_type size() const
: Reference length in nodes. -
size_type paths() const
: The total number of alternate alleles for all sites. -
size_type sites() const
: Number of variant sites. -
size_type alleles(size_type site) const
: Number of alternate alleles at sitesite
. -
size_type refStart(size_type site) const
: The start of sitesite
(inclusive). -
size_type refEnd(size_type site) const
: The end of sitesite
(exclusive).
The VariantPaths
object also contains a simple index for the first occurrences of each node identifier on the reference path. The index can be used for determining the reference intervals for variant sites, given the paths for the reference alleles.
void indexReference()
Builds the index for the current reference path.
size_type firstOccurrence(node_type node)
Returns the reference position of the first occurrence of node identifier node
or invalid_position()
if there are no occurrences.
Variant sites i and i+1 overlap if refStart(i+1) < refEnd(i)
. If a haplotype has alternate alleles at both overlapping sites, it may be unclear how to generate the path for the haplotype. If the path we have already generated ends with a reference nodet or the path for the next allele starts with a reference node (both with the correct offset), we can try to resolve the overlap by removing reference nodes.
The following function prints all combinations of adjacent alleles that may cause unresolvable overlaps:
void checkOverlaps(const VariantPaths& variants, std::ostream& out, bool print_ids = false)
-
const VariantPaths& variants
: The variant paths object. -
std::ostream& out
: Output stream. -
bool print_ids
: By default, the function prints the actual node identifiers. Whenprint_ids
is set to true, it transforms the identifiers by usingNode::id()
.