Skip to content

Haplotype Generation

Jouni Siren edited this page Jul 19, 2018 · 18 revisions

General

Header variants.h contains support structures for generating haplotypes from a reference sequence and phased variants. They assume a diploid genome that may contain some haploid regions. While the structures expect a graph built from a VCF file, they do not include tools for VCF parsing or graph construction.

The reference is a node sequence (a path). Each variant site corresponds to a semiopen interval [start, end) in the reference. Each site has a reference allele and a number of alternate alleles. Each allele is a path that replaces the interval [start, end) in the reference. The paths for alternate alleles may (but are not required to) start and end with a reference node.

The intervals for adjacent sites may overlap. If a haplotype has the reference allele at one of the overlapping sites, the alternate allele takes priority. If both sites have alternate alleles, it may still be possible to resolve the overlap by removing reference nodes from the end of the generated path and by skipping reference nodes at the start of the second alternate allele. If an overlap cannot be resolved, the second site can be ignored or it can cause a phase break. In the latter case, the current haplotype ends after the first site and a new haplotype starts from the second site.

Basic structures

Haplotype

Haplotype represents a haplotype path.

struct Haplotype
{
  vector_type path;
  size_type   offset;
  bool        active, diploid;
  size_type   sample, phase, count;
};
  • path is the node sequence for the haplotype.
  • offset is the reference offset after the end of path.
  • active and diploid are internal variables.
  • sample is the sample identifier.
  • phase is the phase identifier (currently either 0 or 1).
  • count is a running count of haplotypes for the combination of sample and phase.

All member functions are for internal use. Because the haplotype is generated from a reference path, haplotypes for different chromosomes / VCF contigs may have the same combination of sample, phase, and count.

Phasing

Phasing represents the genotype of a sample at a variant site.

struct Phasing
{
  size_type first, second;
  bool      diploid, phased;
};
  • first: The allele identifier for the first phase (the only phase in a haploid region). 0 represents the reference allele and other values represent alternate alleles.
  • second: The allele identifier for the second phase.
  • diploid: Is this a genotype for a diploid site?
  • phased: Is the genotype phased?
Phasing::Phasing(size_type allele)
Phasing::Phasing(size_type first_allele, size_type second_allele, bool is_phased = true)
Phasing(const std::string& genotype, bool was_diploid = true)
  • The first constructor creates a phasing with allele identifier allele for a haploid site.
  • The second constructor creates a phasing for a diploid site with allele first_allele for the first phase and second_allele for the second phase. The site is either phased or unphased, depending on is_phased.
  • The third constructor parses the VCF genotype string genotype. Missing alleles are interpreted as reference alleles. If the genotype string is empty, the constructor will create a phased genotype diploid or haploid reference allels, depending on was_diploid.
void Phasing::forcePhased(std::function<bool()> rng)

This function replaces an unphased genotype with a randomly phased genotype. If the Phasing object does not represent an unphased diploid genotype, nothing happens. Otherwise forcePhased() makes the genotype phased and calls rng(). If rng() returns true, the allele identifiers for the first and the second phase are swapped.

Phasing objects can be compared with operators == and != and the genotype string can be written to any std::ostream object with operator <<.

Variant paths

struct VariantPaths is a structure that contains the reference path, the variant sites, and the paths for alternate alleles. While the user is expected to build the structure, they should only use it through generateHaplotypes() (see below).

Construction

There are three steps in building a VariantPaths object:

  1. Create an empty object.
  2. Append the reference nodes.
  3. Add the variant sites and alternate alleles in sorted order.
explicit VariantPaths(size_type reference_size = 0)

The only constructor builds an empty object. Specifying the reference length (in nodes) via reference_size is optional. Because the reference is stored as vector_type (as std::vector), knowing the length in advance will save memory.

void appendToReference(node_type node)

Append node identifier node to the reference.

void addSite(size_type ref_start, size_type ref_end)

Adds a new variant site corresponding to the reference interval from ref_start (inclusive) to ref_end (exclusive). Added sites receive consecutive identifiers starting from 0. The site is initially empty. Sites should be added in sorted order to avoid unresolvable overlaps.

void addAllele(const vector_type& path)

Add a new alternate allele with path path to the latest variant site. Added alleles receive consecutive identifiers starting from 1.

Example:

VariantPaths buildVariantPaths(const vector_type& reference,
                               const std::vector<range_type>& sites,
                               const std::vector<std::vector<vector_type>>& alleles)
{
  VariantPaths variants(reference.size());
  for(auto node : reference) { variants.appendToReference(node); }

  for(size_type site = 0; site < sites.size(); site++)
  {
    variants.addSite(sites[site].first, sites[site].second);
    for(const vector_type& allele : alleles[site]) { variants.addAllele(allele); }
  }

  return variants;
}

This function builds the VariantPaths object from reference path reference, variant sites sites, and alternate alleles alleles.

Statistics

Most member functions should only be used through generateHaplotypes(). The following functions return some statistics:

  • size_type size() const: Reference length in nodes.
  • size_type paths() const: The total number of alternate alleles for all sites.
  • size_type sites() const: Number of variant sites.
  • size_type alleles(size_type site) const: Number of alternate alleles at site site.
  • size_type refStart(size_type site) const: The start of site site (inclusive).
  • size_type refEnd(size_type site) const: The end of site site (exclusive).

Reference index

The VariantPaths object also contains a simple index for the first occurrences of each node identifier on the reference path. The index can be used for determining the reference intervals for variant sites, given the paths for the reference alleles.

void indexReference()

Builds the index for the current reference path.

size_type firstOccurrence(node_type node)

Returns the reference position of the first occurrence of node identifier node or invalid_position() if there are no occurrences.

Overlapping variants

Variant sites i and i+1 overlap if refStart(i+1) < refEnd(i). If a haplotype has alternate alleles at both overlapping sites, it may be unclear how to generate the path for the haplotype. If the path we have already generated ends with a reference nodet or the path for the next allele starts with a reference node (both with the correct offset), we can try to resolve the overlap by removing reference nodes.

The following function prints all combinations of adjacent alleles that may cause unresolvable overlaps:

void checkOverlaps(const VariantPaths& variants, std::ostream& out, bool print_ids = false)
  • const VariantPaths& variants: The variant paths object.
  • std::ostream& out: Output stream.
  • bool print_ids: By default, the function prints the actual node identifiers. When print_ids is set to true, it transforms the identifiers by using Node::id().

Phasing information

Haplotype generation

Clone this wiki locally