Skip to content

Haplotype Generation

Jouni Siren edited this page Oct 27, 2020 · 18 revisions

General

Header variants.h contains support structures for generating haplotypes from a reference sequence and phased variants. They assume a diploid genome that may contain some haploid regions. While the structures expect a graph built from a VCF file, they do not include tools for VCF parsing or graph construction.

The reference is a node sequence (a path). Each variant site corresponds to a semiopen interval [start, end) in the reference. Each site has a reference allele and a number of alternate alleles. Each allele is a path that replaces the interval [start, end) in the reference. The paths for alternate alleles may (but are not required to) start and end with a reference node.

The intervals for adjacent sites may overlap. If a haplotype has the reference allele at one of the overlapping sites, the alternate allele takes priority. If both sites have alternate alleles, it may still be possible to resolve the overlap by removing reference nodes from the end of the generated path and by skipping reference nodes at the start of the second alternate allele. If an overlap cannot be resolved, the second site can be ignored or it can cause a phase break. In the latter case, the current haplotype ends after the first site and a new haplotype starts from the second site.

By default, phasing information is stored in temporary files that are deleted when the object is deleted. The files can also be created as permanent files, and they can be turned into persistent files that are deleted when the program exits.

Version history

Variant Paths version GBWT version Changes
1 v0.9 Proper header, sample names, contig name.
0 v0.5

Basic structures

Haplotype

Haplotype represents a haplotype path.

struct Haplotype
{
  vector_type path;
  size_type   offset;
  bool        active, diploid;
  size_type   sample, phase, count;
};
  • path is the node sequence for the haplotype.
  • offset is the reference offset after the end of path.
  • active and diploid are internal variables.
  • sample is the sample identifier.
  • phase is the phase identifier (currently either 0 or 1).
  • count is a running count of haplotypes for the combination of sample and phase.

All member functions are for internal use. Because the haplotype is generated from a reference path, haplotypes for different chromosomes / VCF contigs may have the same combination of sample, phase, and count.

Phasing

Phasing represents the genotype of a sample at a variant site.

struct Phasing
{
  size_type first, second;
  bool      diploid, phased;
};
  • first: The allele identifier for the first phase (the only phase in a haploid region). 0 represents the reference allele and other values represent alternate alleles.
  • second: The allele identifier for the second phase.
  • diploid: Is this a genotype for a diploid site?
  • phased: Is the genotype phased?
Phasing::Phasing(size_type allele)
Phasing::Phasing(size_type first_allele, size_type second_allele, bool is_phased = true)
Phasing(const std::string& genotype, bool was_diploid = true, bool phase_homozygous = true)
  • The first constructor creates a phasing with allele identifier allele for a haploid site.
  • The second constructor creates a phasing for a diploid site with allele first_allele for the first phase and second_allele for the second phase. The site is either phased or unphased, depending on is_phased.
  • The third constructor parses the VCF genotype string genotype. Missing alleles are interpreted as reference alleles. If the genotype string is empty, the constructor will create a phased genotype diploid or haploid reference allels, depending on was_diploid. If phase_homozygous is set, homozygous variants are always phased.
void Phasing::forcePhased(std::function<bool()> rng)

This function replaces an unphased genotype with a randomly phased genotype. If the Phasing object does not represent an unphased diploid genotype, nothing happens. Otherwise forcePhased() makes the genotype phased and calls rng(). If rng() returns true, the allele identifiers for the first and the second phase are swapped.

Phasing objects can be compared with operators == and != and the genotype string can be written to any std::ostream object with operator <<.

size_type Phasing::encode(size_type max_allele) const

This function encodes the Phasing object as an integer. It assumes that max_allele is at least as large as first and second. The encoding is:

type + 3 * first + (max_allele + 1) * 3 * second

Where type is 0 for a haploid sample, 1 for an unphased diploid sample, and 2 for a phased diploid sample.

Variant paths

class VariantPaths is a structure that contains the reference path, the variant sites, and the paths for alternate alleles for a single VCF contig. While the user is expected to build the structure, they should only use it through generateHaplotypes() (see below).

VariantPaths can be used as a temporary in-memory structure for generating haplotypes. It can also be serialized and loaded using the SDSL interface. A serialized structure contains the names of the associated phasing information files. The reference index is not serialized.

File format

class VariantPaths
{
  std::uint32_t tag;
  std::uint32_t version;
  std::uint64_t flags;

  vector_type              reference;
  std::vector<size_type>   ref_starts, ref_ends;
  vector_type              alt_paths;
  std::vector<size_type>   path_starts, site_starts;

  std::vector<std::string> phasing_files;
  std::vector<size_type>   file_offsets, file_counts;

  std::vector<std::string> sample_names;
  std::string              contig_name;
};
  • Currently tag must be 0x6B37AA81 and version must be 1.
  • The following flags are stored in the bitfield flags:
    • FLAG_SAMPLE_NAMES = 0x0001: The file contains sample names.
    • FLAG_CONTIG_NAME = 0x0002: The file contains contig name.
  • reference is the reference path. Its name can be stored in contig_name.
  • ref_starts and ref_ends contain the starting and ending points of variants sites in sorted order (see Construction below).
  • alt_paths contains the concatenated paths for all alternate alleles.
  • path_starts contains the starting position of each path in alt_paths and includes a sentinel at the end.
  • site_starts contains the starting position of each site in path_starts and includes a sentinel at the end.
  • phasing_files is the list of phasing file names for this contig.
  • file_offsets stores the identifier of the first sample in each phasing file.
  • file_counts stores the number of samples in each phasing file.
  • sample_names is an optional list of sample names. Sample identifier i corresponds to name sample_names[i].

Construction

There are four steps in building a VariantPaths object:

  1. Create an empty object.
  2. Append the reference nodes.
  3. Add the variant sites and alternate alleles in sorted order.
  4. (Optional) Add phasing information files in the order the haplotypes should appear in the index.
explicit VariantPaths()
explicit VariantPaths(size_type reference_size)

The constructor builds an empty object. Specifying the reference length (in nodes) via reference_size is optional. Because the reference is stored as vector_type (as std::vector), knowing the length in advance will save memory.

void appendToReference(node_type node)

Append node identifier node to the reference.

void addSite(size_type ref_start, size_type ref_end)

Adds a new variant site corresponding to the reference interval from ref_start (inclusive) to ref_end (exclusive). Added sites receive consecutive identifiers starting from 0. The site is initially empty. Sites should be added in sorted order to avoid unresolvable overlaps.

void addAllele(const vector_type& path)

Add a new alternate allele with path path to the latest variant site. Added alleles receive consecutive identifiers starting from 1.

Example:

VariantPaths buildVariantPaths(const vector_type& reference,
                               const std::vector<range_type>& sites,
                               const std::vector<std::vector<vector_type>>& alleles)
{
  VariantPaths variants(reference.size());
  for(auto node : reference) { variants.appendToReference(node); }

  for(size_type site = 0; site < sites.size(); site++)
  {
    variants.addSite(sites[site].first, sites[site].second);
    for(const vector_type& allele : alleles[site]) { variants.addAllele(allele); }
  }

  return variants;
}

This function builds the VariantPaths object from reference path reference, variant sites sites, and alternate alleles alleles.

Statistics

Most member functions should only be used through generateHaplotypes(). The following functions return some statistics:

  • size_type size() const: Reference length in nodes.
  • size_type paths() const: The total number of alternate alleles for all sites.
  • size_type sites() const: Number of variant sites.
  • size_type alleles(size_type site) const: Number of alternate alleles at site site.
  • size_type refStart(size_type site) const: The start of site site (inclusive).
  • size_type refEnd(size_type site) const: The end of site site (exclusive).

Phasing information files

If the structure is going to be serialized and used for GBWT construction with the build_gbwt tool, it should contain the names of the associated permanent phasing information files.

void addFile(const std::string& filename, size_type sample_offset, size_type sample_count)

Adds a new phasing information file with name filename. The samples in the file start from first_sample, and the file contains the phasings for sample_count samples.

The following functions return information about the files:

  • size_type files() const: Number of files.
  • const std::string& name(size_type file) const: Name of file file.
  • size_type offset(size_type file) const: The first sample in file file.
  • size_type count(size_type file) const: Number of samples in file file.

Reference index

The VariantPaths object also contains a simple index for the first occurrences of each node identifier on the reference path. The index can be used for determining the reference intervals for variant sites, given the paths for the reference alleles.

void indexReference()

Builds the index for the current reference path.

size_type firstOccurrence(node_type node)

Returns the reference position of the first occurrence of node identifier node or invalid_position() if there are no occurrences.

Overlapping variants

Variant sites i and i+1 overlap if refStart(i+1) < refEnd(i). If a haplotype has alternate alleles at both overlapping sites, it may be unclear how to generate the path for the haplotype. If the path we have already generated ends with a reference nodet or the path for the next allele starts with a reference node (both with the correct offset), we can try to resolve the overlap by removing reference nodes.

The following function prints all combinations of adjacent alleles that may cause unresolvable overlaps:

void checkOverlaps(const VariantPaths& variants, std::ostream& out, bool print_ids = false)
  • const VariantPaths& variants: The variant paths object.
  • std::ostream& out: Output stream.
  • bool print_ids: By default, the function prints the actual node identifiers. When print_ids is set to true, it transforms the identifiers by using Node::id().

Phasing information

struct PhasingInformation stores phasing information in a temporary of permanent file. The file is usually similar in size to a .vcf.gz file containing the same information, but it is much faster to parse. Because the file is run-length encoded, files containing diploid and haploid genotypes (e.g. human chromosome X) may require more space.

As with the VariantPaths structure, the user is expected to build the PhasingInformation structure. Afterwards it should only be used through generateHaplotypes() (see below).

A PhasingInformation object can be moved but not copied. When the object is deleted and the associated file is temporary, the file will be removed.

File format

Each phasing information file is sdsl::int_vector<8> containing the concatenated representations of each variant site. For each site the file contains the following information:

  • The largest allele identifier encoded using ByteCode.
  • Run-length encoded phasing information for all samples using Run.
    • Each sample is encoded as an integer using Phasing::encode().

See File Formats for further information on ByteCode and Run.

Construction

A PhasingInformation structure is built by first creating an empty object and then inserting the phasings for one site at a time.

PhasingInformation(size_type first_sample, size_type num_samples)

This constructor creates an empty object with phasings for num_samples samples in a temporary file. The samples receive identifiers from first_sample to first_sample + num_samples - 1 (inclusive).

PhasingInformation(const std::string& base_name, size_type first_sample, size_type num_samples)

As above, but the object is backed by a permanent file. The name of the file is [base_name]_[first_sample]_[num_samples] (e.g. parse_200_100).

void append(const std::vector<Phasing>& new_site)

This function adds a new variant site with phasings according to new_site. If new_site.size() != size() (see below), the function prints an error message and returns without adding a new site.

void close()

The PhasingInformation object contains a 1 MB read buffer for the temporary file. If the phasing information in a VCF file is written into a large number of PhasingInformation objects, each of them containing the phasings for a range of samples, the buffers may use a large amount of memory. This can be avoided by closing the PhasingInformation object.

Note: New sites cannot be appended after the file has been closed.

Example:

PhasingInformation buildPhasingInformation(const std::vector<std::vector<Phasing>>& phasing_information)
{
  PhasingInformation phasings(0, phasing_information.size());
  for(const std::vector<Phasing>& site : phasing_information)
  {
    phasings.append(site);
  }
  phasings.close();
  return phasings;
}

This function builds the PhasingInformation object from the vector of vectors phasing_information.

Statistics

Most member functions should only be used through generateHaplotypes(). The following functions return some statistics:

  • bool isOpen(): Is the object open? New sites cannot be appended after the object has been closed.
  • const std::string& name() const: The name of the file.
  • bool isTemporary() const: Is the file temporary.
  • size_type size() const: The number of samples in the object.
  • size_type sites() const: The number of variant sites in the object.
  • size_type bytes() const: The size of the temporary file in bytes.

Persistent files

If the object is backed by a temporary file, the file can be made persistent by calling makePersistent(). The call has no effect if the object is backed by a permanent file.

Haplotype generation

Given a VariantPaths object and a PhasingInformation object over the same variant sites, we can easily generate haplotype paths for the samples in the PhasingInformation object. There are two versions of the haplotype generation function.

void generateHaplotypes(const VariantPaths& variants,
                        PhasingInformation& phasings,
                        std::function<bool(size_type)> process_sample,
                        std::function<void(const Haplotype&)> output,
                        std::function<bool(size_type, size_type)> report_overlap)

void generateHaplotypes(const VariantPaths& variants,
                        const std::set<std::string>& phasing_files,
                        std::function<bool(size_type)> process_sample,
                        std::function<void(const Haplotype&)> output,
                        std::function<bool(size_type, size_type)> report_overlap)
  • const VariantPaths& variants: The VariantPaths object.
  • PhasingInformation& phasings: The PhasingInformation object.
  • const std::set<std::string>& phasing_files: A set of names of phasing information files. Haplotypes will be generated if the file is both in the set and in the VariantPaths structure. An empty set indicates that all files in the structure should be used.
  • std::function<bool(size_type)> process_sample: A function that tells whether the haplotype paths should be generated for the sample with the given identifier.
  • std::function<void(const Haplotype&)> output: A function that outputs the generated haplotype paths.
  • std::function<bool(size_type, size_type)> report_overlap: A function that reports an unresolvable overlap and tells whether the overlapping allele should be replaced with the reference allele.

Example

generateHaplotypes(variants, phasings,
                   [&](size_type sample) -> bool
                   {
                     return (excluded_samples.find(sample) == excluded_samples.end());
                   },
                   [&](const Haplotype& haplotype)
                   {
                     std::cout << "Sample " << haplotype.sample
                               << ", phase " << haplotype.phase
                               << ", path " << haplotype.count << std::endl;
                   },
                   [&](size_type site, size_type allele) -> bool
                   {
                     std::cout << "Unresolvable overlap at site " << site
                               << ", allele " << allele << std::endl;
                     return false;
                   });

This function call generates the haplotypes for VariantPaths variants and PhasingInformation phasings.

  • The first lambda tells to skip a sample, if its identifier is in set excluded_samples.
  • The second lambda prints some information about the generated haplotype but does not do anything with the path itself.
  • The third lambda prints some information about the unresolvable overlap. Then it tells to create a phase break instead of replacing the alternate allele with the reference allele.