-
Notifications
You must be signed in to change notification settings - Fork 13
Haplotype Generation
Header variants.h contains support structures for generating haplotypes from a reference sequence and phased variants. They assume a diploid genome that may contain some haploid regions. While the structures expect a graph built from a VCF file, they do not include tools for VCF parsing or graph construction.
The reference is a node sequence (a path). Each variant site corresponds to a semiopen interval [start, end) in the reference. Each site has a reference allele and a number of alternate alleles. Each allele is a path that replaces the interval [start, end) in the reference. The paths for alternate alleles may (but are not required to) start and end with a reference node.
The intervals for adjacent sites may overlap. If a haplotype has the reference allele at one of the overlapping sites, the alternate allele takes priority. If both sites have alternate alleles, it may still be possible to resolve the overlap by removing reference nodes from the end of the generated path and by skipping reference nodes at the start of the second alternate allele. If an overlap cannot be resolved, the second site can be ignored or it can cause a phase break. In the latter case, the current haplotype ends after the first site and a new haplotype starts from the second site.
By default, phasing information is stored in temporary files that are deleted when the object is deleted. The files can also be created as permanent files, and they can be turned into persistent files that are deleted when the program exits.
Variant Paths version | GBWT version | Changes |
---|---|---|
1 | v0.9 | Proper header, sample names, contig name. |
0 | v0.5 |
Haplotype
represents a haplotype path.
struct Haplotype
{
vector_type path;
size_type offset;
bool active, diploid;
size_type sample, phase, count;
};
-
path
is the node sequence for the haplotype. -
offset
is the reference offset after the end ofpath
. -
active
anddiploid
are internal variables. -
sample
is the sample identifier. -
phase
is the phase identifier (currently either 0 or 1). -
count
is a running count of haplotypes for the combination ofsample
andphase
.
All member functions are for internal use. Because the haplotype is generated from a reference path, haplotypes for different chromosomes / VCF contigs may have the same combination of sample
, phase
, and count
.
Phasing
represents the genotype of a sample at a variant site.
struct Phasing
{
size_type first, second;
bool diploid, phased;
};
-
first
: The allele identifier for the first phase (the only phase in a haploid region). 0 represents the reference allele and other values represent alternate alleles. -
second
: The allele identifier for the second phase. -
diploid
: Is this a genotype for a diploid site? -
phased
: Is the genotype phased?
Phasing::Phasing(size_type allele)
Phasing::Phasing(size_type first_allele, size_type second_allele, bool is_phased = true)
Phasing(const std::string& genotype, bool was_diploid = true, bool phase_homozygous = true)
- The first constructor creates a phasing with allele identifier
allele
for a haploid site. - The second constructor creates a phasing for a diploid site with allele
first_allele
for the first phase andsecond_allele
for the second phase. The site is either phased or unphased, depending onis_phased
. - The third constructor parses the VCF genotype string
genotype
. Missing alleles are interpreted as reference alleles. If the genotype string is empty, the constructor will create a phased genotype diploid or haploid reference allels, depending onwas_diploid
. Ifphase_homozygous
is set, homozygous variants are always phased.
void Phasing::forcePhased(std::function<bool()> rng)
This function replaces an unphased genotype with a randomly phased genotype. If the Phasing
object does not represent an unphased diploid genotype, nothing happens. Otherwise forcePhased()
makes the genotype phased and calls rng()
. If rng()
returns true
, the allele identifiers for the first and the second phase are swapped.
Phasing
objects can be compared with operators ==
and !=
and the genotype string can be written to any std::ostream
object with operator <<
.
size_type Phasing::encode(size_type max_allele) const
This function encodes the Phasing
object as an integer. It assumes that max_allele
is at least as large as first
and second
. The encoding is:
type + 3 * first + (max_allele + 1) * 3 * second
Where type
is 0
for a haploid sample, 1
for an unphased diploid sample, and 2
for a phased diploid sample.
class VariantPaths
is a structure that contains the reference path, the variant sites, and the paths for alternate alleles for a single VCF contig. While the user is expected to build the structure, they should only use it through generateHaplotypes()
(see below).
VariantPaths
can be used as a temporary in-memory structure for generating haplotypes. It can also be serialized and loaded using the SDSL interface. A serialized structure contains the names of the associated phasing information files. The reference index is not serialized.
class VariantPaths
{
std::uint32_t tag;
std::uint32_t version;
std::uint64_t flags;
vector_type reference;
std::vector<size_type> ref_starts, ref_ends;
vector_type alt_paths;
std::vector<size_type> path_starts, site_starts;
std::vector<std::string> phasing_files;
std::vector<size_type> file_offsets, file_counts;
std::vector<std::string> sample_names;
std::string contig_name;
};
- Currently
tag
must be0x6B37AA81
andversion
must be1
. - The following flags are stored in the bitfield
flags
:-
FLAG_SAMPLE_NAMES = 0x0001
: The file contains sample names. -
FLAG_CONTIG_NAME = 0x0002
: The file contains contig name.
-
-
reference
is the reference path. Its name can be stored incontig_name
. -
ref_starts
andref_ends
contain the starting and ending points of variants sites in sorted order (see Construction below). -
alt_paths
contains the concatenated paths for all alternate alleles. -
path_starts
contains the starting position of each path inalt_paths
and includes a sentinel at the end. -
site_starts
contains the starting position of each site inpath_starts
and includes a sentinel at the end. -
phasing_files
is the list of phasing file names for this contig. -
file_offsets
stores the identifier of the first sample in each phasing file. -
file_counts
stores the number of samples in each phasing file. -
sample_names
is an optional list of sample names. Sample identifieri
corresponds to namesample_names[i]
.
There are four steps in building a VariantPaths
object:
- Create an empty object.
- Append the reference nodes.
- Add the variant sites and alternate alleles in sorted order.
- (Optional) Add phasing information files in the order the haplotypes should appear in the index.
explicit VariantPaths()
explicit VariantPaths(size_type reference_size)
The constructor builds an empty object. Specifying the reference length (in nodes) via reference_size
is optional. Because the reference is stored as vector_type
(as std::vector
), knowing the length in advance will save memory.
void appendToReference(node_type node)
Append node identifier node
to the reference.
void addSite(size_type ref_start, size_type ref_end)
Adds a new variant site corresponding to the reference interval from ref_start
(inclusive) to ref_end
(exclusive). Added sites receive consecutive identifiers starting from 0. The site is initially empty. Sites should be added in sorted order to avoid unresolvable overlaps.
void addAllele(const vector_type& path)
Add a new alternate allele with path path
to the latest variant site. Added alleles receive consecutive identifiers starting from 1.
Example:
VariantPaths buildVariantPaths(const vector_type& reference,
const std::vector<range_type>& sites,
const std::vector<std::vector<vector_type>>& alleles)
{
VariantPaths variants(reference.size());
for(auto node : reference) { variants.appendToReference(node); }
for(size_type site = 0; site < sites.size(); site++)
{
variants.addSite(sites[site].first, sites[site].second);
for(const vector_type& allele : alleles[site]) { variants.addAllele(allele); }
}
return variants;
}
This function builds the VariantPaths
object from reference path reference
, variant sites sites
, and alternate alleles alleles
.
Most member functions should only be used through generateHaplotypes()
. The following functions return some statistics:
-
size_type size() const
: Reference length in nodes. -
size_type paths() const
: The total number of alternate alleles for all sites. -
size_type sites() const
: Number of variant sites. -
size_type alleles(size_type site) const
: Number of alternate alleles at sitesite
. -
size_type refStart(size_type site) const
: The start of sitesite
(inclusive). -
size_type refEnd(size_type site) const
: The end of sitesite
(exclusive).
If the structure is going to be serialized and used for GBWT construction with the build_gbwt
tool, it should contain the names of the associated permanent phasing information files.
void addFile(const std::string& filename, size_type sample_offset, size_type sample_count)
Adds a new phasing information file with name filename
. The samples in the file start from first_sample
, and the file contains the phasings for sample_count
samples.
The following functions return information about the files:
-
size_type files() const
: Number of files. -
const std::string& name(size_type file) const
: Name of filefile
. -
size_type offset(size_type file) const
: The first sample in filefile
. -
size_type count(size_type file) const
: Number of samples in filefile
.
The VariantPaths
object also contains a simple index for the first occurrences of each node identifier on the reference path. The index can be used for determining the reference intervals for variant sites, given the paths for the reference alleles.
void indexReference()
Builds the index for the current reference path.
size_type firstOccurrence(node_type node)
Returns the reference position of the first occurrence of node identifier node
or invalid_position()
if there are no occurrences.
Variant sites i and i+1 overlap if refStart(i+1) < refEnd(i)
. If a haplotype has alternate alleles at both overlapping sites, it may be unclear how to generate the path for the haplotype. If the path we have already generated ends with a reference nodet or the path for the next allele starts with a reference node (both with the correct offset), we can try to resolve the overlap by removing reference nodes.
The following function prints all combinations of adjacent alleles that may cause unresolvable overlaps:
void checkOverlaps(const VariantPaths& variants, std::ostream& out, bool print_ids = false)
-
const VariantPaths& variants
: The variant paths object. -
std::ostream& out
: Output stream. -
bool print_ids
: By default, the function prints the actual node identifiers. Whenprint_ids
is set to true, it transforms the identifiers by usingNode::id()
.
struct PhasingInformation
stores phasing information in a temporary of permanent file. The file is usually similar in size to a .vcf.gz
file containing the same information, but it is much faster to parse. Because the file is run-length encoded, files containing diploid and haploid genotypes (e.g. human chromosome X) may require more space.
As with the VariantPaths
structure, the user is expected to build the PhasingInformation
structure. Afterwards it should only be used through generateHaplotypes()
(see below).
A PhasingInformation
object can be moved but not copied. When the object is deleted and the associated file is temporary, the file will be removed.
Each phasing information file is sdsl::int_vector<8>
containing the concatenated representations of each variant site. For each site the file contains the following information:
- The largest allele identifier encoded using
ByteCode
. - Run-length encoded phasing information for all samples using
Run
.- Each sample is encoded as an integer using
Phasing::encode()
.
- Each sample is encoded as an integer using
See File Formats for further information on ByteCode
and Run
.
A PhasingInformation
structure is built by first creating an empty object and then inserting the phasings for one site at a time.
PhasingInformation(size_type first_sample, size_type num_samples)
This constructor creates an empty object with phasings for num_samples
samples in a temporary file. The samples receive identifiers from first_sample
to first_sample + num_samples - 1
(inclusive).
PhasingInformation(const std::string& base_name, size_type first_sample, size_type num_samples)
As above, but the object is backed by a permanent file. The name of the file is [base_name]_[first_sample]_[num_samples]
(e.g. parse_200_100
).
void append(const std::vector<Phasing>& new_site)
This function adds a new variant site with phasings according to new_site
. If new_site.size() != size()
(see below), the function prints an error message and returns without adding a new site.
void close()
The PhasingInformation
object contains a 1 MB read buffer for the temporary file. If the phasing information in a VCF file is written into a large number of PhasingInformation
objects, each of them containing the phasings for a range of samples, the buffers may use a large amount of memory. This can be avoided by closing the PhasingInformation
object.
Note: New sites cannot be appended after the file has been closed.
Example:
PhasingInformation buildPhasingInformation(const std::vector<std::vector<Phasing>>& phasing_information)
{
PhasingInformation phasings(0, phasing_information.size());
for(const std::vector<Phasing>& site : phasing_information)
{
phasings.append(site);
}
phasings.close();
return phasings;
}
This function builds the PhasingInformation
object from the vector of vectors phasing_information
.
Most member functions should only be used through generateHaplotypes()
. The following functions return some statistics:
-
bool isOpen()
: Is the object open? New sites cannot be appended after the object has been closed. -
const std::string& name() const
: The name of the file. -
bool isTemporary() const
: Is the file temporary. -
size_type size() const
: The number of samples in the object. -
size_type sites() const
: The number of variant sites in the object. -
size_type bytes() const
: The size of the temporary file in bytes.
If the object is backed by a temporary file, the file can be made persistent by calling makePersistent()
. The call has no effect if the object is backed by a permanent file.
Given a VariantPaths
object and a PhasingInformation
object over the same variant sites, we can easily generate haplotype paths for the samples in the PhasingInformation
object. There are two versions of the haplotype generation function.
void generateHaplotypes(const VariantPaths& variants,
PhasingInformation& phasings,
std::function<bool(size_type)> process_sample,
std::function<void(const Haplotype&)> output,
std::function<bool(size_type, size_type)> report_overlap)
void generateHaplotypes(const VariantPaths& variants,
const std::set<std::string>& phasing_files,
std::function<bool(size_type)> process_sample,
std::function<void(const Haplotype&)> output,
std::function<bool(size_type, size_type)> report_overlap)
-
const VariantPaths& variants
: TheVariantPaths
object. -
PhasingInformation& phasings
: ThePhasingInformation
object. -
const std::set<std::string>& phasing_files
: A set of names of phasing information files. Haplotypes will be generated if the file is both in the set and in theVariantPaths
structure. An empty set indicates that all files in the structure should be used. -
std::function<bool(size_type)> process_sample
: A function that tells whether the haplotype paths should be generated for the sample with the given identifier. -
std::function<void(const Haplotype&)> output
: A function that outputs the generated haplotype paths. -
std::function<bool(size_type, size_type)> report_overlap
: A function that reports an unresolvable overlap and tells whether the overlapping allele should be replaced with the reference allele.
generateHaplotypes(variants, phasings,
[&](size_type sample) -> bool
{
return (excluded_samples.find(sample) == excluded_samples.end());
},
[&](const Haplotype& haplotype)
{
std::cout << "Sample " << haplotype.sample
<< ", phase " << haplotype.phase
<< ", path " << haplotype.count << std::endl;
},
[&](size_type site, size_type allele) -> bool
{
std::cout << "Unresolvable overlap at site " << site
<< ", allele " << allele << std::endl;
return false;
});
This function call generates the haplotypes for VariantPaths variants
and PhasingInformation phasings
.
- The first lambda tells to skip a sample, if its identifier is in set
excluded_samples
. - The second lambda prints some information about the generated haplotype but does not do anything with the path itself.
- The third lambda prints some information about the unresolvable overlap. Then it tells to create a phase break instead of replacing the alternate allele with the reference allele.