Skip to content

Metadata

Jouni Siren edited this page Sep 18, 2024 · 26 revisions

General

Metadata is optional information associated with path identifiers. Each path can have a name that consists of four components: sample identifier, contig identifier, phase number, and a running count for the same sample/contig/phase combination. The count field names each path name unique. Each sample and contig may also have a unique string as its name.

The internal representation of path names is an array of PathName objects that store each component as an integer. It is also possible to extract FullPathName objects that store sample/contig names as strings, converting the identifiers to strings if names are not present.

Data model

  • Samples and contigs are semantically meaningful fields. Selecting paths by sample or contig should yield meaningful results.
  • Contigs should correspond to non-overlapping objects such as reference contigs/paths or graph components.
  • Phase number (or haplotype identifier) can be used for differentiating fully overlapping paths for the same sample and contig. The field is typically used for haplotypes in diploid/polyploid samples.
  • Haplotype count should be equal to the number of distinct (sample, phase) pairs.
  • The count (or fragment identifier) field can be used for non-overlapping or potentially overlapping paths with the same sample and contig.
  • If there are multiple phase numbers in use, selecting paths by sample, contig, and phase should yield meaningful results. For example, this could retrieve path fragments that can be ordered by the count field.

Ideal data model

Path names are unique and hierarchical.

  • Sample name is the top-level name.
  • Contigs are non-overlapping parts of a sample. Their names should be shared between samples and should refer to the weakly connected components of the graph.
  • Haplotypes are overlapping parts of a (sample, contig). Their identifiers are arbitrary integers.
  • Fragments are non-overlapping parts of (sample, contig, haplotype). Their identifiers are arbitrary integers, but the order of the identifiers for the same (sample, contig, haplotype) should match the order of the underlying sequences.

Technical information

Samples, haplotypes, and contigs

The header contains summary statistics about the paths: the number of samples, haplotypes, and contigs. Valid sample identifiers are 0 to sample_count - 1, while the valid contig identifiers are 0 to contig_count - 1. The number of haplotypes represents the total number of start-to-end paths embedded in the graph.

Index construction

The build_gbwt tool automatically creates metadata when the index is built from a VCF parse (see Haplotype Generation). Each input file is assumed to represent a different contig. When loading an existing index, metadata is only written if the index already contains it. In this case, the tool assumes that we are inserting new contigs for the same samples.

Metadata merging

Merging GBWT indexes (see GBWT Merging) also merges the metadata, if both indexes contain it. Otherwise the merged index will not have any metadata. There are three merging approaches:

  • Merge by names: If both indexes contain sample/contig names, the names from the input index that do not exist in the current index are appended to names in the current index.
  • New samples/contigs: The samples/contigs from the input index are appended to the names in the current index.
  • Same samples/contigs: Both indexes are assumed to contain the same samples/contigs. If one of them contains names, they will be used in the merged index.

By default, samples and contigs will be merged by names. Otherwise the insertion algorithm and the parallel algorithm will assume new samples for the same contigs, while the fast algorithm will assume new contigs for the same samples. In any case, path names from the input index will be appended to the names in the current index.

Merging also updates the haplotype count. If the samples are merged by name and the index contains path names, the number of haplotypes will be counted from the path names. If there are no path names, the new number of haplotypes will be estimated from the old sample and haplotype counts. When merging indexes with same samples, the haplotype count remains unchanged. When there are new samples the new haplotype count is the sum of the old counts.

Interface

Both GBWT and DynamicGBWT implement the following metadata interface:

  • bool hasMetadata() const: Does the index contain metadata?
  • void addMetadata(): Adds a metadata record to the index.
  • void clearMetadata(): Removes all metadata from the index.
  • Metadata metadata: The metadata record.

The metadata record itself has an interface defined in metadata.h as class Metadata.

Basic statistics

  • size_type samples() const: Number of samples in the index.
  • size_type haplotypes() const: Number of haplotypes in the index.
  • size_type contigs() const: Number of contigs in the graph.
  • void setSamples(size_type n): Set the number of samples to n.
  • void setHaplotypes(size_type n): Set the number of haplotypes to n.
  • void setContigs(size_type n): Set the number of contigs to n.

Path names

  • bool hasPathNames() const: Does the metadata contain path names?
  • size_type paths() const: Number of paths in the metadata.
  • const PathName& path(size_type i) const: Name of the path with identifier i.
  • PathName path(const FullPathName& name) const: Internal representation of the given path name (which may not exist in the metadata).
  • FullPathName full_path(size_type i) const: Standalone representation of the name of the path with identifier i.
  • size_type findFragment(const PathName& name) const: Identifier of the path with the largest count <= name.count matching the other three fields, or paths() if there is no such path.
  • size_type findFragment(const FullPathName& name) const: Identifier of the path with the largest count <= name.fragment matching the other three fields, or paths() if there is no such path.
  • std::vector<size_type> findPaths(size_type sample_id, size_type contig_id) const: Sorted list of path identifiers for sample sample_id and contig contig_id.
  • std::vector<size_type> pathsForSample(size_type sample_id) const: Sorted list of path identifiers for sample sample_id.
  • std::vector<size_type> pathsForContig(size_type contig_id) const: Sorted list of path identifiers for contig contig_id.
  • void addPath(const PathName& path): Append a new path name.
  • void addPath(size_type sample, size_type contig, size_type phase, size_type count): Append a new path name.
  • void clearPathNames(): Remove path names from the metadata.

In a bidirectional index, metadata path identifier i corresponds to GBWT sequence identifiers Path::encode(i, false) and Path::encode(i, true).

Sample names

  • bool hasSampleNames() const: Does the metadata contain sample names?
  • std::string sample(size_type i): Name of the sample with identifier i, or an empty string if there is no such sample.
  • size_type sample(const std::string& name) const: Identifier of the sample with name name, or samples() if there is no such sample.
  • void setSamples(const std::vector<std::string>& names): Use the provided list of strings as sample names and update sample count accordingly.
  • void addSamples(const std::vector<std::string>& names): Append the provided list of strings to the current sample names and update sample count accordingly.
  • void clearSampleNames(): Remove sample names from the metadata.

Contig names

  • bool hasContigNames() const: Does the metadata contain contig names?
  • std::string contig(size_type i): Name of the contig with identifier i, or an empty string if there is no such contig.
  • size_type contig(const std::string& name) const: Identifier of the contig with name name, or contigs() if there is no such contig.
  • void setContigs(const std::vector<std::string>& names): Use the provided list of strings as contig names and update contig count accordingly.
  • void addContigs(const std::vector<std::string>& names): Append the provided list of strings to the current contig names and update contig count accordingly.
  • void clearContigNames(): Remove contig names from the metadata.

Remove metadata

  • std::vector<size_type> removeSample(size_type sample_id): Removes all metadata for sample identifier sample_id. Returns the list of removed path identifiers.
  • std::vector<size_type> removeContig(size_type contig_id): Removes all metadata for contig identifier contig_id. Returns the list of removed path identifiers.

When a sample / contig is removed, old sample / contig / path ids may be invalidated. Hence samples and contigs should be removed one at a time:

void removeSamples(DynamicGBWT& index, const std::vector<std::string>& sample_names)
{
  for(const std::string& name : sample_names)
  {
    size_type sample_id = index.metadata.sample(name);
    std::vector<size_type> paths = index.metadata.removeSample(sample_id);
    index.remove(paths);
  }
}

Metadata tool

The metadata_tool tool can be used to view and remove metadata. By default, the tool prints the basic metadata.

metadata_tool [options] basename
  • -s: Print all sample names.
  • -c: Print all contig names.
  • -p: Print all path names.
  • -t: Print all tags.
  • -r: Remove all metadata and overwrite the original index.
  • -O: Output SDSL format instead of simple-sds format.