Skip to content

Latest commit

 

History

History
243 lines (192 loc) · 12 KB

SubreadsBAM.rst

File metadata and controls

243 lines (192 loc) · 12 KB

BAM format additions for PacBio-subreads

PacBio-subread BAM flavors

Data generated by the PacBio basecaller is stored in subreads and scraps BAM files. Those files are consumed by CCS to generate HiFi reads. For PacBio in-house analysis, those files can be used to measure and characterize base calling performance or develop new methods for HiFi generation. Those use cases require extra information to be carried in our BAM files.

The subreads and scraps files are fully compliant with the PacBio BAM spec (with spec version noted in the @HD::pb tag) but will include additional per-read tags containing additional information.

QNAME convention

By convention the QNAME ("query template name") for unrolled reads and subreads is in the following format:

{movieName}/{holeNumber}/{qStart}_{qEnd}

where [qStart, qEnd) is the 0-based coordinate interval representing the span of the query in the ZMW read, as above.

BAM filename conventions

Since we will be using BAM format for different kinds of data, we will use a suffix.bam filename convention:

Data type Filename template
ZMW reads from movie movieName.zmws.bam
Analysis-ready subreads 1
from movie
movieName.subreads.bam
Excised adapters, barcodes, and
rejected subreads
movieName.scraps.bam
Aligned subreads in a job jobID.aligned_subreads.bam
1
Data in a subreads.bam file should be analysis ready, meaning that all of the data present is expected to be useful for down-stream analyses. Any subreads for which we have strong evidence will not be useful (e.g. double-adapter inserts, single-molecule artifacts) should be excluded from this file and placed in scraps.bam as a Filtered with an SC tag of F.

Use of headers for file-level information

Beyond the usual information encoded in headers that is called for SAM/BAM spec and what is added for customer-facing PacBio BAM files, we encode special information as follows.

@RG (read group) header entries:

DS tag ("description"):

contains some semantic information about the reads in the group, encoded as a semicolon-delimited list of "Key=Value" strings, as follows:

Base feature manifest---absent item means feature absent from reads:

Key Value spec Value example
DeletionQV Name of tag used for DeletionQV dq
DeletionTag Name of tag used for DeletionTag dt
InsertionQV Name of tag used for InsertionQV iq
MergeQV Name of tag used for MergeQV mq
SubstitutionQV Name of tag used for SubstitutionQV sq
SubstitutionTag Name of tag used for SubstitutionTag st

Use of read tags for per-read information

Tag Type Description
ws i Start of first base of the query ('qs') in approximate raw frame count since start of movie.
we i Start of last base of the query ('qe - 1') in approximate raw frame count since start of movie.

Use of read tags for per-read-base information

The following read tags encode features measured/calculated per-basecall. Unlike SEQ and QUAL, aligners will not orient these tags. They will be maintained in native orientation (in the same order and sense as collected from the instrument) even if the read record has been aligned to the reverse strand.

Tag Type Description
dq Z DeletionQV
dt Z DeletionTag
ip B,C or B,S IPD (raw frames or codec V1)
iq Z InsertionQV
mq Z MergeQV
pw B,C or B,S PulseWidth (raw frames or codec V1)
sq Z SubstitutionQV
st Z SubstitutionTag

Notes:

  • QV metrics are ASCII+33 encoded as strings
  • DeletionTag and SubstitutionTag represent alternate basecalls, or "N" when there is no alternate basecall available. In other words, they are strings over the alphabet "ACGTN".

How to annotate scrap reads

Reads that belong to a read group with READTYPE=SCRAP have to be annotated in a hierarchical fashion:

  1. Classification with tag sz occurs on a per ZMW level, distinguishing between spike-in controls, sentinels of the basecaller, malformed ZMWs, and user-defined templates.
  2. A region-wise annotation with tag sc to label adapters, barcodes, low-quality regions, and filtered subreads.
Tag Type Description
sz A ZMW classification annotation, one of N:=Normal, C:=Control, M:=Malformed, or S:=Sentinel 1
sc A Scrap region-type annotation, one of A:=Adapter, B:=Barcode, L:=LQRegion, or F:=Filtered 2
1
reads in the subreads/hqregions/zmws.bam file are implicitly marked as Normal, as they stem from user-defined templates.
2
sc tags 'A', 'B', and 'L' denote specific classes of non-subread data, whereas the 'F' tag is reserved for subreads that are undesirable for downstream analysis, e.g., being artifactual or too short.

Subread local context

Some algorithms can make use of knowledge that a subread was flanked on both sides by adapter or barcode hits, or that the subread was in one orientation or the other (as can be deduced when asymmetric adapters or barcodes are used).

To facilitate such algorithms, we furnish the cx bitmask tag for subread records. The cx value is calculated by binary OR-ing together values from this flags enum:

enum LocalContextFlags
{
    ADAPTER_BEFORE     = 1,
    ADAPTER_AFTER      = 2,
    BARCODE_BEFORE     = 4,
    BARCODE_AFTER      = 8,
    FORWARD_PASS       = 16,
    REVERSE_PASS       = 32,
    ADAPTER_BEFORE_BAD = 64,
    ADAPTER_AFTER_BAD  = 128
};

Orientation of a subread (designated by one of the mutually exclusive FORWARD_PASS or REVERSE_PASS bits) can be reckoned only if either the adapters or barcode design is asymmetric, otherwise these flags must be left unset. The convention for what is considered a "forward" or "reverse" pass is determined by a per-ZMW convention, defining one element of the asymmetric barcode/adapter pair as the "front" and the other as the "back". It is up to tools producing the BAM to determine whether to use adapters or barcodes to reckon the orientation, but if pass directions cannot be confidently and consistently assessed for the subreads from a ZMW, neither orientation flag should be set. Tools consuming the BAM should be aware that orientation information may be unavailable for subreads in a ZMW, but if is available for any subread in the ZMW, it will be available for all subreads in the ZMW.

The ADAPTER_* and BARCODE_* flags reflect whether the subread is flanked by adapters or barcodes at the ends.

The ADAPTER_BEFORE_BAD and ADAPTER_AFTER_BAD flags indicate that one or both adapters flanking this subread do not align to the adapter reference sequence(s). The adapter on this flank could be missing from the pbell molecule, or obscured by a local decrease in accuracy. Likewise, some nearby barcode or insert bases may be missing or obscured. ADAPTER_*_BAD flags can not be set unless the corresponding ADAPTER_* flag is set.

This tag is mandatory for subread records, but will be absent from non-subread records (scraps, ZMW read, CCS read, etc.)

Tag Type Description
cx i Subread local context Flags