Data generated by the PacBio basecaller is stored in subreads and scraps BAM files. Those files are consumed by CCS to generate HiFi reads. For PacBio in-house analysis, those files can be used to measure and characterize base calling performance or develop new methods for HiFi generation. Those use cases require extra information to be carried in our BAM files.
The subreads and scraps files are fully compliant with the PacBio
BAM spec (with spec version noted in the @HD::pb
tag) but will
include additional per-read tags containing additional information.
By convention the QNAME
("query template name") for unrolled reads
and subreads is in the following format:
{movieName}/{holeNumber}/{qStart}_{qEnd}
where [qStart, qEnd)
is the 0-based coordinate interval
representing the span of the query in the ZMW read, as above.
Since we will be using BAM format for different kinds of data, we will
use a suffix.bam
filename convention:
Data type Filename template ZMW reads from movie movieName.zmws.bam
- Analysis-ready subreads 1
- from movie
movieName.subreads.bam
- Excised adapters, barcodes, and
- rejected subreads
movieName.scraps.bam Aligned subreads in a job jobID.aligned_subreads.bam
- 1
- Data in a
subreads.bam
file should beanalysis ready
, meaning that all of the data present is expected to be useful for down-stream analyses. Any subreads for which we have strong evidence will not be useful (e.g. double-adapter inserts, single-molecule artifacts) should be excluded from this file and placed inscraps.bam
as aFiltered
with an SC tag ofF
.
Beyond the usual information encoded in headers that is called for SAM/BAM spec and what is added for customer-facing PacBio BAM files, we encode special information as follows.
@RG
(read group) header entries:
DS
tag ("description"):contains some semantic information about the reads in the group, encoded as a semicolon-delimited list of "Key=Value" strings, as follows:
Base feature manifest---absent item means feature absent from reads:
Key Value spec Value example DeletionQV Name of tag used for DeletionQV dq DeletionTag Name of tag used for DeletionTag dt InsertionQV Name of tag used for InsertionQV iq MergeQV Name of tag used for MergeQV mq SubstitutionQV Name of tag used for SubstitutionQV sq SubstitutionTag Name of tag used for SubstitutionTag st
Tag Type Description ws i Start of first base of the query ('qs') in approximate raw frame count since start of movie. we i Start of last base of the query ('qe - 1') in approximate raw frame count since start of movie.
The following read tags encode features measured/calculated
per-basecall. Unlike SEQ
and QUAL
, aligners will not orient
these tags. They will be maintained in native orientation (in the
same order and sense as collected from the instrument) even if the
read record has been aligned to the reverse strand.
Tag Type Description dq Z DeletionQV dt Z DeletionTag ip B,C or B,S IPD (raw frames or codec V1) iq Z InsertionQV mq Z MergeQV pw B,C or B,S PulseWidth (raw frames or codec V1) sq Z SubstitutionQV st Z SubstitutionTag
Notes:
- QV metrics are ASCII+33 encoded as strings
- DeletionTag and SubstitutionTag represent alternate basecalls, or "N" when there is no alternate basecall available. In other words, they are strings over the alphabet "ACGTN".
Reads that belong to a read group with READTYPE=SCRAP have to be annotated in a hierarchical fashion:
- Classification with tag sz occurs on a per ZMW level, distinguishing between spike-in controls, sentinels of the basecaller, malformed ZMWs, and user-defined templates.
- A region-wise annotation with tag sc to label adapters, barcodes, low-quality regions, and filtered subreads.
Tag Type Description sz A ZMW classification annotation, one of N:=Normal, C:=Control, M:=Malformed, or S:=Sentinel 1 sc A Scrap region-type annotation, one of A:=Adapter, B:=Barcode, L:=LQRegion, or F:=Filtered 2
- 1
- reads in the subreads/hqregions/zmws.bam file are implicitly marked as Normal, as they stem from user-defined templates.
- 2
- sc tags 'A', 'B', and 'L' denote specific classes of non-subread data, whereas the 'F' tag is reserved for subreads that are undesirable for downstream analysis, e.g., being artifactual or too short.
Some algorithms can make use of knowledge that a subread was flanked on both sides by adapter or barcode hits, or that the subread was in one orientation or the other (as can be deduced when asymmetric adapters or barcodes are used).
To facilitate such algorithms, we furnish the cx
bitmask tag for
subread records. The cx
value is calculated by binary OR-ing
together values from this flags enum:
enum LocalContextFlags { ADAPTER_BEFORE = 1, ADAPTER_AFTER = 2, BARCODE_BEFORE = 4, BARCODE_AFTER = 8, FORWARD_PASS = 16, REVERSE_PASS = 32, ADAPTER_BEFORE_BAD = 64, ADAPTER_AFTER_BAD = 128 };
Orientation of a subread (designated by one of the mutually
exclusive FORWARD_PASS
or REVERSE_PASS
bits) can be reckoned
only if either the adapters or barcode design is asymmetric,
otherwise these flags must be left unset. The convention for what
is considered a "forward" or "reverse" pass is determined by a
per-ZMW convention, defining one element of the asymmetric
barcode/adapter pair as the "front" and the other as the "back". It
is up to tools producing the BAM to determine whether to use
adapters or barcodes to reckon the orientation, but if pass
directions cannot be confidently and consistently assessed for the
subreads from a ZMW, neither orientation flag should be set. Tools
consuming the BAM should be aware that orientation information may
be unavailable for subreads in a ZMW, but if is available for any
subread in the ZMW, it will be available for all subreads in the
ZMW.
The ADAPTER_*
and BARCODE_*
flags reflect whether the
subread is flanked by adapters or barcodes at the ends.
The ADAPTER_BEFORE_BAD
and ADAPTER_AFTER_BAD
flags indicate
that one or both adapters flanking this subread do not align to the
adapter reference sequence(s). The adapter on this flank could be missing
from the pbell molecule, or obscured by a local decrease in accuracy.
Likewise, some nearby barcode or insert bases may be missing or
obscured. ADAPTER_*_BAD
flags can not be set unless the
corresponding ADAPTER_*
flag is set.
This tag is mandatory for subread records, but will be absent from non-subread records (scraps, ZMW read, CCS read, etc.)
Tag Type Description cx i Subread local context Flags