diff --git a/website/docs/Pipelines/Optimus_Pipeline/Library-metrics.md b/website/docs/Pipelines/Optimus_Pipeline/Library-metrics.md index 143b8f0730..486eb57a55 100644 --- a/website/docs/Pipelines/Optimus_Pipeline/Library-metrics.md +++ b/website/docs/Pipelines/Optimus_Pipeline/Library-metrics.md @@ -6,7 +6,7 @@ sidebar_position: 5 The following table describes the library level metrics of the produced by the Optimus workflow. These are calcuated using custom python scripts available in the warp-tools repository. The Optimus workflow aligns files in shards to parallelize computationally intensive steps. This results in multiple matrix market files and shard-level library metrics. -To produce the library-level metrics here, the [combined_mtx.py script](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/star-merge-npz/scripts/combined_mtx.py) combines all the shard-level matrix market files into one raw mtx file. Then, STARsolo is run to filter this matrix to only those barcodes that meet STARsolo's criteria of cells (using the Emptydrops_CR parameter). Lastly, the [combine_shard_metrics.py script](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/star-merge-npz/scripts/combine_shard_metrics.py) uses the filtered matrix and the all of the shard-level metrics files produced by STARsolo to calculate the metrics below. Each of the scripts are called from [MergeStarOutput task](https://github.com/broadinstitute/warp/blob/develop/tasks/skylab/StarAlign.wdl) of the Optimus workflow. +To produce the library-level metrics here, the [combined_mtx.py script](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/star-merge-npz/scripts/combined_mtx.py) combines all the shard-level matrix market files into one raw mtx file. Then, STARsolo is run to filter this matrix to only those barcodes that meet STARsolo's criteria of cells (using the Emptydrops_CR parameter). This matrix is then used as input during h5ad generation, and metrics are calculated from the final h5ad using the custom [add_library_tso_doublets.py]() script. | Metric | Description | @@ -37,9 +37,10 @@ To produce the library-level metrics here, the [combined_mtx.py script](https:// | total_genes_unique_detected | Total number of unique genes detected. | | percent_target | Percentage of target cells. Calculated as: estimated_number_of_cells / barcoded_cell_sample_number_of_expected_cells | | percent_intronic_reads | Percentage of intronic reads. Calculated as: reads_mapped_confidently_to_intronic_regions / number_of_reads | -| keeper_mean_reads_per_cell | Mean reads per cell for cells with >1500 genes or nuclei with >1000 genes. | -| keeper_median_genes | Median genes per cell for cells with >1500 genes or nuclei with >1000 genes. | -| keeper_cells | Number of cells with >1500 genes or nuclei with >1000 genes.| +| percent_doublets | Percentage of cells flagged as doublets based on doublet scores calculated from a modified [DoubletFinder](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6853612/) algorithm. | +| keeper_mean_reads_per_cell | Mean reads per cell for cells with >1500 genes or nuclei with >1000 genes, and doublet_score < 0.3. | +| keeper_median_genes | Median genes per cell for cells with >1500 genes or nuclei with >1000 genes, and doublet_score < 0.3>. | +| keeper_cells | Number of cells with >1500 genes or nuclei with >1000 genes, and doublet score < 0.3.| | percent_keeper | Percentage of keeper cells. Calculated as: keeper_cells / estimated_cells | | percent_usable | Percentage of usable cells. Calculated as: keeper_cells / expected_cells | | frac_tso | Fraction of reads containing TSO sequence. Calculated as the number of reads that have 20 bp or more of TSO Sequence clipped from 5' end/ total number of reads. | \ No newline at end of file diff --git a/website/docs/Pipelines/Optimus_Pipeline/Loom_schema.md b/website/docs/Pipelines/Optimus_Pipeline/Loom_schema.md index 83e07ba73a..cf861eb91e 100644 --- a/website/docs/Pipelines/Optimus_Pipeline/Loom_schema.md +++ b/website/docs/Pipelines/Optimus_Pipeline/Loom_schema.md @@ -41,6 +41,7 @@ The global attributes (unstuctured metadata) in the h5ad apply to the whole file |`CellID` | [TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort) | The unique identifier for each cell based on cell barcodes (sequences used to identify unique cells); identical to `cell_names`. Learn more about cell barcodes in the [Definitions](#definitions) section below. | |`cell_names` | [TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort) | The unique identifier for each cell based on cell barcodes; identical to `CellID`. | | `input_id` | Provided as pipeline input | The sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata. | +| `star_IsCell` | STARsolo | A true/false flag demarcating if the STARsolo aligner called a cell barcode as a cell. | |`n_reads`|[TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort)| The number of reads associated with the cell. Like all metrics, `n_reads` is calculated from the Optimus output BAM file. Prior to alignment, reads are checked against the whitelist and any within one edit distance (Hamming distance) are corrected. These CB-corrected reads are aligned using STARsolo, where they get further CB correction. For this reason, most reads in the aligned BAM file have both `CB` and `UB` tags. Therefore, `n_reads` represents CB-corrected reads, rather than all reads in the input FASTQ files. | | `tso_reads` | [TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort) | The number of reads that have 20 or more bp of TSO sequence clipped from the 5' end. Calculated using the first number of cN tag in the BAM, which is specific to the number of TSO nucleotides clipped. | |`noise_reads`|[TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort)| Number of reads that are categorized by 10x Genomics Cell Ranger as "noise". Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides. | @@ -85,6 +86,7 @@ The global attributes (unstuctured metadata) in the h5ad apply to the whole file | `reads_mapped_intergenic` | STARsolo and [TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort) | The number of reads counted as intergenic; counted when the BAM file's `sF` tag is assigned to a `7` and the `NH:i` tag is `1`. | | `reads_unmapped` | [TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort) | The total number of reads that are unmapped; counted when the BAM file's `sF` tag is `0`. | |`reads_per_molecule`|[TagSort](https://github.com/broadinstitute/warp-tools/tree/develop/tools/TagSort)| The average number of reads associated with each molecule in the cell. | +| `doublet_score` | Modified version of [DoubletFinder](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6853612/) | A score produced by a modified version of the DoubletFinder software that normalizes data using scanpy and then uses the k-nearest neighbors algorithm to determine cells. This program is non-deterministic, so results will vary across runs of the workflow. The metrics are used to determine overall library quality. | ## Table 3. Gene metrics