To distribute Allen Institute Taxonomies (AIT) we define an anndata
.h5ad file which encapsulates the essential components of a taxonomy required for downstream analysis such as cell type mapping with a formalized schema.
One major challenge in creating a cell type taxonomy schema is in definition of terms such as "taxonomy", "dataset", "annotation", "metadata", and "data". It is becoming increasingly important to separate out the data from the other components, and compartmentalize all components to avoid the need to download, open, or upload huge and unweildy files.
That said, it is still important for many use cases to have an option of including all of the information listed above in a single h5ad file for use with CELLxGENE, scrattch.mapping, analysis tools, and for ease of sharing in a single file format.
Several competing schema have been created for packaging of taxonomies, data sets, and associated metadata and annotations. This document aims to align three such schema and propose a way of integrating them into the Allen Institute Taxonomies (AIT) .h5ad file format presented as part of this GitHub repository. The three standards are:
- AIT (described herein)
- Cell Annotation Schema (CAS): this schema is becoming more widely-used in the cell typing field as a whole because it is largely compatible with the CZ CELLxGENE schema. It is also compabible with Cell Annotation Platform (CAP) and with Taxonomy Development Tools (TDT). CAS has both a general schema and a BICAN-associated schema, both of which are considered herein. CAS can be embedded in the header (
uns
) of an AIT/Scraatch.taxonomy file, where it functions as a store of extended information about an annotation, including ontology term mappings, evidence for annotation (from annotation transfer and marker expression). - Brain Knowledge Platform (BKP): this schema isn't publicly laid out anywhere that I can find, but this is the data model used for Jupyter Notebooks associated with the Allen Brain Cell (ABC) Atlas. More generally, any data sets to be included in ABC Atlas, MapMyCells, or other related BKP resources will eventually need to conform to this format.