Skip to content

Maximum Likelihood Clade Assignment (MLCA)

Robert J. Gifford edited this page Jan 5, 2025 · 2 revisions

AAV-Atlas employs the Maximum Likelihood Clade Assignment (MLCA) method to assign adeno-associated virus (AAV) sequences to specific species and serotypes. This method applies a robust genotyping framework based on the Evolutionary Placement Algorithm (EPA) within the RAxML software suite.

MLCA enables efficient and accurate placement of new AAV sequences onto a fixed reference phylogeny without recalculating the entire tree, making it well-suited for large-scale genomic analyses. This genotyping capability forms a core part of AAV-Atlas, facilitating species and serotype identification for submitted sequences.

In AAV-Atlas, the MLCA process is executed using the maxLikelihoodGenotyper and maxLikelihoodPlacer modules.

Example Usage in AAV-Atlas

The genotyping process in AAV-Atlas can be initiated through the command-line interface. Below is an example of using the MLCA genotyping module for AAV sequences:

GLUE> module aavMaxLikelihoodGenotyper genotype file -f example/test-seqs.fasta

This command processes the sequences in the specified FASTA file and outputs the assigned species and serotype clades for each sequence:

+===========+===================+====================+
| queryName | speciesFinalClade | serotypeFinalClade |
+===========+===================+====================+
| AX344105  | AL_Primate1       | AL_AAV2            |
| AX496953  | AL_Primate1       | AL_AAV2            |
| AX703462  | AL_Primate1       | AL_AAV2            |
| AX720902  | AL_Primate1       | AL_AAV2            |
| AX925291  | AL_Primate1       | AL_AAV2            |
| AX925550  | AL_Primate1       | AL_AAV2            |
| BD293519  | AL_Primate1       | AL_AAV2            |
| HV955994  | AL_Primate1       | AL_AAV2            |
| HZ796968  | AL_Primate1       | AL_AAV2            |
| LQ396120  | AL_Primate1       | AL_AAV2            |
| MP863866  | AL_Primate1       | AL_AAV2            |
| OF065946  | AL_Primate1       | AL_AAV2            |
| PC321937  | AL_Primate1       | AL_AAV2            |
| PE178192  | AL_Primate1       | AL_AAV2            |
| PF056792  | AL_Primate1       | AL_AAV2            |
| V01457    | AL_Primate1       | AL_AAV2            |
+===========+===================+====================+

In this example, each AAV sequence is assigned to the Primate1 species clade and the AAV2 serotype clade.

The MLCA Algorithm

MLCA operates through three primary stages: alignment, placement, and neighbor-weighting. Each stage is critical for accurately assigning query sequences to predefined clades.

  1. Alignment Stage: Query sequences are aligned against a curated set of reference AAV sequences. The alignment uses the MAFFT software with the --add and --keeplength options, ensuring query sequences integrate into the existing alignment without modifying the reference structure. This isolated alignment process prevents alterations to the primary alignment, preserving data integrity.

  2. Placement Stage: The extended alignment is analyzed in conjunction with a fixed reference tree. RAxML's EPA subsystem places each query sequence onto the reference tree at positions maximizing the likelihood of the overall structure. The algorithm retains a subset of high-likelihood placements for further evaluation.

  3. Neighbor-Weighting Stage: This final stage computes the evolutionary distances between query sequences and the closest reference sequences, assessing the likelihood of each placement. Sequences are assigned to species and serotype clades based on proximity to reference sequences, with placements weighted by evolutionary distance. If the calculated weight exceeds a threshold, the query is assigned to the corresponding clade.

Advantages of MLCA in AAV-Atlas

The integration of MLCA into AAV-Atlas provides a scalable and efficient tool for AAV genotyping. By leveraging RAxML's EPA feature and the structured MLCA workflow, the process delivers:

  • High accuracy in species and serotype identification.
  • Computational efficiency, minimizing the need for repeated full phylogenetic tree construction.
  • Applicability to large-scale sequence datasets.