Skip to content

pseudo code supertree

Mark T. Holder edited this page Feb 25, 2015 · 4 revisions

proposed tree-rank based synthesis:

Note: this is being turned into code described here. As it is a work-in-progress, there will probably be some discrepancies between that pipeline and this one. Hopefully, we'll clean this up soon...

Input:

  • OTT pruned by treemachine and expressed in newick
  • list of ranked trees (most important tree first)
  • phylesystem-api for grabbing the nexson.

Conventions:

  • $SYNTHESIS_PARENT is some dir in your env.
  • outdir is $SYNTHESIS_PARENT/out - the directory that will be the parent of the output
  • provdir is ${outdir}/provenance - directory holding artifacts from the synthesis run

Phase 0: fetch data and basic preconditions enforced.

Phase 0 - Step 0. Fetch and snapshot nexsons:

And normalize IDs (which lack the "tree" in many treeIDs in treemachine syntax)

Inputs:

  1. tree ranking list. one identifier per line using (the loathed) studyid_treeid format
  2. a local copy of the phylesystem repo.

fetch-and-snapshot-nexsons.bash:

#!/bin/bash
cd phyleysystem/shards/phylesystem-1
git pull origin master
cd $SYNTHESIS_PARENT
for tag in $(cat in/ranked-tree-list.txt)
do
    snapshot-nexson-for-synth.py "${tag}" outdir/raw-nexsons
done

snapshot-nexsons.py:

#!/usr/bin/env python
demunger_pattern = re.compile('([a-z][a-z]_[0-9]+)_([0-9]+)')
study_tree_str = open(sys.argv[1], 'rU')
nexson_snapshot = sys.argv[2]
create_dir(nexson_snapshot)

def do_curation_check(study, tree, prov):
    e = {}
    if 'ingroup' not in tree:
        e.setdefault('tree_errors', []).append("ingroup missing")
    if 'doi' not in stdy:
        e.setdefault('tree_errors', []).append("doi missing")
    if 'root' not in tree:
        e.setdefault('tree_errors', []).append("root missing")
    if e:
      raise CurationError(e, prov)
try:
    for study_tree_str in open(sys.argv[1], 'rU'):
        match = demunger_pattern(study_tree_str)
        assert match
        study_id, tree_id = match.group(1), match.group(2)
        study = phylesystem.get(study)
        try:
            tree = get_tree(study, tree_id)
        except:
            tree_id = 'tree' + tree_id
            tree = get_tree(study, tree_id)
        study.cull_other_trees(tree_id)
        sha = get_sha(study)
        prov = {'study_id':study,
                'tree_id':tree_id,
                'sha':sha,
                'local_path': local_path}
        do_curation_check(study, tree, prov)
        local_path = nexson_snapshot + '/' + study_id + '_' + tree_id + '.json'
        save_json_to_path(study, local_path)

        sha = get_sha(study)
        prov = {'study_id':study,
                'tree_id':tree_id,
                'sha':sha,
                'local_path': local_path}
        append_el_to_provenance('study_list_with_shas.json', )
        study_tree_sha = []
        with open(prov_dir + '/' + 'study_list_with_shas.json') as j:
            json.dump(study_tree_sha, j)
except CurationError as x:
    LOG_ERROR(x)
    sys.exit(1)

Step 1. Prune duplicated or nested OTUs

Input:

  1. path to pruned taxonomic tree (read-only)
  2. filepath to NexSON with one tree
  3. output directory

Output:

  1. version of each tree with level 1 remapping - mandatory pruning
taxonomy = parse_newick(sys.argv[1])
inf = sys.argv[2]
input = parse_nexson(inf)
outpath = os.path.join(sys.argv[3], os.path.split(inf)[1])
assert outpath != inf
assert input.num_trees() == 1
tree = input.get_trees()[0]
ottId2leaf = {}
unmapped_dropped = []
pruned_leaves = []
suppressed_internals = []
sup = suppress_out_degree_one(tree)
suppressed_internals.extend(sup)
for leaf in iter_leaves(tree):
    ott_id = get_ott_id(leaf)
    if ott_id is None:
        unmapped_dropped.append(leaf)
    ottId2leaf.setdefault(ott_id, []).append(leaf)
pruned_higher_taxon = []
leaf_ott_ids = set()
for ott_id in ottId2leaf.keys():
    anc_ids = included_anc_ids(taxonomy, leaf_ott_ids)
    pruned_higher_taxon.exten(list(anc_ids))
    if is_anc_of_id(taxonomy, leaf_ott_ids):
        pruned_higher_taxon.append(ott_id)
    else:
        leaf_ott_ids.add(ott_id)
uniq_o2n = {}
for ott_id, nodes in ottId2leaf.items():
    if len(nodes) > 1:
        is_exemplar = [i for i in nodes if has_is_exemplar_tag(node)]
        if len(is_exemplar) > 1:
            raise CurationError({'tree_errors': 'multiple nodes flagged as exemplar for the same taxon: {}'.format(is_exemplar)})
        if len(is_exemplar) == 1:
            ex = is_exemplar[0]
        else:
            if form_a_clade(tree, nodes):
                ex = arbitrary_exemplar(nodes)
                WARN('arbitrarily chose ex to represent ott_id')
            else:
                ex = arbitrary_exemplar(nodes)
                WARN_OR_ERROR('arbitrarily chose ex to represent ott_id - and this changes decision might affect the input to the supertree!')
        pruned_leaves.extend([i for i in nodes if i != ex])
        uniq_o2n[ott_id] = ex
    else:
        uniq_o2n[ott_id] = nodes[0]
if len(uniq_o2n) is less than 3:
    raise CurationError({'tree_errors': 'fewer than 3 nodes mapped to unique, non-nested OTT taxa'}, prov)
suppressed_internals = tree.prune(pruned_leaves)
suppressed_internals.extend(tree.prune(pruned_higher_taxon))
unmapped_dropped.extend(tree.prune(unmapped_dropped))

Phase I: Using taxonomy to produce informative phylogenetic inputs.

Step 2. Remap to uncontested higher taxa

Inputs:

  1. taxonomy
  2. list of all nexson

Output:

  1. Each tree with level-2 remapping: no contested tips * pruning of an tip that was mapped to a higher taxon (in the curated nexson) which hss contested by another input tree,
  2. Each tree with level-3 remapping: * the OTT ID of each tip from the level-2 remapping will be replaced with the deepest uncontested ott ID that it examplifies
  3. A pruned newick of the taxonomy which is the induced tree for the remapped ottID set that was created by this operation.
taxonomy = parse_newick(sys.argv[1])
add_mrca_sets(taxonomy)
for input_tree in inputs:
    add_mrca_sets(input_tree)
    for clade in input_tree.iter_internal_node():
        taxonomy.flag_nodes_that_conflict(input_tree.leaf_taxon_set, clade.leaf_taxon_set)
prune = []
for input_tree in inputs:
    output_path_for_this_tree = ...
    taxonomy.clear_encountered_marks()
    for leaf in in input_tree.iter_leaves():
        taxonomy.mark_path_to_root_as_encountered(leaf.ott_id)
    for leaf in in input_tree.iter_leaves():
        taxon = taxonomy.nodebyid[leaf.ott_id]
        if is_contested(taxon):
            prune.extend(leaf)
        else:
            new_taxon = tree.find_deepest_uncontested_terminal_edge_anc(taxon)
            if new_taxon is not taxon:
                remap.append((leaf, taxon, new_taxon))
suppressed_internals = tree.prune(prune)
store_level_2_remapping(tree)
if tree.num_tips is less than 3:
    WARN('removing of leaves that were mapped to contested higher taxa resulted in an uninformative input...')
tree.remap_taxa(remap)
store_level_3_remapping(tree)
taxonomy.clear_encountered_marks()
for input_tree in inputs:
    for leaf in in input_tree.iter_leaves():
        taxonomy.mark_path_to_root_as_encountered(leaf.ott_id)
taxonomy.prune_unencountered_nodes()
taxonomy.write_newick(open(pruned_taxo_filepath, 'w'), internal_names=True)

Step 3. Prune singletons from input set

Input

  1. taxonomy (pruned from step 2 is what we'll use for speed)
  2. set of level 3 remapped input trees

Output:

  1. Artifact: each tree in level-4 remapping: Any tip that is mapped to an internal node in the taxo-2 tree is replaced with a polytomy of the leaf taxa below it, and a marker to indicate that the polytomy node is an assumption - it is not supported by the tree. If the pipeline were more modular, this would be an intermediate (as the input to the creation of level-5 remapping), but it would be more efficient to just write both from one execution).
  2. intermediate: each tree in level-5 remapping. pruned phylo singeltons. input for Step 4 any tip that is only seen in one phylogeny is culled,
  3. intermediate: phylo-informative taxonmic pruning: taxonomy pruned to the tree induced by the level-5 input remapping's leaf set. this is an input for step 4.

phase II: phylogenetic graph of the phylogenetic inputs

Step 4. Decomposition into contested regions

Input:

  1. taxonomy (pruned is what we'll use for speed)
  2. set of fully remapped inputs

Output:

  1. intermediate subdir for each problem to be solved, containing versions of each relevant tree pruned down to the relevant leaf set for this suproblem. These are the inputs of step 5
  2. intermediate some JSON format for the reattachement recipe describing grafting

Step 5. phylogenetic graph for each subproblem

Input:

  1. relevant trees from step 4

Output: 1.intermediate phylogenetic graph for the subproblem

Step 6. graph subproblems back into one phylogenetic graph

Input:

  1. intermediate phylogenetic graphs from step 5.
  2. intermediate reattachement strategy from step 4.

Output:

  1. intermediate phylogenetic graph for phylo-tips that are not singeltons

phase III: graft phylo singeltons back on with phyloreference placement

Step 7. graph subproblems back into one phylogenetic graph

Input:

  1. phylogenetic graph for phylo-tips that are not singeltons
  2. level-2 remapped input tree will be used as source of phylorefence definitions to be grafted back on.

Output:

  1. intermediate phylogenetic graph for taxa included in a phylogeny

phase IV: graft the taxonomy-only nodes

Step 8. graph subproblems back into one phylogenetic graph

Input:

  1. phylogenetic graph for all phylo tips from step 7
  2. original taxonomy will be used as source of phylorefence definitions to be grafted back on.

Output:

  1. primary result phylogenetic graph of life

phase V: return a tree

Step 9. extract a tree from the graph

Input:

  1. the primary result phylogenetic graph of life from step 8

Output:

  1. secondary result a phylogenetic tree of life

traverse the bubble of the phylogenetic graph to return an optimal tree.

Clone this wiki locally