-
Notifications
You must be signed in to change notification settings - Fork 26
pseudo code supertree
Mark T. Holder edited this page Feb 25, 2015
·
4 revisions
Note: this is being turned into code described here. As it is a work-in-progress, there will probably be some discrepancies between that pipeline and this one. Hopefully, we'll clean this up soon...
Input:
- OTT pruned by treemachine and expressed in newick
- list of ranked trees (most important tree first)
- phylesystem-api for grabbing the nexson.
Conventions:
- $SYNTHESIS_PARENT is some dir in your env.
-
outdir
is $SYNTHESIS_PARENT/out - the directory that will be the parent of the output -
provdir
is${outdir}/provenance
- directory holding artifacts from the synthesis run
And normalize IDs (which lack the "tree" in many treeIDs in treemachine syntax)
Inputs:
- tree ranking list. one identifier per line using (the loathed)
studyid_treeid
format - a local copy of the phylesystem repo.
fetch-and-snapshot-nexsons.bash
:
#!/bin/bash cd phyleysystem/shards/phylesystem-1 git pull origin master cd $SYNTHESIS_PARENT for tag in $(cat in/ranked-tree-list.txt) do snapshot-nexson-for-synth.py "${tag}" outdir/raw-nexsons done
snapshot-nexsons.py
:
#!/usr/bin/env python demunger_pattern = re.compile('([a-z][a-z]_[0-9]+)_([0-9]+)') study_tree_str = open(sys.argv[1], 'rU') nexson_snapshot = sys.argv[2] create_dir(nexson_snapshot) def do_curation_check(study, tree, prov): e = {} if 'ingroup' not in tree: e.setdefault('tree_errors', []).append("ingroup missing") if 'doi' not in stdy: e.setdefault('tree_errors', []).append("doi missing") if 'root' not in tree: e.setdefault('tree_errors', []).append("root missing") if e: raise CurationError(e, prov) try: for study_tree_str in open(sys.argv[1], 'rU'): match = demunger_pattern(study_tree_str) assert match study_id, tree_id = match.group(1), match.group(2) study = phylesystem.get(study) try: tree = get_tree(study, tree_id) except: tree_id = 'tree' + tree_id tree = get_tree(study, tree_id) study.cull_other_trees(tree_id) sha = get_sha(study) prov = {'study_id':study, 'tree_id':tree_id, 'sha':sha, 'local_path': local_path} do_curation_check(study, tree, prov) local_path = nexson_snapshot + '/' + study_id + '_' + tree_id + '.json' save_json_to_path(study, local_path) sha = get_sha(study) prov = {'study_id':study, 'tree_id':tree_id, 'sha':sha, 'local_path': local_path} append_el_to_provenance('study_list_with_shas.json', ) study_tree_sha = [] with open(prov_dir + '/' + 'study_list_with_shas.json') as j: json.dump(study_tree_sha, j) except CurationError as x: LOG_ERROR(x) sys.exit(1)
Input:
- path to pruned taxonomic tree (read-only)
- filepath to NexSON with one tree
- output directory
Output:
- version of each tree with level 1 remapping - mandatory pruning
taxonomy = parse_newick(sys.argv[1]) inf = sys.argv[2] input = parse_nexson(inf) outpath = os.path.join(sys.argv[3], os.path.split(inf)[1]) assert outpath != inf assert input.num_trees() == 1 tree = input.get_trees()[0] ottId2leaf = {} unmapped_dropped = [] pruned_leaves = [] suppressed_internals = [] sup = suppress_out_degree_one(tree) suppressed_internals.extend(sup) for leaf in iter_leaves(tree): ott_id = get_ott_id(leaf) if ott_id is None: unmapped_dropped.append(leaf) ottId2leaf.setdefault(ott_id, []).append(leaf) pruned_higher_taxon = [] leaf_ott_ids = set() for ott_id in ottId2leaf.keys(): anc_ids = included_anc_ids(taxonomy, leaf_ott_ids) pruned_higher_taxon.exten(list(anc_ids)) if is_anc_of_id(taxonomy, leaf_ott_ids): pruned_higher_taxon.append(ott_id) else: leaf_ott_ids.add(ott_id) uniq_o2n = {} for ott_id, nodes in ottId2leaf.items(): if len(nodes) > 1: is_exemplar = [i for i in nodes if has_is_exemplar_tag(node)] if len(is_exemplar) > 1: raise CurationError({'tree_errors': 'multiple nodes flagged as exemplar for the same taxon: {}'.format(is_exemplar)}) if len(is_exemplar) == 1: ex = is_exemplar[0] else: if form_a_clade(tree, nodes): ex = arbitrary_exemplar(nodes) WARN('arbitrarily chose ex to represent ott_id') else: ex = arbitrary_exemplar(nodes) WARN_OR_ERROR('arbitrarily chose ex to represent ott_id - and this changes decision might affect the input to the supertree!') pruned_leaves.extend([i for i in nodes if i != ex]) uniq_o2n[ott_id] = ex else: uniq_o2n[ott_id] = nodes[0] if len(uniq_o2n) is less than 3: raise CurationError({'tree_errors': 'fewer than 3 nodes mapped to unique, non-nested OTT taxa'}, prov) suppressed_internals = tree.prune(pruned_leaves) suppressed_internals.extend(tree.prune(pruned_higher_taxon)) unmapped_dropped.extend(tree.prune(unmapped_dropped))
Inputs:
- taxonomy
- list of all nexson
Output:
- Each tree with level-2 remapping: no contested tips * pruning of an tip that was mapped to a higher taxon (in the curated nexson) which hss contested by another input tree,
- Each tree with level-3 remapping: * the OTT ID of each tip from the level-2 remapping will be replaced with the deepest uncontested ott ID that it examplifies
- A pruned newick of the taxonomy which is the induced tree for the remapped ottID set that was created by this operation.
taxonomy = parse_newick(sys.argv[1]) add_mrca_sets(taxonomy) for input_tree in inputs: add_mrca_sets(input_tree) for clade in input_tree.iter_internal_node(): taxonomy.flag_nodes_that_conflict(input_tree.leaf_taxon_set, clade.leaf_taxon_set) prune = [] for input_tree in inputs: output_path_for_this_tree = ... taxonomy.clear_encountered_marks() for leaf in in input_tree.iter_leaves(): taxonomy.mark_path_to_root_as_encountered(leaf.ott_id) for leaf in in input_tree.iter_leaves(): taxon = taxonomy.nodebyid[leaf.ott_id] if is_contested(taxon): prune.extend(leaf) else: new_taxon = tree.find_deepest_uncontested_terminal_edge_anc(taxon) if new_taxon is not taxon: remap.append((leaf, taxon, new_taxon)) suppressed_internals = tree.prune(prune) store_level_2_remapping(tree) if tree.num_tips is less than 3: WARN('removing of leaves that were mapped to contested higher taxa resulted in an uninformative input...') tree.remap_taxa(remap) store_level_3_remapping(tree) taxonomy.clear_encountered_marks() for input_tree in inputs: for leaf in in input_tree.iter_leaves(): taxonomy.mark_path_to_root_as_encountered(leaf.ott_id) taxonomy.prune_unencountered_nodes() taxonomy.write_newick(open(pruned_taxo_filepath, 'w'), internal_names=True)
Input
- taxonomy (pruned from step 2 is what we'll use for speed)
- set of level 3 remapped input trees
Output:
- Artifact: each tree in level-4 remapping: Any tip that is mapped to an internal node in the taxo-2 tree is replaced with a polytomy of the leaf taxa below it, and a marker to indicate that the polytomy node is an assumption - it is not supported by the tree. If the pipeline were more modular, this would be an intermediate (as the input to the creation of level-5 remapping), but it would be more efficient to just write both from one execution).
- intermediate: each tree in level-5 remapping. pruned phylo singeltons. input for Step 4 any tip that is only seen in one phylogeny is culled,
- intermediate: phylo-informative taxonmic pruning: taxonomy pruned to the tree induced by the level-5 input remapping's leaf set. this is an input for step 4.
Input:
- taxonomy (pruned is what we'll use for speed)
- set of fully remapped inputs
Output:
- intermediate subdir for each problem to be solved, containing versions of each relevant tree pruned down to the relevant leaf set for this suproblem. These are the inputs of step 5
- intermediate some JSON format for the reattachement recipe describing grafting
Input:
- relevant trees from step 4
Output: 1.intermediate phylogenetic graph for the subproblem
Input:
- intermediate phylogenetic graphs from step 5.
- intermediate reattachement strategy from step 4.
Output:
- intermediate phylogenetic graph for phylo-tips that are not singeltons
Input:
- phylogenetic graph for phylo-tips that are not singeltons
- level-2 remapped input tree will be used as source of phylorefence definitions to be grafted back on.
Output:
- intermediate phylogenetic graph for taxa included in a phylogeny
Input:
- phylogenetic graph for all phylo tips from step 7
- original taxonomy will be used as source of phylorefence definitions to be grafted back on.
Output:
- primary result phylogenetic graph of life
Input:
- the primary result phylogenetic graph of life from step 8
Output:
- secondary result a phylogenetic tree of life
traverse the bubble of the phylogenetic graph to return an optimal tree.