Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phybin prune option #10

Open
JosieReinhardt opened this issue Aug 4, 2015 · 2 comments
Open

Phybin prune option #10

JosieReinhardt opened this issue Aug 4, 2015 · 2 comments

Comments

@JosieReinhardt
Copy link

The prune option does not appear to be working properly - on my 22 taxa dataset, binning works successfully when I run the full dataset, but fails for one of two reasons when I use --prune.

First, the program errors out and produces no output trees/clusters if I specify > 5 taxa with --prune, and this seems to be in any combination.

Second, if I specify 5 or fewer taxa, the program completes, but the output doesn't make sense. A single cluster is produced, regardless of the edit distance I specify, and the consensus tree output file includes taxa that were not specified with --prune, whereas the alltrees file includes only taxa specified with prune.

The output from two examples demonstrating these is pasted below:

First issue (crash when > 5 taxa are specified with --prune)

$ phybin --complete ./ntXraxml_nttree/*.* -v --editdist=10 --prune="TdKND2 TdGND2 TdGD TdKD TdLD TDALM" -o ./phybin_comp/
Cleaning away previous phybin outputs...
Parsing 489 Newick tree files.
Total unique taxa (22):
  DA EA SBECC SE DM DS TQUIN TwGD TwKD TwKND TwGND TdKND2 TdGD TdKD TdLD TdKND1 TDALM TdL TP TdGND2 TdS TW
Note: defaulting to expecting ALL 6 to be present..
..................
 WARNINGs....
....
Number of input tree files: 489
PRUNING trees to just these taxa: ["TdKND2","TdGND2","TdGD","TdKD","TdLD","TDALM"]
Number of bad/unreadable input tree files: 58
Number of VALID trees (correct # of leaves/taxa): 431
Total tree nodes contained in valid trees: 2586
Average branch len over valid trees: 0.4148210083130796
Max/Min branch lengths: (133.7779417995337,0.0)
 Using HashRF-style algorithm...
 Built matrix for dim 431
Time to compute distance matrix: 0.09973s
Clustering using method CompleteLinkage
 [finished] Wrote full dendrogram to file dendrogram.txt
Sanity checked dendrogram of size: 431
Combining all clusters at distance less than or equal to 10
 [async] writing dendrogram as a graph to dendrogram.dot
After flattening, cluster sizes are: [431]
 Outcome: 1 clusters found, 1 non-singleton, top bin sizes: [431]
  Up to first 30 bin sizes, excluding singletons:
  * cluster#1, members 431, 
 [finished] Wrote contents of each cluster to cluster<N>_<size>.txt
 [finished] Wrote representative (consensus) trees to cluster<N>_<size>_consensus.tr
NOT creating processes to build per-cluster .pdf visualizations. (Not asked to.)
Waiting for 2 asynchronous tasks to finish...
phybin: bipsToTree: Internal error!  No match for bip: fromList [11,16] out is
 [(fromList [0],NTLeaf () 0),(fromList [1],NTLeaf () 1),(fromList [2],NTLeaf () 2),(fromList [3],NTLeaf () 3),(fromList [4],NTLeaf () 4),(fromList [5],NTLeaf () 5)]
 and remaining bips 2
 when processing orig bip set:
  fromList [fromList [0,1,2,3,4,5],fromList [11,16],fromList [12,14]]

Second issue (weird output when <= 5 taxa are specified with --prune)

$ phybin --complete ./ntXraxml_nttree/*.phy -v --editdist=10 --prune="TdKND2 TdGND2 TdGD TdKD TdLD" -o ./phybin_comp/
Cleaning away previous phybin outputs...
Parsing 489 Newick tree files.
Total unique taxa (22):
  DA EA SBECC SE DM DS TQUIN TwGD TwKD TwKND TwGND TdKND2 TdGD TdKD TdLD TdKND1 TDALM TdL TP TdGND2 TdS TW
Note: defaulting to expecting ALL 5 to be present..
..................
 WARNINGs...
...
Number of input tree files: 489
PRUNING trees to just these taxa: ["TdKND2","TdGND2","TdGD","TdKD","TdLD"]
Number of bad/unreadable input tree files: 58
Number of VALID trees (correct # of leaves/taxa): 431
Total tree nodes contained in valid trees: 2155
Average branch len over valid trees: 0.46325352648325513
Max/Min branch lengths: (133.7779417995337,0.0)
 Using HashRF-style algorithm...
 Built matrix for dim 431
Time to compute distance matrix: 0.011019s
Clustering using method CompleteLinkage
 [finished] Wrote full dendrogram to file dendrogram.txt
Sanity checked dendrogram of size: 431
Combining all clusters at distance less than or equal to 10
 [async] writing dendrogram as a graph to dendrogram.dot
After flattening, cluster sizes are: [431]
 Outcome: 1 clusters found, 1 non-singleton, top bin sizes: [431]
  Up to first 30 bin sizes, excluding singletons:
  * cluster#1, members 431, 
Dendrogram graph size: 1
 [finished] Wrote contents of each cluster to cluster<N>_<size>.txt
 [finished] Wrote representative (consensus) trees to cluster<N>_<size>_consensus.tr
NOT creating processes to build per-cluster .pdf visualizations. (Not asked to.)
 [async] Next, plot dendrogram.pdf
Waiting for 2 asynchronous tasks to finish...
 [finished] Writing dendrogram diagram (0.108006s)
Phybin completed.
$ cat ./phybin_comp/cluster1_431_consensus.tr 
(DA, EA, SBECC, SE, DM);
$ head -4 ./phybin_comp/cluster1_431_alltrees.tr 
(TdKND2, (TdKD, ((TdLD, TdGD), TdGND2)));
((TdLD, (TdGD, TdKD)), (TdGND2, TdKND2));
(((TdLD, (TdKD, TdGD)), TdGND2), TdKND2);
(((TdLD, TdGD), TdKD), (TdGND2, TdKND2));
@rrnewton
Copy link
Owner

rrnewton commented Aug 6, 2015

Thanks for this report. We should be able to help fix this. We'll try reproducing with one of our data sets, first. Otherwise, if there's any public dataset that gives the same error that would be a great starting point.

@JosieReinhardt
Copy link
Author

Thanks,

I ended up pre-pruning my dataset using another tool (tree_doctor) and then
everything worked fine. But, I figured you'd want to know anyway. I'm not
sure about a public dataset but if you do want mine to reproduce the error,
I'd be happy to share an anonymized version.

Josie

On Thu, Aug 6, 2015 at 1:17 PM, Ryan Newton [email protected]
wrote:

Thanks for this report. We should be able to help fix this. We'll try
reproducing with one of our data sets, first. Otherwise, if there's any
public dataset that gives the same error that would be a great starting
point.


Reply to this email directly or view it on GitHub
#10 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants