100 most wanted list #23

gregcaporaso · 2012-08-06T17:51:51Z

The OTUs that are abundant across many environment types and distance from sequences in Greengenes/NCBI. We'll have to develop a sorting scheme for this, but would be a way to provide a list of the "most wanted" OTUs, or the high abundance cosmopolitan organisms that are not well-characterized.

jairideout · 2012-08-06T21:55:37Z

Greg and I discussed this and decided on a sorting scheme. The most wanted list will only include "new" OTUs (i.e. ones that were created de novo, not from greengenes).

Sorting priorities:

Sort by the number of environments the OTU is found in.
Sort by the total count across all environments.
Sort by % dissimilarity to greengenes.
Sort by % dissimilarity to NCBI nr database.

Output should include a tab-separated table containing the sorted most wanted OTU IDs, sequence, greengenes assigned taxonomy, and NCBI closest sequence link.

Additional output should be an HTML table (for easy integration into the EMP website) that contains the information above plus a piechart showing the abundance of the OTU in each environment.

jairideout · 2012-08-08T17:27:18Z

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

Filter to only include novel OTUs.
Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
Filter to only include OTUs that are in at least N environments/sample types.
Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
BLAST the rest against nt and sort by % dissimilarity.
Pick the top N from those.

We'll see how this works...

rob-knight · 2012-08-08T18:34:48Z

We could just look at the ones that were new clusters (i.e. don't have gg ids because they failed ref picking), right?

On Aug 8, 2012, at 11:27 AM, jrrideout wrote:

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

Filter to only include novel OTUs.
Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
Filter to only include OTUs that are in at least N environments/sample types.
Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
BLAST the rest against nt and sort by % dissimilarity.
Pick the top N from those.

We'll see how this works...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-7590793.

jairideout · 2012-08-08T19:37:33Z

Yes, that will be the first step in the process, but I think we'll need to do additional filtering (steps 2-5) to get a good list, because many of these novel OTUs might be very similar to either gg seqs or nt seqs.

gregcaporaso · 2012-08-13T23:57:21Z

@meganap, would you be able to help @jrrideout with some css magic to make the html table that he's putting together for this look a little nicer?

meganap · 2012-08-14T02:03:17Z

sure no prob

jairideout · 2012-08-14T05:23:24Z

@meganap awesome, thanks! I'm finishing up some changes tonight and will have the table in the repo sometime tomorrow. Will let you know when it is ready.

gregcaporaso · 2012-08-14T13:14:11Z

Once @meganap takes a crack at it, it'd be best to include her css in the
html generation code for future runs.

jairideout · 2012-08-14T22:25:33Z

@meganap, the table is in the repo now under isme14/most_wanted_otus/most_wanted_otus.html. To view it, open it up in a web browser (I've tried out Chrome and Firefox) and it should find all of the other files it needs (they are all under that same directory).

I tried to keep styling to a minimum. The table has the id 'most_wanted_otus_table' and each of the subtables for the piechart legends have the class 'most_wanted_otus_legend'. If there's anything else I can do from my end to help make this HTML better stylizable, please let me know.

I think the goal was to add this table to one of the EMP webpages. Thus, I'm not sure if we should directly add the CSS to the table-generating code as @gregcaporaso suggested because it may better to just use the EMP CSS stylesheets that are already in use on the website. You may need to get in touch with @douginator2000 to get access to those if you don't have them already. If we go this route, the table-generating code will be able to create generic tables which can then be styled according to whatever website scheme it might be dropped into (thinking of additional uses for this table besides the EMP website).

Thanks again for your help with this, and please let me know if you come across any issues.

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance. You mentioned offline that there might be a way to get access to a node with more memory (>69GB). Do you still want to go this route, or just use the table that we have?

jairideout · 2012-08-14T22:29:19Z

@meganap, I forgot to mention that the second column in the HTML table needs to keep its contents formatted as-is (I'm using pre tags currently, maybe there is a better way to do this though). We just need to keep it formatted with fixed-width font and have those linebreaks respected.

meganap · 2012-08-14T22:30:47Z

@jrrideout cool, I'll take a crack at this tomorrow

gregcaporaso · 2012-08-14T22:39:00Z

Thanks guys!

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933,
and 550 because they were too big to filter on an m2.4xlarge EC2 instance.

I think we just have to go with this for right now, but for the paper we'll get this running on a system with more memory.

gregcaporaso · 2012-08-16T13:41:09Z

@douginator2000, when this is ready could you add a another collapsable section on the EMP login page (same place as the summary statistics, etc)?

meganap · 2012-08-17T20:29:37Z

@gregcaporaso @jrrideout Sorry I didn't get a chance to work on this yet since I was working on figures for other isme stuff, but is there still time for this?

jairideout · 2012-08-17T20:35:19Z

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

rob-knight · 2012-08-17T20:49:01Z

Yes still useful, deadline sunday

On Aug 17, 2012, at 4:35 PM, "jrrideout" <[email protected]mailto:[email protected]> wrote:

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-7834337.

meganap · 2012-08-17T21:43:05Z

hey @jrrideout I noticed that there aren't any html headers for the file and that it just starts off with divs. Is there a reason for this? Adding css styling is only possible if we have html headers.

jairideout · 2012-08-17T22:09:11Z

@meganap, @gregcaporaso requested that I only output the HTML table so that it could be easily dropped into a webpage. Please feel free to modify/add to the HTML as needed to style it (this table will ultimately need to be added to the EMP login page).

meganap · 2012-08-17T23:26:06Z

@jrrideout I've edited the script that writes the html so it writes some stuff in a different way, can you send me the full command you used to run that script (like where the test files are?) so that I can rerun it?

jairideout · 2012-08-18T01:24:49Z

@meganap I'll have to rerun it because it requires the entire nt database, and everything is already set up for this in an EC2 instance. Can you please update the accompanying unit tests and check in your changes? Once they're in, I'll rerun it and commit the latest results to the repo. It won't take long to run.

jairideout · 2012-08-18T02:55:30Z

@meganap The changes are in; please let me know if you run into any issues.

jairideout · 2012-08-18T04:13:07Z

@douginator2000 this is all ready to go. All relevant files are under isme14/most_wanted_otus/. The only file that you can exclude from there is 'analysis_notes.txt'. Thanks!

@meganap thanks for your help in spicing up the table- it looks really good!

gregcaporaso · 2012-08-19T09:53:05Z

Hey guys,
This is awesome, thanks! Doug, could you get this accessible via the EMP
site?

In the meantime I posted here to make it easier for everyone else to see:
https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the
metadata categories in addition to the percentage, but I think that can
wait. (Thanks for the suggestion Daniel!)

Greg

rob-knight · 2012-08-19T13:23:42Z

Yes this is spectacular -- thanks for putting together! Could we get a tree showing where in phylogeny the 100 most wanted are?

On Aug 19, 2012, at 11:53 AM, "Greg Caporaso" <[email protected]mailto:[email protected]> wrote:

Hey guys,
This is awesome, thanks! Doug, could you get this accessible via the EMP
site?

In the meantime I posted here to make it easier for everyone else to see:
https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the
metadata categories in addition to the percentage, but I think that can
wait. (Thanks for the suggestion Daniel!)

Greg

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-7851896.

gilbertjack · 2012-08-19T13:33:25Z

Am I right to think that the criteria for this are those that @jrrideout came up with:

Filter to only include novel OTUs.
Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
Filter to only include OTUs that are in at least N environments/sample types.
Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
BLAST the rest against nt and sort by % dissimilarity.

gregcaporaso · 2012-08-19T13:38:11Z

Yes, that's right. @jrrideout, correct us if we're wrong here.

gilbertjack · 2012-08-19T14:10:01Z

ok but what were the N's for these two filters:
3) Filter to only include OTUs that are in at least N environments/sample types. 
4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).

jairideout · 2012-08-19T17:11:46Z

@gilbertjack The steps 1-5 listed above are what I used. Here's the parameters I ended up using:

filtered out against gg 97
abundance: 100 < OTU count < 500
at least 4 environments
included only OTUs that were at least 20% dissimilar (according to uclust) from gg 97
only included OTUs that were 97% similar or less compared to the NCBI nt database (according to blastall)

So we only ended up with 45 OTUs that were left over after all of that filtering. Please let me know if you have any additional questions regarding how this list was generated.

@gregcaporaso @rob-knight I think these feature requests sound great, though I will not have time to work on them to meet the deadline today.

gregcaporaso · 2012-08-19T17:14:10Z

Thanks a lot!

gilbertjack · 2012-08-19T17:15:12Z

AWESOME, thanks

cuttlefishh · 2015-12-11T00:36:38Z

@rob-knight said: EMP most wanted and picrust definitely valuable this time around (i.e. are there “most wanted” that are in environments with “interesting” parameters?).

ghost assigned jairideout Aug 6, 2012

cuttlefishh added the analysis label May 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100 most wanted list #23

100 most wanted list #23

gregcaporaso commented Aug 6, 2012

jairideout commented Aug 6, 2012

jairideout commented Aug 8, 2012

rob-knight commented Aug 8, 2012

jairideout commented Aug 8, 2012

gregcaporaso commented Aug 13, 2012

meganap commented Aug 14, 2012

jairideout commented Aug 14, 2012

gregcaporaso commented Aug 14, 2012

jairideout commented Aug 14, 2012

jairideout commented Aug 14, 2012

meganap commented Aug 14, 2012

gregcaporaso commented Aug 14, 2012

gregcaporaso commented Aug 16, 2012

meganap commented Aug 17, 2012

jairideout commented Aug 17, 2012

rob-knight commented Aug 17, 2012

meganap commented Aug 17, 2012

jairideout commented Aug 17, 2012

meganap commented Aug 17, 2012

jairideout commented Aug 18, 2012

jairideout commented Aug 18, 2012

jairideout commented Aug 18, 2012

gregcaporaso commented Aug 19, 2012

rob-knight commented Aug 19, 2012

gilbertjack commented Aug 19, 2012

gregcaporaso commented Aug 19, 2012

gilbertjack commented Aug 19, 2012

jairideout commented Aug 19, 2012

gregcaporaso commented Aug 19, 2012

gilbertjack commented Aug 19, 2012

cuttlefishh commented Dec 11, 2015

100 most wanted list #23

100 most wanted list #23

Comments

gregcaporaso commented Aug 6, 2012

jairideout commented Aug 6, 2012

jairideout commented Aug 8, 2012

rob-knight commented Aug 8, 2012

jairideout commented Aug 8, 2012

gregcaporaso commented Aug 13, 2012

meganap commented Aug 14, 2012

jairideout commented Aug 14, 2012

gregcaporaso commented Aug 14, 2012

jairideout commented Aug 14, 2012

jairideout commented Aug 14, 2012

meganap commented Aug 14, 2012

gregcaporaso commented Aug 14, 2012

gregcaporaso commented Aug 16, 2012

meganap commented Aug 17, 2012

jairideout commented Aug 17, 2012

rob-knight commented Aug 17, 2012

meganap commented Aug 17, 2012

jairideout commented Aug 17, 2012

meganap commented Aug 17, 2012

jairideout commented Aug 18, 2012

jairideout commented Aug 18, 2012

jairideout commented Aug 18, 2012

gregcaporaso commented Aug 19, 2012

rob-knight commented Aug 19, 2012

gilbertjack commented Aug 19, 2012

gregcaporaso commented Aug 19, 2012

gilbertjack commented Aug 19, 2012

jairideout commented Aug 19, 2012

gregcaporaso commented Aug 19, 2012

gilbertjack commented Aug 19, 2012

cuttlefishh commented Dec 11, 2015