-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100 most wanted list #23
Comments
Greg and I discussed this and decided on a sorting scheme. The most wanted list will only include "new" OTUs (i.e. ones that were created de novo, not from greengenes). Sorting priorities:
Output should include a tab-separated table containing the sorted most wanted OTU IDs, sequence, greengenes assigned taxonomy, and NCBI closest sequence link. Additional output should be an HTML table (for easy integration into the EMP website) that contains the information above plus a piechart showing the abundance of the OTU in each environment. |
It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have). A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).
We'll see how this works... |
We could just look at the ones that were new clusters (i.e. don't have gg ids because they failed ref picking), right? On Aug 8, 2012, at 11:27 AM, jrrideout wrote: It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have). A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).
We'll see how this works... — |
Yes, that will be the first step in the process, but I think we'll need to do additional filtering (steps 2-5) to get a good list, because many of these novel OTUs might be very similar to either gg seqs or nt seqs. |
@meganap, would you be able to help @jrrideout with some css magic to make the html table that he's putting together for this look a little nicer? |
sure no prob |
@meganap awesome, thanks! I'm finishing up some changes tonight and will have the table in the repo sometime tomorrow. Will let you know when it is ready. |
Once @meganap takes a crack at it, it'd be best to include her css in the |
@meganap, the table is in the repo now under isme14/most_wanted_otus/most_wanted_otus.html. To view it, open it up in a web browser (I've tried out Chrome and Firefox) and it should find all of the other files it needs (they are all under that same directory). I tried to keep styling to a minimum. The table has the id 'most_wanted_otus_table' and each of the subtables for the piechart legends have the class 'most_wanted_otus_legend'. If there's anything else I can do from my end to help make this HTML better stylizable, please let me know. I think the goal was to add this table to one of the EMP webpages. Thus, I'm not sure if we should directly add the CSS to the table-generating code as @gregcaporaso suggested because it may better to just use the EMP CSS stylesheets that are already in use on the website. You may need to get in touch with @douginator2000 to get access to those if you don't have them already. If we go this route, the table-generating code will be able to create generic tables which can then be styled according to whatever website scheme it might be dropped into (thinking of additional uses for this table besides the EMP website). Thanks again for your help with this, and please let me know if you come across any issues. @gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance. You mentioned offline that there might be a way to get access to a node with more memory (>69GB). Do you still want to go this route, or just use the table that we have? |
@meganap, I forgot to mention that the second column in the HTML table needs to keep its contents formatted as-is (I'm using pre tags currently, maybe there is a better way to do this though). We just need to keep it formatted with fixed-width font and have those linebreaks respected. |
@jrrideout cool, I'll take a crack at this tomorrow |
Thanks guys!
I think we just have to go with this for right now, but for the paper we'll get this running on a system with more memory. |
@douginator2000, when this is ready could you add a another collapsable section on the EMP login page (same place as the summary statistics, etc)? |
@gregcaporaso @jrrideout Sorry I didn't get a chance to work on this yet since I was working on figures for other isme stuff, but is there still time for this? |
I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things. |
Yes still useful, deadline sunday On Aug 17, 2012, at 4:35 PM, "jrrideout" <[email protected]mailto:[email protected]> wrote: I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things. — |
hey @jrrideout I noticed that there aren't any html headers for the file and that it just starts off with divs. Is there a reason for this? Adding css styling is only possible if we have html headers. |
@meganap, @gregcaporaso requested that I only output the HTML table so that it could be easily dropped into a webpage. Please feel free to modify/add to the HTML as needed to style it (this table will ultimately need to be added to the EMP login page). |
@jrrideout I've edited the script that writes the html so it writes some stuff in a different way, can you send me the full command you used to run that script (like where the test files are?) so that I can rerun it? |
@meganap I'll have to rerun it because it requires the entire nt database, and everything is already set up for this in an EC2 instance. Can you please update the accompanying unit tests and check in your changes? Once they're in, I'll rerun it and commit the latest results to the repo. It won't take long to run. |
@meganap The changes are in; please let me know if you run into any issues. |
@douginator2000 this is all ready to go. All relevant files are under isme14/most_wanted_otus/. The only file that you can exclude from there is 'analysis_notes.txt'. Thanks! @meganap thanks for your help in spicing up the table- it looks really good! |
Hey guys, In the meantime I posted here to make it easier for everyone else to see: One thing we'll want to do is include the number of samples for each of the Greg |
Yes this is spectacular -- thanks for putting together! Could we get a tree showing where in phylogeny the 100 most wanted are? On Aug 19, 2012, at 11:53 AM, "Greg Caporaso" <[email protected]mailto:[email protected]> wrote: Hey guys, In the meantime I posted here to make it easier for everyone else to see: One thing we'll want to do is include the number of samples for each of the Greg — |
Am I right to think that the criteria for this are those that @jrrideout came up with:
|
Yes, that's right. @jrrideout, correct us if we're wrong here. |
ok but what were the N's for these two filters: |
@gilbertjack The steps 1-5 listed above are what I used. Here's the parameters I ended up using:
So we only ended up with 45 OTUs that were left over after all of that filtering. Please let me know if you have any additional questions regarding how this list was generated. @gregcaporaso @rob-knight I think these feature requests sound great, though I will not have time to work on them to meet the deadline today. |
Thanks a lot! |
AWESOME, thanks |
@rob-knight said: EMP most wanted and picrust definitely valuable this time around (i.e. are there “most wanted” that are in environments with “interesting” parameters?). |
The OTUs that are abundant across many environment types and distance from sequences in Greengenes/NCBI. We'll have to develop a sorting scheme for this, but would be a way to provide a list of the "most wanted" OTUs, or the high abundance cosmopolitan organisms that are not well-characterized.
The text was updated successfully, but these errors were encountered: