-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance idea: generator for object-style representation #432
Comments
Here's some proof of concept code on how to do this.
Use like:
Here's an alternative approach that uses a
Could also think about reading in the file line-by-line as well:
But I think we really can't do this due to duplicate sample names in a table, which requires processing. Also not sure that would really be a benefit due to added overhead of filereading or losing vector processing, since each sample would need to be processed individually. |
After looking into this more, I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient. This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list: Lines 637 to 649 in cac87fb
and then samples are looped through again here: Line 590 in cac87fb
and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work. Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement. |
We should address #436 first, since that's both easier to do, and probably more impactful. |
In the past we've raised some issues about
peppy performance
(See #388 #387). Peppy is fine for small projects (hundreds or even thousands of sample rows, but it gets slow when we are dealing with huge projects, like tens to hundreds of thousands of samples.It would be nice if peppy could handle these very large projects.
One of the problems is that peppy is storing sample information in two forms: a table (as a pandas data frame object), and as a list of
Sample
objects. This is duplicating the information in memory.An idea for improving the performance could be to switch to a single-memory model. But we really want to be able to access the metadata in both ways for different use cases... so what about using the pandas data.frame as the main data structure, and then providing some kind of a generator that could go through it and create objects on the fly, in case someone wants the list-based approach?
This could be one way to increase performance.
The text was updated successfully, but these errors were encountered: