Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance idea: generator for object-style representation #432

Open
nsheff opened this issue Feb 7, 2023 · 3 comments
Open

Performance idea: generator for object-style representation #432

nsheff opened this issue Feb 7, 2023 · 3 comments

Comments

@nsheff
Copy link
Contributor

nsheff commented Feb 7, 2023

In the past we've raised some issues about peppy performance (See #388 #387). Peppy is fine for small projects (hundreds or even thousands of sample rows, but it gets slow when we are dealing with huge projects, like tens to hundreds of thousands of samples.

It would be nice if peppy could handle these very large projects.

One of the problems is that peppy is storing sample information in two forms: a table (as a pandas data frame object), and as a list of Sample objects. This is duplicating the information in memory.

An idea for improving the performance could be to switch to a single-memory model. But we really want to be able to access the metadata in both ways for different use cases... so what about using the pandas data.frame as the main data structure, and then providing some kind of a generator that could go through it and create objects on the fly, in case someone wants the list-based approach?

This could be one way to increase performance.

@nsheff
Copy link
Contributor Author

nsheff commented Mar 22, 2023

Here's some proof of concept code on how to do this.

import pandas
st[1,]

# Read in sample table
st = pandas.read_csv("sample_table.csv")

# This is a generator that allows you to loop through a pandas DF, but gives you sample objects.
def gen_row():
    i=0
    while i < len(st.index):
        yield st.loc[i,].to_dict()
        i += 1

Use like:

[row for row in gen_row()]

Here's an alternative approach that uses a namedtuple instead of a dict:

from collections import namedtuple

Sample = namedtuple('Sample', st.dtypes.index.tolist())
s1 = Sample(*st.iloc[0,:])

def gen_row_obj():
    i=0
    while i < len(st.index):
        d = dict(zip(s1.dtype.names, r0))
        yield d
        i += 1

Could also think about reading in the file line-by-line as well:

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

But I think we really can't do this due to duplicate sample names in a table, which requires processing. Also not sure that would really be a benefit due to added overhead of filereading or losing vector processing, since each sample would need to be processed individually.

@nsheff
Copy link
Contributor Author

nsheff commented Mar 23, 2023

After looking into this more, I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.

This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:

peppy/peppy/project.py

Lines 637 to 649 in cac87fb

def _get_duplicated_sample_ids(self, sample_names_list: List) -> set:
return set(
[
sample_id
for sample_id in track(
sample_names_list,
description="Detecting duplicate sample names",
disable=not (self.is_sample_table_large and self.progressbar),
console=Console(file=sys.stderr),
)
if sample_names_list.count(sample_id) > 1
]
)

and then samples are looped through again here:

) = self._get_duplicated_and_not_duplicated_samples(

and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.

Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.

@nsheff
Copy link
Contributor Author

nsheff commented Mar 23, 2023

We should address #436 first, since that's both easier to do, and probably more impactful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant