Performance idea: generator for object-style representation #432

nsheff · 2023-02-07T14:51:20Z

In the past we've raised some issues about peppy performance (See #388 #387). Peppy is fine for small projects (hundreds or even thousands of sample rows, but it gets slow when we are dealing with huge projects, like tens to hundreds of thousands of samples.

It would be nice if peppy could handle these very large projects.

One of the problems is that peppy is storing sample information in two forms: a table (as a pandas data frame object), and as a list of Sample objects. This is duplicating the information in memory.

An idea for improving the performance could be to switch to a single-memory model. But we really want to be able to access the metadata in both ways for different use cases... so what about using the pandas data.frame as the main data structure, and then providing some kind of a generator that could go through it and create objects on the fly, in case someone wants the list-based approach?

This could be one way to increase performance.

The text was updated successfully, but these errors were encountered:

nsheff · 2023-03-22T19:15:34Z

Here's some proof of concept code on how to do this.

import pandas
st[1,]

# Read in sample table
st = pandas.read_csv("sample_table.csv")

# This is a generator that allows you to loop through a pandas DF, but gives you sample objects.
def gen_row():
    i=0
    while i < len(st.index):
        yield st.loc[i,].to_dict()
        i += 1

Use like:

[row for row in gen_row()]

Here's an alternative approach that uses a namedtuple instead of a dict:

from collections import namedtuple

Sample = namedtuple('Sample', st.dtypes.index.tolist())
s1 = Sample(*st.iloc[0,:])

def gen_row_obj():
    i=0
    while i < len(st.index):
        d = dict(zip(s1.dtype.names, r0))
        yield d
        i += 1

Could also think about reading in the file line-by-line as well:

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

But I think we really can't do this due to duplicate sample names in a table, which requires processing. Also not sure that would really be a benefit due to added overhead of filereading or losing vector processing, since each sample would need to be processed individually.

nsheff · 2023-03-23T11:03:34Z

After looking into this more, I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.

This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:

peppy/peppy/project.py

Lines 637 to 649 in cac87fb

    
           def _get_duplicated_sample_ids(self, sample_names_list: List) -> set: 
        
               return set( 
        
                   [ 
        
                       sample_id 
        
                       for sample_id in track( 
        
                           sample_names_list, 
        
                           description="Detecting duplicate sample names", 
        
                           disable=not (self.is_sample_table_large and self.progressbar), 
        
                           console=Console(file=sys.stderr), 
        
                       ) 
        
                       if sample_names_list.count(sample_id) > 1 
        
                   ] 
        
               )

and then samples are looped through again here:

peppy/peppy/project.py

Line 590 in cac87fb

) = self._get_duplicated_and_not_duplicated_samples(

and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.

Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.

nsheff · 2023-03-23T11:07:08Z

We should address #436 first, since that's both easier to do, and probably more impactful.

nsheff added enhancement - users enhancement - devs labels Feb 7, 2023

nsheff mentioned this issue Mar 23, 2023

Performance idea: find duplicate samples more efficiently #436

Closed

nsheff added the priority-low label Mar 23, 2023

nsheff added this to the v0.50.0 - Consolidate and refactor milestone Jun 25, 2024

github-project-automation bot added this to PEP Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance idea: generator for object-style representation #432

Performance idea: generator for object-style representation #432

nsheff commented Feb 7, 2023

nsheff commented Mar 22, 2023

nsheff commented Mar 23, 2023

nsheff commented Mar 23, 2023

Performance idea: generator for object-style representation #432

Performance idea: generator for object-style representation #432

Comments

nsheff commented Feb 7, 2023

nsheff commented Mar 22, 2023

nsheff commented Mar 23, 2023

nsheff commented Mar 23, 2023