Address SQLite Read Performance Issues #198

d33bs · 2022-05-21T13:08:43Z

SQLite files used by this repo create performance challenges during runtime that may hinder or completely stall progress. This issue is dedicated to targeting SQLite read performance issues, including solutions which convert away from the format or may not use SQLite at all.

Issues which may be related or tied to this:

The text was updated successfully, but these errors were encountered:

d33bs · 2022-05-24T18:05:53Z

Hi @gwaygenomics - I'm noticing that within SQ00014613.sqlite (from here), there are numeric columns which specify not-null for column attributes and which sometimes include 'nan' (textual) actual values. Examples may be found in that file within the cytoplasm table and column Cytoplasm_Correlation_Costes_AGP_DNA (there are many others with this issue). The following SQL shows this as well:

select ObjectNumber, 
    Cytoplasm_Correlation_Costes_AGP_DNA,
    typeof(Cytoplasm_Correlation_Costes_AGP_DNA) from Cytoplasm
where typeof(Cytoplasm_Correlation_Costes_AGP_DNA) != 'real';

SQLite may include column values that don't match the column specification. This could be one cause of the performance issues using Pandas and SQLAlchemy, but it also can impact alternatives. Because these columns are sometimes indicated as not-null, it means we may not use NULL unless we create a new column, copying the original.

Do you know of a way we could interpret these 'nan' values correctly for the purposes of pycytominer (for example, replace with 0, filter from results in database, etc)?

work towards cytomining/pycytominer#198

gwaybio · 2022-05-25T16:24:29Z

Sorry for the delayed reply @d33bs, I'm at a conference!

Yes, you note a very important limitation in our current setup. This is a well known issue, and nobody has seemed to be able to successfully track down its origin and fix. See #79 :)

d33bs · 2022-05-25T22:04:25Z

No worries and thank you @gwaygenomics - I'll take a look. Enjoy the conference!

d33bs · 2022-05-27T20:42:54Z

Hi @gwaygenomics - I just wanted to follow up with some findings, open to any guidance or input you have along the way. I've spent some time investigating ways to speed up sqlite reads, taking note of the background you mentioned a little earlier. I don't yet have thoughts on solutions for the cytominer-database issue, as a brainstorm here it feels like pandas.dataframe.to_sql() or SQLAlchemy may have had a hand in this with built-in assumptions or type inferences.

The file I've focused on for this has so far been only SQ00014613.sqlite. For the below, I have the table cytoplasm column Cytoplasm_Correlation_Costes_AGP_DNA within SQ00014613.sqlite in mind, which includes mostly floating point number values in addition to a handful of 'nan' values.

SQLite Data Type Flexibility

SQLite is a database with sometimes uncommon flexibility when compared to other RDBMS's. One way this is true is through affinity types and storage classes (see docs here). Affinity typing enables table columns to have a "preference" for certain storage classes. Storage classes are ultimately the source of truth concerning a particular value in a column. For example, this means we can have a column of affinity type FLOAT with values 0.1 (REAL class) and 'nan' (TEXT class) coexisting in a completely valid way (where other RDBMS's may throw an error). See below for a table representing this.

SQLite Column Affinity Type	SQLite Value	SQLite Storage Class
FLOAT	0.1	REAL
FLOAT	'nan'	TEXT

Pandas `read_sql` Interpretation of Mixed-Type Values

Pandas does some legwork to interpret mixed-type values within fields we send to it. In the case of the above (a single SQLite table column of affinity FLOAT with REAL and TEXT values) it will infer the column as an O (object) type and include values of float and str (string) types (version pandas==1.4.2). See below for a table representing this.

SQLite Column Affinity Type	SQLite Value	SQLite Storage Class	Pandas DF Column Type	Pandas Value Type
FLOAT	0.1	REAL	Object	float
FLOAT	'nan'	TEXT	Object	str

In and of itself, this may be a bottleneck; Pandas may be performing sub-optimally due mixed typing of columns. We could modify our SQL queries to sidestep this but we'd likely be stepping into more performance concerns (in-flight SQL data transforms). Recasting types within the Pandas dataframe may also have costly impacts to performance.

What About SQLite `NULL`?

SQLite provides a special storage class called NULL which may be used for non-values. Let's say we replace the 'nan' value(s) with NULL within the table column. When we do this and read it using Pandas, the resulting column type is float64 and two values of numpy.float64 (SQLite NULL is interpreted as numpy.nan).

SQLite Column Affinity Type	SQLite Value	SQLite Storage Class	Pandas DF Column Type	Pandas Value Type
FLOAT	0.1	REAL	float64	numpy.float64
FLOAT	NULL	NULL	float64	numpy.float64

As mentioned above, this may lend itself to healthier performance. The catch: NULL may only be used in columns which do not have a constraint of NOT NULL. In a cytoplasm.Cytoplasm_Correlation_Costes_AGP_DNA where I'm witnessing the mixed values occurring, we indeed have a NOT NULL constraint.

Despite SQLite's flexibility with affinity and storage class, it's relatively inflexible when it comes to column constraint modifications. Changing cytoplasm.Cytoplasm_Correlation_Costes_AGP_DNA to be nullable within SQLite requires we create a brand new copy of it which does not include the constraint, removing the original after completion and ideally performing a VACUUM; at some point (quite a lift, oof!). Despite the challenges, there's some promise in doing this work - see below.

Opportunities from Switching `'nan'` for `NULL`

I wanted to restate some of the above and include some additional items below in terms of opportunities as a summary. I ran some testing for these cases to find out more precisely whether and how these might provide you and those who use this library benefit (see test results section below).

Pandas memory consumption may improve (numeric types may consume less memory than objects).
Pandas performance may improve (merges, joins, and data changes may have greater performance due to relative data uniformity and perhaps fewer in-flight conversions).
We'd gain compatibility for other tooling such as ConnectorX, which is purpose-built for greater bulk read performance (see here for example and reported performance improvements).
- Due to how it's optimized, ConnectorX is not compatible with SQLite databases that include storage class values which differ from their column affinity types (see here).

Some Test Results

Running tests using same constraints as reported and tested in #195 on a GCP e2-highmem-8 machine (8 vCPU, 64 GB Mem).

As-is performance with SQLite 'nan's:
- Duration: ~28 minutes
- Peak Memory Consumption: 64.7 GB
- Link to flamegraph
As-is performance with SQLite 'nan's replaced by NULL:
- Duration: ~18 minutes (down ~10 minutes from as-is)
- Peak Memory Consumption: 46.4 GB (down 18.3 GB from as-is)
- Link to flamegraph
ConnectorX reads for load_compartment and with SQLite 'nan's replaced by NULL:
- Duration: ~6 minutes (down ~22 minutes from as-is)
- Peak Memory Consumption: 42.4 GB (down 22.3 GB from as-is)
- Link to flamegraph

gwaybio · 2022-05-27T22:20:09Z

Amazing, thanks for this detail @d33bs. A couple followup questions:

In your last two tests, how did you convert 'nan's to NULL? Are you suggesting this as a preprocessing step prior to processing? How long does this take? How much memory?
Does ConnectorX work with other data formats? We want to keep SQLite support in pycytominer, but we'll eventually want to move away from the format
The peak memory consumption drop is looking great! (down ~35%!) However, the memory leak doesn't appear to be completely solved. Do you anticipate that the SQLite nan fix + ConnectorX is only part of the solution?

Thanks again @d33bs - and I hope you have a great Memorial day weekend!

d33bs · 2022-05-28T20:53:16Z

Thank you for the kind words and follow up's @gwaygenomics ! And wishing you a fantastic Memorial day weekend as well! 🙂

Addressing your questions below:

In your last two tests, how did you convert 'nan's to NULL? Are you suggesting this as a preprocessing step prior to processing? How long does this take? How much memory?

I converted the 'nan's to NULL using the following (non-optimal, solely for testing) Python file. I'm still learning the full lifecycle of the data involved, so apologies if I misunderstand things with what follows:

I haven't yet profiled the memory or resources with the Python file, and I would consider it non-optimal in it's current state. I estimate that it took about an 1.5 hours to complete and was mostly CPU intensive. There's likely a more elegant and performant method to doing this which involves more specificity at the column level, more targeted 'nan' updates (instead of the full table), or possibly in-memory databases (because we have that option with SQLite via :memory:).
Ideally we wouldn't need to make this conversion and the data would be added to SQLite in this format when the database is created.
Because there may be others like the one I tested with that include this format, I feel we could add a "post-creation pre-processing" step as you mentioned.
An alternative might be to make releases of these datasets as a "v2.0" or similar, retaining the originals but indicating performance increases within new versions. This might enable batch reprocessing to take place outside of pycytominer and minimize the impact to those using it.
As another alternative, or maybe more of a sidestep, we could do a quick "SQLite linting" check for storage class mismatches with column affinity, emitting a warning or error to the user for the concerns (and perhaps linking to documentation for updating the database).
Based on the above options or others, do you have a preference or feeling towards what might be best? I can create a PR towards addressing this, and defer to your guidance.

Does ConnectorX work with other data formats? We want to keep SQLite support in pycytominer, but we'll eventually want to move away from the format

ConnectorX works with mostly RDBMS/database connections (see here).
If SQLite is considered a deprecated datasource we may want to warn the user about performance issues therein. We may also want to provide upfront warnings to the user when using the SQLite format more generally, indicating any difference in supportability, perhaps as part of efforts surrounding Add logger #5 .
How long do you think pycytominer will support the SQLite data sources? Setting a goal post of when SQLite will be unsupported might help make some of the judgements surrounding what to do presently (through cost vs. benefit estimations, etc).

The peak memory consumption drop is looking great! (down ~35%!) However, the memory leak doesn't appear to be completely solved. Do you anticipate that the SQLite nan fix + ConnectorX is only part of the solution?

I believe the SQLite nan fix is only part of the solution as you mention. I feel the findings here do assist with what we might try for merges as Pandas datatype handling can sometimes effect performance.
Many discussions and documentation points to in-memory data copies during creation or transformation as a significant issue with Pandas (along with the related garbage collection which may be delayed or non-optimal for certain procedures/systems). I believe that some of this is happening with both the SQLite reads and the merges, and that the SQLite read memory handling could be effecting the merge performance as a result (more memory tasks to take care of, maybe compounding the issue in it's entirety).
The chart below from the ConnectorX test seems to show that we gradually accrued memory with the SQLite reads and then experienced a significant spike towards the end of the processing, perhaps once the process reached the final merge(s) or other embedded data transforms within merge_single_cells. Either way, I'm excited to dive into Address Pandas Dataframe Merge and Other Performance Issues #199 more to explore resolutions to the memory challenges with the merges! 👍

(this chart was generated from ConnectorX test memray flamegraph by clicking the line chart near the top)

gwaybio · 2022-05-31T16:36:17Z

Got it! Here are some responses:

To Converting nans to Nulls

Because there may be others like the one I tested with that include this format, I feel we could add a "post-creation pre-processing" step as you mentioned.

An alternative might be to make releases of these datasets as a "v2.0" or similar, retaining the originals but indicating performance increases within new versions. This might enable batch reprocessing to take place outside of pycytominer and minimize the impact to those using it.

I think this is the way to go. When you get a chance, can you add some version of your python file as a new PR in this repo? I think it should live in cyto_utils and be called sqlite_clean.py. Its API should enable an import (something like: from sqlite_clean import clean_sql_nulls) and we can discuss specific code updates in your PR.

Does this should like a good strategy?

Here's the first two concrete use-cases :)

Dataset	Link	Repo	Stage
Cell Health	https://nih.figshare.com/articles/dataset/Cell_Health_-_Cell_Painting_Single_Cell_Profiles/9995672/5	https://github.com/WayScience/cell-health-data	Ready for action now!
LINCS	Not yet available AWS link, possibly to be migrated to figshare plus	https://github.com/broadinstitute/lincs-cell-painting	broadinstitute/lincs-cell-painting#84

SQLite support timeline

How long do you think pycytominer will support the SQLite data sources? Setting a goal post of when SQLite will be unsupported might help make some of the judgements surrounding what to do presently (through cost vs. benefit estimations, etc).

My estimation is indefinitely, unfortunately. We might consider moving SQLite support in single_cells to some form of sqlite conversion, and we'll likely want to proceed as follows:

Decide new format
Create conversion functionality
Migrate single_cells support to this new data type
Update tutorials detailing how to handle new and old input data

We may also decide to create this conversion functionality in an entirely new repo (upstream of pycytominer in the workflow)

Scheduling a full SQLite deprecation will require other stakeholders to chime in.

Memory leak solution

Sounds great!! I'm looking forward to hearing about what you discover!

d33bs · 2022-06-01T14:32:18Z

Thank you @gwaygenomics !

I think this is the way to go. When you get a chance, can you add some version of your python file as a new PR in this repo? I think it should live in cyto_utils and be called sqlite_clean.py. Its API should enable an import (something like: from sqlite_clean import clean_sql_nulls) and we can discuss specific code updates in your PR.

Does this should like a good strategy?

I like this idea and might expand a bit to include some functionality for detecting challenges with these datasets based on what I found here. More to follow in a PR for these ideas.

gwaybio · 2023-06-07T18:46:23Z

@d33bs - does SQLiteClean or CytoTable mean we can close this issue?

d33bs · 2023-06-08T21:51:11Z

Hi @gwaybio - thanks, I think we could close this issue based on work in SQLite-clean and CytoTable as you mentioned. Maybe we could document expectations somewhere reasonable with user guidance when it comes to large SQLite data handling within Pycytominer? This could be a new issue or it could be a "closing task" for this issue.

gwaybio · 2023-06-09T22:00:06Z

Maybe we could document expectations somewhere reasonable with user guidance when it comes to large SQLite data handling within Pycytominer?

This sounds good. Something along the lines of:

Note: Single-cell processing speed of SQLite files depends on SQLite file size and amount of memory (RAM) on the computational environment machine.

I can add this somewhere and will tag you to review. Also feel free to modify this text here (or in the PR, whichever is useful).

I think this can be the closing task for this issue.

d33bs · 2023-06-10T18:39:00Z

Sounds great!

d33bs mentioned this issue May 21, 2022

Address Pandas Dataframe Merge and Other Performance Issues #199

Closed

d33bs added a commit to d33bs/pycytominer-performance that referenced this issue May 24, 2022

sqlite col vs val types; further testing connectorx

5b26bc7

work towards cytomining/pycytominer#198

d33bs added a commit to d33bs/pycytominer-performance that referenced this issue May 27, 2022

tests with sqlite nulls for cytomining/pycytominer#198

e244a85

gwaybio mentioned this issue May 31, 2022

Scheduling a SQLite deprecation in pycytominer #202

Open

This was referenced Jun 6, 2022

SQLite Utils: clean_like_nulls and More! #203

Merged

Build SQLite conversion tool #205

Closed

d33bs mentioned this issue Dec 9, 2022

Pycytominer fails when running aggregate from large SQLite tables #142

Closed

d33bs mentioned this issue Dec 16, 2022

Add float_datatype parameter for SingleCells compartment load and merge performance flexibility #248

Merged

13 tasks

d33bs mentioned this issue Mar 24, 2023

Duckdb's TypeMismatchException raised in CytoTable's convert() workflow due to nan values being stored as strings instead of expected types cytomining/CytoTable#38

Closed

d33bs added this to SET Projects Jun 9, 2023

d33bs moved this to Paused in SET Projects Jun 9, 2023

gwaybio added this to Version 1 release Jun 9, 2023

d33bs moved this from Paused to Up next in SET Projects Jun 14, 2023

gwaybio assigned d33bs Jun 14, 2023

gwaybio moved this to Todo in Version 1 release Jun 14, 2023

d33bs moved this from Up next to Paused in SET Projects Jun 21, 2023

gwaybio moved this from Todo to In Progress in Version 1 release Jun 23, 2023

gwaybio added a commit to gwaybio/pycytominer that referenced this issue Jun 23, 2023

close cytomining#198

9ac7c78

gwaybio closed this as completed in 7320516 Jun 23, 2023

github-project-automation bot moved this from In Progress to Done in Version 1 release Jun 23, 2023

github-project-automation bot moved this from Paused to Done in SET Projects Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address SQLite Read Performance Issues #198

Address SQLite Read Performance Issues #198

d33bs commented May 21, 2022 •

edited

Loading

d33bs commented May 24, 2022

gwaybio commented May 25, 2022

d33bs commented May 25, 2022

d33bs commented May 27, 2022 •

edited

Loading

gwaybio commented May 27, 2022 •

edited

Loading

d33bs commented May 28, 2022

gwaybio commented May 31, 2022 •

edited

Loading

d33bs commented Jun 1, 2022

gwaybio commented Jun 7, 2023

d33bs commented Jun 8, 2023

gwaybio commented Jun 9, 2023

d33bs commented Jun 10, 2023

Address SQLite Read Performance Issues #198

Address SQLite Read Performance Issues #198

Comments

d33bs commented May 21, 2022 • edited Loading

d33bs commented May 24, 2022

gwaybio commented May 25, 2022

d33bs commented May 25, 2022

d33bs commented May 27, 2022 • edited Loading

SQLite Data Type Flexibility

Pandas read_sql Interpretation of Mixed-Type Values

What About SQLite NULL?

Opportunities from Switching 'nan' for NULL

Some Test Results

gwaybio commented May 27, 2022 • edited Loading

d33bs commented May 28, 2022

gwaybio commented May 31, 2022 • edited Loading

To Converting nans to Nulls

SQLite support timeline

Memory leak solution

d33bs commented Jun 1, 2022

gwaybio commented Jun 7, 2023

d33bs commented Jun 8, 2023

gwaybio commented Jun 9, 2023

d33bs commented Jun 10, 2023

d33bs commented May 21, 2022 •

edited

Loading

d33bs commented May 27, 2022 •

edited

Loading

Pandas `read_sql` Interpretation of Mixed-Type Values

What About SQLite `NULL`?

Opportunities from Switching `'nan'` for `NULL`

gwaybio commented May 27, 2022 •

edited

Loading

gwaybio commented May 31, 2022 •

edited

Loading