-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address SQLite Read Performance Issues #198
Comments
Hi @gwaygenomics - I'm noticing that within select ObjectNumber,
Cytoplasm_Correlation_Costes_AGP_DNA,
typeof(Cytoplasm_Correlation_Costes_AGP_DNA) from Cytoplasm
where typeof(Cytoplasm_Correlation_Costes_AGP_DNA) != 'real'; SQLite may include column values that don't match the column specification. This could be one cause of the performance issues using Pandas and SQLAlchemy, but it also can impact alternatives. Because these columns are sometimes indicated as not-null, it means we may not use Do you know of a way we could interpret these 'nan' values correctly for the purposes of pycytominer (for example, replace with 0, filter from results in database, etc)? |
No worries and thank you @gwaygenomics - I'll take a look. Enjoy the conference! |
Hi @gwaygenomics - I just wanted to follow up with some findings, open to any guidance or input you have along the way. I've spent some time investigating ways to speed up sqlite reads, taking note of the background you mentioned a little earlier. I don't yet have thoughts on solutions for the cytominer-database issue, as a brainstorm here it feels like The file I've focused on for this has so far been only SQ00014613.sqlite. For the below, I have the table SQLite Data Type FlexibilitySQLite is a database with sometimes uncommon flexibility when compared to other RDBMS's. One way this is true is through affinity types and storage classes (see docs here). Affinity typing enables table columns to have a "preference" for certain storage classes. Storage classes are ultimately the source of truth concerning a particular value in a column. For example, this means we can have a column of affinity type
Pandas
|
SQLite Column Affinity Type | SQLite Value | SQLite Storage Class | Pandas DF Column Type | Pandas Value Type |
---|---|---|---|---|
FLOAT | 0.1 | REAL | Object | float |
FLOAT | 'nan' | TEXT | Object | str |
In and of itself, this may be a bottleneck; Pandas may be performing sub-optimally due mixed typing of columns. We could modify our SQL queries to sidestep this but we'd likely be stepping into more performance concerns (in-flight SQL data transforms). Recasting types within the Pandas dataframe may also have costly impacts to performance.
What About SQLite NULL
?
SQLite provides a special storage class called NULL
which may be used for non-values. Let's say we replace the 'nan' value(s) with NULL within the table column. When we do this and read it using Pandas, the resulting column type is float64
and two values of numpy.float64
(SQLite NULL
is interpreted as numpy.nan
).
SQLite Column Affinity Type | SQLite Value | SQLite Storage Class | Pandas DF Column Type | Pandas Value Type |
---|---|---|---|---|
FLOAT | 0.1 | REAL | float64 | numpy.float64 |
FLOAT | NULL | NULL | float64 | numpy.float64 |
As mentioned above, this may lend itself to healthier performance. The catch: NULL may only be used in columns which do not have a constraint of NOT NULL
. In a cytoplasm.Cytoplasm_Correlation_Costes_AGP_DNA
where I'm witnessing the mixed values occurring, we indeed have a NOT NULL
constraint.
Despite SQLite's flexibility with affinity and storage class, it's relatively inflexible when it comes to column constraint modifications. Changing cytoplasm.Cytoplasm_Correlation_Costes_AGP_DNA
to be nullable within SQLite requires we create a brand new copy of it which does not include the constraint, removing the original after completion and ideally performing a VACUUM;
at some point (quite a lift, oof!). Despite the challenges, there's some promise in doing this work - see below.
Opportunities from Switching 'nan'
for NULL
I wanted to restate some of the above and include some additional items below in terms of opportunities as a summary. I ran some testing for these cases to find out more precisely whether and how these might provide you and those who use this library benefit (see test results section below).
- Pandas memory consumption may improve (numeric types may consume less memory than objects).
- Pandas performance may improve (merges, joins, and data changes may have greater performance due to relative data uniformity and perhaps fewer in-flight conversions).
- We'd gain compatibility for other tooling such as ConnectorX, which is purpose-built for greater bulk read performance (see here for example and reported performance improvements).
- Due to how it's optimized, ConnectorX is not compatible with SQLite databases that include storage class values which differ from their column affinity types (see here).
Some Test Results
Running tests using same constraints as reported and tested in #195 on a GCP e2-highmem-8 machine (8 vCPU, 64 GB Mem).
- As-is performance with SQLite
'nan's
:- Duration: ~28 minutes
- Peak Memory Consumption: 64.7 GB
- Link to flamegraph
- As-is performance with SQLite
'nan's
replaced byNULL
:- Duration: ~18 minutes (down ~10 minutes from as-is)
- Peak Memory Consumption: 46.4 GB (down 18.3 GB from as-is)
- Link to flamegraph
- ConnectorX reads for
load_compartment
and with SQLite'nan's
replaced byNULL
:- Duration: ~6 minutes (down ~22 minutes from as-is)
- Peak Memory Consumption: 42.4 GB (down 22.3 GB from as-is)
- Link to flamegraph
Amazing, thanks for this detail @d33bs. A couple followup questions:
Thanks again @d33bs - and I hope you have a great Memorial day weekend! |
Thank you for the kind words and follow up's @gwaygenomics ! And wishing you a fantastic Memorial day weekend as well! 🙂 Addressing your questions below:
I converted the 'nan's to NULL using the following (non-optimal, solely for testing) Python file. I'm still learning the full lifecycle of the data involved, so apologies if I misunderstand things with what follows:
|
Got it! Here are some responses: To Converting nans to Nulls
I think this is the way to go. When you get a chance, can you add some version of your python file as a new PR in this repo? I think it should live in Does this should like a good strategy? Here's the first two concrete use-cases :)
SQLite support timeline
My estimation is indefinitely, unfortunately. We might consider moving SQLite support in
We may also decide to create this conversion functionality in an entirely new repo (upstream of pycytominer in the workflow) Scheduling a full SQLite deprecation will require other stakeholders to chime in. Memory leak solutionSounds great!! I'm looking forward to hearing about what you discover! |
Thank you @gwaygenomics !
I like this idea and might expand a bit to include some functionality for detecting challenges with these datasets based on what I found here. More to follow in a PR for these ideas. |
@d33bs - does SQLiteClean or CytoTable mean we can close this issue? |
Hi @gwaybio - thanks, I think we could close this issue based on work in SQLite-clean and CytoTable as you mentioned. Maybe we could document expectations somewhere reasonable with user guidance when it comes to large SQLite data handling within Pycytominer? This could be a new issue or it could be a "closing task" for this issue. |
This sounds good. Something along the lines of: Note: Single-cell processing speed of SQLite files depends on SQLite file size and amount of memory (RAM) on the computational environment machine. I can add this somewhere and will tag you to review. Also feel free to modify this text here (or in the PR, whichever is useful). I think this can be the closing task for this issue. |
Sounds great! |
SQLite files used by this repo create performance challenges during runtime that may hinder or completely stall progress. This issue is dedicated to targeting SQLite read performance issues, including solutions which convert away from the format or may not use SQLite at all.
Issues which may be related or tied to this:
.merge_single_cells()
method #195The text was updated successfully, but these errors were encountered: