-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build SQLite conversion tool #205
Comments
If it doesn't already somehow exist, it might be beneficial to consider making data type mapping decisions for the A good example of why this matters are Datetimes:
A type mapping table could look something like the following. We might also want to call out how non-values (None, NULL, np.NaN, pd.NA, etc) are handled.
|
Hi @gwaybio - I'm excited to dive into this! In thinking about this a bit more I'm wondering if there's any schema or standardized structure for the source data. A few questions to this effect:
|
Sorry for the delayed reply, my answers as follows:
Prior to SQLite, the data are a collection of
I'm not exactly sure how to answer this question - what do you mean more specifically by "formal"? It might be helpful to point you to the test files we use as input for creating the SQLite files: https://github.com/cytomining/cytominer-database/tree/master/tests/data_b
The data are fairly stable, will range roughly somewhere between 500MB and 20GB, and is not likely to change much |
Thank you @gwaybio ! This is super helpful. Looking through the sample datasets under the |
These are two different columns, and, together with Some more pointers:
With the additional |
Thanks @gwaybio! A few follow-up questions based on what you mention:
|
Perhaps an answer to both questions, IIUC:
pycytominer/pycytominer/cyto_utils/single_cell_ingest_utils.py Lines 5 to 23 in 02d0522
They don't! If I'm remembering correctly, the compartment tables will also have a TableNumber and ImageNumber that we use to link which image the individual cells were measured from.
Yes, exactly. Unless the image contained only one cell :) |
Thanks @gwaybio for the data details - this is very helpful in understanding what we might do towards conversion. Based on this information I've created a sketch of how we could approach merging the tables as one single dataset (for export to parquet as a single file, discussed in #202). Within this sketch:
Rough table format:
Example with data:
I'd welcome any feedback or input you may have on this! How does the above look? Could this work as a SQLite-to-parquet conversion format for this project? |
This is great @d33bs - yes, I think this is on the right track. Two other nuances that will be important to keep in mind, but shouldn't stall progress:
|
Hi @gwaybio - something I'm noticing in working with related SQLite and CSV files is that TableNumber is sometimes not included within sample CSV's. Could I ask where TableNumber is generated and whether it's an optional field which may not always be present within SQLite files for conversion? |
TableNumber is generated in CellProfiler, I believe. I am not familiar enough with the code base to point to it. @bethac07 - can you provide a quick pointer? If you have to dig, no worries - I thought that you might just know. |
TableNumber is actually not made by CellProfiler, at least in our typical use case (using ExportToSpreadsheet)- It's added in cytominer-database. The reason for this column to exist is that depending on how the data is passed into CellProfiler, some data input methods essentially "reset" ImageNumber - so instead of A01-Site1 being ImageNumber=1 and A01-Site2 being ImageNumber=2, both might be ImageNumber=1. This causes problems because now ImageNumber isn't a unique key that we can ie map across to the object tables, etc. So an extra "indexing" column is always added. |
Thank you @bethac07 (and @gwaybio)! Based on what you mention, I'll continue to use TableNumber as a field referenced for conversion work. Generally, my plan is to use TableNumber and ImageNumber as the "joinable keys" for all tables within existing the SQLite databases (presuming ObjectNumber may not always be related across components). |
Hi @gwaybio - I wanted to follow up with some findings and ask a few questions to navigate what conversion will look like. After testing a few different libraries and methods for this work, I'm finding there are significant resource implications. These implications can be broken down into a couple of categories seen below. Both of these approaches have the goal of a single large and sparse dataset (4 tables into a single table or "frame") as previously discussed in this issue. They also presume no external resources - that the process may only consume local CPU and memory. I attempted to use alternatives to Pandas due to the benefits these sometimes offer when it comes to overcoming constraints similar to the ones we face.
Multiple Tables to One Frame and One FileI attempted to merge all four tables to one single frame (with none or none-like elements where appropriate), then export to a single parquet file. This turned out to require a large amount of memory in order to successfully run. In many cases, the procedure failed using some alternative to the below, I believe due to excessive memory usage. Some numbers here (note - I believe the excess above system memory was made available through disk-based caching):
Multiple Tables to Many Frames and Many Files (as One Dataset)Using single dataframes and files may not scale well because we'll eventually reach physical limits of either memory or filesize constraints. To address this, we can use many dataframes and many files using chunking. When working with parquet, many libraries enable passing a directory or glob/wildcard conventions to read from multiple files as one cohesive dataset (see polars IO dealing with multiple files or pandas.read_parquet path spec references for more detail here). For some chunking tests, I chose to use identical dataframe and file chunking based on sets of 50 distinct/unique joinable keys (TableNumber and ImageNumber by default). For
Follow-up QuestionsBased on the above findings, I have some questions:
|
Hi @d33bs - thanks for this deep dive. The memory requirements are a bit concerning. Is it possible to stream elements of the SQLite into a continuously written parquet file? I think 1) we'd sacrifice time over memory, and 2) we'll only need to use this SQLite converter tool for legacy data (after we update cytominer-database/transport to output parquet directly) I'll also answer your questions below:
Average computer: 64GB, 12-16 core CPU
No hard duration, but hours per file seems ok
Multi-file single dataset is not ideal for distribution, so we'd prefer single-file. This conversion tool likely won't be used in a new pipeline (only for legacy datasets). The pipeline from CellProfiler/DeepProfiler -> Cytominer-database/transport goes from multi-files to a single-file. Are there benefits to a multi-file that I'm not thinking of?
This file is on the smaller side. They're usually 10-20GB (I've seen some go as high as 50GB!) 10GB post single-cell merge is perfect
Yes! We are definitely interested in these performance upgrades. While it is worth keeping in mind, lazy computation post data conversion is beyond scope of this issue (focused on the converter tool).
It is on the smaller size, 10-20GB is standard (I've seen SQLite up to 50GB!) One other minor note, in case it makes a difference in your thinking:
Our data are not sparse, they contain mostly non-zero elements. |
Now that @bunnech has a great solution for #195 (h/t @johnarevalo), would it be useful to create a tool right here in Ideally, we'd create a separate tool (as @gwaybio suggested #205 (comment)) in https://github.com/cytomining/ after fully thinking through the design. But I wonder if it's wiser to start with what's easiest right now, which is to take @bunnech's script, clean it up a bit, and create a prototype that will help us think through how we can perfect it over time. |
That’s fantastic - let us know when you get the chance @d33bs. |
Perfect! Ok, here's how I propose we move forward: From @shntnu:
Sounds great. @bunnech is this something you can take on? Can you file a pull request adding your prototype? From @d33bs:
Sounds good! I bet that once @bunnech posts the prototype, you can take charge and apply your learnings from #213 (especially in regards to testing) to take this through to merge. |
Yes, sounds good to me. I will add @d33bs as a co-author to the PR already. Will prepare something later today! |
@d33bs - should we close this issue? |
@gwaybio - yes, I feel this can be closed. I think the work which @bunnech added is great and covers things well. Additionally, the work happening in cytomining/CytoTable#5 (among others) will hopefully add additional capabilities in the near future. |
In #202 we discuss how building a SQLite2XXX (where XXX represents our preferred pycytominer data type) would enable us to sunset SQLite dependencies in this repo.
Any tool that we build will likely contain some (or all) of the SQLite hygiene functions @d33bs wrote in #203
I think this tool would be a new package in
cytomining
and exist outside of pycytominerThe text was updated successfully, but these errors were encountered: