-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling a SQLite deprecation in pycytominer #202
Comments
I think it would be good to loop in @shntnu as well here. I can understand wanting to also support other data sources (ie parquet), but is there a reason at this time to HAVE to sunset SQLite support? For better or for worse, for at least the forseeable future it's what a lot of the "older" data sets in the field are going to have available in terms of single cell data, so you'll either lose the ability to work with those with new and awesome pycytominer stuff past 0.1 or lose the ability to compare against these sets until you do conversions. It doesn't look like SQLite is terribly woven throughout the codebase - from what I can tell, it's literally only in cells.py (and some tests, and in collate but we aren't sure if that will live in this repo forever). My proposed solution whether or not you decide to sunset SQLite anytime soon or not would be to essentially extract out a "data reader" module or class that just handles data reading and formatting instead of having that be part of the SingleCells class; that way everything else in pycytominer remains source-agnostic, and you can easily add or remove source formats as you wish. |
Disclaimer: I've not been in the trenches for a while on this front :) @bethac07 will have a much better sense of ground realities than me. I skimmed #198, and it sounds like the future alternative format is TBD. I'll assume this format is Parquet (for this discussion)
If we have a SQLite2Parquet converter, I think it's ok to sunset SQLite entirely (and it would be worth the whatever rewrite pycytominer needs to make it happen)
IMO as soon as a SQLite2Parquet converter has been written and tested, and https://github.com/cytomining/profiling-recipe/blob/d7012751cf24e9aa1325e099b03aaf74fe2578d6/profiles/profile.py#L54 has been updated to optionally convert the SQLite to Parquet
I didn't understand this question, but I think you're asking: what are the requirements for the SQLite replacement? Beth would have a much better handle on that |
Thanks for this discussion! I'll respond to several comments below, but my overall takeaway for the answer to scheduling a SQLite deprecation is: immediately, but with direction on how to do so. Continued discussion responses
We'd like to make progress with pycytominer (and the full pipeline), and in order to do this, we need to move away from this file format. An alternative approach is to build a completely separate tool (using lessons learned in pycytominer), but we'd prefer to continue long-term support for CellProfiler-based Cell Painting analyses within cytomining.
Yes! I think we should pursue both suggestions. This is the how feedback I was looking for! 👀 Let's develop a Let's also refactor to extract a data reader module from |
I just want to make sure also though to put this problem in a larger context- I think an SQLiteToX (where X is Parquet or whatever else) is a good idea no matter what, if only for legacy data. But we should also consider if whatever it is we were going to build, it makes sense to also think about what we still expect to be the most common data source for a while, which is "per site CSVs coming from CellProfiler". Doing CSV -> SQLite -> X is fine, but if it helps in the selection of X or if now is the right time to talk about it as we're thinking about new refactored data readers, CSV -> X is obviously better. I'm also hoping that this will finally be the year we do the big CellProfiler Export module overhaul, so hopefully we can write to X directly too in some circumstances, but at least for the forseeable I think CSVs are a safe "everything should be able to read and write them" option. |
Ah, in my note above I had kinda assumed that creating a CSV -> X tool is part of the plan here and that we'd never ever need to do CSV -> SQLite -> X. @gwaybio is creating a CSV -> X tool (i.e. a cytominer-database replacement) a part of your plan here? |
My straw proposal for what the right thing to do here is to try to finish cytominer-transport . When we last left it off 14 months ago, it was working but slow; I think I have some ideas on how to make it less slow. |
It's definitely part of the conversation, as it pertains to resource needs for future datasets. I think finishing cytominer-transport is the way to go, and what we learn as we go should inform some pycytominer decisions. The development of cytominer-transport, however, should not impact pycytominer progress. The plan to sunset SQLite in the manner described here #202 (comment) removes the development tangle that might prevent progress in one package vs. another. |
I couldn't figure out if the sunset plan requires you to first decide what X should be. Either way, very glad you're on top of this @gwaybio! |
Just wanted to chime in here to offer up some ideas. There's an opportunity to do some great things in this project with scalable dataframes from an in-memory perspective, and many of these include or are compatible with parquet files as exports. Parquet looks to be used in at least some of this project which is great news for compatibility with these sorts of things 🙂 . Parquet is pretty great when it comes to both performance, file-size, and compatibility with other projects. It's compatibility makes it reasonable to think we could convert from it once a better alternative is known, for ex as If this seems like a good idea, what type of data-to-file structure should we expect? Some options I can think of:
Multidimensional ThoughtsWorking within the contexts of multidimensional table(s) could be helpful for calculations and also for encapsulating the file-based version of the data. This wouldn't be without challenges, as it may completely change how operations are performed. Ideally this would open up roads that otherwise might not exist, and where it's limiting, we could abstract or transform to a more familiar 2-dimensional representation of the data (maybe as a query, or a class method, etc). A sketch of what this might look like:
Some ideas in this space:
|
Thanks @d33bs! Some additional discussion:
Yes, I think this makes sense to do. We've deliberated on parquet in the past, we've been limited on bandwidth/knowledge to implement the switch.
Yes, I think this makes sense to do as well. We can create this repo, and start with a multi-table SQLite -> parquet implementation. IIRC, this repo would also contain some of the SQLite hygiene functions introduced in #203 Once we have this functionality, then we can effectively deprecate SQLite in future pycytominer releases by:
Is there a reason to not perform merges across tables prior to parquet writing? This to me seems like the simplest solution that wouldn't require much operational change. I'm definitely open to more thoughts on this topic! |
Thank you @gwaybio !
Whenever ready, I'd be excited to assist with this effort and include the hygiene functions as well. Are there any other considerations before creating the repo we should think through? Please don't hesitate to let me know if there's anything I may help with in getting started here.
I can see some reasons to not merge prior to parquet writes (there may be others based on data implementation, etc.):
All this said, I still feel there might be benefits to this approach. Maybe this becomes a flag in the |
We are ready now! No time like the present :)
Hmm, I can only think of two:
Sorry, I don't think I understand your replies. I am talking about merging the individual per-cell measurements that exist in three separate tables (compartments) in the existing SQLite files prior to parquet writing. Why does performing this merge make the data multidimensional? Isn't it just a simple file that could just as well be a flat text file? What am I not understanding here? 😂 |
Thank you @gwaybio - after thinking this over a bit and discussing with @staylorx, I feel it'd be best to store the conversion code under this repo for the time being until there's more clarity around the utility of code reuse. Getting the merged data schema in a desired state is likely to result in code that's specific to this project - which makes it a good candidate to be included here (for example, under cyto_utils). Multidimensional - my apologies, I think I was getting carried away with another option. A merged dataset will be 2-dimensional and should enable us to keep things in a single file. Thanks as well for the great comments within #205 towards understanding the existing datasets. I'll plan to follow up within that issue for specifics when it comes to the fully merged data schema. |
Sounds good @d33bs - I saw https://github.com/CUHealthAI/sqlite-clean in the CUHealthAI org. Is the plan to eventually migrate the conversion code here? (I'm just noting this here for decision tracking) |
Hi @gwaybio, great question, I'll try to clarify below:
Some other notes towards progress:
|
In #198 @d33bs states:
It has been well documented in many previous issues (in this repo and others) that we need to move away from SQLite towards a more performant data type. This requires some level of deprecation (either complete or partial).
Let's use this issue to describe:
I believe that we need more info from key stakeholders to help us answer. @bethac07 can provide much of this insight. @bethac07 - what are your resource needs for future data types? What SQLite support would you like to see in pycytominer?
Note @d33bs - I think that we can safely proceed with some elements of our plan in #198 (e.g. function for cleaning existing SQLite, upgrading single cell merges, etc.). I don't want this issue to block any progress - let's use the detail outlined here as a guide on how to maintain critical SQLIte support and when to release new pycytominer versions. Old pycytominer versions (e.g. v0.1) will retain full SQLite support.
The text was updated successfully, but these errors were encountered: