-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQLite to Parquet Conversion Functionality #213
Conversation
preparatory for conversion work and shared functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an amazing contribution, thanks @d33bs! I've made several in-line comments. Let's discuss these and then likely merge soon.
Two general comments:
- Please make sure to keep consistent with our documentation style (i've made in-line comments where appropriate).
- Is it possible to add an example usage somewhere? Maybe directly in the docstring? Sphinx can read and render an example IIRC
docs/architecture.data.rst
Outdated
@@ -0,0 +1,135 @@ | |||
Data Architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data Architecture | |
Data architecture |
Unless this is convention I'm not aware of, let's keep only the first word of titles lowercase (when not proper nouns of course).
This will keep docs consistent throughout the project, IIRC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I'll make the related changes to this effect, thank you.
docs/architecture.data.rst
Outdated
|
||
Pycytominer data architecture documentation. | ||
|
||
Distinct Upstream Data Sources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinct Upstream Data Sources | |
Distinct upstream data sources |
I'm not going to update all of these, please adjust all titles if i'm not violating some convention i'm not aware of
docs/architecture.data.rst
Outdated
Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate | ||
different kinds of data which are handled differently within Pycytominer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate | |
different kinds of data which are handled differently within Pycytominer. | |
Pycytominer has distinct data flow contingent on the upstream data source. | |
Various projects are used to generate different kinds of data which are handled differently within Pycytominer. |
We use 1-sentence-per-line convention for docs. I am not sure if .rst
is different, but let's stick with this convention for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, will do, thank you.
docs/architecture.data.rst
Outdated
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ | ||
data used by Pycytominer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ | |
data used by Pycytominer. | |
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_data used by Pycytominer. |
Does it make sense to do this? To stick with 1-sentence per line convention I mean. There might be a way to auto-convert
docs/architecture.data.rst
Outdated
`NPZ <https://numpy.org/doc/stable/reference/routines.io.html?highlight=npz%20format#numpy-binary-files-npy-npz>`_ | ||
data used by Pycytominer. | ||
|
||
SQLite Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLite Data | |
SQLite data |
filename: str: | ||
Filename to use for the parquet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're adding a .parquet
suffix, so maybe filename
isn't the best variable name? (just trying to avoid files like TEST.parquet.parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point here, I'll change this to be more clear and hopefully avoid the double extension name challenge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a follow up to mention: I've made a commit which removes the "parquet" extension addition from the functions, relying on the developer/user to provide this via filename
or pq_filename
(or use an extension-less filename). The aim here was to align this closer to how Pandas I/O parameters work. Please don't hesitate to let me know if we should still make a change to the parameter name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trust your judgement!
if os.path.isfile(concatted_parquet_filename): | ||
os.remove(concatted_parquet_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been using pathlib
as a replacement to os. Have you played with it?
it will involve setting concatted_parquet_filename
on line 327 to a pathlib.Path object, and then checking if it exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll make the changes here to this effect, thank you.
) | ||
|
||
# return the result of reduced_pq_result, a path to the new file | ||
return flow_state.result[reduced_pq_result].result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks complicated. How complex is the flow_state
object? Will we ever need to access other elements of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling this forward. The flow_state object is somewhat complex - it contains state and result data for all tasks within the Prefect flow. The return here was an effort to make the primary goal of the Prefect flow of practical use for a Pycytominer developer (instead of sharing all tasks / meta along the way). Because it's a new convention for the repo, I wasn't certain about much to lean towards flexibility vs convenience.
Potential changes: We could return flow_state
on it's own here to rely on the developer to navigate the results (and allow them greater flexibility with what happens to the flow). Another further step in this direction might be to deliver a flow without running it, enriching options in this territory as well. Thoughts?
Additional information which may help: Prefect 1.x, used for changes related to this PR, is stable and works well. There are also ongoing efforts for Prefect 2.x which change how flows are handled (flows are formed through decorators for Python functions, similar to how @task
is used for these changes). This may mean that we eventually end up closer to the "potential changes" structure anyways, when Prefect 2.x is out of beta, and if this is something you'd like to head towards.
@@ -0,0 +1,150 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've reviewed this code already? Oh, is it being moved over here from a different file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies - yes, we've reviewed this code, it is moved from cyto_utils/sqlite.py
and I mislabeled the commit change for this file (add vs rename/move).
|
||
# insert statement with some simple values | ||
# note: we use SQLAlchemy's parameters to insert data properly, esp. BLOB | ||
connection.execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the dreaded sql injection vulnerability? Do we want to test this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this comment. I leaned towards a simple approach with these test inserts over SQLAlchemy insert compilation to help increase understandability and presuming a low risk vector for the non-public-facing SQLite databases.
I can understand wanting to keep things secure, protecting against future scenarios when the databases may be public-facing, or taking a more SQLAlchemy-ORM approach here. The cost might be greater complexity for the codebase for SQLite work. Do you think we should change these inserts (and related) towards this effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of simplicity. Perhaps the current branch and testing protections are sufficient. Maybe best to proceed as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thank you. I'm okay proceeding as-is for now, I feel we'd be abiding PEP 20 here (simple over complex when the scenario isn't too complicated + practicality over purity).
There's a chance we can move away from some of this with real data samples (SQLite or parquet files) which could be used as part of the tests instead of generating SQLite data. Another thought here: we could illustrate reasoning here with comments based on this discussion.
- Lowercase titling after first letter - 1-sentence-per-line for general content - Parquet format as conversion-only - Minor space/line fixes
Rely on developer/user to provide .parquet extension as desired.
missing_ok parameter is only in python >= 3.8
relevant for non-local execution, such as dask.
…nk files accounting for variability in column ordering
little discernable difference in performance for using compression=None after testing
Providing a comment here for context regarding some upcoming code changes for this PR:
|
Closing this PR to account for a change in approach and due to other challenges as noted by earlier comments. |
Description
Add functionality for converting SQLite-based Pycytominer data to parquet format to address #205 .
Other notable changes:
prefect
dependency to enable scalable and resource-sensitive dataflow for conversion work.pyarrow
dependency to enable efficient parquet file operations.sphinxcontrib-mermaid
dependency to generate diagrams for related documentation.What is the nature of your change?
Checklist