Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQLite to Parquet Conversion Functionality #213

Closed
wants to merge 14 commits into from
Closed

Add SQLite to Parquet Conversion Functionality #213

wants to merge 14 commits into from

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Jul 14, 2022

Description

Add functionality for converting SQLite-based Pycytominer data to parquet format to address #205 .

Other notable changes:

  • Adds prefect dependency to enable scalable and resource-sensitive dataflow for conversion work.
  • Adds pyarrow dependency to enable efficient parquet file operations.
  • Adds data architecture documentation covering topics related to and informed by addressing the above issue.
  • Adds sphinxcontrib-mermaid dependency to generate diagrams for related documentation.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@d33bs d33bs requested a review from gwaybio July 14, 2022 22:51
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an amazing contribution, thanks @d33bs! I've made several in-line comments. Let's discuss these and then likely merge soon.

Two general comments:

  • Please make sure to keep consistent with our documentation style (i've made in-line comments where appropriate).
  • Is it possible to add an example usage somewhere? Maybe directly in the docstring? Sphinx can read and render an example IIRC

@@ -0,0 +1,135 @@
Data Architecture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data Architecture
Data architecture

Unless this is convention I'm not aware of, let's keep only the first word of titles lowercase (when not proper nouns of course).

This will keep docs consistent throughout the project, IIRC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'll make the related changes to this effect, thank you.


Pycytominer data architecture documentation.

Distinct Upstream Data Sources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Distinct Upstream Data Sources
Distinct upstream data sources

I'm not going to update all of these, please adjust all titles if i'm not violating some convention i'm not aware of

Comment on lines 9 to 10
Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate
different kinds of data which are handled differently within Pycytominer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate
different kinds of data which are handled differently within Pycytominer.
Pycytominer has distinct data flow contingent on the upstream data source.
Various projects are used to generate different kinds of data which are handled differently within Pycytominer.

We use 1-sentence-per-line convention for docs. I am not sure if .rst is different, but let's stick with this convention for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, will do, thank you.

Comment on lines 12 to 13
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_
data used by Pycytominer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_
data used by Pycytominer.
* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_data used by Pycytominer.

Does it make sense to do this? To stick with 1-sentence per line convention I mean. There might be a way to auto-convert

`NPZ <https://numpy.org/doc/stable/reference/routines.io.html?highlight=npz%20format#numpy-binary-files-npy-npz>`_
data used by Pycytominer.

SQLite Data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SQLite Data
SQLite data

Comment on lines +317 to +318
filename: str:
Filename to use for the parquet file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're adding a .parquet suffix, so maybe filename isn't the best variable name? (just trying to avoid files like TEST.parquet.parquet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point here, I'll change this to be more clear and hopefully avoid the double extension name challenge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up to mention: I've made a commit which removes the "parquet" extension addition from the functions, relying on the developer/user to provide this via filename or pq_filename (or use an extension-less filename). The aim here was to align this closer to how Pandas I/O parameters work. Please don't hesitate to let me know if we should still make a change to the parameter name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust your judgement!

Comment on lines 330 to 331
if os.path.isfile(concatted_parquet_filename):
os.remove(concatted_parquet_filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using pathlib as a replacement to os. Have you played with it?

it will involve setting concatted_parquet_filename on line 327 to a pathlib.Path object, and then checking if it exists

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll make the changes here to this effect, thank you.

)

# return the result of reduced_pq_result, a path to the new file
return flow_state.result[reduced_pq_result].result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks complicated. How complex is the flow_state object? Will we ever need to access other elements of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this forward. The flow_state object is somewhat complex - it contains state and result data for all tasks within the Prefect flow. The return here was an effort to make the primary goal of the Prefect flow of practical use for a Pycytominer developer (instead of sharing all tasks / meta along the way). Because it's a new convention for the repo, I wasn't certain about much to lean towards flexibility vs convenience.

Potential changes: We could return flow_state on it's own here to rely on the developer to navigate the results (and allow them greater flexibility with what happens to the flow). Another further step in this direction might be to deliver a flow without running it, enriching options in this territory as well. Thoughts?

Additional information which may help: Prefect 1.x, used for changes related to this PR, is stable and works well. There are also ongoing efforts for Prefect 2.x which change how flows are handled (flows are formed through decorators for Python functions, similar to how @task is used for these changes). This may mean that we eventually end up closer to the "potential changes" structure anyways, when Prefect 2.x is out of beta, and if this is something you'd like to head towards.

@@ -0,0 +1,150 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've reviewed this code already? Oh, is it being moved over here from a different file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies - yes, we've reviewed this code, it is moved from cyto_utils/sqlite.py and I mislabeled the commit change for this file (add vs rename/move).


# insert statement with some simple values
# note: we use SQLAlchemy's parameters to insert data properly, esp. BLOB
connection.execute(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the dreaded sql injection vulnerability? Do we want to test this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment. I leaned towards a simple approach with these test inserts over SQLAlchemy insert compilation to help increase understandability and presuming a low risk vector for the non-public-facing SQLite databases.

I can understand wanting to keep things secure, protecting against future scenarios when the databases may be public-facing, or taking a more SQLAlchemy-ORM approach here. The cost might be greater complexity for the codebase for SQLite work. Do you think we should change these inserts (and related) towards this effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of simplicity. Perhaps the current branch and testing protections are sufficient. Maybe best to proceed as-is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thank you. I'm okay proceeding as-is for now, I feel we'd be abiding PEP 20 here (simple over complex when the scenario isn't too complicated + practicality over purity).

There's a chance we can move away from some of this with real data samples (SQLite or parquet files) which could be used as part of the tests instead of generating SQLite data. Another thought here: we could illustrate reasoning here with comments based on this discussion.

d33bs added 9 commits July 18, 2022 14:34
- Lowercase titling after first letter
- 1-sentence-per-line for general content
- Parquet format as conversion-only
- Minor space/line fixes
Rely on developer/user to provide .parquet extension as desired.
missing_ok parameter is only in python >= 3.8
relevant for non-local execution, such as dask.
…nk files

accounting for variability in column ordering
little discernable difference in performance for using compression=None after testing
@d33bs
Copy link
Member Author

d33bs commented Aug 19, 2022

Providing a comment here for context regarding some upcoming code changes for this PR:

  • Resulting parquet dataset needs to be changed to a new format (component table rows aligned as one parquet file, image data in another parquet file) (many thanks go to @gwaybio for guidance here).
  • clean_like_nulls to be included as part of conversion, preferably as a parameterized option to help reduce the complexity developers may face when performing conversion work (many thanks to discussions with @axiomcura).
  • Change implementation from Prefect 1.x to 2.x. Prefect 2.x was officially released between the time that this PR was opened and now. Moving to Prefect 2.x will be preferable for future-facing work to help prevent dependency challenges.

@d33bs
Copy link
Member Author

d33bs commented Aug 22, 2022

Closing this PR to account for a change in approach and due to other challenges as noted by earlier comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants