Add SQLite to Parquet Conversion Functionality #213

d33bs · 2022-07-14T22:51:18Z

Description

Add functionality for converting SQLite-based Pycytominer data to parquet format to address #205 .

Other notable changes:

Adds prefect dependency to enable scalable and resource-sensitive dataflow for conversion work.
Adds pyarrow dependency to enable efficient parquet file operations.
Adds data architecture documentation covering topics related to and informed by addressing the above issue.
Adds sphinxcontrib-mermaid dependency to generate diagrams for related documentation.

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

preparatory for conversion work and shared functionality

gwaybio

This looks like an amazing contribution, thanks @d33bs! I've made several in-line comments. Let's discuss these and then likely merge soon.

Two general comments:

Please make sure to keep consistent with our documentation style (i've made in-line comments where appropriate).
Is it possible to add an example usage somewhere? Maybe directly in the docstring? Sphinx can read and render an example IIRC

gwaybio · 2022-07-18T02:32:55Z

docs/architecture.data.rst

@@ -0,0 +1,135 @@
+Data Architecture


Suggested change

Data Architecture

Data architecture

Unless this is convention I'm not aware of, let's keep only the first word of titles lowercase (when not proper nouns of course).

This will keep docs consistent throughout the project, IIRC

Understood, I'll make the related changes to this effect, thank you.

gwaybio · 2022-07-18T02:33:45Z

docs/architecture.data.rst

+
+Pycytominer data architecture documentation.
+
+Distinct Upstream Data Sources


Suggested change

Distinct Upstream Data Sources

Distinct upstream data sources

I'm not going to update all of these, please adjust all titles if i'm not violating some convention i'm not aware of

gwaybio · 2022-07-18T02:34:43Z

docs/architecture.data.rst

+Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate 
+different kinds of data which are handled differently within Pycytominer.


Suggested change

Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate

different kinds of data which are handled differently within Pycytominer.

Pycytominer has distinct data flow contingent on the upstream data source.

Various projects are used to generate different kinds of data which are handled differently within Pycytominer.

We use 1-sentence-per-line convention for docs. I am not sure if .rst is different, but let's stick with this convention for consistency.

Understood, will do, thank you.

gwaybio · 2022-07-18T02:37:00Z

docs/architecture.data.rst

+* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_
+  data used by Pycytominer.


Suggested change

* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_

data used by Pycytominer.

* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_data used by Pycytominer.

Does it make sense to do this? To stick with 1-sentence per line convention I mean. There might be a way to auto-convert

gwaybio · 2022-07-18T02:37:16Z

docs/architecture.data.rst

+  `NPZ <https://numpy.org/doc/stable/reference/routines.io.html?highlight=npz%20format#numpy-binary-files-npy-npz>`_
+  data used by Pycytominer.
+
+SQLite Data


Suggested change

SQLite Data

SQLite data

gwaybio · 2022-07-18T02:56:57Z

pycytominer/cyto_utils/sqlite/convert.py

+    filename: str:
+        Filename to use for the parquet file.


you're adding a .parquet suffix, so maybe filename isn't the best variable name? (just trying to avoid files like TEST.parquet.parquet

Good point here, I'll change this to be more clear and hopefully avoid the double extension name challenge.

Just a follow up to mention: I've made a commit which removes the "parquet" extension addition from the functions, relying on the developer/user to provide this via filename or pq_filename (or use an extension-less filename). The aim here was to align this closer to how Pandas I/O parameters work. Please don't hesitate to let me know if we should still make a change to the parameter name.

I trust your judgement!

gwaybio · 2022-07-18T02:58:14Z

pycytominer/cyto_utils/sqlite/convert.py

+    if os.path.isfile(concatted_parquet_filename):
+        os.remove(concatted_parquet_filename)


We've been using pathlib as a replacement to os. Have you played with it?

it will involve setting concatted_parquet_filename on line 327 to a pathlib.Path object, and then checking if it exists

Sounds good, I'll make the changes here to this effect, thank you.

gwaybio · 2022-07-18T03:01:21Z

pycytominer/cyto_utils/sqlite/convert.py

+    )
+
+    # return the result of reduced_pq_result, a path to the new file
+    return flow_state.result[reduced_pq_result].result


This looks complicated. How complex is the flow_state object? Will we ever need to access other elements of it?

Thanks for calling this forward. The flow_state object is somewhat complex - it contains state and result data for all tasks within the Prefect flow. The return here was an effort to make the primary goal of the Prefect flow of practical use for a Pycytominer developer (instead of sharing all tasks / meta along the way). Because it's a new convention for the repo, I wasn't certain about much to lean towards flexibility vs convenience.

Potential changes: We could return flow_state on it's own here to rely on the developer to navigate the results (and allow them greater flexibility with what happens to the flow). Another further step in this direction might be to deliver a flow without running it, enriching options in this territory as well. Thoughts?

Additional information which may help: Prefect 1.x, used for changes related to this PR, is stable and works well. There are also ongoing efforts for Prefect 2.x which change how flows are handled (flows are formed through decorators for Python functions, similar to how @task is used for these changes). This may mean that we eventually end up closer to the "potential changes" structure anyways, when Prefect 2.x is out of beta, and if this is something you'd like to head towards.

gwaybio · 2022-07-18T03:02:23Z

pycytominer/cyto_utils/sqlite/meta.py

@@ -0,0 +1,150 @@
+"""


I think I've reviewed this code already? Oh, is it being moved over here from a different file?

Apologies - yes, we've reviewed this code, it is moved from cyto_utils/sqlite.py and I mislabeled the commit change for this file (add vs rename/move).

gwaybio · 2022-07-18T03:04:00Z

pycytominer/tests/test_cyto_utils/sqlite/conftest.py

+
+        # insert statement with some simple values
+        # note: we use SQLAlchemy's parameters to insert data properly, esp. BLOB
+        connection.execute(


is this the dreaded sql injection vulnerability? Do we want to test this way?

Thank you for this comment. I leaned towards a simple approach with these test inserts over SQLAlchemy insert compilation to help increase understandability and presuming a low risk vector for the non-public-facing SQLite databases.

I can understand wanting to keep things secure, protecting against future scenarios when the databases may be public-facing, or taking a more SQLAlchemy-ORM approach here. The cost might be greater complexity for the codebase for SQLite work. Do you think we should change these inserts (and related) towards this effect?

I'm in favor of simplicity. Perhaps the current branch and testing protections are sufficient. Maybe best to proceed as-is?

Sounds good, thank you. I'm okay proceeding as-is for now, I feel we'd be abiding PEP 20 here (simple over complex when the scenario isn't too complicated + practicality over purity).

There's a chance we can move away from some of this with real data samples (SQLite or parquet files) which could be used as part of the tests instead of generating SQLite data. Another thought here: we could illustrate reasoning here with comments based on this discussion.

- Lowercase titling after first letter - 1-sentence-per-line for general content - Parquet format as conversion-only - Minor space/line fixes

Rely on developer/user to provide .parquet extension as desired.

missing_ok parameter is only in python >= 3.8

relevant for non-local execution, such as dask.

…nk files accounting for variability in column ordering

little discernable difference in performance for using compression=None after testing

d33bs · 2022-08-19T20:08:13Z

Providing a comment here for context regarding some upcoming code changes for this PR:

Resulting parquet dataset needs to be changed to a new format (component table rows aligned as one parquet file, image data in another parquet file) (many thanks go to @gwaybio for guidance here).
clean_like_nulls to be included as part of conversion, preferably as a parameterized option to help reduce the complexity developers may face when performing conversion work (many thanks to discussions with @axiomcura).
Change implementation from Prefect 1.x to 2.x. Prefect 2.x was officially released between the time that this PR was opened and now. Moving to Prefect 2.x will be preferable for future-facing work to help prevent dependency challenges.

d33bs · 2022-08-22T18:02:00Z

Closing this PR to account for a change in approach and due to other challenges as noted by earlier comments.

d33bs added 5 commits July 12, 2022 11:29

move to dedicated cyto_utils sqlite path with decomposed functionality

ce9782d

preparatory for conversion work and shared functionality

add sqlite to parquet convert + related tests

855a026

reorganize existing and new sqlite tests; consolidate sqlite fixtures

8adec01

docs consistency

74583b0

add docs surrounding related data and changes

5b1046f

d33bs requested a review from gwaybio July 14, 2022 22:51

gwaybio requested changes Jul 18, 2022

View reviewed changes

d33bs added 9 commits July 18, 2022 14:34

update architecture.data.rst for #213

b4b195c

- Lowercase titling after first letter - 1-sentence-per-line for general content - Parquet format as conversion-only - Minor space/line fixes

filename parameter clarifications; flow example docstring for #213

747bae8

Rely on developer/user to provide .parquet extension as desired.

use pathlib instead of os for workfile removal

a54c04b

enable python 3.7 compatibility for pathlib

d8b656d

missing_ok parameter is only in python >= 3.8

practice DRY with starmap for nan_data_fill

2df6f83

use string engine url for prefect cloudpickle compatibility

5e4f4e8

relevant for non-local execution, such as dask.

add example table_name and code sample for sql_table_to_pd_dataframe

7627507

constrain multi_to_single_parquet to read strict column order for chu…

df40cf5

…nk files accounting for variability in column ordering

parquet compression as pandas default for chunks

337dbb2

little discernable difference in performance for using compression=None after testing

gwaybio mentioned this pull request Jul 21, 2022

Image based QC prior to aggregate_profiles #215

Open

d33bs requested a review from axiomcura August 11, 2022 22:47

gwaybio mentioned this pull request Aug 22, 2022

Build SQLite conversion tool #205

Closed

d33bs closed this Aug 22, 2022

bunnech mentioned this pull request Aug 22, 2022

Add to Function Converting SQLite to Pandas DataFrame and Merging of Additional Metadata #228

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQLite to Parquet Conversion Functionality #213

Add SQLite to Parquet Conversion Functionality #213

d33bs commented Jul 14, 2022

gwaybio left a comment

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

gwaybio Jul 18, 2022

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

gwaybio Jul 18, 2022

gwaybio Jul 18, 2022

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

d33bs Jul 19, 2022

gwaybio Jul 19, 2022

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

gwaybio Jul 18, 2022

d33bs Jul 18, 2022

gwaybio Jul 18, 2022

d33bs Jul 19, 2022

d33bs commented Aug 19, 2022

d33bs commented Aug 22, 2022


		Pycytominer data architecture documentation.

		Distinct Upstream Data Sources

		Pycytominer has distinct data flow contingent on upstream data source. Various projects are used to generate
		different kinds of data which are handled differently within Pycytominer.

		* `CellProfiler <https://github.com/CellProfiler/CellProfiler>`_ Generates `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_
		data used by Pycytominer.

		if os.path.isfile(concatted_parquet_filename):
		os.remove(concatted_parquet_filename)

Add SQLite to Parquet Conversion Functionality #213

Add SQLite to Parquet Conversion Functionality #213

Conversation

d33bs commented Jul 14, 2022

Description

What is the nature of your change?

Checklist

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d33bs commented Aug 19, 2022

d33bs commented Aug 22, 2022