Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duckdb's TypeMismatchException raised in CytoTable's convert() workflow due to nan values being stored as strings instead of expected types #38

Closed
axiomcura opened this issue Mar 23, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@axiomcura
Copy link
Member

CytoTable's convert() function seems to capture nan's as string types within the cell-helath dataset causing duckdb to raise duckdb.TypeMismatchException error.

Below is the code to replicate the problem:

import pathlib
import cytotable

# sqlite file path
sqlite_file = str(pathlib.Path("./SQ00014613.sqlite").resolve(strict=True))

# execute covert workflow 
data = cytotable.convert(sqlite_file, dest_path="./parquet_data/test.parquet", dest_datatype="parquet", preset="cellprofiler_sqlite", source_datatype="sqlite")

link to download data

Traceback

From Prefect:

13:38:45.442 | ERROR   | Task run '_read_data-1' - Encountered exception during execution:
Traceback (most recent call last):
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/engine.py", line 1533, in orchestrate_task_run
    result = await run_sync(task.fn, *args, **kwargs)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 156, in run_sync_in_interruptible_worker_thread
    tg.start_soon(
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 135, in capture_worker_thread_and_result
    result = __fn(*args, **kwargs)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/cytotable/convert.py", line 59, in _read_data
    _duckdb_with_sqlite()
duckdb.TypeMismatchException: Mismatch Type Error: Invalid type in column "Nuclei_Correlation_Costes_AGP_DNA": expected float or integer, found "nan" of type "text" instead.
13:38:45.680 | ERROR   | Flow run 'wonderful-cuckoo' - Encountered exception during execution:
Traceback (most recent call last):
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/engine.py", line 665, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 156, in run_sync_in_interruptible_worker_thread
    tg.start_soon(
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 135, in capture_worker_thread_and_result
    result = __fn(*args, **kwargs)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/cytotable/convert.py", line 730, in _to_parquet
    common_schema = _infer_source_group_common_schema(
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/tasks.py", line 469, in __call__
    return enter_task_run_engine(
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/engine.py", line 965, in enter_task_run_engine
    return run_async_from_worker_thread(begin_run)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/engine.py", line 1114, in get_task_call_return_value
    return await future._result()
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/home/erikserrano/Programs/miniconda3/envs/pycytominer/lib/python3.9/site-packages/prefect/states.py", line 103, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

It seems that the second exception being raised by Prefect is caused by the previous exception thrown by duckdb, which prevents it to change the state of the data.

@gwaybio
Copy link
Member

gwaybio commented Mar 24, 2023

related to cytomining/pycytominer#79

@d33bs
Copy link
Member

d33bs commented Mar 24, 2023

Thank you for raising this @axiomcura ! This is worth considering for CytoTable. Contextually it's related to conversations in cytomining/pycytominer#198 (comment) which resulted in work found within sqlite-clean, specifically clean_like_nulls.

"Like nulls" is a reference to data values which look like null-types but are actually strings which have found their way into numeric type columns (SQLite allows this flexibility). The sqlite-clean package was created in part to address some of the data typing challenges within SQLite and also to help with performance (among other things).

CytoTable might need to use sqlite-clean as a dependency or as a code reference to detect or fix datasets where this occurs. Fixes for larger datasets (in contrast to detection only) require rewriting modified SQLite data and are not performance optimized. This may be more difficult to work around as a result. One minimal way we could start with work here would be to raise a specific exception on detection (for example using contains_str_like_null) and warn the user about the potential for data type mismatches.

@d33bs d33bs moved this to In Progress in SET Projects May 7, 2023
@d33bs d33bs self-assigned this May 7, 2023
@d33bs d33bs added the bug Something isn't working label May 7, 2023
@d33bs d33bs moved this from In Progress to Paused in SET Projects May 8, 2023
@d33bs d33bs moved this from Paused to In Progress in SET Projects May 9, 2023
@d33bs d33bs moved this from In Progress to Paused in SET Projects May 9, 2023
@jenna-tomkinson
Copy link
Member

Hi @d33bs,

@MattsonCam and I are getting this same error when downloading SQLite files from AWS and converting to parquet files.

Code

We are using this file here, but below are the exact convert parameters we are using is: https://github.com/WayScience/JUMP-single-cell/blob/main/0.convert/1.convert.ipynb

%%time
what = cytotable.convert(
    source_path="/".join(manifest_df.sqlite_file[2].split("/")[0:-1]),
    dest_path="test2.parquet",
    dest_datatype="parquet",
    chunk_size=150000,
    parsl_config=parsl_config,
# changed preset to this since compartments don't use prefix, but the CP version is not the same
    preset="cellprofiler_sqlite"
)

Output

The error we are receiving is as follows:

TypeMismatchException: Mismatch Type Error: Invalid type in column "Cells_Correlation_K_DNA_Mito": expected float or integer, found "nan" of type "text" instead.

Solution

There is no solution we can come up with at this time since even if we were to download all SQLite files from AWS onto our local machine, we would still have this error. We will likely have to use pycytominer SingleCells class instead which I notice takes longer to merge single cells than CytoTable.

We hope to see a solution to this and are happy to explain more of the issue!

@d33bs
Copy link
Member

d33bs commented Jul 7, 2023

Hi @jenna-tomkinson , thank you for adding to this issue, and sorry to hear this is giving you and @MattsonCam trouble. This hasn't been yet resolved with code additions. There are some open code changes related to this which seek to resolve the issue in #50. This work hasn't been yet merged into main and is blocked by review needs at the moment, but please feel free to reference in case it helps in the meantime. In addition to Pycytominer, other alternatives might include using sqlite-clean, referred to earlier in this issue.

@gwaybio
Copy link
Member

gwaybio commented Jul 7, 2023

We just merged #50! 🎉

@MattsonCam - if you're able, please test the newest version and report back if it solves this issue. We can then close it :)

@gwaybio
Copy link
Member

gwaybio commented Jul 26, 2023

@MattsonCam - were you able to test the newest version? Can we close this issue?

@d33bs
Copy link
Member

d33bs commented Mar 14, 2024

Hi @MattsonCam, @jenna-tomkinson, and @axiomcura - I just wanted to double check on this. Do you know if this issue may be closed (or does the challenge still occur)? I'm also working on validating this as well but it's taking some time to process due to the large data size (will follow up).

@d33bs
Copy link
Member

d33bs commented Mar 15, 2024

I was able to confirm this is now addressed with a completed CytoTable run on SQ00014613.sqlite using modified code based on @axiomcura's original description. See this Google Colab notebook or the related backup Gist. In order to achieve this I had to move to the ThreadPoolExecutor instead of the HighThroughputExecutor (default). I've created #169 to address the HTE issues separately.

Please note: the confirmation relies on an incoming change found within #168 (which addresses a separate issue related to completion of data processing and not data type processing errors).

Thanks again @axiomcura, @jenna-tomkinson, and @MattsonCam for your help with addressing this issue! Closing it for now. Please don't hesitate to reopen or reach out if you have any questions.

@d33bs d33bs closed this as completed Mar 15, 2024
@github-project-automation github-project-automation bot moved this from Paused to Done in SET Projects Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants