Find a better approach to signal error values #119

pmayd · 2023-12-21T20:20:30Z

Currently we have certain error values like 999999 for numerical values and "9999-09-09" for date values that cannot be converted by our second script.

We need this information to signal where the conversion of the original data failed and thus we might have a data quality issue that needs further investigation.

However, when working with the data in Looker studio or when exporting the data, we don't want these values because they need to be filtered out for further analysis or aggregations.

My current SQL code looks something like this which needs to be done for nearly every column:

CASE hba1c_updated_date 
    WHEN "9999-09-09" THEN NULL
    ELSE hba1c_updated_date
    END
    AS hba1c_updated_date,

We need a better approach for this. I think we have several options here, so we first need a sound concept.

Do it in R after the final pipeline step, maybe split the tables in something like _raw with error values and the final tables without
Do it in R and create a relational representation where we have the data without errors and a big error table that contains the information about which row and which columns had error values
Do it in SQL/Google?

Just some ideas

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find a better approach to signal error values #119

Find a better approach to signal error values #119

pmayd commented Dec 21, 2023 •

edited

Loading

Find a better approach to signal error values #119

Find a better approach to signal error values #119

Comments

pmayd commented Dec 21, 2023 • edited Loading

pmayd commented Dec 21, 2023 •

edited

Loading