Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a better approach to signal error values #119

Open
pmayd opened this issue Dec 21, 2023 · 0 comments
Open

Find a better approach to signal error values #119

pmayd opened this issue Dec 21, 2023 · 0 comments

Comments

@pmayd
Copy link
Collaborator

pmayd commented Dec 21, 2023

Currently we have certain error values like 999999 for numerical values and "9999-09-09" for date values that cannot be converted by our second script.

We need this information to signal where the conversion of the original data failed and thus we might have a data quality issue that needs further investigation.

However, when working with the data in Looker studio or when exporting the data, we don't want these values because they need to be filtered out for further analysis or aggregations.

My current SQL code looks something like this which needs to be done for nearly every column:

CASE hba1c_updated_date 
    WHEN "9999-09-09" THEN NULL
    ELSE hba1c_updated_date
    END
    AS hba1c_updated_date,

We need a better approach for this. I think we have several options here, so we first need a sound concept.

  1. Do it in R after the final pipeline step, maybe split the tables in something like _raw with error values and the final tables without
  2. Do it in R and create a relational representation where we have the data without errors and a big error table that contains the information about which row and which columns had error values
  3. Do it in SQL/Google?

Just some ideas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant