Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor visit and cost models, plus other small changes #111

Merged
merged 6 commits into from
Feb 1, 2025

Conversation

katy-sadowski
Copy link
Collaborator

@katy-sadowski katy-sadowski commented Jan 25, 2025

My initial intention was to knock out some of the minor issues in our backlog, but one thing led to another and I ended up refactoring our visit and cost models. They're simpler and more performant, now, and we have a proper visit_detail table rather than a placeholder.

This PR includes the following changes. See my inline comments for more detail. Sorry the PR is so huge; I know it makes it harder to review!

  • Refactored visit models (resolves Refactor visit ID assignment #108 and Properly populate visit_detail #110)
    • Simplified the assignment of visit_occurrence_id - now happens in a single model rather than being spread across 3
    • Changes some assumptions in rollup/collapse of encounters into visit occurrences
    • Preserves connection of individual encounters to the visits they were rolled up into - and uses these encounters to populate the visit_detail table
  • Refactored cost model
    • Instead of creating separate int models per cost type, we now just add cost columns to the int tables for drug, procedure, visit. This is a simpler and more efficient approach
    • Separated out encounter costs from drug/procedure costs. The previous models were adding in the cost of the associated encounter to the cost of each drug and procedure. It seems this could lead to duplication of costs for encounters with multiple drugs/procedures, so now the encounter costs are put in their own cost rows
    • Removed assumptions around total_paid and paid_by_patient columns - it wasn't clear to me from Synthea docs that we could make this leap. So I nulled these columns out instead
    • Nulled out non-required DRG and revenue code columns rather than putting placeholder values
    • Removed condition costs - previous logic was assigning the cost from a claim to a condition based on that condition showing up on a claim and in the diagnosis table on the same date. This feels like a stretch (and the query was very slow) - instead, I believe that we should more holistically model all the data contained in claims and claims_transactions. I will file an issue for this work
  • The visit and cost refactors resolve Investigate use of dates vs timestamps for intermediate entity derivation / de-duping logic #74
  • Added datatypes to the vocabulary staging model config (final change needed to resolve Ensure data types are specified correctly in all model configs #36)
  • Added back and resolved SQLFluff checks on column prefixes (resolves Add back ignored SQLFluff checks #46 and reference issue in int__assign_all_visit_ids for visit_occurrence_id #99)

@@ -1,33 +1,15 @@
/* emergency visits */
/* collapse er claim lines with no days between them into one visit */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i chose to remove this bit of logic as it complicates the modeling, and i'm not sure we should implement this as a blanket assumption (i.e., it's plausible that someone has separate visits to the ER or urgent care 2 days in a row)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, in my experience this always over complicates stuff and just comes back to haunt you

FROM {{ ref( 'stg_synthea__encounters') }}
WHERE
encounter_class = 'inpatient'
OR (encounter_class IN ('ambulatory', 'wellness', 'outpatient', 'emergency', 'urgentcare') AND encounter_start_date != encounter_stop_date)
Copy link
Collaborator Author

@katy-sadowski katy-sadowski Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any visit not labeled inpatient but that lasts more than 1 day is considered an inpatient visit. i've seen this done in other ETLs.

i could see the argument that multiday OP visits are actually bad data, though, so i was on the fence about making this assumption. we could also just keep them as OP visits and let the analyst figure it out. it probably depends on the data source. (here, since it's fake data, there is prob no right answer).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a solid approach. Realisitcally if you were interested in those fringes of people coming in and out repeatedly in quick succession you'd do a more specific extract.

, min(e.end_datetime) AS visit_end_datetime
FROM {{ ref( 'stg_synthea__encounters') }} AS v
min(a.encounter_id) OVER (PARTITION BY a.patient_id, a.encounter_start_date) AS encounter_id
, a.encounter_id AS original_encounter_id
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the code in this model is unchanged (rows just got shifted around). this line is an exception. i'm retaining the original encounter ID so we can map all the rolled up encounters during an IP stay back to the visit occurrence for that stay.

, {{ dbt.cast("null", api.Column.translate_type("integer")) }} AS country_concept_id
, {{ dbt.cast("null", api.Column.translate_type("varchar")) }} AS country_source_value
, p.patient_latitude AS latitude
, p.patient_longitude AS longitude
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into fanning issues related to the fact that we were including lat/long in the location model, but that location_source_value (our join key from person to location) was only based on the address. I felt it best just to exclude these columns as most sources will not include this data.

@katy-sadowski
Copy link
Collaborator Author

@burrowse in case it's of interest, this PR includes changes which allow us to properly populate visit_detail (rather than duplicating VO with different IDs like is currently done in ETL-Synthea).

@lawrenceadams
Copy link
Collaborator

Damn @katy-sadowski !! Amazing work!! I am going through this slowly - want to see how it all comes together but I think it should be great!! Will accept then if ok?

Copy link
Collaborator

@lawrenceadams lawrenceadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row diffs (main vs current branch):

int__er_visits
[17]
[19]
cost
[805]
[1424]
int__ip_visits
[11]
[27]
visit_occurrence
[601]
[604]
visit_detail
[601]
[617]
int__op_visits
[573]
[571]

I believe the visit details are now simplified in how they rolled up causing the mild change in numbers. The cost model is a lot different as we now have Visit cost types - which makes sense, obviously people have to pay for things beyond drugs and procedures!

Better modelling! 🔝 🧠 @katy-sadowski

Copy link
Collaborator

@lawrenceadams lawrenceadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid stuff!! I think this is easier to follow now, much less fiddly.

I have tried to follow the output and inspect how they've changed and all seems pretty good. Differences are due to remodelling.

Only query is the common join with the predicate ...encounter_id = vd.visit_detail_source_value where vd is the int__visit_detail.

Looking at the visit_detail_source value distinct values, they have the descriptive values like so:

image

And so nothing will ever join - unsure how it should correctly bind - I always find the visit_detail bit fiddly... Am I missing something?

@lawrenceadams
Copy link
Collaborator

Something I noticed during reviewing which you've also fixed - https://github.com/OHDSI/dbt-synthea/blob/23e60a735ecbda2216771291da9f32baed0d77d2/seeds/synthea/_sources.yml ingests a lot of cost data types as float types, which are then cast to numeric types which sorts out the chaotic float issues that arise.

Is it worth importing them as a numeric type rather than a float at the seed stage though? Probs worth spinning out into a different issue and seeing the effect downstream! Minor point!

@katy-sadowski
Copy link
Collaborator Author

Thanks so much @lawrenceadams for the review and for checking the rowcounts, super handy!

Only query is the common join with the predicate ...encounter_id = vd.visit_detail_source_value where vd is the int__visit_detail.

🤦 LOL, this was a mistake on my part. It's meant to join on encounter_id, which used to be stored in visit_detail_source_value but I changed it since source values are supposed to store the source code values, not IDs. I'll add encounter_id to that model and fix this join.

Is it worth importing them as a numeric type rather than a float at the seed stage though? Probs worth spinning out into a different issue and seeing the effect downstream! Minor point!

I recall when I was first working on this that I ran into issues with these datatypes which I fixed by using float in the seed stage BUT I was just trying to get things to match ETL-Synthea and not thinking deeply about it. Agree re: filing an issue; I'll do that. Good catch!

Copy link
Collaborator

@lawrenceadams lawrenceadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! seems to work for me!

@katy-sadowski
Copy link
Collaborator Author

Woooo! Thanks! Merging 🚢

@katy-sadowski katy-sadowski merged commit f036514 into main Feb 1, 2025
@katy-sadowski katy-sadowski deleted the katy__minor_fixes branch February 1, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants