-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor visit and cost models, plus other small changes #111
Conversation
@@ -1,33 +1,15 @@ | |||
/* emergency visits */ | |||
/* collapse er claim lines with no days between them into one visit */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i chose to remove this bit of logic as it complicates the modeling, and i'm not sure we should implement this as a blanket assumption (i.e., it's plausible that someone has separate visits to the ER or urgent care 2 days in a row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, in my experience this always over complicates stuff and just comes back to haunt you
FROM {{ ref( 'stg_synthea__encounters') }} | ||
WHERE | ||
encounter_class = 'inpatient' | ||
OR (encounter_class IN ('ambulatory', 'wellness', 'outpatient', 'emergency', 'urgentcare') AND encounter_start_date != encounter_stop_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any visit not labeled inpatient but that lasts more than 1 day is considered an inpatient visit. i've seen this done in other ETLs.
i could see the argument that multiday OP visits are actually bad data, though, so i was on the fence about making this assumption. we could also just keep them as OP visits and let the analyst figure it out. it probably depends on the data source. (here, since it's fake data, there is prob no right answer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a solid approach. Realisitcally if you were interested in those fringes of people coming in and out repeatedly in quick succession you'd do a more specific extract.
, min(e.end_datetime) AS visit_end_datetime | ||
FROM {{ ref( 'stg_synthea__encounters') }} AS v | ||
min(a.encounter_id) OVER (PARTITION BY a.patient_id, a.encounter_start_date) AS encounter_id | ||
, a.encounter_id AS original_encounter_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the code in this model is unchanged (rows just got shifted around). this line is an exception. i'm retaining the original encounter ID so we can map all the rolled up encounters during an IP stay back to the visit occurrence for that stay.
, {{ dbt.cast("null", api.Column.translate_type("integer")) }} AS country_concept_id | ||
, {{ dbt.cast("null", api.Column.translate_type("varchar")) }} AS country_source_value | ||
, p.patient_latitude AS latitude | ||
, p.patient_longitude AS longitude |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into fanning issues related to the fact that we were including lat/long in the location model, but that location_source_value (our join key from person to location) was only based on the address. I felt it best just to exclude these columns as most sources will not include this data.
@burrowse in case it's of interest, this PR includes changes which allow us to properly populate visit_detail (rather than duplicating VO with different IDs like is currently done in ETL-Synthea). |
Damn @katy-sadowski !! Amazing work!! I am going through this slowly - want to see how it all comes together but I think it should be great!! Will accept then if ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Row diffs (main vs current branch):
int__er_visits
[17]
[19]
cost
[805]
[1424]
int__ip_visits
[11]
[27]
visit_occurrence
[601]
[604]
visit_detail
[601]
[617]
int__op_visits
[573]
[571]
I believe the visit details are now simplified in how they rolled up causing the mild change in numbers. The cost
model is a lot different as we now have Visit cost types - which makes sense, obviously people have to pay for things beyond drugs and procedures!
Better modelling! 🔝 🧠 @katy-sadowski
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid stuff!! I think this is easier to follow now, much less fiddly.
I have tried to follow the output and inspect how they've changed and all seems pretty good. Differences are due to remodelling.
Only query is the common join with the predicate ...encounter_id = vd.visit_detail_source_value
where vd is the int__visit_detail
.
Looking at the visit_detail_source value distinct values, they have the descriptive values like so:
And so nothing will ever join - unsure how it should correctly bind - I always find the visit_detail bit fiddly... Am I missing something?
Something I noticed during reviewing which you've also fixed - https://github.com/OHDSI/dbt-synthea/blob/23e60a735ecbda2216771291da9f32baed0d77d2/seeds/synthea/_sources.yml ingests a lot of cost data types as Is it worth importing them as a numeric type rather than a float at the seed stage though? Probs worth spinning out into a different issue and seeing the effect downstream! Minor point! |
Thanks so much @lawrenceadams for the review and for checking the rowcounts, super handy!
🤦 LOL, this was a mistake on my part. It's meant to join on encounter_id, which used to be stored in visit_detail_source_value but I changed it since source values are supposed to store the source code values, not IDs. I'll add encounter_id to that model and fix this join.
I recall when I was first working on this that I ran into issues with these datatypes which I fixed by using float in the seed stage BUT I was just trying to get things to match ETL-Synthea and not thinking deeply about it. Agree re: filing an issue; I'll do that. Good catch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!! seems to work for me!
Woooo! Thanks! Merging 🚢 |
My initial intention was to knock out some of the minor issues in our backlog, but one thing led to another and I ended up refactoring our visit and cost models. They're simpler and more performant, now, and we have a proper visit_detail table rather than a placeholder.
This PR includes the following changes. See my inline comments for more detail. Sorry the PR is so huge; I know it makes it harder to review!
visit_occurrence_id
- now happens in a single model rather than being spread across 3visit_detail
tabletotal_paid
andpaid_by_patient
columns - it wasn't clear to me from Synthea docs that we could make this leap. So I nulled these columns out insteadclaims
andclaims_transactions
. I will file an issue for this work