Refactor visit and cost models, plus other small changes #111

katy-sadowski · 2025-01-25T21:20:41Z

My initial intention was to knock out some of the minor issues in our backlog, but one thing led to another and I ended up refactoring our visit and cost models. They're simpler and more performant, now, and we have a proper visit_detail table rather than a placeholder.

This PR includes the following changes. See my inline comments for more detail. Sorry the PR is so huge; I know it makes it harder to review!

Refactored visit models (resolves Refactor visit ID assignment #108 and Properly populate visit_detail #110)
- Simplified the assignment of visit_occurrence_id - now happens in a single model rather than being spread across 3
- Changes some assumptions in rollup/collapse of encounters into visit occurrences
- Preserves connection of individual encounters to the visits they were rolled up into - and uses these encounters to populate the visit_detail table
Refactored cost model
- Instead of creating separate int models per cost type, we now just add cost columns to the int tables for drug, procedure, visit. This is a simpler and more efficient approach
- Separated out encounter costs from drug/procedure costs. The previous models were adding in the cost of the associated encounter to the cost of each drug and procedure. It seems this could lead to duplication of costs for encounters with multiple drugs/procedures, so now the encounter costs are put in their own cost rows
- Removed assumptions around total_paid and paid_by_patient columns - it wasn't clear to me from Synthea docs that we could make this leap. So I nulled these columns out instead
- Nulled out non-required DRG and revenue code columns rather than putting placeholder values
- Removed condition costs - previous logic was assigning the cost from a claim to a condition based on that condition showing up on a claim and in the diagnosis table on the same date. This feels like a stretch (and the query was very slow) - instead, I believe that we should more holistically model all the data contained in claims and claims_transactions. I will file an issue for this work
The visit and cost refactors resolve Investigate use of dates vs timestamps for intermediate entity derivation / de-duping logic #74
Added datatypes to the vocabulary staging model config (final change needed to resolve Ensure data types are specified correctly in all model configs #36)
Added back and resolved SQLFluff checks on column prefixes (resolves Add back ignored SQLFluff checks #46 and reference issue in int__assign_all_visit_ids for visit_occurrence_id #99)

katy-sadowski · 2025-01-25T21:25:38Z

models/intermediate/int__er_visits.sql

@@ -1,33 +1,15 @@
-/* emergency visits */
-/* collapse er claim lines with no days between them into one visit */


i chose to remove this bit of logic as it complicates the modeling, and i'm not sure we should implement this as a blanket assumption (i.e., it's plausible that someone has separate visits to the ER or urgent care 2 days in a row)

Agree, in my experience this always over complicates stuff and just comes back to haunt you

katy-sadowski · 2025-01-25T21:28:38Z

models/intermediate/int__ip_visits.sql

+    FROM {{ ref( 'stg_synthea__encounters') }}
+    WHERE
+        encounter_class = 'inpatient'
+        OR (encounter_class IN ('ambulatory', 'wellness', 'outpatient', 'emergency', 'urgentcare') AND encounter_start_date != encounter_stop_date)


any visit not labeled inpatient but that lasts more than 1 day is considered an inpatient visit. i've seen this done in other ETLs.

i could see the argument that multiday OP visits are actually bad data, though, so i was on the fence about making this assumption. we could also just keep them as OP visits and let the analyst figure it out. it probably depends on the data source. (here, since it's fake data, there is prob no right answer).

I think this is a solid approach. Realisitcally if you were interested in those fringes of people coming in and out repeatedly in quick succession you'd do a more specific extract.

katy-sadowski · 2025-01-25T21:30:44Z

models/intermediate/int__ip_visits.sql

-        , min(e.end_datetime) AS visit_end_datetime
-    FROM {{ ref( 'stg_synthea__encounters') }} AS v
+        min(a.encounter_id) OVER (PARTITION BY a.patient_id, a.encounter_start_date) AS encounter_id
+        , a.encounter_id AS original_encounter_id


most of the code in this model is unchanged (rows just got shifted around). this line is an exception. i'm retaining the original encounter ID so we can map all the rolled up encounters during an IP stay back to the visit occurrence for that stay.

katy-sadowski · 2025-01-25T21:33:50Z

models/intermediate/int__location.sql

-        , {{ dbt.cast("null", api.Column.translate_type("integer")) }} AS country_concept_id
-        , {{ dbt.cast("null", api.Column.translate_type("varchar")) }} AS country_source_value
-        , p.patient_latitude AS latitude
-        , p.patient_longitude AS longitude


I ran into fanning issues related to the fact that we were including lat/long in the location model, but that location_source_value (our join key from person to location) was only based on the address. I felt it best just to exclude these columns as most sources will not include this data.

katy-sadowski · 2025-01-25T21:40:00Z

@burrowse in case it's of interest, this PR includes changes which allow us to properly populate visit_detail (rather than duplicating VO with different IDs like is currently done in ETL-Synthea).

lawrenceadams · 2025-01-28T08:56:44Z

Damn @katy-sadowski !! Amazing work!! I am going through this slowly - want to see how it all comes together but I think it should be great!! Will accept then if ok?

lawrenceadams

Row diffs (main vs current branch):

int__er_visits
[17]
[19]
cost
[805]
[1424]
int__ip_visits
[11]
[27]
visit_occurrence
[601]
[604]
visit_detail
[601]
[617]
int__op_visits
[573]
[571]

I believe the visit details are now simplified in how they rolled up causing the mild change in numbers. The cost model is a lot different as we now have Visit cost types - which makes sense, obviously people have to pay for things beyond drugs and procedures!

Better modelling! 🔝 🧠 @katy-sadowski

lawrenceadams

Solid stuff!! I think this is easier to follow now, much less fiddly.

I have tried to follow the output and inspect how they've changed and all seems pretty good. Differences are due to remodelling.

Only query is the common join with the predicate ...encounter_id = vd.visit_detail_source_value where vd is the int__visit_detail.

Looking at the visit_detail_source value distinct values, they have the descriptive values like so:

And so nothing will ever join - unsure how it should correctly bind - I always find the visit_detail bit fiddly... Am I missing something?

lawrenceadams · 2025-01-28T23:04:09Z

Something I noticed during reviewing which you've also fixed - https://github.com/OHDSI/dbt-synthea/blob/23e60a735ecbda2216771291da9f32baed0d77d2/seeds/synthea/_sources.yml ingests a lot of cost data types as float types, which are then cast to numeric types which sorts out the chaotic float issues that arise.

Is it worth importing them as a numeric type rather than a float at the seed stage though? Probs worth spinning out into a different issue and seeing the effect downstream! Minor point!

katy-sadowski · 2025-01-31T02:06:24Z

Thanks so much @lawrenceadams for the review and for checking the rowcounts, super handy!

Only query is the common join with the predicate ...encounter_id = vd.visit_detail_source_value where vd is the int__visit_detail.

🤦 LOL, this was a mistake on my part. It's meant to join on encounter_id, which used to be stored in visit_detail_source_value but I changed it since source values are supposed to store the source code values, not IDs. I'll add encounter_id to that model and fix this join.

Is it worth importing them as a numeric type rather than a float at the seed stage though? Probs worth spinning out into a different issue and seeing the effect downstream! Minor point!

I recall when I was first working on this that I ran into issues with these datatypes which I fixed by using float in the seed stage BUT I was just trying to get things to match ETL-Synthea and not thinking deeply about it. Agree re: filing an issue; I'll do that. Good catch!

lawrenceadams

Looks great!! seems to work for me!

katy-sadowski · 2025-02-01T02:21:55Z

Woooo! Thanks! Merging 🚢

Katy Sadowski added 4 commits January 19, 2025 20:36

vocab datatypes, sqlfluff

ef1441d

add table prefix

866a208

fix refs

a33bedf

refactor visits, other fixes

9b27fa1

katy-sadowski commented Jan 25, 2025

View reviewed changes

fix comment

23e60a7

katy-sadowski requested a review from lawrenceadams January 25, 2025 21:38

katy-sadowski mentioned this pull request Jan 25, 2025

[NOTES] Experiences with dbt-synthea for Big Patient Datasets #91

Open

lawrenceadams mentioned this pull request Jan 28, 2025

refactor: Optimise ID Generation #114

Open

lawrenceadams reviewed Jan 28, 2025

View reviewed changes

katy-sadowski mentioned this pull request Jan 31, 2025

Investigate use of float vs numeric types #116

Open

fix encounter ID ref

12766d7

lawrenceadams approved these changes Jan 31, 2025

View reviewed changes

katy-sadowski merged commit f036514 into main Feb 1, 2025

katy-sadowski deleted the katy__minor_fixes branch February 1, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor visit and cost models, plus other small changes #111

Refactor visit and cost models, plus other small changes #111

katy-sadowski commented Jan 25, 2025 •

edited

Loading

katy-sadowski Jan 25, 2025

lawrenceadams Jan 28, 2025

katy-sadowski Jan 25, 2025 •

edited

Loading

lawrenceadams Jan 28, 2025

katy-sadowski Jan 25, 2025

katy-sadowski Jan 25, 2025

katy-sadowski commented Jan 25, 2025

lawrenceadams commented Jan 28, 2025

lawrenceadams left a comment •

edited

Loading

lawrenceadams left a comment

lawrenceadams commented Jan 28, 2025

katy-sadowski commented Jan 31, 2025

lawrenceadams left a comment

katy-sadowski commented Feb 1, 2025

		@@ -1,33 +1,15 @@
		/* emergency visits */
		/* collapse er claim lines with no days between them into one visit */

Refactor visit and cost models, plus other small changes #111

Refactor visit and cost models, plus other small changes #111

Conversation

katy-sadowski commented Jan 25, 2025 • edited Loading

katy-sadowski Jan 25, 2025

Choose a reason for hiding this comment

lawrenceadams Jan 28, 2025

Choose a reason for hiding this comment

katy-sadowski Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

lawrenceadams Jan 28, 2025

Choose a reason for hiding this comment

katy-sadowski Jan 25, 2025

Choose a reason for hiding this comment

katy-sadowski Jan 25, 2025

Choose a reason for hiding this comment

katy-sadowski commented Jan 25, 2025

lawrenceadams commented Jan 28, 2025

lawrenceadams left a comment • edited Loading

Choose a reason for hiding this comment

lawrenceadams left a comment

Choose a reason for hiding this comment

lawrenceadams commented Jan 28, 2025

katy-sadowski commented Jan 31, 2025

lawrenceadams left a comment

Choose a reason for hiding this comment

katy-sadowski commented Feb 1, 2025

katy-sadowski commented Jan 25, 2025 •

edited

Loading

katy-sadowski Jan 25, 2025 •

edited

Loading

lawrenceadams left a comment •

edited

Loading