-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to represent mart data? Kimball dimensional modeling or other? #178
Comments
I agree with this. In fact, this is what I did with the fork that I have deployed on my client with high-volume sites. Separating sessions into I also think that we should include I also think we should discuss the Having |
Yeah. The other aspect is that BigQuery doesn't index like a Kimball-era database would, so having normalized models doesn't pay off in the same way. |
Here's something like what I envision the sessions and users models looking like. Green represents partitioned tables while yellow represents views. The idea here is to minimize lookups that hit the I'm not sure if I'd also welcome a discussion on whether we need both the first_last_events and first_last_pageviews models. I've noticed some problems with session attribution in sessions with just For last_non_direct attribution, I'm planning on introducing a If we were to introduce a The lookback window would also be good for user stitching between I personally like the idea of having a lookback window setting as it makes me a lot more comfortable with looking up past data which gets expensive quickly on large sites and most of the relevant data is usually in the last few days. We also might want to leave the lifetime user tables in the package as well and let the people deploying the package decide whether they care more about the efficiency of a lookback window versus having perfect user data. |
Just discovered this package, it does a lot of things very well, but 100% the separation of dim_ga4__sessions and fct_ga4__sessions makes no sense to me. It's also worth pointing out that most companies have been around much longer than they've been using GA 4. So the company's fact_sessions table is probably fed into by a table full of GA 4 sessions AND a table full of Universal Analytics sessions, and maybe they have Mixpanel or Snowplow or Amplitude sessions as well. Only the final sessions table is a real fact table. So I don't think that a sessions table that only has GA 4 sessions needs a fct prefix at all... your GA 4 sessions table is a building block towards a fact table. |
Fivetran's packages are probably the most used outside of dbt_utils, and they don't call their shopify orders table fct_orders, but just shopify__orders. A company might have the shopify_orders table join with a stripe_orders table to create their fct_orders. It makes sense to me to use the same approach here. |
There's definitely an argument for avoiding |
As pointed out by @willbryant , the current
dim
andfct
mart models don't provide much value in being separated as they each are unique on the same key.Ex:
dim_ga4__sessions
andfct_ga4__sessions
are unique onsession_key
and could be joined together without issue.It may make sense to have a single
ga4__sessions
model, but I know this is a debate with a lot of facets so it seems worth researching first.My opinion:
I like the idea of:
dim
models where there is 1 record for each entity andfct
models that contain low-grain events that can be aggregated against the entity. I think our mistake was in pre-aggregating thefct
models so now they are entity-grain rather than the grain of the facts themselves.The text was updated successfully, but these errors were encountered: