Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Support) Why is dim_ga4__sessions_daily built by ignoring first_visit and session_start? #358

Open
jice-lavocat opened this issue Feb 7, 2025 · 3 comments

Comments

@jice-lavocat
Copy link

I'm sorry to ask this here since this is not an issue per se, but a request to better understand the current logic. Feel free to close this if you feel it's not appropriate.

My question is about the creation of dim_ga4__sessions_daily. I couldn't find my answer in the code nor the documentation. Basically, what is the rationale behing excluding specific events when building this table?

For dim_ga4__sessions_daily, we are using the first non-null value for a given session (excluding event_name = 'first_visit' and 'session_start')
#279 (comment)

Right now, with the current setup on some of our games (mobile apps using Firebase, not websites using GA4), we are only 'enriching' 50% of the sessions found in fct_ga4__sessions_daily. [1]
If I do a quick update of the code and remove the first_visit and session_start filter to include all events, I get much better enrichment (almost 99%).

Note that I'm joining the fct and dim tables by joining on (session_key, stream_id, session_partition_key). I'm not sure this is the recommended way.

[1] To save from high volume, we filter events in the GA4 integration from Firebase to BigQuery. We specifically filter out screen_view event (that would otherwise sometimes 3 or 4x our volumes).

@adamribaudo-velir
Copy link
Collaborator

adamribaudo-velir commented Feb 7, 2025

@jice-lavocat no worries, it's a good question. Those events were omitted originally because, at the time, they were never enriched with event parameters other than the user's pseudo ID. To be transparent, most of the work on this package comes from people working on websites which means decisions like this may have been made without regard to how the events behave in-app.

My understanding (and I just checked my own website) is that at some point Google started hydrating those 2 events like normal events on the web. So we can actually remove that limitation in the core package if you like. Want to submit a PR?

@jice-lavocat
Copy link
Author

It's good to know. Thank you.

I can of course suggest a PR, but I'm not sure how 'bad' this could affect some other users. As you are suggesting that performance was not a reason initially, maybe there is no blocker to proceed.

@dgitis
Copy link
Collaborator

dgitis commented Feb 7, 2025

The potential issue with removing the restriction on first_visit and session_start is that people who have source data from before Google started populating attribution parameters to those events may have problems with those events wrongly setting attribution when they full-refresh their warehouse.

If I remember correctly, we removed those events after having problems with them showing up. However, I think ignore nulls in the window functions should ignore missing source, medium, etc... parameters.

        ,COALESCE(FIRST_VALUE((CASE WHEN event_source <> '(direct)' THEN COALESCE(event_medium, '(none)') END) IGNORE NULLS) OVER (session_window), '(none)') AS session_medium

I don't think it will be a problem unless first_visit and session_start were getting empty, not-null values, or wrong values, like Direct, but we should definitely check this before deploying any changes.

Pretty much any time Google changes something, we're going to have to consider how to support properties that predate the change. It's going to be a regular problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants