You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the cron runs if it has never ran before, it goes all the way back to the earliest start_date in the database. This could potentially result in a huge query if there are old courses added into there. We should probably have some limit configured that could be overridden via the cron as most of the time we don't need this old data. I'm not sure how else to avoid it without changes to BigQuery like partitions, that we are unable to do.
In addition, it will process older courses that don't actually have any activity and don't need processed. We should add a "last_accessed_date" field or similar to the course table to keep track if we actually need to update it.
Describe the solution you'd like
Basically we want to limit update_resource_access to the latest possible date to prevent a full BigQuery database SELECT.
This calls the method get_data_earliest_date which has 3 conditions.
I think the best way to prevent this is just add a hardcoded setting MAX_UPDATE_RESOURCE_DAYS and limit it to that in any case. I'd say we set likely set it to 180 days by default. The problem would be if an old course is added there's no way to get it without changing this. So maybe this value should be stored in the Constance config in the admin so then someone can change it before running the cron. We want to probably put warning messages around this.
And in the cron final_date = max(data_last_updated, settings.MAX_ALLOWED_UPDATE_DATE)
This might need to be tested to avoid a TypeError if one of these is None for some reason and prevent them from being none.
Describe any possible alternatives you've considered
We have considered only running on active courses, but in testing and on the first run it might not be known what's active or not. Once. this has run once we shouldn't have this problem again.
The text was updated successfully, but these errors were encountered:
jonespm
changed the title
Add a maximum limit for how far BQ data is
Add a maximum limit for how far BQ data is retrieved (event_time)
Apr 23, 2024
Thank you for contributing to this project!
Describe your problem or feature you'd like added
When the cron runs if it has never ran before, it goes all the way back to the earliest
start_date
in the database. This could potentially result in a huge query if there are old courses added into there. We should probably have some limit configured that could be overridden via the cron as most of the time we don't need this old data. I'm not sure how else to avoid it without changes to BigQuery like partitions, that we are unable to do.In addition, it will process older courses that don't actually have any activity and don't need processed. We should add a "
last_accessed_date
" field or similar to the course table to keep track if we actually need to update it.Describe the solution you'd like
Basically we want to limit update_resource_access to the latest possible date to prevent a full BigQuery database SELECT.
This calls the method get_data_earliest_date which has 3 conditions.
I think the best way to prevent this is just add a hardcoded setting MAX_UPDATE_RESOURCE_DAYS and limit it to that in any case. I'd say we set likely set it to 180 days by default. The problem would be if an old course is added there's no way to get it without changing this. So maybe this value should be stored in the Constance config in the admin so then someone can change it before running the cron. We want to probably put warning messages around this.
So something like in settings
MAX_ALLOWED_UPDATE_DATE = date.today() - timedelta(days=config.MAX_ALLOWED_UPDATE_DATE)
And in the cron
final_date = max(data_last_updated, settings.MAX_ALLOWED_UPDATE_DATE)
This might need to be tested to avoid a TypeError if one of these is None for some reason and prevent them from being none.
Describe any possible alternatives you've considered
We have considered only running on active courses, but in testing and on the first run it might not be known what's active or not. Once. this has run once we shouldn't have this problem again.
The text was updated successfully, but these errors were encountered: