Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add an aggregated velocity statistics table #24

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NoamGaash
Copy link
Member

No description provided.

@NoamGaash NoamGaash marked this pull request as ready for review December 23, 2024 18:24
@NoamGaash NoamGaash requested a review from OriHoch December 23, 2024 18:24
@OriHoch
Copy link
Contributor

OriHoch commented Dec 24, 2024

I tried to run this query manually for a single day

by adding WHERE date >= '2024-12-01' and date < '2024-12-02'

  • to the HourlyAverages query - it takes 10 seconds and returns ~20,000 rows
  • to the external query - it was running for more then 5 minutes so I stopped it

extrapolating to all the data we have - which currently is ~2 years the total number of rows will be 20,000 * 730 = 14,600,000

because it's a materialized view which stores all its rows and taking into account the indices, this will add significant size to the DB. Also, the load on the DB for running the materialized view refresh needs to be taken into account, we will need to run it periodically.

I think because the basic query takes 10 seconds per day, and assuming it won't run very often it's worth to just run the query itself, or maybe add a non-materizlied view.

Copy link
Contributor

@OriHoch OriHoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

@NoamGaash
Copy link
Member Author

Thank you!
Ideally, I'd like to have some cache mechanism, because I want to expose an API endpoint to get a heatmap of velocity statistics, that I will call to from the frontent.
When considering 14 million rows, each row contains 7 values (8 bytes each I guess?) is about 56 bytes per row * 14,600,000 ~= 900MB. I can see how that's a lot.
The front-end will cache the responses, but it's still computationally heavy operation.
I'll read about non materialized views.
Thank you very much for the fast and educative feedback 🙏

@NoamGaash NoamGaash changed the title feat: add an aggregated velocity statistics view feat: add an aggregated velocity statistics table Jan 17, 2025
@NoamGaash
Copy link
Member Author

Hi @OriHoch ! I came out with this proposal to implement some kind of cache mechanism.

the last used column will store the last time a specific date was calculated. It will be used to remove old entries from the DB.

I made an implementation for the API and tested it locally
hasadna/open-bus-stride-api#44

@OriHoch
Copy link
Contributor

OriHoch commented Jan 18, 2025

  1. Currently the API is strictly for SELECT and I want to keep it that way as introducing updates/inserts into it introduces a lot of complexity and risk I want to keep out
  2. Caching mechanisms I would prefer not to implement on the DB as that also introduces risks and scale problems, if you want caching we can add a Redis server but I prefer if we can avoid it
  3. The idea of adding a new table and populating it is good, but the way to do it is via the ETL system, you can add a task that runs daily, iterates over all the dates for which all the source data exists and for which no data is in the new table and inserts their data to this table. Most of the tasks in the open_bus_stride_etl do something like this. I would name this table a more generic name maybe something like siri_vehicle_locations_daily_stats so in the future we can add other statistics to this table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants