-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
for yyyy_mm_dd metrics, query by fmt_time instead of local_dt #970
for yyyy_mm_dd metrics, query by fmt_time instead of local_dt #970
Conversation
I recently rewrote the local dt queries used by TimeComponentQuery (e-mission#968) to make them behave as datetime range queries. But this results in some pretty complex queries which could have nested $and and $or conditions. It felt overengineered but that was the logic required if we were to query date ranges by looking at local dt objects where year, month, day, etc are all recorded separately. Unfortunately, those queries run super slowly on production. I think this is because the $and / $or conditions prevented MongoDB from optimizing efficiently via indexing Looked for a different solution, I found something better and simpler. At first I didn't think using fmt_time fields with $lt and $gt comparisons would work becuase they're ISO strings, not numbers. But luckily MongoDB can compare strings this way. it performs a lexicographical comparison where 0-9 < A-Z < a-z (like ASCII). ISO has the standard format 0000-00-00T00:00:00 The dashes and the T will always be in the same places, so effectively, only the numbers will be compared. And the timezone info is at the end, so it doesn't get considered as long as we don't include it in the start and end inputs. The only thing that specifically needs to be handled is making the end range inclusive. If I query from "2024-06-01" to "2024-06-03", an entry like "2024-06-03T23:59:59-04:00" should match. But lexicographically, that comes after "2024-06-03". Appending a "Z" to the end range solves this. The result of all this is that the TimeQuery class (timequery.py) is now able to handle either timestamp or ISO dates, and this is what metrics.py will use for summarize_by_yyyy_mm_dd.
@shankari |
I vote for "filter query" instead of "range query" |
I realized that the TimeQuery on For that reason, I don't want to rename anything in I briefly read up on SQL queries to make sure they work the same way when comparing strings. They do – so later on, we could implement |
…into faster_localdate_queries
e-mission#970 (comment) > I realized that the TimeQuery on `e-mission-server` (which works w/ MongoDB) mirrors TimeQuery in the native code plugins (which work with SQL dbs). > > For that reason, I don't want to rename anything in `TimeQuery` right now. So instead of extending the functionality of `TimeQuery` to work with either ts or fmt_time, I'm going to make a new class called `FmtTimeQuery` (or similar) > > I briefly read up on SQL queries to make sure they work the same way when comparing strings. They do – so later on, we could implement `FmtTimeQuery` on the native code UserCache plugin too
e-mission#970 (comment) reverts e-mission#968 back to the original implementation since we now have a faster way to do "range" queries via FmtTimeQuery. To make it clearer that FmtTimeQuery is for ranges and TimeComponentQuery is for filters, I renamed some functions and added more descriptive comments/docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Glad to see that this worked out; the ISO folks designed their format correctly!
- I am merging this for now, but we should add tests for the new
FmtTimeQuery
, similar to the current filter query - I assume that the changes to the local date queries are just reverting the previous changes. I spot checked and that does seem to be true, but please flag locations that were not a simple revert so that I can take a closer look later.
|
||
def get_query(self) -> dict: | ||
time_key = self.timeType | ||
ret_query = {time_key: {"$lte": self.endIso}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect that the end will always be present?
An alternate approach would be to assume that the start will always be present, and set the end to today
I recently rewrote the local dt queries used by TimeComponentQuery (#968) to make them behave as datetime range queries. But this results in some pretty complex queries which could have nested $and and $or conditions. It felt overengineered but that was the logic required if we were to query date ranges by looking at local dt objects where year, month, day, etc are all recorded separately. Unfortunately, those queries run super slowly on production. I think this is because the $and / $or conditions prevented MongoDB from optimizing efficiently via indexing
Looked for a different solution, I found something better and simpler. At first I didn't think using fmt_time fields with $lt and $gt comparisons would work becuase they're ISO strings, not numbers. But luckily MongoDB can compare strings this way. it performs a lexicographical comparison where 0-9 < A-Z < a-z (like ASCII). ISO has the standard format 0000-00-00T00:00:00
The dashes and the T will always be in the same places, so effectively, only the numbers will be compared. And the timezone info is at the end, so it doesn't get considered as long as we don't include it in the start and end inputs. The only thing that specifically needs to be handled is making the end range inclusive. If I query from "2024-06-01" to "2024-06-03", an entry like "2024-06-03T23:59:59-04:00" should match. But lexicographically, that comes after "2024-06-03". Appending a "Z" to the end range solves this.
The result of all this is that the TimeQuery class (timequery.py) is now able to handle either timestamp or ISO dates, and this is what metrics.py will use for summarize_by_yyyy_mm_dd.