-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚀 Section Segmentation Optimization #1036
🚀 Section Segmentation Optimization #1036
Conversation
Since DB calls are a bottleneck, we can reduce the number of DB calls by making upfront queries for all the required types of entries we'll need across the entire time range we are sectioning. Then we can filter those down accordingly. These include: segmentation/raw_place, background/bluetooth_ble, background/motion_activity, background/location, and background/filtered_location (For some edge cases, statemachine/transition is used – but seeing as it will only be used in those specific cases, it doesn't make sense to load it upfront) The motion activity and locations are used as dataframes later on, so I used get_data_df for those. trips, places, and ble are queried with find_entries. Then these are plumbed through to segment_trip_into_sections, filtered to the time range of the current trip, and ultimately passed into each SectionSegmentationMethod's segment_into_sections With the locations already pre-loaded as dataframes, I was able to remove get_location_streams_for_trip altogether For _get_distance_from_start_place_to_end, we preload all places in the segmentation range so we can usually find it in memory by its _id. However, unit tests revealed edge cases where the first start place is before the current segmentation time range, so we must fallback to querying the DB via esda.get_object
We create segmentation/raw_section and segmentation/raw_stop entries in segment_trip_into_sections. We have been inserting them as we create them (which sometimes required updating the previous entry if there is a stop), resulting in numerous insert and update DB calls Instead we can keep a list of entries to be created and then only bulk_insert them at the end of the stage, once we are done updating them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM barring some high-level cleanup comments.
I am now going to run this against our 3 canonical users from ccebike before merging.
trip_start_loc = ecwl.Location(trip_filtered_loc_df.iloc[0]) | ||
trip_end_loc = ecwl.Location(trip_filtered_loc_df.iloc[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I considered doing this for the trip segmentation as well as part of #1035
I think this will largely work except for the LocalDate
part which is expanded in the dataframe and would have to be collapsed back into a structure in the Location
object.
I even looked into "reverse json_normalize` 😄 but there was no one-line solution so I gave up given the tight timeframe.
https://stackoverflow.com/questions/54776916/inverse-of-pandas-json-normalize
It is a good point that we don't really need the LocalDate
fields here. However, to avoid confusion down the road (here's a Location
object, I expect to be able to access the start_local_dt
field but I can't), I would suggest filing a cleanup issue that recreates those structured fields correctly and use both here and in trip_segmentation
location in trip_segmentation:
start_loc = get_loc_for_row(start_loc_doc) |
Fortunately, the testing shows that looking up by _id
is fairly fast
#1035 (comment)
so this is not super high priority
section.trip_id = trip_entry['_id'] | ||
if len(section_entries) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: we are now building the section list in memory before saving it, so if the list is zero, this is the first point.
start_loc = ecwl.Location(start_loc_doc) | ||
end_loc = ecwl.Location(end_loc_doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this for trip_segmentation as well!
|
||
|
||
def fill_section(section, start_loc, end_loc, sensed_mode, ble_sensed_mode=None): | ||
section.start_ts = start_loc.ts | ||
section.start_local_dt = start_loc.local_dt | ||
section.start_local_dt = ecwld.LocalDate.get_local_date(start_loc.ts, start_loc['local_dt_timezone']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah we do use the local date (I had forgotten although I saw it just last weekend). This is fine, but I do think we should refactor as part of cleanup
@@ -77,7 +77,7 @@ def mark_sectioning_done(user_id, last_trip_done): | |||
mark_stage_done(user_id, ps.PipelineStages.SECTION_SEGMENTATION, None) | |||
else: | |||
mark_stage_done(user_id, ps.PipelineStages.SECTION_SEGMENTATION, | |||
last_trip_done.data.end_ts + END_FUZZ_AVOID_LTE) | |||
last_trip_done['data']['end_ts'] + END_FUZZ_AVOID_LTE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: we need this because the last_trip_done
is no longer an entry.
Note to @JGreenlee: we can also wrap the trip in a Trip
object like you did with location, in case we need to keep it consistent with other stages.
After #1035 Trip segmentation took minutes, but section segmentation took all day
I don't see any errors in
|
There was an error in This run finished in 10 mins
without any errors; all stages completed successfully
|
(Continuation of #1031)
read all entries upfront during section segmentation
Since DB calls are a bottleneck, we can reduce the number of DB calls by making upfront queries for all the required types of entries we'll need across the entire time range we are sectioning.
Then we can filter those down accordingly.
These include: segmentation/raw_place, background/bluetooth_ble, background/motion_activity, background/location, and background/filtered_location
(For some edge cases, statemachine/transition is used – but seeing as it will only be used in those specific cases, it doesn't make sense to load upfront)
The motion activity and locations are used as dataframes later on, so I used get_data_df for those.
trips, places, and ble are queried with find_entries.
Then these are plumbed through to segment_trip_into_sections, filtered to the time range of the current trip, and ultimately passed into each SectionSegmentationMethod's segment_into_sections
With the locations already pre-loaded as dataframes, I was able to remove get_location_streams_for_trip altogether
For _get_distance_from_start_place_to_end, we preload all places in the segmentation range, so we can usually find the trip's start place in memory by its _id. However, unit tests revealed edge cases where the first start place is before the current segmentation time range, so we must fallback to querying the DB via esda.get_object
insert section segmentation entries in bulk
We create segmentation/raw_section and segmentation/raw_stop entries in segment_trip_into_sections. We have been inserting them as we create them (which sometimes required updating the previous entry if there is a stop), resulting in numerous insert and update DB calls
Instead we can keep a list of entries to be created and then only bulk_insert them at the end of the stage, once we are done updating them