-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Trip Segmentation by reducing DB calls #958
base: master
Are you sure you want to change the base?
Improving Trip Segmentation by reducing DB calls #958
Conversation
The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are : 1. db calls for transition and motion dataframes are moved upstream from `is_tracking_restarted_in_range` function and `get_ongoing_motion_in_range` in `restart_checking.py` to `trip_segmentaiton.py`. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting. 2. All the other changes in `trip_segmentation.py`,`dwell_segmentation_dist_filter.py` and `dwell_segmentation_time_filter.py` are just to support the change in point 1 ( above).
Replying to the comment #956 (comment) here :
The current PR just has reduced DB calls. In this case, the runtime for androidWrapper is at ~2.1s ( as mentioned here e-mission/e-mission-docs#1041 (comment) in the issue) , IOS Wrapper at 0.4s and CombinedWrapper at ~2.6s.
I am using the
When the data in
The for loop then iterates on these points
|
This is in regards to the fact that the db call for "background/filtered_location":
Is present in these 3 files This causes an overhead of ~0.2s for androidWrapper and IosWrapper in the current test setup. However, there is nothing else failing here. |
And what was it before? Just reporting the new number does not capture the difference. I don't see a comment in the issue for the time taken in
This is not a sufficient test. |
On second look, for the iOSWrapper, the test reads data from
I believe the IosWrapper is at 0.4s due to lesser trips, which we come to know after running the pipeline. ( 2 in case of iOS vs 8 for android ). I tried other real example files with platform as iOS. The other files with iOS data are : iphone_2016-02-22 , once the example is set (using all of them one by one), there are are either 0 or 2 trips for iOS, giving the similar runtime of ~0.4s - 0.5s. |
The runtimes are :
|
@JGreenlee scalability improvement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that < 10 trips is going to give us a real sense of the scalability improvement. Please run more complex tests:
TestPipelineRealData
for a combination of correctness and performance- against a real large dataset, after resetting the pipeline, for a rough estimate of performance.
transition_df_start_idx=transition_df.ts.searchsorted(start_ts,side='left') | ||
transition_df_end_idx=transition_df.ts.searchsorted(end_ts,side='right') | ||
transition_df_for_current=transition_df.iloc[transition_df_start_idx:transition_df_end_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you are using this instead of something like
transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]
or
transition_df_for_current=transition_df.query('ts >= start_ts & .ts <= end_ts')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First was the performance reason, O(log (n)) here vs O (n) in others.
Second, in case of query(), it creates a boolean series that marks rows to keep or discard, which happens internally, which increases temporary memory usage. For very large DataFrames, this can be an issue.
can use this one ,if log(n) vs O(n) is not an issue :
transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]
motion_df_start_idx=motion_df.ts.searchsorted(start_ts,side='left') | ||
motion_df_end_idx=motion_df.ts.searchsorted(end_ts,side='right') | ||
filtered_motion_df=motion_df.iloc[motion_df_start_idx:motion_df_end_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
@@ -7,6 +7,7 @@ | |||
from builtins import * | |||
from builtins import object | |||
import logging | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to add this import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused currently. Removed.
def has_trip_ended(self, lastPoint, currPoint, timeseries): | ||
def has_trip_ended(self, lastPoint, currPoint, timeseries, motion_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are already passing in the motion_df
here, why do you need to also pass in the timeseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's being used here :
timeseries.invalidate_raw_entry(currPoint["_id"])
if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, timeseries): | ||
if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, self.transition_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is transition_df
a module field?
ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, timeseries) | ||
ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, motion_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while motion_df is not. They are both read upfront and should be treated the same way for clarity
non_still_motions = [ma for ma in motionInRange if ma["data"]["type"] not in ignore_modes_list and ma["data"]["confidence"] == 100] | ||
logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) | ||
non_still_motions=motionInRange[~motionInRange['type'].isin(ignore_modes_list) & (motionInRange['confidence'] ==100)] | ||
#logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change doesn't appear to be related to this fix.
def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes): | ||
def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes, transition_df, motion_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. if we are passing in the transition_df and motion_df, why do we also need the timeseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it's unused. Cleared.
Code cleanup as per the comments.
The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are :
db calls for transition and motion dataframes are moved upstream from
is_tracking_restarted_in_range
function andget_ongoing_motion_in_range
inrestart_checking.py
totrip_segmentaiton.py
. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting.All the other changes in
trip_segmentation.py
,dwell_segmentation_dist_filter.py
anddwell_segmentation_time_filter.py
are just to support the change in point 1 ( above).