⚡️ Use previously-read transition entries instead of re-reading them #1035

shankari · 2025-03-03T08:19:17Z

While testing the improvements in
#1017

I noticed that there were multiple calls to get_data_df during TRIP_SEGMENTATION, even though we should be reading all the values upfront.

#1017 (comment) #1017 (comment)

On further investigation, this turned out to be from the is_tracking_restarted_in_range function, which is used to determine whether we need to add untracked time entries or not.
#1017 (comment)

This is O(p), so it is not very intensive, but it is also redundant since we start off by reading all transitions in the active time range into memory.

We tried to use the newly added tracking_restarted_in_loc_df but it only has comparisons with the previous point. We tried to see if we could determine the range to consider and then check for all() but the tracking_restarted flag is not available outside of the dist/time subclasses.
#1017 (comment) #1017 (comment)

So we plumbed the transition_df through the codebase and used it directly.

is_tracking_restarted used to take a timeseries or a transition_df, we have replaced it with the transition_df since we don't want to read from the database any more.

Testing done:

all "real data" unit tests pass locally

$ ./e-mission-py.bash emission/tests/analysisTests/intakeTests/TestPipelineRealData.py
----------------------------------------------------------------------
Ran 36 tests in 159.777s

OK

While testing the improvements in e-mission#1017 I noticed that there were multiple calls to `get_data_df` during TRIP_SEGMENTATION, even though we should be reading all the values upfront. e-mission#1017 (comment) e-mission#1017 (comment) On further investigation, this turned out to be from the `is_tracking_restarted_in_range` function, which is used to determine whether we need to add untracked time entries or not. e-mission#1017 (comment) This is O(p), so it is not very intensive, but it is also redundant since we start off by reading all transitions in the active time range into memory. We tried to use the newly added `tracking_restarted_in_loc_df` but it only has comparisons with the previous point. We tried to see if we could determine the range to consider and then check for `all()` but the `tracking_restarted` flag is not available outside of the dist/time subclasses. e-mission#1017 (comment) e-mission#1017 (comment) So we plumbed the transition_df through the codebase and used it directly. `is_tracking_restarted` used to take a timeseries or a transition_df, we have replaced it with the transition_df since we don't want to read from the database any more. Testing done: - all "real data" unit tests pass locally ``` $ ./e-mission-py.bash emission/tests/analysisTests/intakeTests/TestPipelineRealData.py ---------------------------------------------------------------------- Ran 36 tests in 159.777s OK ```

shankari · 2025-03-03T09:17:29Z

Running this on the ccebikes data for our three canonical users
e-mission/e-mission-docs#1109

with USE_HINTS not set (aka false)

we get

Before

2025-03-02 15:53:33,605:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-02 17:22:13,370:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2024-11-26T23:47:34.319918

2025-03-02 15:53:26,515:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-02 18:30:41,251:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:10:28.813000

2025-03-02 15:53:25,494:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-02 16:01:41,723:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:06:25.235000

After

2025-03-03 01:10:01,132:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-03 01:10:17,616:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2024-11-26T23:47:34.319918

2025-03-03 01:09:58,731:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-03 01:12:22,018:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:10:28.813000

2025-03-03 01:09:53,067:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-03 01:14:31,541:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:06:25.235000

The execution time has gone down from hours to minutes.

shankari · 2025-03-03T09:18:15Z

@JGreenlee @TeachMeTW for visibility. It may be possible to clean up this code further - e.g. put the dataframes into a thread context instead of passing them around as arguments. But not sure that is the highest priority now.

shankari merged commit 869796b into e-mission:master Mar 3, 2025
5 checks passed

shankari mentioned this pull request Mar 7, 2025

🚀 Section Segmentation Optimization #1036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Use previously-read transition entries instead of re-reading them #1035

⚡️ Use previously-read transition entries instead of re-reading them #1035

shankari commented Mar 3, 2025

shankari commented Mar 3, 2025

shankari commented Mar 3, 2025 •

edited

Loading

⚡️ Use previously-read transition entries instead of re-reading them #1035

⚡️ Use previously-read transition entries instead of re-reading them #1035

Conversation

shankari commented Mar 3, 2025

shankari commented Mar 3, 2025

Before

After

shankari commented Mar 3, 2025 • edited Loading

shankari commented Mar 3, 2025 •

edited

Loading