Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Use previously-read transition entries instead of re-reading them #1035

Merged

Conversation

shankari
Copy link
Contributor

@shankari shankari commented Mar 3, 2025

While testing the improvements in
#1017

I noticed that there were multiple calls to get_data_df during TRIP_SEGMENTATION, even though we should be reading all the values upfront.

#1017 (comment) #1017 (comment)

On further investigation, this turned out to be from the is_tracking_restarted_in_range function, which is used to determine whether we need to add untracked time entries or not.
#1017 (comment)

This is O(p), so it is not very intensive, but it is also redundant since we start off by reading all transitions in the active time range into memory.

We tried to use the newly added tracking_restarted_in_loc_df but it only has comparisons with the previous point. We tried to see if we could determine the range to consider and then check for all() but the tracking_restarted flag is not available outside of the dist/time subclasses.
#1017 (comment) #1017 (comment)

So we plumbed the transition_df through the codebase and used it directly.

is_tracking_restarted used to take a timeseries or a transition_df, we have replaced it with the transition_df since we don't want to read from the database any more.

Testing done:

  • all "real data" unit tests pass locally
$ ./e-mission-py.bash emission/tests/analysisTests/intakeTests/TestPipelineRealData.py
----------------------------------------------------------------------
Ran 36 tests in 159.777s

OK

While testing the improvements in
e-mission#1017

I noticed that there were multiple calls to `get_data_df` during
TRIP_SEGMENTATION, even though we should be reading all the values upfront.

e-mission#1017 (comment)
e-mission#1017 (comment)

On further investigation, this turned out to be from the
`is_tracking_restarted_in_range` function, which is used to determine whether
we need to add untracked time entries or not.
e-mission#1017 (comment)

This is O(p), so it is not very intensive, but it is also redundant since we
start off by reading all transitions in the active time range into memory.

We tried to use the newly added `tracking_restarted_in_loc_df` but it only has
comparisons with the previous point. We tried to see if we could determine the
range to consider and then check for `all()` but the `tracking_restarted` flag is not
available outside of the dist/time subclasses.
e-mission#1017 (comment)
e-mission#1017 (comment)

So we plumbed the transition_df through the codebase and used it directly.

`is_tracking_restarted` used to take a timeseries or a transition_df, we have
replaced it with the transition_df since we don't want to read from the
database any more.

Testing done:
 - all "real data" unit tests pass locally

```
$ ./e-mission-py.bash emission/tests/analysisTests/intakeTests/TestPipelineRealData.py
----------------------------------------------------------------------
Ran 36 tests in 159.777s

OK
```
@shankari
Copy link
Contributor Author

shankari commented Mar 3, 2025

Running this on the ccebikes data for our three canonical users
e-mission/e-mission-docs#1109

with USE_HINTS not set (aka false)

we get

Before

2025-03-02 15:53:33,605:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-02 17:22:13,370:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2024-11-26T23:47:34.319918

2025-03-02 15:53:26,515:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-02 18:30:41,251:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:10:28.813000

2025-03-02 15:53:25,494:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-02 16:01:41,723:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:06:25.235000

After

2025-03-03 01:10:01,132:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-03 01:10:17,616:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2024-11-26T23:47:34.319918

2025-03-03 01:09:58,731:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-03 01:12:22,018:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:10:28.813000

2025-03-03 01:09:53,067:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, start_ts is None
2025-03-03 01:14:31,541:INFO:8474691648:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2025-02-12T00:06:25.235000

The execution time has gone down from hours to minutes.

@shankari shankari merged commit 869796b into e-mission:master Mar 3, 2025
5 checks passed
@shankari
Copy link
Contributor Author

shankari commented Mar 3, 2025

@JGreenlee @TeachMeTW for visibility. It may be possible to clean up this code further - e.g. put the dataframes into a thread context instead of passing them around as arguments. But not sure that is the highest priority now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant