Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journey on how to fix the pipeline ⛏️ #689

Closed
larbizard opened this issue Dec 7, 2021 · 18 comments
Closed

Journey on how to fix the pipeline ⛏️ #689

larbizard opened this issue Dec 7, 2021 · 18 comments

Comments

@larbizard
Copy link

Steps to reproduce:
Execute analysis pipeline using /e-mission-py.bash bin/debug/intake_single_user.py -e "[email protected]"

Expected result:
The pipeline executres corretrly and the data is analyzed and stored in the Staging_analysis_timeseries collection.

Actuel result:
(emission) root@e-mission-server-fabmob-qc:~/e-mission-server# ./e-mission-py.bash bin/debug/intake_single_user.py -e "[email protected]"
Connecting to database URL db
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
transit stops query not configured, falling back to default
expectations.conf.json not configured, falling back to sample, default configuration
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
File "/root/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in
key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
2021-12-07T11:19:40.559267+00:00UUID f5e92e19-39fa-4598-ad66-ebfc47cb9e34: moving to long term
2021-12-07T11:19:40.600720+00:00UUID f5e92e19-39fa-4598-ad66-ebfc47cb9e34: updating incoming user inputs
2021-12-07T11:19:40.886939+00:00UUID f5e92e19-39fa-4598-ad66-ebfc47cb9e34: filter accuracy if needed
2021-12-07T11:19:41.452110+00:00UUID f5e92e19-39fa-4598-ad66-ebfc47cb9e34: segmenting into trips
Found error curr_state.curr_run_ts = 1635335308.6691191 while processing pipeline for user f5e92e19-39fa-4598-ad66-ebfc47cb9e34, skipping
Traceback (most recent call last):
File "/root/e-mission-server/emission/pipeline/intake_stage.py", line 73, in run_intake_pipeline
run_intake_pipeline_for_user(uuid)
File "/root/e-mission-server/emission/pipeline/intake_stage.py", line 122, in run_intake_pipeline_for_user
eaist.segment_current_trips(uuid)
File "/root/e-mission-server/emission/analysis/intake/segmentation/trip_segmentation.py", line 52, in segment_current_trips
time_query = epq.get_time_range_for_segmentation(user_id)
File "/root/e-mission-server/emission/storage/pipeline_queries.py", line 55, in get_time_range_for_segmentation
return get_time_range_for_stage(user_id, ps.PipelineStages.TRIP_SEGMENTATION)
File "/root/e-mission-server/emission/storage/pipeline_queries.py", line 308, in get_time_range_for_stage
assert curr_state.curr_run_ts is None, "curr_state.curr_run_ts = %s" % curr_state.curr_run_ts
AssertionError: curr_state.curr_run_ts = 1635335308.6691191

@shankari
Copy link
Contributor

shankari commented Dec 7, 2021

@larbizard this is a dup of #473

As you can see from https://github.com/e-mission/e-mission-docs/blob/6a0c1ec9ddc31036d64dfc087458bfa46b94cecf/docs/dev/archi/pipeline_details.md#curr_run_ts---while-processing-pipeline the curr_run_ts essentially acts as a lock for a particular (user, stage) combination. If two processes try to operate on the same stage for the same user, the process that enters the stage first grabs the lock. The process that tries to operate second cannot grab the lock and skips the stage.

This means that you either:

  • have two runs of the pipeline for the same user running in parallel, OR
  • had a previous run of the pipeline for this user abort. This could have happened:
    • by your killing the run using Ctrl+C OR
    • by the server/process crashing

You can fix it by resetting the pipeline (bin/reset_pipeline.py) for the user and then relaunching it exactly once.

@shankari shankari closed this as completed Dec 7, 2021
@larbizard
Copy link
Author

Since last may 15 the pipeline started crashing again. Was the pipeline analysing data for your account ?

Neither me nor Raouf have analysed data. We only have Draft icon on our phone since then. Same case for many users.

Also we noticed in the logs that there are pipeline errors for some users. Theses users have probably not used the app since our last update, they maybe even deleted the app. Shall we purge them ?

We are thinking of executing the pipeline only for specific users (Known active users who have installed the new version after april 29) using bin/debug/intake_single_user.py

We tryed to execute the pipeline for Raouf and it passed but nothing was updated.
Then we reset the pipeline for Raouf after 2022-05-16 and executed it again then it failed with the following logs:

(emission) root@e-mission-server-fabmob-qc:~/e-mission-server# ./e-mission-py.bash bin/debug/intake_single_user.py -u cbeb55e5969943a6911bb44b283b45d1
Connecting to database URL db
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
transit stops query not configured, falling back to default
expectations.conf.json not configured, falling back to sample, default configuration
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
File "/root/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in
key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
2022-05-19T14:48:17.500481+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: moving to long term
2022-05-19T14:48:17.569037+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: updating incoming user inputs
2022-05-19T14:48:17.846348+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: filter accuracy if needed
2022-05-19T14:48:19.379204+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: segmenting into trips
2022-05-19T14:48:19.860724+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: segmenting into sections
2022-05-19T14:48:21.303548+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: smoothing sections
2022-05-19T14:48:22.476714+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: cleaning and resampling timeline
2022-05-19T14:48:22.557741+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: inferring transportation mode
2022-05-19T14:48:34.239456+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: inferring labels
2022-05-19T14:48:34.300314+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: populating expectations
2022-05-19T14:48:34.357588+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: creating confirmed objects **********
2022-05-19T14:48:34.500369+00:00
UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: storing views to cache**********
Found error Found len(ret_list) = 2, expected <=1 while processing trip Entry({'_id': ObjectId('628195a21f61a1a85cc6bb13'), 'user_id': UUID('cbeb55e5-9699-43a6-911b-b44b283b45d1'), 'metadata': {'key': 'analysis/cleaned_trip', 'platform': 'server', 'write_ts': 1652659618.866277, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2022, 'month': 5, 'day': 15, 'hour': 17, 'minute': 6, 'second': 58, 'weekday': 6, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2022-05-15T17:06:58.866277-07:00'}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1652655756.719, 'end_local_dt': {'year': 2022, 'month': 5, 'day': 15, 'hour': 19, 'minute': 2, 'second': 36, 'weekday': 6, 'timezone': 'America/Toronto'}, 'end_fmt_time': '2022-05-15T19:02:36.719000-04:00', 'end_loc': {'type': 'Point', 'coordinates': [-73.5520783, 45.5460571]}, 'raw_trip': ObjectId('628195a01f61a1a85cc6bb0c'), 'start_ts': 1652655482.3305671, 'start_local_dt': {'year': 2022, 'month': 5, 'day': 15, 'hour': 18, 'minute': 58, 'second': 2, 'weekday': 6, 'timezone': 'America/Toronto'}, 'start_fmt_time': '2022-05-15T18:58:02.330567-04:00', 'start_loc': {'type': 'Point', 'coordinates': [-73.543472, 45.5519273]}, 'duration': 274.38843297958374, 'distance': 1592.1785739074278, 'start_place': ObjectId('6281883e6a1b862dcba6934d'), 'end_place': ObjectId('628195a31f61a1a85cc6bb21')}})
Traceback (most recent call last):
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 288, in get_geojson_for_timeline
trip_geojson = trip_to_geojson(trip, tl)
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 247, in trip_to_geojson
section_gj = section_to_geojson(section, tl)
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 156, in section_to_geojson
ise = esds.cleaned2inferred_section(section.user_id, section.get_id())
File "/root/e-mission-server/emission/storage/decorations/section_queries.py", line 45, in cleaned2inferred_section
curr_predicted_entry = _get_inference_entry_for_section(user_id, section_id, "analysis/inferred_section", "data.cleaned_section")
File "/root/e-mission-server/emission/storage/decorations/section_queries.py", line 56, in _get_inference_entry_for_section
assert len(ret_list) <= 1, "Found len(ret_list) = %d, expected <=1" % len(ret_list)
AssertionError: Found len(ret_list) = 2, expected <=1
Storing views to cache failed for user cbeb55e5-9699-43a6-911b-b44b283b45d1
Traceback (most recent call last):
File "/root/e-mission-server/emission/net/usercache/builtin_usercache_handler.py", line 109, in storeViewsToCache
self.storeTimelineToCache(time_query)
File "/root/e-mission-server/emission/net/usercache/builtin_usercache_handler.py", line 180, in storeTimelineToCache
trip_gj_list = self.get_trip_list_for_seven_days(start_ts)
File "/root/e-mission-server/emission/net/usercache/builtin_usercache_handler.py", line 269, in get_trip_list_for_seven_days
trip_gj_list = gfc.get_geojson_for_ts(self.user_id, seventy_two_hours_ago_ts, start_ts)
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 272, in get_geojson_for_ts
return get_geojson_for_timeline(user_id, tl)
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 293, in get_geojson_for_timeline
raise e
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 288, in get_geojson_for_timeline
trip_geojson = trip_to_geojson(trip, tl)
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 247, in trip_to_geojson
section_gj = section_to_geojson(section, tl)
File "/root/e-mission-server/emission/analysis/plotting/geojson/geojson_feature_converter.py", line 156, in section_to_geojson
ise = esds.cleaned2inferred_section(section.user_id, section.get_id())
File "/root/e-mission-server/emission/storage/decorations/section_queries.py", line 45, in cleaned2inferred_section
curr_predicted_entry = _get_inference_entry_for_section(user_id, section_id, "analysis/inferred_section", "data.cleaned_section")
File "/root/e-mission-server/emission/storage/decorations/section_queries.py", line 56, in _get_inference_entry_for_section
assert len(ret_list) <= 1, "Found len(ret_list) = %d, expected <=1" % len(ret_list)
AssertionError: Found len(ret_list) = 2, expected <=1

We reste again and we had the following error:

(emission) root@e-mission-server-fabmob-qc:~/e-mission-server# ./e-mission-py.bash bin/debug/intake_single_user.py -u cbeb55e5969943a6911bb44b283b45d1
Connecting to database URL db
google maps key not configured, falling back to nominatim
nominatim not configured either, place decoding must happen on the client
transit stops query not configured, falling back to default
expectations.conf.json not configured, falling back to sample, default configuration
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
File "/root/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in
key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
2022-05-19T15:33:32.390398+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: moving to long term
2022-05-19T15:33:32.440895+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: updating incoming user inputs
2022-05-19T15:33:32.688896+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: filter accuracy if needed
2022-05-19T15:33:33.336430+00:00UUID cbeb55e5-9699-43a6-911b-b44b283b45d1: segmenting into trips
Found error curr_state.curr_run_ts = 1652973727.5321794 while processing pipeline for user cbeb55e5-9699-43a6-911b-b44b283b45d1, skipping
Traceback (most recent call last):
File "/root/e-mission-server/emission/pipeline/intake_stage.py", line 73, in run_intake_pipeline
run_intake_pipeline_for_user(uuid)
File "/root/e-mission-server/emission/pipeline/intake_stage.py", line 122, in run_intake_pipeline_for_user
eaist.segment_current_trips(uuid)
File "/root/e-mission-server/emission/analysis/intake/segmentation/trip_segmentation.py", line 52, in segment_current_trips
time_query = epq.get_time_range_for_segmentation(user_id)
File "/root/e-mission-server/emission/storage/pipeline_queries.py", line 55, in get_time_range_for_segmentation
return get_time_range_for_stage(user_id, ps.PipelineStages.TRIP_SEGMENTATION)
File "/root/e-mission-server/emission/storage/pipeline_queries.py", line 308, in get_time_range_for_stage
assert curr_state.curr_run_ts is None, "curr_state.curr_run_ts = %s" % curr_state.curr_run_ts
AssertionError: curr_state.curr_run_ts = 1652973727.5321794

We tryed with my account also reset and execute the pipeline there where no errors but nothing changed I still have Draft trips.

@shankari shankari reopened this May 20, 2022
@shankari
Copy link
Contributor

@larbizard

first,

 assert curr_state.curr_run_ts is None, "curr_state.curr_run_ts = %s" % curr_state.curr_run_ts
AssertionError: curr_state.curr_run_ts = 1652973727.5321794

means that either your reset failed, or you are running two copies of the pipeline at the same time.

second,

 AssertionError: Found len(ret_list) = 2, expected <=1

is also similar, we expect there to be one inference per section and we found two.

We tryed with my account also reset and execute the pipeline there where no errors but nothing changed I still have Draft trips.

wrt this can you send me the logs for executing your pipeline? If you executed the pipeline using bin/debug/intake_single_user.py, the logs should be in /var/tmp/intake_single.log

@lgharib
Copy link

lgharib commented Jul 12, 2022

Hi @shankari,

I did a reset pipeline using:

./e-mission-py.bash bin/reset_pipeline.py -u aee1cab98e294ddda67726dea5abd8a3 -d 2022-07-10

And it seemed to work correctly no errors.

Then I am using the following script to start the pipeline for my user:

./e-mission-py.bash bin/debug/intake_single_user.py

Unfortunatly at the last step, :UUID aee1cab9-8e29-4ddd-a677-26dea5abd8a3: storing views to cache we have an error.

Since the log file is too big I past it in this Google Doc file 👍
https://docs.google.com/document/d/18T3tOGTZjysD6fzIVL1p4v1eS8eSunVCDpz59CEC_WQ/edit?usp=sharing

Thank you for your help,

Larbi

@shankari
Copy link
Contributor

shankari commented Jul 12, 2022

@lgharib

tl;dr: You are running reset_pipeline.py incorrectly. -d 2022-07-10 indicates that the pipeline should be reset to 2022-07-10 aka day before yesterday, which won't actually delete any of the prior analysis results (aka duplicate entries from 2022-05-15)

$ ./e-mission-py.bash bin/reset_pipeline.py --help
storage not configured, falling back to sample, default configuration
URL not formatted, defaulting to "Stage_database"
Connecting to database URL localhost
usage: reset_pipeline.py [-h]
                         (-a | -p {android,ios} | -u USER_LIST [USER_LIST ...] | -e EMAIL_LIST [EMAIL_LIST ...])
                         [-d DATE] [-n]

optional arguments:
  -h, --help            show this help message and exit
  -a, --all             reset the pipeline for all users
  -p {android,ios}, --platform {android,ios}
                        reset the pipeline for all on the specified platform
  -u USER_LIST [USER_LIST ...], --user_list USER_LIST [USER_LIST ...]
                        user ids to reset the pipeline for
  -e EMAIL_LIST [EMAIL_LIST ...], --email_list EMAIL_LIST [EMAIL_LIST ...]
                        email addresses to reset the pipeline for
  -d DATE, --date DATE  date to reset the pipeline to. Format 'YYYY-mm-dd'
                        e.g. 2016-02-17. Interpreted in UTC, so 2016-02-17
                        will reset the pipeline to 2016-02-16T16:00:00-08:00
                        in the pacific time zone
  -n, --dry_run         do everything except actually perform the operations

details

The backtrace is

2022-07-12 08:45:49,959:ERROR:139621497538368:Storing views to cache failed for user aee1cab9-8e29-4ddd-a677-26dea5abd8a3
Traceback (most recent call last):
  File "/root/e-mission-server/emission/net/usercache/builtin_usercache_handler.py", line 109, in storeViewsToCache
    self.storeTimelineToCache(time_query)
...
  File "/root/e-mission-server/emission/storage/decorations/section_queries.py", line 45, in cleaned2inferred_section
    curr_predicted_entry = _get_inference_entry_for_section(user_id, section_id, "analysis/inferred_section", "data.cleaned_section")
  File "/root/e-mission-server/emission/storage/decorations/section_queries.py", line 56, in _get_inference_entry_for_section
    assert len(ret_list) <= 1, "Found len(ret_list) = %d, expected <=1" % len(ret_list)
AssertionError: Found len(ret_list) = 2, expected <=1

So it looks like there are two inferred sections for the cleaned section.

The erroneous trip appears to be

2022-07-12 08:45:49,955:ERROR:139621497538368:Found error Found len(ret_list) = 2, expected <=1 while processing trip Entry({'_id': ObjectId('62ccafb3e86c73263a959dde'), 'user_id': UUID('...), 'metadata': {'key': 'analysis/cleaned_trip', 'platform': 'server', 'write_ts': 1657581491.3980167, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {...}, 'data': {'source': 'DwellSegmentationTimeFilter', 'end_ts': 1652621530.061, 'end_local_dt': {...}, 'end_fmt_time': '2022-05-15...', 'end_loc': {'type': 'Point', 'coordinates': [...]}, 'raw_trip': ObjectId('62ccaf6be86c73263a958825'), 'start_ts': ..., 'start_local_dt': {...'timezone': 'Africa/Casablanca'}, 'start_fmt_time': '2022-05-15T...', 'start_loc': {'type': 'Point', 'coordinates': [...]}, 'duration': 791.6627106666565, 'distance': 154.4805688772839, 'start_place': ObjectId('62ccafc8e86c73263a959e53'), 'end_place': ObjectId('62ccafc8e86c73263a959e54')}})

And the related section seems to be:

2022-07-12 08:45:48,069:DEBUG:139621497538368:About to execute query {'user_id': UUID('aee1cab9-8e29-4ddd-a677-26dea5abd8a3'), 'data.trip_id': ObjectId('62ccafb3e86c73263a959dde'), 'metadata.key': 'analysis/cleaned_section'} with sort_key data.start_ts
2022-07-12 08:45:49,830:DEBUG:139621497538368:keys = (analysis/cleaned_stop, analysis/cleaned_section), len(places) = 0, len(trips) = 1
2022-07-12 08:45:49,830:DEBUG:139621497538368:places_or_stops = []
2022-07-12 08:45:49,830:DEBUG:139621497538368:trips_or_sections = [ObjectId('62ccafb3e86c73263a959de0')]

Looking at the MODE_INFERENCE, the step has already been run so we don't re-run it.

2022-07-12 08:43:51,551:INFO:139621497538368:**********UUID aee1cab9-8e29-4ddd-a677-26dea5abd8a3: inferring transportation mode**********
2022-07-12 08:43:51,554:INFO:139621497538368:For stage PipelineStages.MODE_INFERENCE, start_ts = 2022-05-15T13:32:15.061000

It doesn't look like reset_pipeline worked properly because almost all the pipeline states had a timestamp until which they were complete.

@lgharib
Copy link

lgharib commented Jul 13, 2022

@shankari How shall I use the rest_pipeline to fix the issue ?

Without the date ?

./e-mission-py.bash bin/reset_pipeline.py -u aee1cab98e294ddda67726dea5abd8a3

@lgharib
Copy link

lgharib commented Jul 13, 2022

Also I saw that you noticed duplicates "aka duplicate entries from 2022-05-15".

I have backed up twice the machine I am working using a snapshot from a week before.

Is it possible that I created duplicates ?

Does the app sync all the missing data or just partial data ?

@shankari
Copy link
Contributor

Also I saw that you noticed duplicates "aka duplicate entries from 2022-05-15".
I have backed up twice the machine I am working using a snapshot from a week before.
Is it possible that I created duplicates ?

Seems unlikely given that we don't see duplicates anywhere else.

Does the app sync all the missing data or just partial data ?

The duplicates are not in app sensed data, but in analysis results.
The app syncs all new data since the last sync.

@shankari
Copy link
Contributor

As an aside, I note that all the pipeline timestamps are for May 14th. Did you stop collecting data then, or restore a backup from May or ??? Just trying to see if there is a reason why you don't have more recent data.

/tmp/pipeline_issue.log:2022-07-12 08:43:48,024:INFO:139621497538368:For stage PipelineStages.USERCACHE, start_ts = 2022-05-14T20:49:14.735000
/tmp/pipeline_issue.log:2022-07-12 08:43:48,057:INFO:139621497538368:For stage PipelineStages.USER_INPUT_MATCH_INCOMING, start_ts = 2022-05-11T13:23:56.202000
/tmp/pipeline_issue.log:2022-07-12 08:43:50,030:INFO:139621497538368:For stage PipelineStages.ACCURACY_FILTERING, start_ts = 2022-04-30T23:42:50.784000
/tmp/pipeline_issue.log:2022-07-12 08:43:50,050:INFO:139621497538368:For stage PipelineStages.TRIP_SEGMENTATION, start_ts = 2022-05-15T13:36:55.645000
/tmp/pipeline_issue.log:2022-07-12 08:43:50,481:INFO:139621497538368:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2022-05-15T13:32:15.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:50,646:INFO:139621497538368:For stage PipelineStages.JUMP_SMOOTHING, start_ts = 2022-05-15T13:32:15.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:50,734:INFO:139621497538368:For stage PipelineStages.CLEAN_RESAMPLING, start_ts = 2022-05-15T13:32:10.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:51,554:INFO:139621497538368:For stage PipelineStages.MODE_INFERENCE, start_ts = 2022-05-15T13:32:15.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:51,651:INFO:139621497538368:For stage PipelineStages.LABEL_INFERENCE, start_ts = 2022-05-15T13:32:15.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:51,792:INFO:139621497538368:For stage PipelineStages.EXPECTATION_POPULATION, start_ts = 2022-05-15T13:32:15.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:51,884:INFO:139621497538368:For stage PipelineStages.CREATE_CONFIRMED_OBJECTS, start_ts = 2022-05-15T13:32:15.061000
/tmp/pipeline_issue.log:2022-07-12 08:43:51,970:INFO:139621497538368:For stage PipelineStages.OUTPUT_GEN, start_ts is None

@larbizard
Copy link
Author

We did not stop collecting data.

However after May 14th we notice that the data stopped beeing analysed for Raouf and me.

At that time we tried to reset restpectivly my pipeline and his an also restart the intake single user analysis unsuccessfully.

I restored a backup a month ago on June the 15th (Backups are done on mondays at 10 pm so probably a backup from the monday before june the 13th)

Yesterday
After doing updates to mongodv and pulled the last version of https://github.com/e-mission/e-mission-server/tree/gis-based-mode-detection the server stoped working so I did another restore from a backup from a week ago.

Is there a way to reset all analysis and restart them for all users.

Also when I start a pipeline usin the intake single user pipeline I takes avout 30 minutes to 40 minutes to executes and meanwhile a scheduled bin/intake_multiprocess.py 3 is executed. That maybe what causes the first pipeline to crash.

@shankari
Copy link
Contributor

shankari commented Jul 14, 2022

At that time we tried to reset restpectivly my pipeline and his an also restart the intake single user analysis unsuccessfully.

I wonder if this is related to the incorrect call for reset_pipeline.

Also when I start a pipeline usin the intake single user pipeline I takes avout 30 minutes to 40 minutes to executes and meanwhile a scheduled bin/intake_multiprocess.py 3 is executed. That maybe what causes the first pipeline to crash.

Ah yes, you do not want to execute two pipelines in parallel. I would suggest pulling your data to your laptop and running the pipeline there to ensure that it works and debug the issues if any.

Then, on production, just reset the pipeline and don't run the single_intake.py. When intake_multiprocess runs, it will run the pipeline on your data as well.

Is there a way to reset all analysis and restart them for all users.

As you can see, reset all analysis is the -a option instead of -u for a single user. The intake_multiprocess will re-run the pipeline for all users and catch up after the reset. Note that this may take a few hours, so you want to disable the cronjob that runs the intake pipeline until the "catch up after reset" manually scheduled job is done.

@larbizard
Copy link
Author

I turned off the crontab that executes the multiprocess.
I also executed the purge analyse script to remove any duplicates.
I checked the data on my phone before May the 14 th, no data was analysed all draft.
I executed the intake single process ( 3 h of waiting for my data )
The data analysed before the 14 th was analysed correctly but after that date still draft.
Wither the data is not synchronized or the data is corrupted

@shankari
Copy link
Contributor

shankari commented Jul 14, 2022

  • how did you purge analyse? Please specify exact command line so that I can verify that it was correct
  • what do the logs for the intake_single_ process show?
  • wrt sync, what is the last data point that you see on the server
import edb.core.get_database as edb
from uuid import UUID
# find most 3 recent locations in the usercache 
list(edb.get_usercache_db().find({"user_id": UUID("<your uuid>"), "metadata.key": "background/location"}).sort("metadata.write_ts":, -1).limit(3))
# find most 3 recent locations in the timeseries 
list(edb.get_timeseries_db().find({"user_id": UUID("<your uuid>"), "metadata.key": "background/location"}).sort("metadata.write_ts":, -1).limit(3))

@lgharib
Copy link

lgharib commented Jul 15, 2022

I used:

e-mission-py.bash bin/purge_analysis_database.py

According to the code anaylisis, common places common trips and pipeline states are removed :
def purgeAnalysisData():
print(edb.get_analysis timeseries_db().remove()),
print(edb.get_common_place_db().remove())
print(edb.get_common_trip_db().remove())
print(edb.get_pipeline_state_db().remove())

Then I executed :

e-mission-py.bash bin/debug/intake_single_user.py -u aeelcab98e294ddda67726dea5abd8a3

I didn't output the logs in a file. I will execute the script again and send them.

I searched for the last synched data on the server for my user and the last data is from may 15 th however other users had their data synched correctly yeasterday.

My data:

db.Stage_timeseries.find({"user_id" : BinData(3,"ruHKuY4pTd2mdybepavYow=="), "metadata.key": "background/location"}).sort({"metadata.write_ts": -1}).limit(3)
{ "_id" : ObjectId("628101f399243fc036d9096f"), "user_id" : BinData(3,"ruHKuY4pTd2mdybepavYow=="), "metadata" : { "key" : "background/location", "platform" : "android", "read_ts" : 0, "time_zone" : "Africa/Casablanca", "type" : "sensor-data", "write_ts" : 1652621810.623, "write_local_dt" : { "year" : 2022, "month" : 5, "day" : 15, "hour" : 14, "minute" : 36, "second" : 50, "weekday" : 6, "timezone" : "Africa/Casablanca" }, "write_fmt_time" : "2022-05-15T14:36:50.623000+01:00" }, "data" : { "accuracy" : 16.566, "altitude" : 459.29998779296875, "elapsedRealtimeNanos" : NumberLong("168191300295515"), "filter" : "time", "fmt_time" : "2022-05-15T14:36:50+01:00", "latitude" : 34.0326923, "longitude" : -5.0061007, "sensed_speed" : 0, "ts" : 1652621810, "local_dt" : { "year" : 2022, "month" : 5, "day" : 15, "hour" : 14, "minute" : 36, "second" : 50, "weekday" : 6, "timezone" : "Africa/Casablanca" }, "loc" : { "type" : "Point", "coordinates" : [ -5.0061007, 34.0326923 ] }, "heading" : 0 } }
{ "_id" : ObjectId("628101f399243fc036d90969"), "user_id" : BinData(3,"ruHKuY4pTd2mdybepavYow=="), "metadata" : { "key" : "background/location", "platform" : "android", "read_ts" : 0, "time_zone" : "Africa/Casablanca", "type" : "sensor-data", "write_ts" : 1652621784.297, "write_local_dt" : { "year" : 2022, "month" : 5, "day" : 15, "hour" : 14, "minute" : 36, "second" : 24, "weekday" : 6, "timezone" : "Africa/Casablanca" }, "write_fmt_time" : "2022-05-15T14:36:24.297000+01:00" }, "data" : { "accuracy" : 8.273, "altitude" : 459.29998779296875, "elapsedRealtimeNanos" : NumberLong("168164949000000"), "filter" : "time", "fmt_time" : "2022-05-15T14:36:24.190000+01:00", "latitude" : 34.0326917, "longitude" : -5.0061008, "sensed_speed" : 0, "ts" : 1652621784.19, "local_dt" : { "year" : 2022, "month" : 5, "day" : 15, "hour" : 14, "minute" : 36, "second" : 24, "weekday" : 6, "timezone" : "Africa/Casablanca" }, "loc" : { "type" : "Point", "coordinates" : [ -5.0061008, 34.0326917 ] }, "heading" : 90 } }
{ "_id" : ObjectId("628101f399243fc036d90965"), "user_id" : BinData(3,"ruHKuY4pTd2mdybepavYow=="), "metadata" : { "key" : "background/location", "platform" : "android", "read_ts" : 0, "time_zone" : "Africa/Casablanca", "type" : "sensor-data", "write_ts" : 1652621780.586, "write_local_dt" : { "year" : 2022, "month" : 5, "day" : 15, "hour" : 14, "minute" : 36, "second" : 20, "weekday" : 6, "timezone" : "Africa/Casablanca" }, "write_fmt_time" : "2022-05-15T14:36:20.586000+01:00" }, "data" : { "accuracy" : 8.273, "altitude" : 459, "elapsedRealtimeNanos" : NumberLong("168161295804589"), "filter" : "time", "fmt_time" : "2022-05-15T14:36:20+01:00", "latitude" : 34.0326917, "longitude" : -5.0061008, "sensed_speed" : 0, "ts" : 1652621780, "local_dt" : { "year" : 2022, "month" : 5, "day" : 15, "hour" : 14, "minute" : 36, "second" : 20, "weekday" : 6, "timezone" : "Africa/Casablanca" }, "loc" : { "type" : "Point", "coordinates" : [ -5.0061008, 34.0326917 ] }, "heading" : 0 } }

All other users:

db.Stage_timeseries.find({"metadata.key": "background/location"}).sort({"metadata.write_ts": -1}).limit(3)
{ "_id" : ObjectId("62cfa5734ae51d1abf3cdcd9"), "user_id" : BinData(3,"+U0w/rUMTLmEZv0OCwPIGw=="), "metadata" : { "key" : "background/location", "platform" : "android", "read_ts" : 0, "time_zone" : "America/Toronto", "type" : "sensor-data", "write_ts" : 1657775475.165, "write_local_dt" : { "year" : 2022, "month" : 7, "day" : 14, "hour" : 1, "minute" : 11, "second" : 15, "weekday" : 3, "timezone" : "America/Toronto" }, "write_fmt_time" : "2022-07-14T01:11:15.165000-04:00" }, "data" : { "accuracy" : 43.952, "altitude" : 130.17681884765625, "elapsedRealtimeNanos" : NumberLong("367635481943485"), "filter" : "time", "fmt_time" : "2022-07-14T01:11:15-04:00", "latitude" : 45.3939464, "longitude" : -71.8895922, "sensed_speed" : 0.025381435, "ts" : 1657775475, "local_dt" : { "year" : 2022, "month" : 7, "day" : 14, "hour" : 1, "minute" : 11, "second" : 15, "weekday" : 3, "timezone" : "America/Toronto" }, "loc" : { "type" : "Point", "coordinates" : [ -71.8895922, 45.3939464 ] }, "heading" : 338.4458 } }
{ "_id" : ObjectId("62cfa5734ae51d1abf3cdcd7"), "user_id" : BinData(3,"+U0w/rUMTLmEZv0OCwPIGw=="), "metadata" : { "key" : "background/location", "platform" : "android", "read_ts" : 0, "time_zone" : "America/Toronto", "type" : "sensor-data", "write_ts" : 1657775466.762, "write_local_dt" : { "year" : 2022, "month" : 7, "day" : 14, "hour" : 1, "minute" : 11, "second" : 6, "weekday" : 3, "timezone" : "America/Toronto" }, "write_fmt_time" : "2022-07-14T01:11:06.762000-04:00" }, "data" : { "accuracy" : 362.166, "altitude" : 131.7021484375, "elapsedRealtimeNanos" : NumberLong("367626960000000"), "filter" : "time", "fmt_time" : "2022-07-14T01:11:06.478000-04:00", "latitude" : 45.3939408, "longitude" : -71.8895891, "sensed_speed" : 0, "ts" : 1657775466.478, "local_dt" : { "year" : 2022, "month" : 7, "day" : 14, "hour" : 1, "minute" : 11, "second" : 6, "weekday" : 3, "timezone" : "America/Toronto" }, "loc" : { "type" : "Point", "coordinates" : [ -71.8895891, 45.3939408 ] }, "heading" : 0 } }
{ "_id" : ObjectId("62cfa5734ae51d1abf3cdcd3"), "user_id" : BinData(3,"+U0w/rUMTLmEZv0OCwPIGw=="), "metadata" : { "key" : "background/location", "platform" : "android", "read_ts" : 0, "time_zone" : "America/Toronto", "type" : "sensor-data", "write_ts" : 1657775442.088, "write_local_dt" : { "year" : 2022, "month" : 7, "day" : 14, "hour" : 1, "minute" : 10, "second" : 42, "weekday" : 3, "timezone" : "America/Toronto" }, "write_fmt_time" : "2022-07-14T01:10:42.088000-04:00" }, "data" : { "accuracy" : 43.952, "altitude" : 131.7021484375, "elapsedRealtimeNanos" : NumberLong("367602481943433"), "filter" : "time", "fmt_time" : "2022-07-14T01:10:42-04:00", "latitude" : 45.3939414, "longitude" : -71.8895871, "sensed_speed" : 0.027612034, "ts" : 1657775442, "local_dt" : { "year" : 2022, "month" : 7, "day" : 14, "hour" : 1, "minute" : 10, "second" : 42, "weekday" : 3, "timezone" : "America/Toronto" }, "loc" : { "type" : "Point", "coordinates" : [ -71.8895871, 45.3939414 ] }, "heading" : 246.21338 } }

I had a look at my loggerDB file from my phone and filtered by "Sync" and I found the following error:

ServerSyncAdapter: 10 Error java.io.IOException while posting converted trips to JSON

I sent you by email a link to my loggerDB 1.4 gb.

@shankari
Copy link
Contributor

I didn't output the logs in a file. I will execute the script again and send them.

You don't need to, the output is automatically logged to /var/tmp/intake_*

ServerSyncAdapter: 10 Error java.io.IOException while posting converted trips to JSON

That seems to explain why all your recent trips are "draft". They were never sent to the server.

@shankari
Copy link
Contributor

shankari commented Jul 15, 2022

@lgharib

This is pretty straightforward. You have not bumped up the message size on ngnix.
Note that OpenPATH caches the data locally and pushes it up to the server periodically.
If you have taken long trips, and/or are off the internet for a while, this data can pile up.
We limit the amount of data sent to 10k entries, but even with that, we want to need to bump up the default ngnix (or other reverse proxy) message size.
#558
1b01245

From your logs:

22,1655286269.312,2022-06-15T02:44:29.312000-07:00,ServerSyncAdapter : Starting sync with push
28,1655286269.3960001,2022-06-15T02:44:29.396000-07:00,BuiltinUserCache : Result count = 10000
29,1655286269.3979998,2022-06-15T02:44:29.398000-07:00,BuiltinUserCache : Reading entry = 0 with key config/consent and write_ts 1.651136789082E9
30,1655286269.43,2022-06-15T02:44:29.430000-07:00,BuiltinUserCache : Reading entry = 500 with key background/filtered_location and write_ts 1.652622500639E9
...
47,1655286270.072,2022-06-15T02:44:30.072000-07:00,BuiltinUserCache : Reading entry = 9000 with key background/filtered_location and write_ts 1.652642667573E9
48,1655286270.1039999,2022-06-15T02:44:30.104000-07:00,BuiltinUserCache : Reading entry = 9500 with key background/location and write_ts 1.652643284619E9
49,1655286270.139,2022-06-15T02:44:30.139000-07:00,BuiltinUserCache : Returning array of length 10000
50,1655286270.1420002,2022-06-15T02:44:30.142000-07:00,"ConnectionSettings : in getConnectURL, connectionSettings = {""connectUrl"":""https:\/\/e-mission.fabmobqc.ca"",""aggregate_call_auth"":""no_auth"",""android"":{""auth"":{""method"":""prompted-auth"",""clientID"":""ignored""}},""ios"":{""auth"":{""method"":""prompted-auth"",""clientID"":""ignored""}}}"
51,1655286270.144,2022-06-15T02:44:30.144000-07:00,"ConnectionSettings : in getConnectURL, returning https://e-mission.fabmobqc.ca"
52,1655286271.3279998,2022-06-15T02:44:31.328000-07:00,CommunicationHelper : Got response org.apache.http.message.BasicHttpResponse@fc5a10c with statusHTTP/1.1 413 Request Entity Too Large
53,1655286271.336,2022-06-15T02:44:31.336000-07:00,ServerSyncAdapter : IO Error java.io.IOException while posting converted trips to JSON

@lgharib
Copy link

lgharib commented Jul 18, 2022

Hi @shankari,

Thank you for your help I was able to execute the pipeline after increasing the memory to 4GB 👍 .

Thank you,

Larbi


This issue can be closed ✔️


Journey on how to fix the pipeline ⛏️

I was able to fix my issue of the pipeline not excuting. ✔️

The discussion has diverged a little bit from the title of the issue.

How to fix Assertion error in pipeline_queries.py AssertionError: curr_state.curr_run_ts = 1635335308.6691191

I fixed the assertion error by using reset pipeline script:

./e-mission-py.bash bin/reset_pipeline.py -u <uuid>

How to fix missing data on the server

The data was not synched between the user's phones and the server.

The main cause was the server Nginx configuration not allowing Large files.

After 2 months of not synchronizing due to another problem explained in the next section (Last trip end was at -1.0) the accumulated data to be synched was growing and reached arround 1 GB.

To fix that issue the solution was to increase the message size in the Nginx configuration file:

vim /etc/nginx/nginx.conf

http {
       ....
       client_max_body_size 10G;
       ....
}

Data not synched: root issue. The last trip didn't end corretly

BuiltinUserCache : Last trip end was at -1.0
BuiltinUserCache : We don't have a completed trip, so we don't want to push anything yet"

If the data is not synched with the server check the loggerDB file in mobile app.

How to check the loggerDB file 📝

You can send the loggerDB by email, otherwise if the file is too large, mine was 1,4 GB, you can try to send it and then right after the popup displays you cancel the operation then you will be able to navigate to the cached forlder in the Mobile app directory on your phone by plugging it to your computer using a cable and copy it to you computer.

You can then use this scrip to add the timestamp and convert the file to CSV and open it with Excel 👍

bin/csv_export_add_date.py

If you find the Last trip end was at -1.0 log message that means that there was an issue in ending a trip.

Solution:

  1. You can either use the End trip and Force Sync under the Developer Zone.
  2. Or wait for the next trip to start and end. You can then Force the synchronization to push the data.

Failing pipeline due to lack of memory Out Of Memory (OOM) 🔢

./e-mission-py.bash: line 8: 3578 Killed PYTHONPATH=. python "$@"

If you execute the pipeline and receive this issue, that's propbably because you have an Out Of Memory (OOM) issue and the process was killed. You will need to increase you memory. I used 4GB to analyse 2 months worth of data.

@shankari shankari changed the title Assertion error in pipeline_queries.py AssertionError: curr_state.curr_run_ts = 1635335308.6691191 Journey on how to fix the pipeline ⛏️ Jul 19, 2022
@shankari
Copy link
Contributor

Thanks @lgharib for the detailed explanation. Changed the subject and closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants