Rjf/incremental inference #852

robfitzgerald · 2022-05-23T22:28:36Z

this draft PR is an update to make sure i'm going in the right direction. i'm about 8 hours in and expect 4-6 hours remain. the tour_model_first_only module is refactored into two modules with two core abstractions:

emission/
  analysis/
    modelling/
      user_label_model/user_label_prediction_model.py
      similarity/similarity_metric.py

the existing binning-based heuristic has been refactored into user_label_model.GreedySimilarityBinning. user_label_model.run_model contains methods to expose to the inference and build_model phases. a future dev would 1) create new derived instances of UserLabelPredictionModel and add their model's default initialization to run_model._model_factory.

remaining work:

write tests with Confirmedtrip data against the binning function, debug, confirm
read/write to the pipeline to store the incremental timestamp, use as TimeQuery parameter, test + confirm
wire run_model.predict_labels_with_n into the inference code
wire run_model.update_user_label_model into the bin/build_label_model.py script
see about setting various arguments (storage type, model type, distance radius, trip count threshold, model params) via e-mission config
fill out stubs for load_db + save_db in user_label_model/util

the objectives here were to

make it easy for future devs to add new prediction models
to make it simple for those models to also use incremental data reading
to support both file system and database storage (database implementation left to future work)
to de-duplicate and dis-ambiguate

i tried to balance what seemed important to keep with the desire to clean up the module. it took me a while to understand what was going on and so i hope i've improved on readability. some notable removals:

i left in the "elbow heuristic" used for pruning bins, but, can delete that if it's no longer used
some code in tour_model.get_request_percentage.requested_trips_bl_cutoff expects us to have held on to the bins that were removed in the elbow heuristic, but i delete them. if we need to support this, i can fix that

robfitzgerald · 2022-05-23T22:31:09Z

emission/analysis/modelling/user_label_model/run_model.py

+    return model
+
+
+if __name__ == '__main__':


oh, these were the test cases at the bottom of the tour_model_first_only/load_predict file, though i'm not sure what dataset they were based on. at least not the dataset documented here

They are based on a private dataset from the CEO e-bike program. However, the dataset doesn't matter as much because we don't actually check the results - only that a result exists. These should be moved out into a separate test suite under emission/tests

shankari · 2022-05-24T03:06:04Z

emission/analysis/modelling/user_label_model/util.py

+def save_db(user_id, table: str, model_data: Dict):
+    """
+    saves a user label prediction model to the database.
+
+    data is assumed stored in a document database, with the structure:
+
+    { "user_id": user_id, "data": model_data }
+
+    :param user_id: the user to store data for
+    :type user_id: object
+    :param table: the table name
+    :type table: str
+    :param model_data: the data row to store tagged by this user id
+    :type model_data: Dict
+    """
+    pass


@robfitzgerald alas, storing to the DB is also a super high priority task at this point. As we work on deploying this in the NREL cloud environment (e-mission/e-mission-docs#721), cloud services would prefer to run the modeling/inference tasks as stateless scheduled tasks. So we cannot store the model to disk, since there is no persistent disk.

We did consider storing the models to S3, but that would then require an S3 bucket approved for moderate use, which would, in turn require additional cyber approval.

Since the database is already approved for moderate use, and as a NoSQL database, mongodb supports document storage, it seemed easiest to store in the database.

this is fine, i can work with Andrew on supporting db storage then. but i also assume we still want to support file system storage for non-NRELian users on less-restrictive deployment environments or for testing. i mean, why not, it's already in there 🤷

@robfitzgerald you will actually want to work with me; Andrew will be focusing on other projects starting next week.

I actually vote for removing the file system storage, but I am open to be persuaded otherwise.

Storing it in the database ensures that there is one source of data that needs to be managed wrt backup/encryption etc

All OpenPATH users must have access to a database anyway since we store the sensor data there.

We only stored in the file system to begin with since @corinne-hcr was not familiar with database operations.

Concretely, the current non-NRELian way to run OpenPATH is as a set of containers deployed using docker-compose, with only the database installed on a persistent volume.

The model files are created in the filesystem of the analysis container. This means that if the containers are removed, the stored models will be lost. This was not as much of an issue when we did not have incremental updates, since we would re-create the model every time, but will be an issue in the future. We can work around this by creating persistent volumes for the analysis container as well, but that is additional work, and means that we can't just remove and re-deploy the containers to get a clean setup.

ok, sounds good, i will remove the FS-related utils and we can make getting the database read/write done within this PR. as long as you're ok with letting go of backwards compatibility here. i'll reach out to you if i have questions about writing to tables (and naming conventions and schemas, but, feel free to propose those here if you have ideas beyond my initial assumptions).

@robfitzgerald Backwards compatibility is not a huge concern right now.

At this point, the system regenerates the model from scratch every time. And when it is deployed on the NREL OpenPATH servers, it will not have any file based models to read from anyway.

If we want to be extra nice, we could fallback to reading from file if there is no entry in the database, and then remove the fallback code after a year. That will ensure that if somebody else had been running this modeling pipeline (not aware of anybody other than CanBikeCO at this point), they can continue to use the old model to label incoming trips while the new model was being built. But there is no reason at this point to every write to the file system.

Again, not aware of anybody other than CanBikeCO who has enabled the new pipeline, so we could also just punt on the backwards compatibility if it is faster.

wrt naming conventions and schema, I have created two main collections and am storing all data in them. Stage_timeseries accessed via emission.core.get_database.get_stage_timeseries() and Stage_analysis_timeseries accessed via emission.core.get_database.get_analysis_timeseries().

Thinking of creating similar Stage_models_timeseries and Stage_results_timeseries.

In the raw and analysis timeseries, the entries follow the same basic format with data and metadata entries, but with the fields in data being different based on metadata.key. This allows us to have generic metadata based queries based on metadata.write_ts across multiple types of data.

Not sure if the models need to be as time based, or whether they should just be UUID based. Thoughts?

so, given your convention of maintaining immutable database entries, we could do that with the label models, and it would just mean that we take the one with the most recent write_ts entry when we need to run inference. but i'm not sure how large different models can get. i'm guessing not huge, as the datasets for each user are not huge. if that sounds sensible, we could adopt the convention of immutable records and grab the latest each time.

there's probably not a lot of value in storing all of those models, 365 models per year per user. so, the other thing we could do is assume to store these rows as mutable ones, and really only store 1 row per UUID. in that case, metadata.write_ts becomes less a query key and more a housekeeping field. but that would assume that UUID-based indexing is not slow in Stage_analysis_timeseries (or maybe motivation to create another collection configured for UUID indexing and designed/labeled for mutable DB operations).

as for the fs discussion, i'm not trying to advocate for file saving i'm just being careful that's all. if it's not helping anyone we can remove that stuff. but it's already implemented so it's not "faster" or "slower" to keep that functionality.

I think we should start with an immutable object, but just keep the last n entries, with n=3 or n=5 or something. This should strike a good balance between robustness and disk space. We will then need a cleanup cronjob/schedule script that will go through and delete older entries.

in the long run, we should remove it because it is obsolete and I don't want to have a bunch of vestigial code sitting around either being maintained or waiting to bitrot.

in the short term, reading from DB with fallback to file seems like a bit more work than reading only from the DB.

shankari · 2022-06-10T00:36:17Z

@robfitzgerald is there an ETA for this change, at least to switch to storing in the database? Once the initial data collection is running in the NREL hosted environment (e-mission/e-mission-docs#732), I would like to enable label assist as well.

LMK if I should take over that part instead.

robfitzgerald · 2022-06-10T17:03:48Z

i was dreading this message, knowing full well i promised results by now. i would like to try and wrap this up next week, if that's not too late for you. i'm dealing with a few tasks at work that have been taking much more of my time than anticipated (and that i've had trouble estimating my time for because they are outside of my comfort zone, honestly).

robfitzgerald · 2022-06-21T17:29:05Z

@shankari i've spent a little time today reviewing the database queries. i'm looking for how best to record models and retrieve them. two ideas:

when i need the model, i retrieve the one with the latest timestamp from the analysis timeseries with matching user_id/key
when i need the model, i retrieve the one with the timestamp that matches last_ts_run in the pipeline state with matching user_id/key

looking for your opinion there.

also, if i did want to implement the former (retrieving the latest entry with a matching user_id/key), would we need to add that capability to BuiltinTimeSeries via some new method?

shankari · 2022-06-21T19:40:26Z

when i need the model, i retrieve the one with the latest timestamp from the analysis timeseries with matching user_id/key

I prefer this option, to keep it simple, and not introduce any additional dependencies on the pipeline state.

also, if i did want to implement the former (retrieving the latest entry with a matching user_id/key), would we need to add that capability to BuiltinTimeSeries via some new method?

Already here.
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L301

…zgerald/e-mission-server into rjf/incremental-inference

robfitzgerald · 2022-06-21T21:35:06Z

update here. database read/write is wired into the save functions along with pipeline updates. what trip data is loaded depends on whether the Model type reports itself as "incremental" or not.

wanted to confirm how this should be wired in. at this point i have simply swapped in the relevant methods in inferrers.py and build_save_model.py. i haven't tested anything, wanted to confirm this design looks correct, hoping to make time to test the internals though i'm a little fuzzy on running e2e tests here.

shankari

First round of comments below.

Two high level questions:

it looks like you copied over the original code into new locations and changed the command line signatures. Did you change the implementation as well? If not, please indicate so, and add a permalink to the original code. That helps with review (specially since we don't have dedicated unit tests for this code section), and also helps us find the original commit history if we want to trace back the reason for a particular section of code.
I am not sure we should call this a user mode. As you may recall from the project with Venu last year, a user model is typically a model of user behavior based on their preferences - e.g. what weight do they place on cost vs. time vs. carbon? This kind of model of their common trips is called a tour model.
It's a small thing, but just fixing the naming may help others navigate the code in the future.

bin/build_label_model.py

emission/analysis/classification/inference/labels/inferrers.py

emission/analysis/modelling/user_label_model/run_model.py

shankari · 2022-06-22T00:57:25Z

emission/analysis/modelling/user_label_model/run_model.py

+    return model
+
+
+if __name__ == '__main__':


They are based on a private dataset from the CEO e-bike program. However, the dataset doesn't matter as much because we don't actually check the results - only that a result exists. These should be moved out into a separate test suite under emission/tests

emission/analysis/modelling/similarity/confirmed_trip_feature_extraction.py

emission/analysis/modelling/user_label_model/greedy_similarity_binning.py

emission/analysis/modelling/user_label_model/model_storage.py

emission/analysis/modelling/user_label_model/util.py

robfitzgerald · 2022-06-22T20:32:54Z

thanks for the thorough review. i'll reply to all of your comments inline

robfitzgerald · 2022-06-22T21:05:30Z

I am not sure we should call this a user mode. As you may recall from the project with Venu last year, a user model is typically a model of user behavior based on their preferences - e.g. what weight do they place on cost vs. time vs. carbon? This kind of model of their common trips is called a tour model. It's a small thing, but just fixing the naming may help others navigate the code in the future.

it's funny you say that, i found the names (and layout) in this directory confusing, but i certainly don't want to create more confusion. but i also don't know what's deprecated and what's important to keep in the "tour_model" directory so i'm staying out of it, and i don't think the name "tour_model_first_only" is a good name for the proposed refactor here, as the name disambiguates from some approach to clustering that appears deprecated as well. but maybe i can get around solving this if we could call it a "trip_model", since it deals in trips, not tours?

robfitzgerald · 2022-06-22T21:17:36Z

it looks like you copied over the original code into new locations and changed the command line signatures.

i did, if you mean function signatures here.

Did you change the implementation as well?

yes in some parts. in order to create an abstraction for future devs to work with, i needed to fit the existing solution to that abstraction, so a few changes were made for that. also, while porting it, i found duplicate entries for its similarity metric and decided it should be extracted into it's own module and made available to future model implementations. i also found that the existing solution work with lists and then converted to dicts for saving, and changed it so that it simply created the model as a dict in the first place. so, slightly different data structure stored in GreedySimilarityBinning than in tour_model/similarity.

If not, please indicate so, and add a permalink to the original code. That helps with review (specially since we don't have dedicated unit tests for this code section), and also helps us find the original commit history if we want to trace back the reason for a particular section of code.

i'll tag those copied code segments with URLs to the latest commit (not master), which i think is what you mean by permalinking them.

robfitzgerald · 2022-08-10T01:59:26Z

@shankari

fixed the typo breaking the feature extraction for od features (see comment).

robfitzgerald · 2022-08-10T02:02:26Z

still fails though because it's moved onto a new problem. looking into it.

shankari · 2022-08-10T02:05:17Z

still fails though because it's moved onto a new problem. looking into it.

That test passes for me now.

wrt other failure, I wonder if it is related to

--- a/emission/analysis/modelling/trip_model/greedy_similarity_binning.py
+++ b/emission/analysis/modelling/trip_model/greedy_similarity_binning.py
@@ -121,7 +121,7 @@ class GreedySimilarityBinning(eamuu.TripModel):
         predicted_bin, bin_record = self._nearest_bin(trip)
         if predicted_bin is None:
             logging.debug(f"unable to predict bin for trip {trip}")
-            return [], -1
+            return [], 0
         else:

I had to make that change locally when I was testing earlier
#872 (comment)

robfitzgerald · 2022-08-10T02:19:06Z

wonder if it is related to

that was it. i wrongly set this as a flag value for when no prediction occurred. i've updated it to return 0 now.

shankari · 2022-08-10T05:05:58Z

@robfitzgerald you now need to change testNoPrediction to expect 0 and not -1
I think that is it wrt unit tests

robfitzgerald · 2022-08-10T15:14:07Z

found another that failed too, it was expecting 2 bins with one singleton outlier bin, but it really should have been 3 bins, two singleton bins. added a comment to explain, in TestRunGreedyIncrementalModel.py, showing the similarity matrix:

        #        0      1      2      3      4      5  labels?
        # 0   True   True   True   True  False  False     True
        # 1   True   True   True   True  False  False    False
        # 2   True   True   True   True  False  False    False
        # 3   True   True   True   True  False  False     True
        # 4  False  False  False  False   True  False     True
        # 5  False  False  False  False  False   True     True
        
        # trip 0 and 3 are similar and will form bin 0
        # trip 1 and 2 have no labels and will be ignored
        # trips 4 and 5 are both dis-similar from the rest and will form singleton bins

also, realized i needed to "delete_many" on the pipeline state as well for test cleanup.

confirmed all tests in the modellingTests directory are OK this time:

Ran 20 tests in 2.341s

OK

shankari

Couple of code changes

shankari

One more code change