Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rjf/incremental inference #852

Merged
merged 53 commits into from
Aug 11, 2022

Conversation

robfitzgerald
Copy link
Contributor

@robfitzgerald robfitzgerald commented May 23, 2022

this draft PR is an update to make sure i'm going in the right direction. i'm about 8 hours in and expect 4-6 hours remain. the tour_model_first_only module is refactored into two modules with two core abstractions:

emission/
  analysis/
    modelling/
      user_label_model/user_label_prediction_model.py
      similarity/similarity_metric.py

the existing binning-based heuristic has been refactored into user_label_model.GreedySimilarityBinning. user_label_model.run_model contains methods to expose to the inference and build_model phases. a future dev would 1) create new derived instances of UserLabelPredictionModel and add their model's default initialization to run_model._model_factory.

remaining work:

  • write tests with Confirmedtrip data against the binning function, debug, confirm
  • read/write to the pipeline to store the incremental timestamp, use as TimeQuery parameter, test + confirm
  • wire run_model.predict_labels_with_n into the inference code
  • wire run_model.update_user_label_model into the bin/build_label_model.py script
  • see about setting various arguments (storage type, model type, distance radius, trip count threshold, model params) via e-mission config
  • fill out stubs for load_db + save_db in user_label_model/util

the objectives here were to

  1. make it easy for future devs to add new prediction models
  2. to make it simple for those models to also use incremental data reading
  3. to support both file system and database storage (database implementation left to future work)
  4. to de-duplicate and dis-ambiguate

i tried to balance what seemed important to keep with the desire to clean up the module. it took me a while to understand what was going on and so i hope i've improved on readability. some notable removals:

  • i left in the "elbow heuristic" used for pruning bins, but, can delete that if it's no longer used
  • some code in tour_model.get_request_percentage.requested_trips_bl_cutoff expects us to have held on to the bins that were removed in the elbow heuristic, but i delete them. if we need to support this, i can fix that

return model


if __name__ == '__main__':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, these were the test cases at the bottom of the tour_model_first_only/load_predict file, though i'm not sure what dataset they were based on. at least not the dataset documented here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are based on a private dataset from the CEO e-bike program. However, the dataset doesn't matter as much because we don't actually check the results - only that a result exists. These should be moved out into a separate test suite under emission/tests

Comment on lines 83 to 98
def save_db(user_id, table: str, model_data: Dict):
"""
saves a user label prediction model to the database.

data is assumed stored in a document database, with the structure:

{ "user_id": user_id, "data": model_data }

:param user_id: the user to store data for
:type user_id: object
:param table: the table name
:type table: str
:param model_data: the data row to store tagged by this user id
:type model_data: Dict
"""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robfitzgerald alas, storing to the DB is also a super high priority task at this point. As we work on deploying this in the NREL cloud environment (e-mission/e-mission-docs#721), cloud services would prefer to run the modeling/inference tasks as stateless scheduled tasks. So we cannot store the model to disk, since there is no persistent disk.

We did consider storing the models to S3, but that would then require an S3 bucket approved for moderate use, which would, in turn require additional cyber approval.

Since the database is already approved for moderate use, and as a NoSQL database, mongodb supports document storage, it seemed easiest to store in the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, i can work with Andrew on supporting db storage then. but i also assume we still want to support file system storage for non-NRELian users on less-restrictive deployment environments or for testing. i mean, why not, it's already in there 🤷

Copy link
Contributor

@shankari shankari May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robfitzgerald you will actually want to work with me; Andrew will be focusing on other projects starting next week.

I actually vote for removing the file system storage, but I am open to be persuaded otherwise.

  • Storing it in the database ensures that there is one source of data that needs to be managed wrt backup/encryption etc
  • All OpenPATH users must have access to a database anyway since we store the sensor data there.
  • We only stored in the file system to begin with since @corinne-hcr was not familiar with database operations.

Concretely, the current non-NRELian way to run OpenPATH is as a set of containers deployed using docker-compose, with only the database installed on a persistent volume.

The model files are created in the filesystem of the analysis container. This means that if the containers are removed, the stored models will be lost. This was not as much of an issue when we did not have incremental updates, since we would re-create the model every time, but will be an issue in the future. We can work around this by creating persistent volumes for the analysis container as well, but that is additional work, and means that we can't just remove and re-deploy the containers to get a clean setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good, i will remove the FS-related utils and we can make getting the database read/write done within this PR. as long as you're ok with letting go of backwards compatibility here. i'll reach out to you if i have questions about writing to tables (and naming conventions and schemas, but, feel free to propose those here if you have ideas beyond my initial assumptions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robfitzgerald Backwards compatibility is not a huge concern right now.

At this point, the system regenerates the model from scratch every time. And when it is deployed on the NREL OpenPATH servers, it will not have any file based models to read from anyway.

If we want to be extra nice, we could fallback to reading from file if there is no entry in the database, and then remove the fallback code after a year. That will ensure that if somebody else had been running this modeling pipeline (not aware of anybody other than CanBikeCO at this point), they can continue to use the old model to label incoming trips while the new model was being built. But there is no reason at this point to every write to the file system.

Again, not aware of anybody other than CanBikeCO who has enabled the new pipeline, so we could also just punt on the backwards compatibility if it is faster.

wrt naming conventions and schema, I have created two main collections and am storing all data in them. Stage_timeseries accessed via emission.core.get_database.get_stage_timeseries() and Stage_analysis_timeseries accessed via emission.core.get_database.get_analysis_timeseries().

Thinking of creating similar Stage_models_timeseries and Stage_results_timeseries.

In the raw and analysis timeseries, the entries follow the same basic format with data and metadata entries, but with the fields in data being different based on metadata.key. This allows us to have generic metadata based queries based on metadata.write_ts across multiple types of data.

Not sure if the models need to be as time based, or whether they should just be UUID based. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, given your convention of maintaining immutable database entries, we could do that with the label models, and it would just mean that we take the one with the most recent write_ts entry when we need to run inference. but i'm not sure how large different models can get. i'm guessing not huge, as the datasets for each user are not huge. if that sounds sensible, we could adopt the convention of immutable records and grab the latest each time.

there's probably not a lot of value in storing all of those models, 365 models per year per user. so, the other thing we could do is assume to store these rows as mutable ones, and really only store 1 row per UUID. in that case, metadata.write_ts becomes less a query key and more a housekeeping field. but that would assume that UUID-based indexing is not slow in Stage_analysis_timeseries (or maybe motivation to create another collection configured for UUID indexing and designed/labeled for mutable DB operations).

as for the fs discussion, i'm not trying to advocate for file saving i'm just being careful that's all. if it's not helping anyone we can remove that stuff. but it's already implemented so it's not "faster" or "slower" to keep that functionality.

Copy link
Contributor

@shankari shankari Jun 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start with an immutable object, but just keep the last n entries, with n=3 or n=5 or something. This should strike a good balance between robustness and disk space. We will then need a cleanup cronjob/schedule script that will go through and delete older entries.

in the long run, we should remove it because it is obsolete and I don't want to have a bunch of vestigial code sitting around either being maintained or waiting to bitrot.

in the short term, reading from DB with fallback to file seems like a bit more work than reading only from the DB.

@shankari
Copy link
Contributor

@robfitzgerald is there an ETA for this change, at least to switch to storing in the database? Once the initial data collection is running in the NREL hosted environment (e-mission/e-mission-docs#732), I would like to enable label assist as well.

LMK if I should take over that part instead.

@robfitzgerald
Copy link
Contributor Author

i was dreading this message, knowing full well i promised results by now. i would like to try and wrap this up next week, if that's not too late for you. i'm dealing with a few tasks at work that have been taking much more of my time than anticipated (and that i've had trouble estimating my time for because they are outside of my comfort zone, honestly).

@robfitzgerald
Copy link
Contributor Author

@shankari i've spent a little time today reviewing the database queries. i'm looking for how best to record models and retrieve them. two ideas:

  • when i need the model, i retrieve the one with the latest timestamp from the analysis timeseries with matching user_id/key
  • when i need the model, i retrieve the one with the timestamp that matches last_ts_run in the pipeline state with matching user_id/key

looking for your opinion there.

also, if i did want to implement the former (retrieving the latest entry with a matching user_id/key), would we need to add that capability to BuiltinTimeSeries via some new method?

@shankari
Copy link
Contributor

when i need the model, i retrieve the one with the latest timestamp from the analysis timeseries with matching user_id/key

I prefer this option, to keep it simple, and not introduce any additional dependencies on the pipeline state.

also, if i did want to implement the former (retrieving the latest entry with a matching user_id/key), would we need to add that capability to BuiltinTimeSeries via some new method?

Already here.
https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/builtin_timeseries.py#L301

@robfitzgerald
Copy link
Contributor Author

update here. database read/write is wired into the save functions along with pipeline updates. what trip data is loaded depends on whether the Model type reports itself as "incremental" or not.

wanted to confirm how this should be wired in. at this point i have simply swapped in the relevant methods in inferrers.py and build_save_model.py. i haven't tested anything, wanted to confirm this design looks correct, hoping to make time to test the internals though i'm a little fuzzy on running e2e tests here.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments below.

Two high level questions:

  • it looks like you copied over the original code into new locations and changed the command line signatures. Did you change the implementation as well? If not, please indicate so, and add a permalink to the original code. That helps with review (specially since we don't have dedicated unit tests for this code section), and also helps us find the original commit history if we want to trace back the reason for a particular section of code.
  • I am not sure we should call this a user mode. As you may recall from the project with Venu last year, a user model is typically a model of user behavior based on their preferences - e.g. what weight do they place on cost vs. time vs. carbon? This kind of model of their common trips is called a tour model.
    It's a small thing, but just fixing the naming may help others navigate the code in the future.

bin/build_label_model.py Outdated Show resolved Hide resolved
bin/build_label_model.py Outdated Show resolved Hide resolved
emission/analysis/modelling/user_label_model/run_model.py Outdated Show resolved Hide resolved
return model


if __name__ == '__main__':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are based on a private dataset from the CEO e-bike program. However, the dataset doesn't matter as much because we don't actually check the results - only that a result exists. These should be moved out into a separate test suite under emission/tests

emission/analysis/modelling/user_label_model/util.py Outdated Show resolved Hide resolved
@robfitzgerald
Copy link
Contributor Author

thanks for the thorough review. i'll reply to all of your comments inline

@robfitzgerald
Copy link
Contributor Author

I am not sure we should call this a user mode. As you may recall from the project with Venu last year, a user model is typically a model of user behavior based on their preferences - e.g. what weight do they place on cost vs. time vs. carbon? This kind of model of their common trips is called a tour model. It's a small thing, but just fixing the naming may help others navigate the code in the future.

it's funny you say that, i found the names (and layout) in this directory confusing, but i certainly don't want to create more confusion. but i also don't know what's deprecated and what's important to keep in the "tour_model" directory so i'm staying out of it, and i don't think the name "tour_model_first_only" is a good name for the proposed refactor here, as the name disambiguates from some approach to clustering that appears deprecated as well. but maybe i can get around solving this if we could call it a "trip_model", since it deals in trips, not tours?

@robfitzgerald
Copy link
Contributor Author

it looks like you copied over the original code into new locations and changed the command line signatures.

i did, if you mean function signatures here.

Did you change the implementation as well?

yes in some parts. in order to create an abstraction for future devs to work with, i needed to fit the existing solution to that abstraction, so a few changes were made for that. also, while porting it, i found duplicate entries for its similarity metric and decided it should be extracted into it's own module and made available to future model implementations. i also found that the existing solution work with lists and then converted to dicts for saving, and changed it so that it simply created the model as a dict in the first place. so, slightly different data structure stored in GreedySimilarityBinning than in tour_model/similarity.

If not, please indicate so, and add a permalink to the original code. That helps with review (specially since we don't have dedicated unit tests for this code section), and also helps us find the original commit history if we want to trace back the reason for a particular section of code.

i'll tag those copied code segments with URLs to the latest commit (not master), which i think is what you mean by permalinking them.

@robfitzgerald
Copy link
Contributor Author

@shankari

fixed the typo breaking the feature extraction for od features (see comment).

@robfitzgerald
Copy link
Contributor Author

still fails though because it's moved onto a new problem. looking into it.

@shankari
Copy link
Contributor

still fails though because it's moved onto a new problem. looking into it.

That test passes for me now.

wrt other failure, I wonder if it is related to

--- a/emission/analysis/modelling/trip_model/greedy_similarity_binning.py
+++ b/emission/analysis/modelling/trip_model/greedy_similarity_binning.py
@@ -121,7 +121,7 @@ class GreedySimilarityBinning(eamuu.TripModel):
         predicted_bin, bin_record = self._nearest_bin(trip)
         if predicted_bin is None:
             logging.debug(f"unable to predict bin for trip {trip}")
-            return [], -1
+            return [], 0
         else:

I had to make that change locally when I was testing earlier
#872 (comment)

@robfitzgerald
Copy link
Contributor Author

wonder if it is related to

that was it. i wrongly set this as a flag value for when no prediction occurred. i've updated it to return 0 now.

@shankari
Copy link
Contributor

@robfitzgerald you now need to change testNoPrediction to expect 0 and not -1
I think that is it wrt unit tests

@robfitzgerald
Copy link
Contributor Author

robfitzgerald commented Aug 10, 2022

found another that failed too, it was expecting 2 bins with one singleton outlier bin, but it really should have been 3 bins, two singleton bins. added a comment to explain, in TestRunGreedyIncrementalModel.py, showing the similarity matrix:

        #        0      1      2      3      4      5  labels?
        # 0   True   True   True   True  False  False     True
        # 1   True   True   True   True  False  False    False
        # 2   True   True   True   True  False  False    False
        # 3   True   True   True   True  False  False     True
        # 4  False  False  False  False   True  False     True
        # 5  False  False  False  False  False   True     True
        
        # trip 0 and 3 are similar and will form bin 0
        # trip 1 and 2 have no labels and will be ignored
        # trips 4 and 5 are both dis-similar from the rest and will form singleton bins

also, realized i needed to "delete_many" on the pipeline state as well for test cleanup.

confirmed all tests in the modellingTests directory are OK this time:

Ran 20 tests in 2.341s

OK

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of code changes

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more code change

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More minor code changes

@shankari
Copy link
Contributor

shankari commented Aug 11, 2022

@robfitzgerald I assume that you are not planning on making the changes to move the model to a separate model DB and I should not wait for them.

Can you address the two or three remaining comments that are not related to DB storage, mainly around naming and comments?

Comment on lines 49 to 60
bin_id: {
"features": [
[f1, f2, .., fn],
...
],
"labels": [
{ label1: value1, ... }
],
"predictions": [
{ "labels": { label1: value1, ... }, 'p': p_val }
]
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have "features" instead of "trips"

  1. since this object is also the one we will store in the database, it is smaller if it only includes the features we require for classification
  2. trips are domain-specific, i thought for most folks just working on classification they would prefer working in the domain of real number vectors
  3. this way, we extract trip features at most once

why do we save the labels and predictions separately? I double-checked the current file-based storage, and

labels are the input and are extracted when we extract the features. predictions are tacked on once we finish fitting the model. if optimizing memory/storage size is important, this could be re-written so that we remove "labels" once we finish creating "predictions". but

  1. when re-running an incremental model, we can apply the same code that trains from "labels" if we still have them (as opposed to reverse-engineering the labels from the previous predictions)
  2. to a smaller degree, leaving the labels does help with debugging and explaining the model

Comment on lines 49 to 60
bin_id: {
"features": [
[f1, f2, .., fn],
...
],
"labels": [
{ label1: value1, ... }
],
"predictions": [
{ "labels": { label1: value1, ... }, 'p': p_val }
]
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should definitely be explained better in the comments

sure. but, i mean, the same could be said about the previous implementation.

@shankari
Copy link
Contributor

shankari commented Aug 11, 2022

@robfitzgerald

since this object is also the one we will store in the database, it is smaller if it only includes the features we require for classification

I understand, but calling it only features makes it confusing when we are iterating. It looks like we are iterating over individual features of a trip but we are actually iterating over rows of features/trips.

sure. but, i mean, the same could be said about the previous implementation.

Sure, but the previous implementation was by an undergrad intern who went on to work on the "foundations of cryptography". Now that we have this shiny new refactored implementation, I want to make sure that it is readable and maintainable for the long term.

So would really appreciate renaming/comments that improve clarity!

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good conceptually.
I will go ahead and merge once tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants