Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rjf/incremental inference #852

Merged
merged 53 commits into from
Aug 11, 2022
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
3e87e50
checkpoint: initial work on clustering abstraction
robfitzgerald May 12, 2022
a1e75a6
user label prediction refactor
robfitzgerald May 19, 2022
f4b1260
fetch of upstream fork
robfitzgerald May 23, 2022
75ca7e7
Merge branch 'master' into rjf/incremental-inference
robfitzgerald Jun 15, 2022
091bf66
integrate user label model into pipeline
robfitzgerald Jun 21, 2022
3ee7b07
Merge branch 'e-mission:master' into rjf/incremental-inference
robfitzgerald Jun 21, 2022
c14c619
Merge branch 'rjf/incremental-inference' of https://github.com/robfit…
robfitzgerald Jun 21, 2022
5fcd508
cleanup
robfitzgerald Jun 21, 2022
c6306b4
cleanup
robfitzgerald Jun 21, 2022
714c017
integration
robfitzgerald Jun 21, 2022
c1540f3
comments
robfitzgerald Jun 21, 2022
91bdfe4
comments
robfitzgerald Jun 21, 2022
567c4d8
add user label model stage
robfitzgerald Jun 21, 2022
b6fb2a7
simplification, cleanup, comments
robfitzgerald Jun 22, 2022
dd90f1a
cleanup, naming
robfitzgerald Jun 22, 2022
580498f
logging
robfitzgerald Jun 22, 2022
22bb419
remove 'data' field
robfitzgerald Jun 22, 2022
92aa9c9
invalid file checked in
robfitzgerald Jun 22, 2022
2f3be9a
code review
robfitzgerald Jun 22, 2022
de12282
relocate tour model test code
robfitzgerald Jun 22, 2022
c795a55
cleanup
robfitzgerald Jun 23, 2022
b98d5dc
cleanup and documentation
robfitzgerald Jun 23, 2022
a183b22
module rename trip_model, begin unit tests
robfitzgerald Jun 23, 2022
f318bea
fixes from testing
robfitzgerald Jun 23, 2022
5e37bc7
comments, modify default sampling rate
robfitzgerald Jun 23, 2022
6637b70
adding missing python dependencies
robfitzgerald Jun 27, 2022
fbae182
fixes related to e2e test
robfitzgerald Jun 28, 2022
1320e96
integrate config file
robfitzgerald Jun 29, 2022
c8d733d
fix edge cases for empty training sets
robfitzgerald Jun 29, 2022
4483a14
add test of incremental training
robfitzgerald Jul 7, 2022
6aae056
incremental model testing
robfitzgerald Jul 7, 2022
1cd57d8
update/finish e2e testing of greedy trip model
robfitzgerald Jul 11, 2022
3f0a551
comments
robfitzgerald Jul 11, 2022
baf66f6
Merge branch 'rjf/incremental-inference' of https://github.com/robfit…
shankari Aug 7, 2022
0b7649f
Revert "adding missing python dependencies"
shankari Aug 7, 2022
75c1a02
Added backwards compat test to showcase the change in behavior from "…
shankari Aug 7, 2022
2ce8e06
Add a monkeytest to check backwards compat
shankari Aug 7, 2022
d1e9af6
Update bin/build_label_model.py
robfitzgerald Aug 8, 2022
29cdc15
checkpoint: addressing review
robfitzgerald Aug 9, 2022
f8db4f2
Merge branch 'rjf/incremental-inference' of https://github.com/robfit…
robfitzgerald Aug 9, 2022
6c73249
Merge pull request #1 from shankari/restructure_clustering_codebase
robfitzgerald Aug 9, 2022
a5cb123
apply revisions from review, comments
robfitzgerald Aug 9, 2022
e00559e
greedy binning should fit trip to all binned trips
robfitzgerald Aug 9, 2022
b3be3d6
typo
robfitzgerald Aug 10, 2022
05278f2
fix log typo
robfitzgerald Aug 10, 2022
351a75a
failed "predict" should return num=0
robfitzgerald Aug 10, 2022
696b999
update tests after sim metric fix
robfitzgerald Aug 10, 2022
76b4f2e
Apply suggestions from code review
shankari Aug 11, 2022
c383740
Add more links to the old code
shankari Aug 11, 2022
d032601
More documentation and clarification
shankari Aug 11, 2022
6901b99
Revert duplicate commit
shankari Aug 11, 2022
b252622
comments and naming changes
robfitzgerald Aug 11, 2022
dab7295
Update emission/tests/modellingTests/TestRunGreedyIncrementalModel.py
robfitzgerald Aug 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
*.swp
*debug.log
.DS_Store
.vscode

CFC_WebApp/config.json
CFC_WebApp/keys.json
Expand Down
11 changes: 8 additions & 3 deletions bin/build_label_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@

import argparse
import uuid
import copy

import emission.pipeline.reset as epr
import emission.core.get_database as edb
import emission.core.wrapper.user as ecwu
import emission.storage.timeseries.abstract_timeseries as esta
import emission.analysis.modelling.tour_model_first_only.build_save_model as eamtb
import emission.analysis.modelling.trip_model.run_model as eamur
import emission.analysis.modelling.trip_model.config as eamtc

def _get_user_list(args):
if args.all:
Expand Down Expand Up @@ -64,4 +64,9 @@ def _email_2_user_list(email_list):
logging.info("received list with %s users" % user_list)
for user_id in user_list:
logging.info("building model for user %s" % user_id)
eamtb.build_user_model(user_id)
# these can come from the application config as default values

model_type = eamtc.get_model_type()
model_storage = eamtc.get_model_storage()
min_trips = eamtc.get_minimum_trips()
eamur.update_trip_model(user_id, model_type, model_storage, min_trips)
robfitzgerald marked this conversation as resolved.
Show resolved Hide resolved
13 changes: 13 additions & 0 deletions conf/analysis/trip_model.conf.json.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"model_type": "greedy",
"model_storage": "document_database",
"minimum_trips": 14,
"model_parameters": {
"greedy": {
"metric": "od_similarity",
"similarity_threshold_meters": 500,
"apply_cutoff": false,
"incremental_evaluation": false
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import copy

import emission.analysis.modelling.tour_model_first_only.load_predict as lp
import emission.analysis.modelling.trip_model.run_model as eamur
import emission.analysis.modelling.trip_model.config as eamtc

# A set of placeholder predictors to allow pipeline development without a real inference algorithm.
# For the moment, the system is configured to work with two labels, "mode_confirm" and
Expand Down Expand Up @@ -140,7 +142,10 @@ def n_to_confidence_coeff(n, max_confidence=None, first_confidence=None, confide

# predict_two_stage_bin_cluster but with the above reduction in confidence
def predict_cluster_confidence_discounting(trip, max_confidence=None, first_confidence=None, confidence_multiplier=None):
labels, n = lp.predict_labels_with_n(trip)
# load application config
model_type = eamtc.get_model_type()
model_storage = eamtc.get_model_storage()
labels, n = eamur.predict_labels_with_n(trip, model_type, model_storage)
shankari marked this conversation as resolved.
Show resolved Hide resolved
if n <= 0: # No model data or trip didn't match a cluster
logging.debug(f"In predict_cluster_confidence_discounting: n={n}; returning as-is")
return labels
Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
from typing import List
import emission.core.wrapper.confirmedtrip as ecwc


shankari marked this conversation as resolved.
Show resolved Hide resolved
def origin_features(trip: ecwc.Confirmedtrip) -> List[float]:
"""extract the trip origin coordinates.

:param trip: trip to extract features from
:return: origin coordinates
"""
try:
origin = trip['data']['start_loc']["coordinates"]
return origin
except KeyError as e:
msg = 'Confirmedtrip expected to have path data.start_loc.coordinates'
raise KeyError(msg) from e

def destination_features(trip: ecwc.Confirmedtrip) -> List[float]:
"""extract the trip destination coordinates.

:param trip: trip to extract features from
:return: destination coordinates
"""
try:
destination = trip['data']['start_loc']["coordinates"]
return destination
except KeyError as e:
msg = 'Confirmedtrip expected to have path data.start_loc.coordinates'
raise KeyError(msg) from e


def od_features(trip: ecwc.Confirmedtrip) -> List[float]:
"""extract both origin and destination coordinates.

:param trip: trip to extract features from
:return: od coordinates
"""
o_lat, o_lon = origin_features(trip)
d_lat, d_lon = destination_features(trip)
return [o_lat, o_lon, d_lat, d_lon]
shankari marked this conversation as resolved.
Show resolved Hide resolved

def distance_feature(trip: ecwc.Confirmedtrip) -> List[float]:
"""provided for forward compatibility.

:param trip: trip to extract features from
:return: distance feature
"""
try:
return [trip['data']['distance']]
except KeyError as e:
msg = 'Confirmedtrip expected to have path data.distance'
raise KeyError(msg) from e

def duration_feature(trip: ecwc.Confirmedtrip) -> List[float]:
"""provided for forward compatibility.

:param trip: trip to extract features from
:return: duration feature
"""
try:
return [trip['data']['duration']]
except KeyError as e:
msg = 'Confirmedtrip expected to have path data.duration'
raise KeyError(msg) from e
21 changes: 21 additions & 0 deletions emission/analysis/modelling/similarity/od_similarity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from typing import List
import emission.analysis.modelling.similarity.similarity_metric as eamss
import emission.analysis.modelling.similarity.confirmed_trip_feature_extraction as ctfe
import emission.core.wrapper.confirmedtrip as ecwc
import emission.core.common as ecc


class OriginDestinationSimilarity(eamss.SimilarityMetric):
"""
similarity metric which compares, for two trips,
the distance for origin to origin, and destination to destination,
in meters.
"""

def extract_features(self, trip: ecwc.Confirmedtrip) -> List[float]:
return ctfe.od_features(trip)

def similarity(self, a: List[float], b: List[float]) -> List[float]:
o_dist = ecc.calDistance([a[0], a[1]], [b[0], b[1]])
d_dist = ecc.calDistance([a[2], a[3]], [b[2], b[3]])
return [o_dist, d_dist]
shankari marked this conversation as resolved.
Show resolved Hide resolved
41 changes: 41 additions & 0 deletions emission/analysis/modelling/similarity/similarity_metric.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from abc import ABCMeta, abstractmethod
from typing import List
import logging

import emission.core.wrapper.confirmedtrip as ecwc

shankari marked this conversation as resolved.
Show resolved Hide resolved

class SimilarityMetric(metaclass=ABCMeta):

@abstractmethod
def extract_features(self, trip: ecwc.Confirmedtrip) -> List[float]:
"""extracts the features we want to compare for similarity

:param trip: a confirmed trip
:return: the features to compare
"""
pass

@abstractmethod
def similarity(self, a: List[float], b: List[float]) -> List[float]:
"""compares the features, producing their similarity
as computed by this similarity metric

:param a: features for a trip
:param b: features for another trip
:return: for each feature, the similarity of these features
"""
pass

def similar(self, a: List[float], b: List[float], thresh: float) -> bool:
"""compares the features, returning true if they are similar
within some threshold

:param a: features for a trip
:param b: features for another trip
:param thresh: threshold for similarity
:return: true if the feature similarity is within some threshold
"""
similarity_values = self.similarity(a, b)
is_similar = all(map(lambda sim: sim <= thresh, similarity_values))
return is_similar
49 changes: 49 additions & 0 deletions emission/analysis/modelling/similarity/similarity_metric_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from __future__ import annotations
import enum


import emission.analysis.modelling.similarity.od_similarity as eamso
import emission.analysis.modelling.similarity.similarity_metric as eamss

class SimilarityMetricType(enum.Enum):
OD_SIMILARITY = 0

def build(self) -> eamss.SimilarityMetric:
"""

hey YOU! add future similarity metric types here please!

:raises KeyError: if the SimilarityMetricType isn't found in the below dictionary
:return: the associated similarity metric
"""
metrics = {
SimilarityMetricType.OD_SIMILARITY: eamso.OriginDestinationSimilarity()
}

metric = metrics.get(self)
if metric is None:
names = "{" + ",".join(SimilarityMetricType.names) + "}"
msg = f"unknown metric type {metric}, must be one of {names}"
raise KeyError(msg)
else:
return metric


@classmethod
def names(cls):
return list(map(lambda e: e.name, list(cls)))

@classmethod
def from_str(cls, str):
"""attempts to match the provided string to a known SimilarityMetricType.
not case sensitive.

:param str: a string name of a SimilarityMetricType
"""
try:
str_caps = str.upper()
return cls[str_caps]
except KeyError:
names = "{" + ",".join(cls.names) + "}"
msg = f"{str} is not a known SimilarityMetricType, must be one of {names}"
raise KeyError(msg)
16 changes: 8 additions & 8 deletions emission/analysis/modelling/tour_model/label_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,14 @@ def map_labels_mode(user_input_df):
# convert mode
if "replaced_mode" in user_input_df.columns:
same_mode_df = user_input_df[user_input_df.replaced_mode == "same_mode"]
logging.debug("The following rows will be changed %s" %
same_mode_df.index)
for a in range(len(user_input_df)):
if user_input_df.iloc[a]["replaced_mode"] == "same_mode":
# to see which row will be converted
# logging.debug("The following rows will be changed: %s", user_input_df.iloc[a])
user_input_df.iloc[a]["replaced_mode"] = user_input_df.iloc[a]['mode_confirm']
logging.debug("Finished changing all rows")
if len(same_mode_df) > 0:
logging.debug("The following rows will be changed %s" % same_mode_df.index)
for a in range(len(user_input_df)):
if user_input_df.iloc[a]["replaced_mode"] == "same_mode":
# to see which row will be converted
# logging.debug("The following rows will be changed: %s", user_input_df.iloc[a])
user_input_df.iloc[a]["replaced_mode"] = user_input_df.iloc[a]['mode_confirm']
logging.debug("Finished changing all rows")
shankari marked this conversation as resolved.
Show resolved Hide resolved
else:
logging.info("map_labels_mode: no replaced mode column found, early return")
return user_input_df
Expand Down
Empty file.
79 changes: 79 additions & 0 deletions emission/analysis/modelling/trip_model/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import json
import re
from this import d
from typing import Optional
import logging
from numpy import isin

import emission.analysis.modelling.trip_model.model_storage as eamums
import emission.analysis.modelling.trip_model.model_type as eamumt

config_filename = ""

def load_config():
global config_filename
try:
config_filename = 'conf/analysis/trip_model.conf.json'
config_file = open(config_filename)
except:
print("analysis.trip_model.conf.json not configured, falling back to sample, default configuration")
config_filename = 'conf/analysis/trip_model.conf.json.sample'
config_file = open('conf/analysis/trip_model.conf.json.sample')
ret_val = json.load(config_file)
config_file.close()
return ret_val

config_data = load_config()

def reload_config():
global config_data
config_data = load_config()

def get_config():
return config_data

def get_optional_config_value(key) -> Optional[str]:
"""
get a config value at the provided path/key

:param key: a key name or a dot-delimited path to some key within the config object
:return: the value at the key, or, None if not found
"""
cursor = config_data
Comment on lines +36 to +42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you currently use the dot-delimited path anywhere? Doesn't look like it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i was at first.. 🤔 i think i copied that logic from another config.py though? sorry, don't remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is pkg_resource.resource_filename which takes dot-delimited paths too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by this

$  grep -r resource_filename emission/analysis | wc -l
       0

Copy link
Contributor Author

@robfitzgerald robfitzgerald Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh no, pkg_resource is a nifty built-in python library that lets one reference files that are in a python package via dot notation. it gets around os-specific pathing.

for example, if you have a file

emission/
  __init__.py
  resources/
    __init__.py
    config/
      __init__.py
      my_conf.json

then you can reference this in code via

from pkg_resource import resource_filename

file = resource_filename("emission.resources.config", "my_conf.json")

note, we need all directories in the path to be "python modules" aka have an init.py file in them for this to work.

path = key.split(".")
for k in path:
cursor = cursor.get(k)
if cursor is None:
return None
return cursor

def get_config_value_or_raise(key):
logging.debug(f'getting key {key} in config')
value = get_optional_config_value(key)
if value is None:
logging.debug('config object:')
logging.debug(json.dumps(config_data, indent=2))
msg = f"expected config key {key} not found in config file {config_filename}"
raise KeyError(msg)
else:
return value

def get_model_type():
model_type_str = get_config_value_or_raise('model_type')
model_type = eamumt.ModelType.from_str(model_type_str)
return model_type

def get_model_storage():
model_storage_str = get_config_value_or_raise('model_storage')
model_storage = eamums.ModelStorage.from_str(model_storage_str)
return model_storage

def get_minimum_trips():
minimum_trips = get_config_value_or_raise('minimum_trips')
if not isinstance(minimum_trips, int):
msg = f"config key 'minimum_trips' not an integer in config file {config_filename}"
raise TypeError(msg)
return minimum_trips



Loading