feature store utils

A light-weight package that allows you express ML features in simple yaml, build a training data set and then write them to a feature store.

some general thoughts on building a training dataset

https://docs.google.com/presentation/d/1tVkrwCLVwFp8cZC7CmAHSNFhsJrcTdC20MlZfptkSBE/edit?usp=sharing

options for use

clone this repo. create features.yaml. follow demo notebook. do not check back in.
install as a python package. See https://github.com/BenMacKenzie/churn_model_demo as an example.

Bugs

Need to be able to create tables associated with lookups e.g., to handle transformation functions.
average_growth(col, length) macro has a divide by 0 bug. Not even sure when this would be used instead of geometric_growth.

Notes

Current version is experimental. Not clear that Jinja is the right way to write parameterized SQL. Might be better to do in Python.
Current version is not optimized. Each feature is calculated individually, whereas if table, filters and time windows are identical, multiple aggregation features can be calculated simultaneously.
I believe there are around a dozen standard feature types. The most common have been implemented. Note that views can fill in a lot of gaps if encountered. missing:

2nd order aggregations over time series e.g., max monthly job dbu over 6 month window.
time in state, e.g., how long was a ticket open. based on a type 2 table.
time to event in fact table, e.g., time since last call to customer support
scalar functions of two or more features, e.g, time in days between two date
num state changes over interval (rare)
functions of features (e.g., ratio of growth in job dbu to interactive dbu). Arguably this is not needed for boosted trees. Might be useful for neural nets...but why use a nueral net on heterogeneous data? (actually this kind of thing can be good for model explainability)

Need to illustrate adding features from a related dimension table (using a foreign key...machinery is in place to do so.)
Current version illustrates creating a pipeline which uses the api. But it would be nice just to generate the code and write it to a notebook so that the package is invisible in production (like bamboolib)
The demo repo (https://github.com/BenMacKenzie/churn_model_demo) illustrates 'hyper-features' which are features with variable parameters.
Connecting 'hyper-features' to feature store needs to be worked out. Currently the option is to add all of them or specify individual version by their (generated) name
Fix feature store feature gen observation dates. Align with grain of feature, e.g., if grain is monthly make sure feature store contains an observation on first of month.

notes on testing:

I think https://docs.databricks.com/en/dev-tools/databricks-connect/python/testing.html needs to be updated to indicate that db-connect only works with pytest if you are using the default profile. It does not work with other profiles.

Also there is no connection between the configuration (profile, auth, cluster) in the Databricks 'panel' and pytest...so you need to explicitly specify the cluster you want to use as well.

Tests run correctly individualy but fail if I run them all at once. Possibly due to reuse of renewal_eol table.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Content Streamer Demo		Content Streamer Demo
Customer Churn Demo		Customer Churn Demo
features		features
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
databricks.yml		databricks.yml
generate_sql.py		generate_sql.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

feature store utils

some general thoughts on building a training dataset

options for use

Bugs

Notes

notes on testing:

About

Releases

Packages

Languages

License

BenMacKenzie/feature_store_utils

Folders and files

Latest commit

History

Repository files navigation

feature store utils

some general thoughts on building a training dataset

options for use

Bugs

Notes

notes on testing:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages