Skip to content

Commit

Permalink
prepare fix and readme updates
Browse files Browse the repository at this point in the history
  • Loading branch information
lizzypy committed Oct 29, 2023
1 parent 9cc086f commit 9c5e17b
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 3 deletions.
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## Simple Recommendation Engine

This repository contains a simple movie recommendation engine.
The development of this engine is used in articles and talks to demonstrate how
test driven development and CI/CD can be applied to data analytics and ml models.

To use the notebooks/scripts in this repository you should create a virtual environment with python 3.9
(this is the latest python version that AWS Glue Supports)

To manage python environments I recommend pyenv (it's forked from rbenv):
- You can get started with pyenv here: https://github.com/pyenv/pyenv#getting-pyenv

Once you have pyenv installed you can follow these steps. From the root of the project run:

1. pyenv install 3.9.15
2. pyenv local 3.9.15
3. pip3 install -r analysis/requirements.txt

### Analysis

You should now be able to run the following command:

`cd analysis/notebooks && jupyter lab`

This should open the jupyter notebooks in the notebooks directory. You should be able to run the
`Movie Data Analysis.ipynb` from start to finish without errors.

### Tests

Navigating to run tests from the root you can run the following:

`cd analysis && pytest`

To run a single test from root:

`cd analysis && pytest utils/tests/test_cleaning.py`


### Terraform
7 changes: 4 additions & 3 deletions analysis/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pydata_engine_utils import cleaning,extract
from pydata_engine_utils import cleaning, extract


def prepare():
s3 = boto3.client('s3')
Expand All @@ -22,10 +23,10 @@ def prepare():
cosine_df = pd.DataFrame(cosine_similarity(tfidf_matrix, tfidf_matrix))
cosine_df.to_csv('the_matrix.csv')

s3.upload_file('the_matrix.csv', 'pydatapipelinebucket', 'the_s3_matrix.csv')
s3.upload_file('the_matrix.csv', 'pydatapipelinebucket-final', 'the_s3_matrix.csv')

print("Finished!")


if __name__ == "__main__":
prepare()

0 comments on commit 9c5e17b

Please sign in to comment.