Over time, I found that I need very similar codes across multiple data science projects. Thus, I decide to put codes that can solve those needs in one place to save myself time. It will be great if this can also make your work a little bit easier!
My understanding for data science and coding keep changing and thus the code in this repo will be constantly updated. Any contributions, suggestions, and feedbacks will be really appreciated!
├── README.md <- You are here
├── src/ <- Source module for the project
├── test/ <- codes for unit test
├── nice_things/ <- Some codes that can be directly copied and used
├── *.ipynb <- notebooks that demonstrate the usage functions in src
├── pyproject.toml <- describe the python package version when this code is developed
This section gives an overview of the purpose of each file.
This folder contains modularized functions that can be easily reused.
eda.py
: contain codes that is helpful for exploring data analysis.model_supervised.py
: contain codes that develop a supervised learning model, including codes for hyper-parameter tuning.evaluate.py
: contain functions that evaluate the performance of a supervised learning model.explain.py
: contain codes that explain why a model makes certain prediction.model_cluster.py
: contain functions that simplify the process of developing and analyzing cluster models, especially for KMeans.
This folder contain some codes that might not be used for developing statistical models, but can be handy for copying when developing.
a_b_test.py
: some codes for doing a/b test analysistune_grid.py
: contain predefined parameter grids that can be a good start point for doing hyper-parameter tuning.
supervised_clf_demo.ipynb
: demonstrates how the function in this repo can help various stages when developing a classification model. In Colab.supervised_reg_demo.ipynb
: demonstrates how the function in this repo can help various stages when developing a regression model.cluster_demo.ipynb
: demonstrates how the function in this repo can help developing a clustering model.
This section is mainly for development purpose. You can skip this section if you want to directly copy and use functions inside this repo.
We use poetry to manage python dependencies in this project. You can install poetry with the following command.
The following is copied the official guide of poetry.
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -
The installer installs the poetry
tool to Poetry’s bin
directory. This location depends on your system:
$HOME/.local/bin
for Unix%APPDATA%\Python\Scripts
on Windows
If this directory is not on your PATH
, you will need to add it manually if you want to invoke Poetry with simply poetry
. What I did is that I add the following command in my ~/.zshrc
. (Mac OS)
export PATH="$HOME/.local/bin:$PATH"
You can install all dependencies with the following.
poetry install
Now you can activate virtual environment with the following.
poetry shell
You can exit the poetry shell by typing exit
in the command line.
We use the pre-commit library to automatically check .py
scripts with flake8
and black
before each commit. After setting up the virtual environment, you can install the hook with the following.
pre-commit install
We implement unit tests for functions in src/
. (Work in Progress). You can run the unit test with the following command.
pytest