Reusable Tabular Data Workflow 🌊

Vision ☺️

Over time, I found that I need very similar codes across multiple data science projects. Thus, I decide to put codes that can solve those needs in one place to save myself time. It will be great if this can also make your work a little bit easier!

My understanding for data science and coding keep changing and thus the code in this repo will be constantly updated. Any contributions, suggestions, and feedbacks will be really appreciated！

Directory Structure 📜

├── README.md                         <- You are here
├── src/                              <- Source module for the project
├── test/                             <- codes for unit test
├── nice_things/                      <- Some codes that can be directly copied and used
├── *.ipynb                           <- notebooks that demonstrate the usage functions in src
├── pyproject.toml                    <- describe the python package version when this code is developed

Tour 🚂

This section gives an overview of the purpose of each file.

src

This folder contains modularized functions that can be easily reused.

eda.py: contain codes that is helpful for exploring data analysis.
model_supervised.py: contain codes that develop a supervised learning model, including codes for hyper-parameter tuning.
evaluate.py: contain functions that evaluate the performance of a supervised learning model.
explain.py: contain codes that explain why a model makes certain prediction.
model_cluster.py: contain functions that simplify the process of developing and analyzing cluster models, especially for KMeans.

nice_things

This folder contain some codes that might not be used for developing statistical models, but can be handy for copying when developing.

a_b_test.py: some codes for doing a/b test analysis
tune_grid.py: contain predefined parameter grids that can be a good start point for doing hyper-parameter tuning.

Jupyter Notebooks

supervised_clf_demo.ipynb: demonstrates how the function in this repo can help various stages when developing a classification model. In Colab.
supervised_reg_demo.ipynb: demonstrates how the function in this repo can help various stages when developing a regression model.
cluster_demo.ipynb: demonstrates how the function in this repo can help developing a clustering model.

Development Guide 📋

This section is mainly for development purpose. You can skip this section if you want to directly copy and use functions inside this repo.

Virtual Environment

We use poetry to manage python dependencies in this project. You can install poetry with the following command.

Install Poetry

The following is copied the official guide of poetry.

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -

The installer installs the poetry tool to Poetry’s bin directory. This location depends on your system:

$HOME/.local/bin for Unix
%APPDATA%\Python\Scripts on Windows

If this directory is not on your PATH, you will need to add it manually if you want to invoke Poetry with simply poetry. What I did is that I add the following command in my ~/.zshrc. (Mac OS)

export PATH="$HOME/.local/bin:$PATH"

Activate Virtual Environment

You can install all dependencies with the following.

poetry install

Now you can activate virtual environment with the following.

poetry shell

You can exit the poetry shell by typing exit in the command line.

Git Hook

We use the pre-commit library to automatically check .py scripts with flake8 and black before each commit. After setting up the virtual environment, you can install the hook with the following.

pre-commit install

Unit Test

We implement unit tests for functions in src/. (Work in Progress). You can run the unit test with the following command.

pytest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reusable Tabular Data Workflow 🌊

Vision ☺️

Directory Structure 📜

Tour 🚂

src

nice_things

Jupyter Notebooks

Development Guide 📋

Virtual Environment

Install Poetry

Activate Virtual Environment

Git Hook

Unit Test

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
data		data
data_viz/plotly		data_viz/plotly
nice_things		nice_things
src		src
test		test
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
cluster_demo.ipynb		cluster_demo.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
supervised_clf_demo.ipynb		supervised_clf_demo.ipynb
supervised_reg_demo.ipynb		supervised_reg_demo.ipynb

wpan03/quick_ds_python

Folders and files

Latest commit

History

Repository files navigation

Reusable Tabular Data Workflow 🌊

Vision ☺️

Directory Structure 📜

Tour 🚂

src

nice_things

Jupyter Notebooks

Development Guide 📋

Virtual Environment

Install Poetry

Activate Virtual Environment

Git Hook

Unit Test

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages