Skip to content

A runtime schema validator for Pandas DataFrames.

Notifications You must be signed in to change notification settings

ulfaslakprecis/pandabear

 
 

Repository files navigation

pandabear

Code style: black Imports: isort Coverage pre-commit

A runtime schema validator for Pandas DataFrames.

When you have a code that passes pandas.DataFrames around, it can become difficult to keep track of the state of the data at any given point in the pipeline.

In a nutshell

When you have a function like

def foo(df: pd.DataFrame) -> pd.DataFrame:
    # change `df` somehow
    return df

somewhere deep in your code, you can only know the state of df by running a debugger, or scrutinizing the code. This is especially true when you have a large codebase with many developers. pandabear solves this problem by allowing you to define schemas for your pandas.DataFrames, and validate them at runtime. This way, you can be sure that the pandas.DataFrame you're passing around is in the state you expect it to be.

Example

import pandas as pd
import pandabear as pb

# define your input and output schemas
class InputDFSchema(pb.DataFrameModel):
    col1: int
    col2: str
    col3: float = pb.Field(gt=0)

class OutputDFSchema(pb.DataFrameModel):
    col1: int
    col3: float = pb.Field(lt=0)

# decorate your function with `check_schemas` and pass the schemas to your function as type hints.
@pb.check_schemas
def foo(df: pb.DataFrame[InputDFSchema]) -> pb.DataFrame[OutputDFSchema]:
    df = df.drop('col2', axis=1)
    df.col3 *= -1
    return df

Now, whenever foo is called, validation triggers and you can be sure that the data follows your predefined schemas at input and return. If it does not, an exception will be raised.

This package is heavily inspired by the pandera Python package. Pandera is a fantastic Python library for statistical data testing, that offers a lot more functionality than pandabear. Consider this a lighter, pandas-only version of pandera. If you're looking for a more comprehensive solution that supports other backends than just pandas (like spark, polars, etc.), we highly recommend you check it out.

See package level README.md for documentation and usage examples

Usage

  • See the examples directory for detailed demo

Installation

  • Install globally or to a given environment:
    • Activate virtual environment (optional)
    • pip install pandabear

Prerequisites:

  • python and virtual environment manager of your choice
    • pip version > 21.0.0
  • docker

Setup:

  • Create/activate a virtual environment
  • Run make help to see various helper commands to interact with the project
  • Run make setup to install dependencies and setup the local package
  • To make commits: run make commit or make commit-all (adds all changed files to git staged)

Commitizen and Automated Versioning and Changelog

  • A python package must have a version, here we use semantic versioning format (ex. 1.1.1)
  • Instead of manually bumping a version, this template uses commitizen to auto-update the package version and auto-generate a CHANGELOG.md
  • This automation is done by parsing commit history as you apply changes, using the Conventional Commits format of commit messages
  • To auto-bump the version and generate a change log, GitHub Actions is used on pushes to main/master branch, defined in .github/workflows/bumpversion.yaml. See CI/CD section for more details
  • To enforce this commit message, commitizen is used as a pre-commit hook, and we highly recommend you use the commitizen CLI to make commits, you can run make commit or make commit-all helper commands, or can run things manually in below example
# Can run `make commit` or `make commit-all` helper command, or each command manually, listed below: 
# Optionally run pre-commit checks to ensure code formatting/linting is good
pre-commit run --all-files -v

# First add your specific files to git staged changes, or add all via '.'
git add .

# Then run cz commit and follow prompt to generate a good commit message
gz commit

Commitizen Demo

Misc.

  • Author names and emails are specified in setup.cfg, the package template initially fills in these values from the git user who created the package, if a user doesn't have a git name specified a placeholder value is used and should be updated.
    • Multiple author names and emails can be specified, as a comma-separated list (ex. author = John Doe,Jane Doe)
  • Specifying dependencies:
    • You must specify what dependencies your project needs to work in setup.cfg (install_requires), preferably with wider-scope version constraints (eg. requests>=2.0.0 instead of requests==2.1.3)

CI/CD:

GitHub Actions are used to automatically bump the version and update the CHANGELOG.md based on the commit messages since the last version (no action needed to enable or configure, settings in .github/workflows/bumpversion.yaml). Cloud Build is used to automatically package version and publish to PyPI

Notes / Docs:

  • Uses:
    • pre-commit for Git pre-commit hooks
    • Black for python code formatting
    • isort to sort imports
    • gitleaks for secrets scanning in pre-commit hooks and CI/CD
    • pytest as test runner (can run both pytests and unittests)
    • Coverage for assessing test coverage
    • Commitizen for enforcing correct git commit message format, auto-bumping versions, and auto-generating the change log

About

A runtime schema validator for Pandas DataFrames.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.6%
  • Shell 2.1%
  • Makefile 1.3%