A runtime schema validator for Pandas DataFrames.
When you have a code that passes pandas.DataFrame
s around, it can become difficult to keep track of the state of the data at any given point in the pipeline.
When you have a function like
def foo(df: pd.DataFrame) -> pd.DataFrame:
# change `df` somehow
return df
somewhere deep in your code, you can only know the state of df
by running a debugger, or scrutinizing the code. This is especially true when you have a large codebase with many developers. pandabear
solves this problem by allowing you to define schemas for your pandas.DataFrame
s, and validate them at runtime. This way, you can be sure that the pandas.DataFrame
you're passing around is in the state you expect it to be.
import pandas as pd
import pandabear as pb
# define your input and output schemas
class InputDFSchema(pb.DataFrameModel):
col1: int
col2: str
col3: float = pb.Field(gt=0)
class OutputDFSchema(pb.DataFrameModel):
col1: int
col3: float = pb.Field(lt=0)
# decorate your function with `check_schemas` and pass the schemas to your function as type hints.
@pb.check_schemas
def foo(df: pb.DataFrame[InputDFSchema]) -> pb.DataFrame[OutputDFSchema]:
df = df.drop('col2', axis=1)
df.col3 *= -1
return df
Now, whenever foo
is called, validation triggers and you can be sure that the data follows your predefined schemas at input and return. If it does not, an exception will be raised.
This package is heavily inspired by the pandera
Python package. Pandera is a fantastic Python library for statistical data testing, that offers a lot more functionality than pandabear
. Consider this a lighter, pandas
-only version of pandera
. If you're looking for a more comprehensive solution that supports other backends than just pandas
(like spark
, polars
, etc.), we highly recommend you check it out.
See package level README.md for documentation and usage examples
- See the examples directory for detailed demo
- Install globally or to a given environment:
- Activate virtual environment (optional)
pip install pandabear
- Create/activate a virtual environment
- Run
make help
to see various helper commands to interact with the project - Run
make setup
to install dependencies and setup the local package - To make commits: run
make commit
ormake commit-all
(adds all changed files to git staged)
- A python package must have a version, here we use semantic versioning format (ex. 1.1.1)
- Instead of manually bumping a version, this template uses commitizen to auto-update the package version and auto-generate a CHANGELOG.md
- This automation is done by parsing commit history as you apply changes, using the Conventional Commits format of commit messages
- To auto-bump the version and generate a change log, GitHub Actions is used on pushes to main/master branch, defined in .github/workflows/bumpversion.yaml. See CI/CD section for more details
- To enforce this commit message, commitizen is used as a pre-commit hook, and we highly recommend you use the commitizen CLI to make commits, you can run
make commit
ormake commit-all
helper commands, or can run things manually in below example
# Can run `make commit` or `make commit-all` helper command, or each command manually, listed below:
# Optionally run pre-commit checks to ensure code formatting/linting is good
pre-commit run --all-files -v
# First add your specific files to git staged changes, or add all via '.'
git add .
# Then run cz commit and follow prompt to generate a good commit message
gz commit
- Author names and emails are specified in setup.cfg, the package template initially fills in these values from the git user who created the package, if a user doesn't have a git name specified a placeholder value is used and should be updated.
- Multiple author names and emails can be specified, as a comma-separated list (ex.
author = John Doe,Jane Doe
)
- Multiple author names and emails can be specified, as a comma-separated list (ex.
- Specifying dependencies:
- You must specify what dependencies your project needs to work in setup.cfg (install_requires), preferably with wider-scope version constraints (eg. requests>=2.0.0 instead of requests==2.1.3)
GitHub Actions are used to automatically bump the version and update the CHANGELOG.md based on the commit messages since the last version (no action needed to enable or configure, settings in .github/workflows/bumpversion.yaml). Cloud Build is used to automatically package version and publish to PyPI
- Uses:
- pre-commit for Git pre-commit hooks
- Black for python code formatting
- isort to sort imports
- gitleaks for secrets scanning in pre-commit hooks and CI/CD
- pytest as test runner (can run both pytests and unittests)
- Coverage for assessing test coverage
- Commitizen for enforcing correct git commit message format, auto-bumping versions, and auto-generating the change log