[ENH] Introduce a Dataframe validation method? #152
Labels
enhancement
New feature or request
good intermediate issue
Issues that are good for seasoned programmers to make a contribution
question
Further information is requested
I'm working on a
.validate()
method that would validate a DataFrame for certain characteristics.Most of the examples are taken from this stackoverflow answer, but there are a few others that I think are important to add (such as regex matching and uniqueness over multiple columns).
We could just include PandasSchema to the project, but it's a bit wordy when creating the schema and it doesn't validate for uniqueness across multiple columns (which is a pretty big deal in my specific use-cases).
PandasSchema
mentions this and alludes to doing more than just column level validation, such as this comment.My schema looks like this:
and then somewhere in the chained methods one would call
What's nice about
PandasSchema
is that it outputs the rows that don't match the validation; mine doesn't at the moment. I'm thinking about throwing exceptions, though, so at the program execution stops if theDataFrame
doesn't match validation. One could also have different schemas and pepper them throughout the method chain to ensure theDataFrame
is transforming correctly. My main concern is ensuring that a DataFrame is 'correct' enough to save to a database table with a set schema.The main item I'm currently struggling with are whether I should throw
Exceptions
, and if not, what should the validation do? I think outputting the values that don't match the schema is reasonable, but I guess I don't see how that helps so much in the method chain.The text was updated successfully, but these errors were encountered: