Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Introduce a Dataframe validation method? #152

Open
szuckerman opened this issue Apr 17, 2019 · 1 comment
Open

[ENH] Introduce a Dataframe validation method? #152

szuckerman opened this issue Apr 17, 2019 · 1 comment
Labels
enhancement New feature or request good intermediate issue Issues that are good for seasoned programmers to make a contribution question Further information is requested

Comments

@szuckerman
Copy link
Collaborator

I'm working on a .validate() method that would validate a DataFrame for certain characteristics.

Most of the examples are taken from this stackoverflow answer, but there are a few others that I think are important to add (such as regex matching and uniqueness over multiple columns).

We could just include PandasSchema to the project, but it's a bit wordy when creating the schema and it doesn't validate for uniqueness across multiple columns (which is a pretty big deal in my specific use-cases). PandasSchema mentions this and alludes to doing more than just column level validation, such as this comment.

My schema looks like this:

schema={
		('firstname', 'lastname'): 'unique',
		'login': 'len<=8',
		'phonenumber': 'regex:((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}'
	}

and then somewhere in the chained methods one would call

df.validate(schema)

What's nice about PandasSchema is that it outputs the rows that don't match the validation; mine doesn't at the moment. I'm thinking about throwing exceptions, though, so at the program execution stops if the DataFrame doesn't match validation. One could also have different schemas and pepper them throughout the method chain to ensure the DataFrame is transforming correctly. My main concern is ensuring that a DataFrame is 'correct' enough to save to a database table with a set schema.

The main item I'm currently struggling with are whether I should throw Exceptions, and if not, what should the validation do? I think outputting the values that don't match the schema is reasonable, but I guess I don't see how that helps so much in the method chain.

@ericmjl
Copy link
Member

ericmjl commented May 8, 2019

@szuckerman I just realized I let this ball drop.

We now have a data dictionary thingy, and I think you could add another custom accessor that allows for validation of a dataframe. Or it could just be a function, just as you provided. What do you think?

Btw, I'm so happy you brought up PandasSchema; the code shown on their examples are really awesome!

@ericmjl ericmjl added enhancement New feature or request good intermediate issue Issues that are good for seasoned programmers to make a contribution question Further information is requested labels May 8, 2019
@ericmjl ericmjl changed the title Validation method [ENH] Introduce a Dataframe validation method? May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good intermediate issue Issues that are good for seasoned programmers to make a contribution question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants