Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added validation method IsTypeValidation #56

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions pandas_schema/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from .validation_warning import ValidationWarning
from .errors import PanSchArgumentError
from pandas.api.types import is_categorical_dtype, is_numeric_dtype
from typing import List


class _BaseValidation:
Expand Down Expand Up @@ -214,6 +215,80 @@ def validate(self, series: pd.Series) -> pd.Series:
return (series >= self.min) & (series < self.max)


def convert_type_to_dtype(type_to_convert: type) -> np.dtype:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly sure that np.dtype(int) returns np.int64, making this function redundant.

Copy link
Author

@chrispijo chrispijo Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked some different Python versions. On Linux Ubuntu with Python 3.8.5, it returns indeed np.int64. In my IDE on Windows it returns np.int32. The latter is for both Python 3.8 and 3.9. These stackoverflow answers explain that this results from C in Windows, where long int is 32bit despite the system being 64bit.

So pd.Series([1,2,3]).dtype results in np.int64 and pd.Series(np.array([1,2,3])).dtype results in np.int32.

It makes it tricky to anticipate which is to happen when..

EDIT:
Converting the series instead might be a solution. The below code is pretty consistent, although I only did data types int, float and bool. Leaving out (at least?) datetime. np.zeros feels a bit hacky though. And there remains a conversion.

np.dtype(int)  # int32
series = pd.Series([1,2,3])  # int64
python_type = type(np.zeros(1, series.dtype).tolist()[0])  # int
series_converted_type = series.astype(python_type)  # int32

np.dtype(float)  # float64
series = pd.Series([1.0,2,3])  # float64
python_type = type(np.zeros(1, series.dtype).tolist()[0])  # float
series_converted_type = series.astype(python_type)  # float64

np.dtype(bool)  # bool (dtype)
series = pd.Series([True,False,True])  # bool (dtype)
python_type = type(np.zeros(1, series.dtype).tolist()[0])  # bool (normal Python class)
series_converted_type = series.astype(python_type)  # bool (dtype)

"""
Converts type to the numpy variant dtype.
:param type_to_convert: The type to convert to np.dtype.
:return: Numpy dtype
"""
# DISLIKE 02: It is doubtful if this function converts all types correctly to numpy in accordance to a Pandas
# Series.
if type_to_convert == int:
return np.dtype(np.int64) # np.dtype(int) results in np.int32.
elif type_to_convert == str:
return np.dtype(object)
else:
return np.dtype(type_to_convert)


class IsTypeValidation(_SeriesValidation):
"""
Checks that each element in the series equals one of the allowed types. This validation only makes sense for an
object series.

Examples
--------
>>> v = IsTypeValidation(allowed_types=[str, int])
>>> s = pd.Series(data=["alpha", 1.4, True, "beta", 5])
>>> v.validate(series=s)
0 True
1 False
2 False
3 True
4 True
dtype: bool
"""

def __init__(self, allowed_types: List, **kwargs):
"""
:param allowed_types: List containing the allowed data types.
"""
self.allowed_types = allowed_types
super().__init__(**kwargs)

@property
def default_message(self):
return "was not of listed type {}".format(self.allowed_types.__str__())

def get_errors(self, series: pd.Series, column: 'column.Column' = None):

# Numpy dtypes other than 'object' can be validated with IsDtypeValidation instead, but only if the
# allowed_types is singular. Otherwise continue.
# DISLIKE 01: IsDtypeValidation only allows a single dtype. So this if-statement redirects only if one type is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather that you implement multiple dtype support in the IsDtypeValidation rather than here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Looking forward on your answer about converting types.

# specified in the list self.allowed_types.
if not series.dtype == np.dtype(object) and len(self.allowed_types) == 1:
allowed_type = convert_type_to_dtype(type_to_convert=self.allowed_types[0])
new_validation_method = IsDtypeValidation(dtype=np.dtype(allowed_type))
return new_validation_method.get_errors(series=series)

# Else, validate each element along the allowed types.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this in the default method implementation? If so just call super.get_errors()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I misunderstood you. But the code below line 274 can then be rewritten to

return super().get_errors(series=series, column=column)

where the default value None for column-variable was removed.
I will commit this together with your other feedback later on.

Btw. Why did you use column as a variable name (in get_errors())? It shadows from the outer scope.

errors = []
valid_indices = series.index[~self.validate(series)]
for i in valid_indices:
element = series[i]
errors.append(ValidationWarning(
message=self.message,
value=element,
row=i,
column=series.name
))

return errors

def validate(self, series: pd.Series) -> pd.Series:
return series.apply(type).isin(self.allowed_types)


class IsDtypeValidation(_BaseValidation):
"""
Checks that a series has a certain numpy dtype
Expand Down
2 changes: 1 addition & 1 deletion pandas_schema/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.3.5'
__version__ = '0.3.5.7'
chrispijo marked this conversation as resolved.
Show resolved Hide resolved