-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added validation method IsTypeValidation #56
base: master
Are you sure you want to change the base?
Changes from 6 commits
8b8c865
b56483a
e382ce5
b1835a3
ca8e1e5
2253125
3598ce9
59713cb
f8e593e
d05f365
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ | |
from .validation_warning import ValidationWarning | ||
from .errors import PanSchArgumentError | ||
from pandas.api.types import is_categorical_dtype, is_numeric_dtype | ||
from typing import List | ||
|
||
|
||
class _BaseValidation: | ||
|
@@ -214,6 +215,80 @@ def validate(self, series: pd.Series) -> pd.Series: | |
return (series >= self.min) & (series < self.max) | ||
|
||
|
||
def convert_type_to_dtype(type_to_convert: type) -> np.dtype: | ||
""" | ||
Converts type to the numpy variant dtype. | ||
:param type_to_convert: The type to convert to np.dtype. | ||
:return: Numpy dtype | ||
""" | ||
# DISLIKE 02: It is doubtful if this function converts all types correctly to numpy in accordance to a Pandas | ||
# Series. | ||
if type_to_convert == int: | ||
return np.dtype(np.int64) # np.dtype(int) results in np.int32. | ||
elif type_to_convert == str: | ||
return np.dtype(object) | ||
else: | ||
return np.dtype(type_to_convert) | ||
|
||
|
||
class IsTypeValidation(_SeriesValidation): | ||
""" | ||
Checks that each element in the series equals one of the allowed types. This validation only makes sense for an | ||
object series. | ||
|
||
Examples | ||
-------- | ||
>>> v = IsTypeValidation(allowed_types=[str, int]) | ||
>>> s = pd.Series(data=["alpha", 1.4, True, "beta", 5]) | ||
>>> v.validate(series=s) | ||
0 True | ||
1 False | ||
2 False | ||
3 True | ||
4 True | ||
dtype: bool | ||
""" | ||
|
||
def __init__(self, allowed_types: List, **kwargs): | ||
""" | ||
:param allowed_types: List containing the allowed data types. | ||
""" | ||
self.allowed_types = allowed_types | ||
super().__init__(**kwargs) | ||
|
||
@property | ||
def default_message(self): | ||
return "was not of listed type {}".format(self.allowed_types.__str__()) | ||
|
||
def get_errors(self, series: pd.Series, column: 'column.Column' = None): | ||
|
||
# Numpy dtypes other than 'object' can be validated with IsDtypeValidation instead, but only if the | ||
# allowed_types is singular. Otherwise continue. | ||
# DISLIKE 01: IsDtypeValidation only allows a single dtype. So this if-statement redirects only if one type is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would rather that you implement multiple dtype support in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Looking forward on your answer about converting types. |
||
# specified in the list self.allowed_types. | ||
if not series.dtype == np.dtype(object) and len(self.allowed_types) == 1: | ||
allowed_type = convert_type_to_dtype(type_to_convert=self.allowed_types[0]) | ||
new_validation_method = IsDtypeValidation(dtype=np.dtype(allowed_type)) | ||
return new_validation_method.get_errors(series=series) | ||
|
||
# Else, validate each element along the allowed types. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this in the default method implementation? If so just call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct me if I misunderstood you. But the code below line 274 can then be rewritten to return super().get_errors(series=series, column=column) where the default value Btw. Why did you use |
||
errors = [] | ||
valid_indices = series.index[~self.validate(series)] | ||
for i in valid_indices: | ||
element = series[i] | ||
errors.append(ValidationWarning( | ||
message=self.message, | ||
value=element, | ||
row=i, | ||
column=series.name | ||
)) | ||
|
||
return errors | ||
|
||
def validate(self, series: pd.Series) -> pd.Series: | ||
return series.apply(type).isin(self.allowed_types) | ||
|
||
|
||
class IsDtypeValidation(_BaseValidation): | ||
""" | ||
Checks that a series has a certain numpy dtype | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
__version__ = '0.3.5' | ||
__version__ = '0.3.5.7' | ||
chrispijo marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure that
np.dtype(int)
returnsnp.int64
, making this function redundant.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked some different Python versions. On Linux Ubuntu with Python 3.8.5, it returns indeed
np.int64
. In my IDE on Windows it returnsnp.int32
. The latter is for both Python 3.8 and 3.9. These stackoverflow answers explain that this results from C in Windows, where long int is 32bit despite the system being 64bit.So
pd.Series([1,2,3]).dtype
results innp.int64
andpd.Series(np.array([1,2,3])).dtype
results innp.int32
.It makes it tricky to anticipate which is to happen when..
EDIT:
Converting the series instead might be a solution. The below code is pretty consistent, although I only did data types int, float and bool. Leaving out (at least?) datetime. np.zeros feels a bit hacky though. And there remains a conversion.