-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wl clean validate #5
base: main
Are you sure you want to change the base?
Conversation
src/clean_data.py
Outdated
|
||
# Test 1: Ensure the raw name file exists, if not raise error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not "tests" per say... I would refer to them as "input validation", "input validation checks" or "defensive programming" in general. And here, probably "input validation checks". This comment applies to all the comments you have named this way.
scripts/clean_validate.py
Outdated
from src.validate_data import * | ||
|
||
@click.command() | ||
@click.option('--raw-data', type=str, help="Path to directory where raw data resides") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be the filename, not the path. What if the filename changes? Then this script would break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try:
@click.option('--raw-data-file', type=str, help="Path to raw data file")
scripts/clean_validate.py
Outdated
# Validates that the numeric values of specified columns are within specified range. | ||
col_range = { | ||
'mean_radius': [6,30,False,False], | ||
'mean_texture': [9,40,False,False], | ||
'mean_perimeter': [40,200,False,False], | ||
'mean_area': [140,2510,False,False], | ||
'mean_smoothness': [0,1,False,False], | ||
'mean_compactness': [0,1,False,False], | ||
'mean_concavity': [0,1,False,False], | ||
'mean_concave': [0,1,False,False], | ||
'mean_symmetry': [0,1,False,False], | ||
'mean_fractal': [0,1,False,False], | ||
'se_radius': [0,3,False,False], | ||
'se_texture': [0,5,False,False], | ||
'se_perimeter': [0,22,False,False], | ||
'se_area': [6,550,False,False], | ||
'se_smoothness': [0,1,False,False], | ||
'se_compactness': [0,1,False,False], | ||
'se_concavity': [0,1,False,False], | ||
'se_concave': [0,1,False,False], | ||
'se_symmetry': [0,1,False,False], | ||
'se_fractal': [0,1,False,False], | ||
'max_radius': [7,40,False,False], | ||
'max_texture': [12,50,False,False], | ||
'max_perimeter': [50,260,False,False], | ||
'max_area': [180,4300,False,False], | ||
'max_smoothness': [0,1,False,False], | ||
'max_compactness': [0,2,False,False], | ||
'max_concavity': [0,2,False,False], | ||
'max_concave': [0,1,False,False], | ||
'max_symmetry': [0,1,False,False], | ||
'max_fractal': [0,1,False,False] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I see something this long, and with values like this hard-coded into a script I start to think how we can do this better... One thing is that the range of possible values is similar across most measures, whether it's the mean or max (which makes sense). And even for some se's.... Another is the repeat of all the Falses. Finally, if we wanted to update these values, editing the script feels wrong... I feel like a config file or a data file might be best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the other expectations have similar large amounts of data or is it just this one?
src/clean_data.py
Outdated
# Test 2: Ensure the raw name file is a .names file, if not raise error | ||
if not raw_name_file.endswith('.names'): | ||
raise ValueError("The raw_name file must be a .names file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to be so strict here. Any file name could be fine, it's more important that it exists and that content that matters.
data/processed/data_config.csv
Outdated
@@ -0,0 +1,32 @@ | |||
column,type,max,min | |||
diagnosis,str,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include column labels in this config, or in another. Seems strange to only have numerical data in the config file.
scripts/clean_validate.py
Outdated
|
||
@click.command() | ||
@click.option('--raw-data-file', type=str, help="Path to raw data file") | ||
@click.option('--name-file', type=str, help="Path to dirctory where names file resides") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point to the file for names file, not just the directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, its just the comment that is wrong, not the code.
scripts/clean_validate.py
Outdated
|
||
# Validate cleaned data | ||
# Load the CSV config file | ||
data_config_file = '/data/processed/data_config.csv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_config_file
should be a command line argument, its value should be the path to this file.
scripts/clean_validate.py
Outdated
# Load the CSV config file | ||
data_config_file = '/data/processed/data_config.csv' | ||
# define schema | ||
schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames) | |
schema = build_schema_from_csv(data_config=data_config_file, expected_columns=colnames) |
# Create schema | ||
config_df = pd.read_csv(data_config_file) | ||
|
||
schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the column name to drop the ID column name from the list. Numerical indexing makes code brittle and is less readable.
schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list | |
clean_colnames = colnames.remove("id") | |
schema = build_schema_from_csv(data_config=config_df, expected_columns=clean_colnames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, wondering if we should rename build_schema_from_csv
to build_schema_from_DataFrame
since that is what the function does (like it would work if we created a data frame from pd.DataFrame
with code and then used the function.
@click.option('--name-file', type=str, help="Path to names file") | ||
@click.option('--data-config-file', type=str, help="Path to data configuration file") | ||
@click.option('--write-to', type=str, help="Path to directory where cleaned data will be written to") | ||
@click.option('--written-file-name', type=str, help="The name of the file will be written") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
written_file_name
is a bit awkward. How about file-name
? Or we can just hard-code this part of the code in the script. That path is what matters most about being flexible. I think I would go with the latter option.
if not os.path.exists(raw_name_file): | ||
raise FileNotFoundError(f"The raw_name file does not exist.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fairly certain that open
returns FileNotFoundError: [Errno 2] No such file or directory: 'filename.txt'
when the file doesn't exist, so we don't need this, nor the test for this? I think we have to be careful to not over-test for this "demonstration" of what good code should look like.
import os | ||
|
||
|
||
def extract_column_name(raw_name_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function does too much. A function should do just one thing. I would move the open command to the script (so read in the whole file) and then just have the regular expressions as what is modularized to the function.
if len(col_name) != 32: | ||
raise ValueError("col_name must contain exactly 32 items.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic numbers here are a brittle and confusing to others that won't know where they come from. This is the number of columns in raw_data
, right, and so you can get this number from raw_data.shape[1]
.
# Ensure the list only contains strings, if not raise error | ||
if not all(isinstance(item, str) for item in col_name): | ||
raise ValueError("col_name must only contain strings.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... this is generally a good thing, but if this is not true Pandas can handle it so I am not sure we want to fully throw an error here. Maybe remove this or just throw a warning.
|
||
return colnames | ||
|
||
def read_data(raw_data, col_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this read_raw_data
in the script?
if not os.path.exists(data_to): | ||
raise FileNotFoundError('The directory provided does not exist.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't have this in your function, will Python's .to_csv
give you a FileNotFoundError
or some other path error anyways? If so, we should remove this as it has already been handled by os.path.exists
.
if not os.path.isdir(data_to): | ||
raise NotADirectoryError('The directory path provided is not a directory, it is an existing file path. Please provide a path to a new, or existing directory.') | ||
|
||
# Ensure the name of file is string, if not raise an error | ||
if not isinstance(name_of_file, str): | ||
raise TypeError("name_of_file must be string.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above, let's first check that .to_csv
doesn't already handle these kinds of errors.
raise TypeError("data_config must be a pandas dataframe.") | ||
|
||
# Ensure the data_config has following columns: column,type,max,min,category | ||
required_columns = ['column', 'type', 'min', 'max','category', 'max_nullable'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use Panderas here to check the config 🫠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can even use base Python like you do, but just check that all those 6 columns exists and have those names, and that there is one row of data at least (doesn't make sense to run the function otherwise). To do the latter, you can use data_config.empty
and as long as that is false you are good for that check.
To do the former, you can look to see that there in nothing different between the sets (column names and expected column names) and check the `data_config.shape[1] == 6
# Define the correct Pandera data type | ||
if column_type == 'int': | ||
dtype = pa.Int | ||
elif column_type == 'float': | ||
dtype = pa.Float | ||
elif column_type == 'str': | ||
dtype = pa.String | ||
else: | ||
raise ValueError(f"Unsupported column type: {column_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this is needed. Pandera's schemas work with int
, or float
, or str
. See an example here: https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L45
if dataframe.empty: | ||
raise ValueError("dataframe must contain observations.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be caught by a Pandera schema check, like this one: https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L79
Like we want it to fail if there is even one missing row.
raise ValueError("dataframe must contain observations.") | ||
|
||
schema.validate(dataframe, lazy=True) | ||
# return print(f"Expected Columns: {expected_columns}, Actual Columns: {actual_columns}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code
Most functions have minimal docstrings. We want them to be more robust and numpy-style. Like this example (which you are welcome to copy): https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L7 |
Adding docker file and clean_data.py and tests for it.