Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wl clean validate #5

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Wl clean validate #5

wants to merge 21 commits into from

Conversation

WeilinHan8
Copy link
Collaborator

Adding docker file and clean_data.py and tests for it.

scripts/clean_validate.py Outdated Show resolved Hide resolved
Comment on lines 12 to 13

# Test 1: Ensure the raw name file exists, if not raise error
Copy link
Member

@ttimbers ttimbers Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not "tests" per say... I would refer to them as "input validation", "input validation checks" or "defensive programming" in general. And here, probably "input validation checks". This comment applies to all the comments you have named this way.

from src.validate_data import *

@click.command()
@click.option('--raw-data', type=str, help="Path to directory where raw data resides")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the filename, not the path. What if the filename changes? Then this script would break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:

@click.option('--raw-data-file', type=str, help="Path to raw data file")

Comment on lines 57 to 89
# Validates that the numeric values of specified columns are within specified range.
col_range = {
'mean_radius': [6,30,False,False],
'mean_texture': [9,40,False,False],
'mean_perimeter': [40,200,False,False],
'mean_area': [140,2510,False,False],
'mean_smoothness': [0,1,False,False],
'mean_compactness': [0,1,False,False],
'mean_concavity': [0,1,False,False],
'mean_concave': [0,1,False,False],
'mean_symmetry': [0,1,False,False],
'mean_fractal': [0,1,False,False],
'se_radius': [0,3,False,False],
'se_texture': [0,5,False,False],
'se_perimeter': [0,22,False,False],
'se_area': [6,550,False,False],
'se_smoothness': [0,1,False,False],
'se_compactness': [0,1,False,False],
'se_concavity': [0,1,False,False],
'se_concave': [0,1,False,False],
'se_symmetry': [0,1,False,False],
'se_fractal': [0,1,False,False],
'max_radius': [7,40,False,False],
'max_texture': [12,50,False,False],
'max_perimeter': [50,260,False,False],
'max_area': [180,4300,False,False],
'max_smoothness': [0,1,False,False],
'max_compactness': [0,2,False,False],
'max_concavity': [0,2,False,False],
'max_concave': [0,1,False,False],
'max_symmetry': [0,1,False,False],
'max_fractal': [0,1,False,False]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see something this long, and with values like this hard-coded into a script I start to think how we can do this better... One thing is that the range of possible values is similar across most measures, whether it's the mean or max (which makes sense). And even for some se's.... Another is the repeat of all the Falses. Finally, if we wanted to update these values, editing the script feels wrong... I feel like a config file or a data file might be best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the other expectations have similar large amounts of data or is it just this one?

Comment on lines 17 to 19
# Test 2: Ensure the raw name file is a .names file, if not raise error
if not raw_name_file.endswith('.names'):
raise ValueError("The raw_name file must be a .names file.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be so strict here. Any file name could be fine, it's more important that it exists and that content that matters.

@@ -0,0 +1,32 @@
column,type,max,min
diagnosis,str,,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include column labels in this config, or in another. Seems strange to only have numerical data in the config file.


@click.command()
@click.option('--raw-data-file', type=str, help="Path to raw data file")
@click.option('--name-file', type=str, help="Path to dirctory where names file resides")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point to the file for names file, not just the directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, its just the comment that is wrong, not the code.


# Validate cleaned data
# Load the CSV config file
data_config_file = '/data/processed/data_config.csv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_config_file should be a command line argument, its value should be the path to this file.

# Load the CSV config file
data_config_file = '/data/processed/data_config.csv'
# define schema
schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames)
schema = build_schema_from_csv(data_config=data_config_file, expected_columns=colnames)

# Create schema
config_df = pd.read_csv(data_config_file)

schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the column name to drop the ID column name from the list. Numerical indexing makes code brittle and is less readable.

Suggested change
schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list
clean_colnames = colnames.remove("id")
schema = build_schema_from_csv(data_config=config_df, expected_columns=clean_colnames)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wondering if we should rename build_schema_from_csv to build_schema_from_DataFrame since that is what the function does (like it would work if we created a data frame from pd.DataFrame with code and then used the function.

@click.option('--name-file', type=str, help="Path to names file")
@click.option('--data-config-file', type=str, help="Path to data configuration file")
@click.option('--write-to', type=str, help="Path to directory where cleaned data will be written to")
@click.option('--written-file-name', type=str, help="The name of the file will be written")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

written_file_name is a bit awkward. How about file-name? Or we can just hard-code this part of the code in the script. That path is what matters most about being flexible. I think I would go with the latter option.

Comment on lines +14 to +15
if not os.path.exists(raw_name_file):
raise FileNotFoundError(f"The raw_name file does not exist.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly certain that open returns FileNotFoundError: [Errno 2] No such file or directory: 'filename.txt' when the file doesn't exist, so we don't need this, nor the test for this? I think we have to be careful to not over-test for this "demonstration" of what good code should look like.

import os


def extract_column_name(raw_name_file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function does too much. A function should do just one thing. I would move the open command to the script (so read in the whole file) and then just have the regular expressions as what is modularized to the function.

Comment on lines +58 to +59
if len(col_name) != 32:
raise ValueError("col_name must contain exactly 32 items.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers here are a brittle and confusing to others that won't know where they come from. This is the number of columns in raw_data, right, and so you can get this number from raw_data.shape[1].

Comment on lines +61 to +63
# Ensure the list only contains strings, if not raise error
if not all(isinstance(item, str) for item in col_name):
raise ValueError("col_name must only contain strings.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... this is generally a good thing, but if this is not true Pandas can handle it so I am not sure we want to fully throw an error here. Maybe remove this or just throw a warning.


return colnames

def read_data(raw_data, col_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this read_raw_data in the script?

Comment on lines +93 to +94
if not os.path.exists(data_to):
raise FileNotFoundError('The directory provided does not exist.')
Copy link
Member

@ttimbers ttimbers Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't have this in your function, will Python's .to_csv give you a FileNotFoundError or some other path error anyways? If so, we should remove this as it has already been handled by os.path.exists.

Comment on lines +97 to +102
if not os.path.isdir(data_to):
raise NotADirectoryError('The directory path provided is not a directory, it is an existing file path. Please provide a path to a new, or existing directory.')

# Ensure the name of file is string, if not raise an error
if not isinstance(name_of_file, str):
raise TypeError("name_of_file must be string.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, let's first check that .to_csv doesn't already handle these kinds of errors.

raise TypeError("data_config must be a pandas dataframe.")

# Ensure the data_config has following columns: column,type,max,min,category
required_columns = ['column', 'type', 'min', 'max','category', 'max_nullable']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use Panderas here to check the config 🫠

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even use base Python like you do, but just check that all those 6 columns exists and have those names, and that there is one row of data at least (doesn't make sense to run the function otherwise). To do the latter, you can use data_config.empty and as long as that is false you are good for that check.

To do the former, you can look to see that there in nothing different between the sets (column names and expected column names) and check the `data_config.shape[1] == 6

Comment on lines +40 to +48
# Define the correct Pandera data type
if column_type == 'int':
dtype = pa.Int
elif column_type == 'float':
dtype = pa.Float
elif column_type == 'str':
dtype = pa.String
else:
raise ValueError(f"Unsupported column type: {column_type}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this is needed. Pandera's schemas work with int, or float, or str. See an example here: https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L45

Comment on lines +90 to +91
if dataframe.empty:
raise ValueError("dataframe must contain observations.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be caught by a Pandera schema check, like this one: https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L79

Like we want it to fail if there is even one missing row.

raise ValueError("dataframe must contain observations.")

schema.validate(dataframe, lazy=True)
# return print(f"Expected Columns: {expected_columns}, Actual Columns: {actual_columns}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code

@ttimbers
Copy link
Member

Most functions have minimal docstrings. We want them to be more robust and numpy-style. Like this example (which you are welcome to copy): https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants