Wl clean validate #5

WeilinHan8 · 2024-10-18T18:04:57Z

Adding docker file and clean_data.py and tests for it.

scripts/clean_validate.py

ttimbers · 2024-10-22T22:11:21Z

src/clean_data.py

+
+    # Test 1: Ensure the raw name file exists, if not raise error


These are not "tests" per say... I would refer to them as "input validation", "input validation checks" or "defensive programming" in general. And here, probably "input validation checks". This comment applies to all the comments you have named this way.

ttimbers · 2024-10-22T22:22:29Z

scripts/clean_validate.py

+from src.validate_data import *
+
+@click.command()
+@click.option('--raw-data', type=str, help="Path to directory where raw data resides")


I think this should be the filename, not the path. What if the filename changes? Then this script would break.

try:

@click.option('--raw-data-file', type=str, help="Path to raw data file")

ttimbers · 2024-10-22T22:28:42Z

scripts/clean_validate.py

+    # Validates that the numeric values of specified columns are within specified range.
+    col_range = {
+            'mean_radius': [6,30,False,False],
+            'mean_texture': [9,40,False,False],
+            'mean_perimeter': [40,200,False,False],
+            'mean_area': [140,2510,False,False],
+            'mean_smoothness': [0,1,False,False],
+            'mean_compactness': [0,1,False,False],
+            'mean_concavity': [0,1,False,False],
+            'mean_concave': [0,1,False,False],
+            'mean_symmetry': [0,1,False,False],
+            'mean_fractal': [0,1,False,False],
+            'se_radius': [0,3,False,False],
+            'se_texture': [0,5,False,False],
+            'se_perimeter': [0,22,False,False],
+            'se_area': [6,550,False,False],
+            'se_smoothness': [0,1,False,False],
+            'se_compactness': [0,1,False,False],
+            'se_concavity': [0,1,False,False],
+            'se_concave': [0,1,False,False],
+            'se_symmetry': [0,1,False,False],
+            'se_fractal': [0,1,False,False],
+            'max_radius': [7,40,False,False],
+            'max_texture': [12,50,False,False],
+            'max_perimeter': [50,260,False,False],
+            'max_area': [180,4300,False,False],
+            'max_smoothness': [0,1,False,False],
+            'max_compactness': [0,2,False,False],
+            'max_concavity': [0,2,False,False],
+            'max_concave': [0,1,False,False],
+            'max_symmetry': [0,1,False,False],
+            'max_fractal': [0,1,False,False]
+    }


When I see something this long, and with values like this hard-coded into a script I start to think how we can do this better... One thing is that the range of possible values is similar across most measures, whether it's the mean or max (which makes sense). And even for some se's.... Another is the repeat of all the Falses. Finally, if we wanted to update these values, editing the script feels wrong... I feel like a config file or a data file might be best?

Do the other expectations have similar large amounts of data or is it just this one?

ttimbers · 2024-10-22T22:33:19Z

src/clean_data.py

+    # Test 2: Ensure the raw name file is a .names file, if not raise error
+    if not raw_name_file.endswith('.names'):
+        raise ValueError("The raw_name file must be a .names file.")


I don't think we need to be so strict here. Any file name could be fine, it's more important that it exists and that content that matters.

ttimbers · 2024-10-31T20:48:45Z

data/processed/data_config.csv

@@ -0,0 +1,32 @@
+column,type,max,min
+diagnosis,str,,


Should we include column labels in this config, or in another. Seems strange to only have numerical data in the config file.

ttimbers · 2024-10-31T20:50:10Z

scripts/clean_validate.py

+
+@click.command()
+@click.option('--raw-data-file', type=str, help="Path to raw data file")
+@click.option('--name-file', type=str, help="Path to dirctory where names file resides")


point to the file for names file, not just the directory

Oh, its just the comment that is wrong, not the code.

ttimbers · 2024-10-31T20:53:10Z

scripts/clean_validate.py

+
+    # Validate cleaned data
+    # Load the CSV config file
+    data_config_file = '/data/processed/data_config.csv'


data_config_file should be a command line argument, its value should be the path to this file.

ttimbers · 2024-10-31T20:54:40Z

scripts/clean_validate.py

+    # Load the CSV config file
+    data_config_file = '/data/processed/data_config.csv'
+    # define schema
+    schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames)


Suggested change

schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames)

schema = build_schema_from_csv(data_config=data_config_file, expected_columns=colnames)

src/clean_data.py

ttimbers · 2024-12-12T22:53:28Z

scripts/clean_validate.py

+    # Create schema
+    config_df = pd.read_csv(data_config_file)
+
+    schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list


Let's use the column name to drop the ID column name from the list. Numerical indexing makes code brittle and is less readable.

Suggested change

schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list

clean_colnames = colnames.remove("id")

schema = build_schema_from_csv(data_config=config_df, expected_columns=clean_colnames)

Also, wondering if we should rename build_schema_from_csv to build_schema_from_DataFrame since that is what the function does (like it would work if we created a data frame from pd.DataFrame with code and then used the function.

ttimbers · 2024-12-12T23:01:05Z

scripts/clean_validate.py

+@click.option('--name-file', type=str, help="Path to names file")
+@click.option('--data-config-file', type=str, help="Path to data configuration file")
+@click.option('--write-to', type=str, help="Path to directory where cleaned data will be written to")
+@click.option('--written-file-name', type=str, help="The name of the file will be written")


written_file_name is a bit awkward. How about file-name? Or we can just hard-code this part of the code in the script. That path is what matters most about being flexible. I think I would go with the latter option.

ttimbers · 2024-12-12T23:05:00Z

src/clean_data.py

+    if not os.path.exists(raw_name_file):
+        raise FileNotFoundError(f"The raw_name file does not exist.")


I am fairly certain that open returns FileNotFoundError: [Errno 2] No such file or directory: 'filename.txt' when the file doesn't exist, so we don't need this, nor the test for this? I think we have to be careful to not over-test for this "demonstration" of what good code should look like.

ttimbers · 2024-12-12T23:10:13Z

src/clean_data.py

+import os
+
+
+def extract_column_name(raw_name_file):


I think this function does too much. A function should do just one thing. I would move the open command to the script (so read in the whole file) and then just have the regular expressions as what is modularized to the function.

ttimbers · 2024-12-12T23:13:54Z

src/clean_data.py

+    if len(col_name) != 32:
+        raise ValueError("col_name must contain exactly 32 items.")


The magic numbers here are a brittle and confusing to others that won't know where they come from. This is the number of columns in raw_data, right, and so you can get this number from raw_data.shape[1].

ttimbers · 2024-12-12T23:33:09Z

src/clean_data.py

+    # Ensure the list only contains strings, if not raise error
+    if not all(isinstance(item, str) for item in col_name):
+        raise ValueError("col_name must only contain strings.")


Hmmm... this is generally a good thing, but if this is not true Pandas can handle it so I am not sure we want to fully throw an error here. Maybe remove this or just throw a warning.

ttimbers · 2024-12-12T23:37:31Z

src/clean_data.py

+
+    return colnames
+
+def read_data(raw_data, col_name):


Isn't this read_raw_data in the script?

ttimbers · 2024-12-12T23:39:55Z

src/clean_data.py

+    if not os.path.exists(data_to):
+        raise FileNotFoundError('The directory provided does not exist.')


If you don't have this in your function, will Python's .to_csv give you a FileNotFoundError or some other path error anyways? If so, we should remove this as it has already been handled by os.path.exists.

ttimbers · 2024-12-12T23:40:44Z

src/clean_data.py

+    if not os.path.isdir(data_to):
+        raise NotADirectoryError('The directory path provided is not a directory, it is an existing file path. Please provide a path to a new, or existing directory.')
+
+    # Ensure the name of file is string, if not raise an error
+    if not isinstance(name_of_file, str):
+        raise TypeError("name_of_file must be string.")


Similar to the comment above, let's first check that .to_csv doesn't already handle these kinds of errors.

ttimbers · 2024-12-12T23:42:43Z

src/validate_data.py

+        raise TypeError("data_config must be a pandas dataframe.")
+
+    # Ensure the data_config has following columns: column,type,max,min,category
+    required_columns = ['column', 'type', 'min', 'max','category', 'max_nullable']


Could you use Panderas here to check the config 🫠

You can even use base Python like you do, but just check that all those 6 columns exists and have those names, and that there is one row of data at least (doesn't make sense to run the function otherwise). To do the latter, you can use data_config.empty and as long as that is false you are good for that check.

To do the former, you can look to see that there in nothing different between the sets (column names and expected column names) and check the `data_config.shape[1] == 6

ttimbers · 2024-12-12T23:44:55Z

src/validate_data.py

+        # Define the correct Pandera data type
+        if column_type == 'int':
+            dtype = pa.Int
+        elif column_type == 'float':
+            dtype = pa.Float
+        elif column_type == 'str':
+            dtype = pa.String
+        else:
+            raise ValueError(f"Unsupported column type: {column_type}")


None of this is needed. Pandera's schemas work with int, or float, or str. See an example here: https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L45

ttimbers · 2024-12-12T23:47:24Z

src/validate_data.py

+    if dataframe.empty:
+        raise ValueError("dataframe must contain observations.")


This could be caught by a Pandera schema check, like this one: https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L79

Like we want it to fail if there is even one missing row.

ttimbers · 2024-12-12T23:47:37Z

src/validate_data.py

+        raise ValueError("dataframe must contain observations.")
+
+    schema.validate(dataframe, lazy=True)
+    # return print(f"Expected Columns:  {expected_columns}, Actual Columns:  {actual_columns}")


remove commented code

ttimbers · 2024-12-12T23:49:00Z

Most functions have minimal docstrings. We want them to be more robust and numpy-style. Like this example (which you are welcome to copy): https://github.com/ttimbers/breast-cancer-predictor/blob/206d1c2ba56583e87dbc359538c007698df4772c/src/validate_data.py#L7

WeilinHan8 added 11 commits October 17, 2024 23:15

adding great-expectation dependency

d42e94d

adding great-expectation dependency

eb86505

adding clean_data.py and tests

fee3c78

remove unnecessary moduels

05d548d

change error type

96d1507

changing column name class to diagnosis to match the name in .names file

f406da0

adding validate_data.py that includes data validation functions

f989c6b

adding validation function

3a7ddf3

adding write_data function and tests

8148e3f

adding tests for validation functions

d0cb2ff

adding clean_validate.py (main function)

915bef5

ttimbers reviewed Oct 22, 2024

View reviewed changes

scripts/clean_validate.py Outdated Show resolved Hide resolved

ttimbers reviewed Oct 22, 2024

View reviewed changes

change data validation function and test (using pandera)

1c0b396

ttimbers reviewed Oct 31, 2024

View reviewed changes

WeilinHan8 added 6 commits November 8, 2024 07:34

adding label category check

d7d5235

modify functions and add tests

93791e7

adding tests for validate_data()

ccc7d5e

adding tests for clean and validation files

956688e

change tests data for split tests

77bdc92

changing test data for split tests

9be3fad

ttimbers reviewed Nov 20, 2024

View reviewed changes

src/clean_data.py Show resolved Hide resolved

WeilinHan8 added 2 commits November 21, 2024 15:19

changing the extract column name function

3f2d183

adding global validation in the schema

cd71ea2

updating validate test files

124e09d

ttimbers reviewed Dec 12, 2024

View reviewed changes

src/clean_data.py

return colnames

def read_data(raw_data, col_name):

Copy link

Member

ttimbers Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this read_raw_data in the script?

ttimbers reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wl clean validate #5

Wl clean validate #5

WeilinHan8 commented Oct 18, 2024

ttimbers Oct 22, 2024 •

edited

Loading

ttimbers Oct 22, 2024

ttimbers Oct 23, 2024

ttimbers Oct 22, 2024

ttimbers Oct 22, 2024

ttimbers Oct 22, 2024

ttimbers Oct 31, 2024

ttimbers Oct 31, 2024

ttimbers Oct 31, 2024

ttimbers Oct 31, 2024

ttimbers Oct 31, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024 •

edited

Loading

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 13, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers Dec 12, 2024

ttimbers commented Dec 12, 2024


		# Test 1: Ensure the raw name file exists, if not raise error

	schema = build_schema_from_csv(data_config=data_config_file,expected_columns=colnames)
	schema = build_schema_from_csv(data_config=data_config_file, expected_columns=colnames)

	schema=build_schema_from_csv(data_config=config_df, expected_columns=colnames[1:]) #removing id colnames list
	clean_colnames = colnames.remove("id")
	schema = build_schema_from_csv(data_config=config_df, expected_columns=clean_colnames)

		if not os.path.exists(raw_name_file):
		raise FileNotFoundError(f"The raw_name file does not exist.")

		if len(col_name) != 32:
		raise ValueError("col_name must contain exactly 32 items.")

		if not os.path.exists(data_to):
		raise FileNotFoundError('The directory provided does not exist.')

		if dataframe.empty:
		raise ValueError("dataframe must contain observations.")

		@@ -0,0 +1,32 @@
		column,type,max,min
		diagnosis,str,,

Wl clean validate #5

Are you sure you want to change the base?

Wl clean validate #5

Conversation

WeilinHan8 commented Oct 18, 2024

ttimbers Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttimbers Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttimbers commented Dec 12, 2024

ttimbers Oct 22, 2024 •

edited

Loading

ttimbers Dec 12, 2024 •

edited

Loading