Featurizers in Maudlin are preprocessing functions written in Python, applied sequentially to transform raw data before input and target extraction. They handle tasks like dropping columns, merging data, and encoding categories. Defined in the YAML configuration, their order determines execution, enabling chained transformations for refined feature engineering. Flexible and customizable, featurizers streamline data preparation while ensuring consistency and reproducibility.
They do anything you want to do to the set of features coming from your input CSV, and in the order you specify. This includes.
- Data Transformation: Modify raw data into features suitable for modeling.
- Column Management: Merge, split, rename, or drop columns as needed.
- Feature Engineering: Create new features, such as moving averages or categorical encodings.
- Data Cleaning: Handle missing values, outliers, or inconsistent formats.
- Aggregation: Compute statistics like sums, means, or counts over grouped data.
- Sequencing Operations: Apply transformations in a defined order for complex workflows.
- Consistency and Reproducibility: Ensure preprocessing steps are standardized across runs.
- Customization: Support domain-specific logic through custom Python functions.
Featurizers are defined in the unit configuration file (e.g., UNIT-NAME.config.yaml
) under the data.features
section.
data:
columns:
csv:
- 'date'
- 'product_id'
- 'quantity_sold'
- 'price_per_unit'
- 'total_revenue'
features:
- name: day_of_the_week
- name: holiday_indicator
- name: moving_average
params:
periods: [3, 7]
data_field_name: 'quantity_sold'
- name: Identifier for the featurizer.
- params: Parameters to customize behavior, e.g., window size, fields, etc.
Featurizers in Maudlin are Python functions used to preprocess and transform datasets before input and target extraction. They are defined in the savvato-maudlin-lib
library, serving as a temporary location for development and iteration. Featurizers operate sequentially, following the order specified in the YAML configuration, enabling complex transformations through chaining.
-
Function Signature:
Each featurizer must define anapply()
function that accepts a dataset and any required parameters. -
Input and Output:
- Input: A Pandas DataFrame containing the dataset.
- Output: The transformed DataFrame returned for further processing.
-
Serialization Support:
Use@register_keras_serializable
to ensure compatibility with Keras models and persistence across training and prediction. -
Parameterization:
All configurable options should be passed through the YAML file, avoiding hardcoded values.
import pandas as pd
from keras.saving import register_keras_serializable
@register_keras_serializable(package="CustomPackage")
def apply(data, columns, target_column='y', smoothing=10.0):
"""
Target Encoding Featurizer.
Args:
data (pd.DataFrame): Input dataset.
columns (list): Categorical columns to encode.
target_column (str): Name of the target column.
smoothing (float): Smoothing factor to handle low-frequency categories.
Returns:
pd.DataFrame: Transformed dataset with encoded features.
"""
# Compute global mean
global_mean = data[target_column].mean()
# Apply target encoding to each specified column
for column in columns:
stats = data.groupby(column)[target_column].agg(['mean', 'count'])
stats['smooth'] = (
(stats['mean'] * stats['count'] + global_mean * smoothing) /
(stats['count'] + smoothing)
)
data[column] = data[column].map(stats['smooth']).fillna(global_mean)
return data
features:
- name: target_encode
params:
columns: [job, marital, education]
target_column: 'y'
smoothing: 10.0
- Computes smoothed mean target values for specified categorical columns.
- Reduces dimensionality while preserving relationships with the target variable.
- Handles low-frequency categories using smoothing to prevent overfitting.
- Define the Function: Write the preprocessing logic using Pandas or NumPy.
- Decorate for Serialization: Use
@register_keras_serializable
to ensure compatibility with the Keras pipeline. - Parameterize Inputs: Accept dynamic parameters like columns and smoothing factors via YAML.
- Return Transformed Data: Ensure the function outputs a modified DataFrame with the desired features.
- Test Independently: Validate the featurizer on sample datasets before integrating it into the Maudlin pipeline.
Python Implementation
import numpy as np
def apply(data, columns, lower_percentile=0.01, upper_percentile=0.99):
for col in columns:
lower = data[col].quantile(lower_percentile)
upper = data[col].quantile(upper_percentile)
data[col] = np.clip(data[col], lower, upper)
return data
YAML Configuration
- name: winsorize
params:
columns: [balance]
lower_percentile: 0.01
upper_percentile: 0.99
Python Implementation
def apply(data, columns, method='multiply'):
col1, col2 = columns
if method == 'multiply':
data[f"{col1}_{col2}_interaction"] = data[col1] * data[col2]
return data
YAML Configuration
- name: interaction_term
params:
columns: [age, balance]
method: multiply
- Modular Design: Focus each featurizer on a single, well-defined transformation.
- Chaining Support: Ensure compatibility with prior and subsequent transformations by assuming data may already be altered.
- Parameter Flexibility: Avoid hardcoding parameters; rely on YAML for configuration.
- Error Handling: Include checks for missing or invalid columns to prevent runtime errors.
- Testing and Debugging: Test on sample data and log intermediate outputs for debugging complex chains.
- Reproducibility: Use consistent names and column outputs to simplify debugging and reuse.
- Purpose: Group continuous numeric values into discrete bins or intervals.
- Example: Categorize ages into bins like 0–18, 19–35, 36–50, and 51+.
- Purpose: Remove unnecessary or redundant columns from the dataset.
- Example: Drop columns like
customer_id
ortimestamp
.
- Purpose: Replace categorical values with their frequency counts.
- Example: Replace city names with the count of occurrences for each city.
- Purpose: Replace categorical values with the mean of the target variable for each category.
- Example: Encode product categories based on average sales.
- Purpose: Create features that combine two or more variables to capture relationships.
- Example: Multiply
price
andquantity_sold
to generaterevenue
.
- Purpose: Scale numeric features to a specified range, usually [0, 1].
- Example: Normalize income values between 0 and 1.
- Purpose: Smooth time-series data by averaging values over a sliding window.
- Example: Calculate 7-day and 30-day moving averages for sales data.
- Purpose: Convert categorical variables into binary indicator columns.
- Example: Encode gender into two columns:
male
andfemale
.
- Purpose: Limit extreme values (outliers) by capping them at specified percentiles.
- Example: Cap incomes above the 95th percentile and below the 5th percentile.
Featurizers are Python functions dynamically loaded by their names specified in the YAML file. The Maudlin framework applies them sequentially during preprocessing.
features:
- name: moving_average
params:
periods: [3, 5]
data_field_name: 'sales'
- name: lag_features
params:
lags: [1, 3]
data_field_name: 'sales'
- Ensure functions are stateless and operate only on input data.
- Log intermediate outputs to verify feature creation.
- Validate outputs by visualizing features using histograms, line plots, or scatter plots.
print(data[['date', 'quantity_sold', 'moving_avg_3', 'moving_avg_7']].head(10))
- Use descriptive names for features.
- Document assumptions and parameters in the code.
- Test featurizers independently before integrating them into workflows.
- Apply feature selection to reduce dimensionality.
- Consider computational efficiency for large datasets.
Featurizers are a powerful tool in Maudlin to transform raw data into features that enhance model performance. By leveraging YAML configurations and reusable Python functions, Maudlin ensures flexibility and scalability in feature engineering. Proper validation and testing of featurizers can significantly impact model accuracy and robustness.