Table of Contents
This repository contains supplementary code for the paper "Proposal of an Automated Feature Engineering Pipeline for High-Dimensional Tabular Regression Data Using Reinforcement Learning". Author: Julian Müller [email protected], on behalf of MBition GmbH.
Source code has been tested solely for our own use cases, which might differ from yours. This project is actively maintained and contributing is endorsed.
‘automotive_feature-engineering’ is a Python package designed to automate the feature engineering process for large in-car communication datasets within the automotive industry. It simplifies the transformation of raw data into meaningful input features for machine learning models, enhancing efficiency and reducing computational overhead. It supports both static analysis and dynamic feature engineering through reinforcement learning techniques.
To clone the source code of this repository to your local machine, follow these steps:
-
Install Git: Make sure you have Git installed on your computer. If not, you can download it from git-scm.com.
-
Open a Terminal/Command Prompt: Navigate to the directory where you want to clone the repository.
-
Clone the Repository: Use the
git clone
command followed by the repository URL. Run the following command for HTTPS:git clone https://github.com/mercedes-benz/automotive_feature_engineering.git
Or this one for SSH:
git clone [email protected]:mercedes-benz/automotive_feature_engineering.git
pip install dist/automotive_feature_engineering-0.1.0-py3-none-any.whl
Index | Method | Parameters | Description |
---|---|---|---|
0 | `` | - | Do nothing with features |
1 | drop_correlated_features_09 |
- | Drop highly correlated features with a correlation threshold of 0.9. |
2 | drop_correlated_features_095 |
- | Drop highly correlated features with a correlation threshold of 0.95. |
3 | sns_handling_median_8 |
- | Fill NaN values with the median for columns with more than 8 unique values. |
4 | sns_handling_median_32 |
- | Fill NaN values with the median for columns with more than 32 unique values. |
5 | sns_handling_mean_8 |
- | Fill NaN values with the mean for columns with more than 8 unique values. |
6 | sns_handling_mean_32 |
- | Fill NaN values with the mean for columns with more than 32 unique values. |
7 | sns_handling_zero_8 |
- | Fill NaN values with 0 for columns with more than 8 unique values. |
8 | sns_handling_zero_32 |
- | Fill NaN values with 0 for columns with more than 32 unique values. |
9 | filter_by_variance |
- | Removes columns with variance below 0.1 across datasets. |
10 | ohe |
- | Applies one-hot encoding to categorical variables in datasets. |
11 | feature_importance_filter_00009999 |
- | Filters out features from datasets that have an importance less than 0.00009999. |
12 | feature_importance_filter_00049999 |
- | Filters out features from datasets that have an importance less than 0.00049999. |
13 | pca |
- | Applies Principal Component Analysis transformation to reduce dimensionality. |
14 | polynominal_features |
- | Enhances feature set by creating polynomial terms. |
99 | filter_by_variance_0 |
- | Removes columns with only one unique value across datasets. |
The static and manual methods in the automotive_featureengineering package are designed to perform feature engineering on automotive data sets. The static method uses a predefined sequence of feature engineering steps, while the manual method allows users to specify their own sequence.
Parameter | Type | Description | Default Value |
---|---|---|---|
df_train | pd.DataFrame |
Training data. | Required |
df_test | pd.DataFrame |
Test data. | Required |
model | str |
Model to be used for feature selection. Options: etree , randomforest . |
Required |
target_names_list | List[str] |
List of target names. | Required |
import_joblib_path | str , optional |
Path to import joblib file of previously exported feature engineering methods. | None |
alt_docu_path | str , optional |
Alternative documentation path. | None |
alt_config | Dict , optional |
Alternative configuration dictionary. | None |
unrelated_cols | List[str] , optional |
List of columns that are not considered in feature engineering. | None |
model_export | bool |
Whether to export the model. | False |
fe_export_joblib | bool |
Whether to export the feature engineering methods used. | False |
explainable | bool |
If set to True, a pipeline without polynomial features is used. | False |
Prepare your training and testing datasets as pd.DataFrame.
With your data frames ready, you can now call the static method. You need to specify additional parameters such as the model type and target features list according to your specific needs. The static method does not require a method list as it uses a predefined sequence of methods.
# Import function
from automotive_feature_engineering import static
# Execute the static method
results = static(df_train, df_test, model, target_names_list)
If no method list is provided, the default pipeline will be used.
If you want to specify your own sequence of feature engineering steps, use the manual method. You need to provide a method list along with other parameters.
# Import function
from automotive_feature_engineering import manual
# Execute the manual method
results = manual(method_list, df_train, df_test, model, target_names_list)
The RL method in the is designed to perform dynamic feature engineering on automotive data sets using reinforcement learning techniques. It processes input data frames to adaptively extract and engineer features that are essential for predictive modeling and further analysis.
Parameter | Type | Description | Default Value |
---|---|---|---|
df_train | pd.DataFrame |
Training data used in reinforcement learning. | Required |
df_train_origin | pd.DataFrame |
Train data. | Required |
df_test_origin | pd.DataFrame |
Test data. | Required |
model | str |
Model to be used for feature selection. Options: etree , randomforest . |
Required |
target_names_list | List[str] |
List of target names. | Required |
rl_raster | float |
Sampling rate of input data. | Required |
alt_docu | str , optional |
Alternative documentation path. | None |
alt_config | Dict , optional |
Alternative configuration dictionary. | None |
unrelated_cols | List[str] , optional |
List of columns that are not considered in feature engineering. | None |
Prepare your training and testing datasets as pd.DataFrame. Create a new training dataset instead of original training and testing datasets specifically for reinforcement learning.
Once your data frames are prepared, you can now call the RL method as well. You need to specify additional parameters such as the model type, target feature list, and other parameters tailored to your specific needs.
# Import function
from automotive_feature_engineering import rl
# Execute the rl method
results = rl(df_train, df_train_origin, df_test_origin, target_names_list, model, rl_raster, unrelated_cols, alt_config, alt_docu)
For more examples, please refer to the Documentation
The instructions on how to contribute can be found in the file CONTRIBUTING.md in this repository.
The code is published under the MIT license. Further information on that can be found in the LICENSE.md file in this repository.
@article{key2023, title={}, author={}, year={2023}, url={} }