[BUG] KeyError on DataPreprocessing #160

LordGedelicious · 2024-11-23T16:05:35Z

Describe the bug
Currently trying to create a drug activity binary classification model. Dataset provided here:
Active
Inactive

Bug found when trying to fit the dataset to FTTransformerClassifier, MambularClassifier, MLPClassifier, and TabTransformerClassifier. Source of the bug is located in the preprocessing module.

To Reproduce
After installing the mambular module through pip, run the following code:

import numpy as np

active = pd.read_csv('isoform_II_active_95_filtered_550.csv')
inactive = pd.read_csv('isoform_II_inactive_95_filtered_550.csv')

active = active.iloc[:, 1:]
inactive = inactive.iloc[:, 1:]

active_train = active.iloc[:450, :]
active_val = active.iloc[450:500, :]
active_test = active.iloc[500:, :]
inactive_train = inactive.iloc[:450, :]
inactive_val = inactive.iloc[450:500, :]
inactive_test = inactive.iloc[500:, :]

train_data = pd.concat([active_train, inactive_train])
val_data = pd.concat([active_val, inactive_val])
test_data = pd.concat([active_test, inactive_test])

X_train = train_data.iloc[:, :-1].values  # All columns except the last as features
y_train = train_data.iloc[:, -1].values   # Last column as the target
X_val = val_data.iloc[:, :-1].values
y_val = val_data.iloc[:, -1].values
X_test = test_data.iloc[:, :-1].values
y_test = test_data.iloc[:, -1].values

from mambular.models import TabTransformerClassifier # Change to any mambular model if needed

model = TabTransformerClassifier() # Change to any mambular model if needed

model.fit(X_train, y_train)

Error stack provided below:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-15-a8f5efccc824>](https://localhost:8080/#) in <cell line: 5>()
      3 model = TabTransformerClassifier()
      4 
----> 5 model.fit(X_train, y_train)

19 frames
[/usr/local/lib/python3.10/dist-packages/mambular/models/sklearn_base_classifier.py](https://localhost:8080/#) in fit(self, X, y, val_size, X_val, y_val, max_epochs, random_state, batch_size, shuffle, patience, monitor, mode, lr, lr_patience, factor, weight_decay, checkpoint_path, dataloader_kwargs, rebuild, **trainer_kwargs)
    340             )
    341 
--> 342             self.data_module.preprocess_data(
    343                 X, y, X_val, y_val, val_size=val_size, random_state=random_state
    344             )

[/usr/local/lib/python3.10/dist-packages/mambular/data_utils/datamodule.py](https://localhost:8080/#) in preprocess_data(self, X_train, y_train, X_val, y_val, val_size, random_state)
    131 
    132         # Fit the preprocessor
--> 133         self.preprocessor.fit(combined_X, combined_y)
    134 
    135         # Update feature info based on the actual processed data

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/preprocessor.py](https://localhost:8080/#) in fit(self, X, y)
    297             transformers=transformers, remainder="passthrough"
    298         )
--> 299         self.column_transformer.fit(X, y)
    300 
    301         self.fitted = True

[/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py](https://localhost:8080/#) in fit(self, X, y, **params)
    920         # we use fit_transform to make sure to set sparse_output_ (for which we
    921         # need the transformed data) to have consistent output type in predict
--> 922         self.fit_transform(X, y=y, **params)
    923         return self
    924 

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://localhost:8080/#) in wrapped(self, X, *args, **kwargs)
    314     @wraps(f)
    315     def wrapped(self, X, *args, **kwargs):
--> 316         data_to_wrap = f(self, X, *args, **kwargs)
    317         if isinstance(data_to_wrap, tuple):
    318             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in wrapper(estimator, *args, **kwargs)
   1471                 )
   1472             ):
-> 1473                 return fit_method(estimator, *args, **kwargs)
   1474 
   1475         return wrapper

[/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py](https://localhost:8080/#) in fit_transform(self, X, y, **params)
    974             routed_params = self._get_empty_routing()
    975 
--> 976         result = self._call_func_on_transformers(
    977             X,
    978             y,

[/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py](https://localhost:8080/#) in _call_func_on_transformers(self, X, y, func, column_as_labels, routed_params)
    883                 )
    884 
--> 885             return Parallel(n_jobs=self.n_jobs)(jobs)
    886 
    887         except ValueError as e:

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py](https://localhost:8080/#) in __call__(self, iterable)
     72             for delayed_func, args, kwargs in iterable
     73         )
---> 74         return super().__call__(iterable_with_config)
     75 
     76 

[/usr/local/lib/python3.10/dist-packages/joblib/parallel.py](https://localhost:8080/#) in __call__(self, iterable)
   1916             output = self._get_sequential_output(iterable)
   1917             next(output)
-> 1918             return output if self.return_generator else list(output)
   1919 
   1920         # Let's create an ID that uniquely identifies the current call. If the

[/usr/local/lib/python3.10/dist-packages/joblib/parallel.py](https://localhost:8080/#) in _get_sequential_output(self, iterable)
   1845                 self.n_dispatched_batches += 1
   1846                 self.n_dispatched_tasks += 1
-> 1847                 res = func(*args, **kwargs)
   1848                 self.n_completed_tasks += 1
   1849                 self.print_progress()

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
    134             config = {}
    135         with config_context(**config):
--> 136             return self.function(*args, **kwargs)
    137 
    138 

[/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py](https://localhost:8080/#) in _fit_transform_one(transformer, X, y, weight, message_clsname, message, params)
   1308     with _print_elapsed_time(message_clsname, message):
   1309         if hasattr(transformer, "fit_transform"):
-> 1310             res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
   1311         else:
   1312             res = transformer.fit(X, y, **params.get("fit", {})).transform(

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in wrapper(estimator, *args, **kwargs)
   1471                 )
   1472             ):
-> 1473                 return fit_method(estimator, *args, **kwargs)
   1474 
   1475         return wrapper

[/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py](https://localhost:8080/#) in fit_transform(self, X, y, **params)
    539             last_step_params = routed_params[self.steps[-1][0]]
    540             if hasattr(last_step, "fit_transform"):
--> 541                 return last_step.fit_transform(
    542                     Xt, y, **last_step_params["fit_transform"]
    543                 )

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://localhost:8080/#) in wrapped(self, X, *args, **kwargs)
    314     @wraps(f)
    315     def wrapped(self, X, *args, **kwargs):
--> 316         data_to_wrap = f(self, X, *args, **kwargs)
    317         if isinstance(data_to_wrap, tuple):
    318             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in fit_transform(self, X, y, **fit_params)
   1099         else:
   1100             # fit method of arity 2 (supervised transformation)
-> 1101             return self.fit(X, y, **fit_params).transform(X)
   1102 
   1103 

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/ple_encoding.py](https://localhost:8080/#) in fit(self, feature, target)
     97         dt.fit(feature, target)
     98 
---> 99         self.conditions = tree_to_code(dt, ["feature"])
    100         return self
    101 

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/ple_encoding.py](https://localhost:8080/#) in tree_to_code(tree, feature_names)
     68             # print(k,')',pathto[parent], tree_.value[node])
     69 
---> 70     recurse(0, 1, 0)
     71 
     72     return my_list

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/ple_encoding.py](https://localhost:8080/#) in recurse(node, depth, parent)
     65         else:
     66             k = k + 1
---> 67             my_list.append(pathto[parent])
     68             # print(k,')',pathto[parent], tree_.value[node])
     69 

KeyError: 0

Expected behavior
Normal model training, as this dataset works with other SKLearn models.

Desktop (please complete the following information):

Google Colab A100 GPU
Python version: 3.10.12
Mambular Version: 0.2.4

The text was updated successfully, but these errors were encountered:

AnFreTh · 2024-11-23T19:14:41Z

Hi,
could you maybe give us a reproducible code example with simulated data where this error occurs? I could so far not reproduce it, but it definitely shows that if not another, functional bug, the error messages in the preprocessor have to be improved.
Once we can reproduce the error, we will try and fix it asap.

LordGedelicious · 2024-11-23T19:17:24Z

I'm running the code in Google Colab, would a downloaded ipynb file be sufficient?

AnFreTh · 2024-11-23T19:22:16Z

The problem is that we do not have access to the data, since you load it locally from .csv. Or is it a publicly available dataset?

LordGedelicious · 2024-11-23T19:28:18Z

Mambular_Experimentation.zip
Apologies, the dataset is publicly available but I have curated it to some extent. So I'll provide a ZIP file containing the data and the ipynb file. Please let me know if you need anything else

AnFreTh · 2024-11-23T19:45:58Z

Thanks.

The problem lies in the decision tree during ple preprocessing, since X_train[:,32] is a np.array of only zeroes. So either dropping that feature from training or using

model = FTTransformerClassifier(numerical_preprocessing="standardization")

model.fit(X_train, y_train)

could be a fast workaround for you.

I will leave this issue open, such that we will include better error handling for situations like this.

LordGedelicious · 2024-11-23T20:05:06Z

Much appreciated for the fix, thank you!

LordGedelicious added the bug Something isn't working label Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] KeyError on DataPreprocessing #160

[BUG] KeyError on DataPreprocessing #160

LordGedelicious commented Nov 23, 2024 •

edited

Loading

AnFreTh commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024

AnFreTh commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024

AnFreTh commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024

[BUG] KeyError on DataPreprocessing #160

[BUG] KeyError on DataPreprocessing #160

Comments

LordGedelicious commented Nov 23, 2024 • edited Loading

AnFreTh commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024

AnFreTh commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024

AnFreTh commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024

LordGedelicious commented Nov 23, 2024 •

edited

Loading