Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] KeyError on DataPreprocessing #160

Open
LordGedelicious opened this issue Nov 23, 2024 · 6 comments
Open

[BUG] KeyError on DataPreprocessing #160

LordGedelicious opened this issue Nov 23, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@LordGedelicious
Copy link

LordGedelicious commented Nov 23, 2024

Describe the bug
Currently trying to create a drug activity binary classification model. Dataset provided here:
Active
Inactive

Bug found when trying to fit the dataset to FTTransformerClassifier, MambularClassifier, MLPClassifier, and TabTransformerClassifier. Source of the bug is located in the preprocessing module.

To Reproduce
After installing the mambular module through pip, run the following code:

import numpy as np

active = pd.read_csv('isoform_II_active_95_filtered_550.csv')
inactive = pd.read_csv('isoform_II_inactive_95_filtered_550.csv')

active = active.iloc[:, 1:]
inactive = inactive.iloc[:, 1:]

active_train = active.iloc[:450, :]
active_val = active.iloc[450:500, :]
active_test = active.iloc[500:, :]
inactive_train = inactive.iloc[:450, :]
inactive_val = inactive.iloc[450:500, :]
inactive_test = inactive.iloc[500:, :]

train_data = pd.concat([active_train, inactive_train])
val_data = pd.concat([active_val, inactive_val])
test_data = pd.concat([active_test, inactive_test])

X_train = train_data.iloc[:, :-1].values  # All columns except the last as features
y_train = train_data.iloc[:, -1].values   # Last column as the target
X_val = val_data.iloc[:, :-1].values
y_val = val_data.iloc[:, -1].values
X_test = test_data.iloc[:, :-1].values
y_test = test_data.iloc[:, -1].values

from mambular.models import TabTransformerClassifier # Change to any mambular model if needed

model = TabTransformerClassifier() # Change to any mambular model if needed

model.fit(X_train, y_train)

Error stack provided below:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-15-a8f5efccc824>](https://localhost:8080/#) in <cell line: 5>()
      3 model = TabTransformerClassifier()
      4 
----> 5 model.fit(X_train, y_train)

19 frames
[/usr/local/lib/python3.10/dist-packages/mambular/models/sklearn_base_classifier.py](https://localhost:8080/#) in fit(self, X, y, val_size, X_val, y_val, max_epochs, random_state, batch_size, shuffle, patience, monitor, mode, lr, lr_patience, factor, weight_decay, checkpoint_path, dataloader_kwargs, rebuild, **trainer_kwargs)
    340             )
    341 
--> 342             self.data_module.preprocess_data(
    343                 X, y, X_val, y_val, val_size=val_size, random_state=random_state
    344             )

[/usr/local/lib/python3.10/dist-packages/mambular/data_utils/datamodule.py](https://localhost:8080/#) in preprocess_data(self, X_train, y_train, X_val, y_val, val_size, random_state)
    131 
    132         # Fit the preprocessor
--> 133         self.preprocessor.fit(combined_X, combined_y)
    134 
    135         # Update feature info based on the actual processed data

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/preprocessor.py](https://localhost:8080/#) in fit(self, X, y)
    297             transformers=transformers, remainder="passthrough"
    298         )
--> 299         self.column_transformer.fit(X, y)
    300 
    301         self.fitted = True

[/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py](https://localhost:8080/#) in fit(self, X, y, **params)
    920         # we use fit_transform to make sure to set sparse_output_ (for which we
    921         # need the transformed data) to have consistent output type in predict
--> 922         self.fit_transform(X, y=y, **params)
    923         return self
    924 

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://localhost:8080/#) in wrapped(self, X, *args, **kwargs)
    314     @wraps(f)
    315     def wrapped(self, X, *args, **kwargs):
--> 316         data_to_wrap = f(self, X, *args, **kwargs)
    317         if isinstance(data_to_wrap, tuple):
    318             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in wrapper(estimator, *args, **kwargs)
   1471                 )
   1472             ):
-> 1473                 return fit_method(estimator, *args, **kwargs)
   1474 
   1475         return wrapper

[/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py](https://localhost:8080/#) in fit_transform(self, X, y, **params)
    974             routed_params = self._get_empty_routing()
    975 
--> 976         result = self._call_func_on_transformers(
    977             X,
    978             y,

[/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py](https://localhost:8080/#) in _call_func_on_transformers(self, X, y, func, column_as_labels, routed_params)
    883                 )
    884 
--> 885             return Parallel(n_jobs=self.n_jobs)(jobs)
    886 
    887         except ValueError as e:

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py](https://localhost:8080/#) in __call__(self, iterable)
     72             for delayed_func, args, kwargs in iterable
     73         )
---> 74         return super().__call__(iterable_with_config)
     75 
     76 

[/usr/local/lib/python3.10/dist-packages/joblib/parallel.py](https://localhost:8080/#) in __call__(self, iterable)
   1916             output = self._get_sequential_output(iterable)
   1917             next(output)
-> 1918             return output if self.return_generator else list(output)
   1919 
   1920         # Let's create an ID that uniquely identifies the current call. If the

[/usr/local/lib/python3.10/dist-packages/joblib/parallel.py](https://localhost:8080/#) in _get_sequential_output(self, iterable)
   1845                 self.n_dispatched_batches += 1
   1846                 self.n_dispatched_tasks += 1
-> 1847                 res = func(*args, **kwargs)
   1848                 self.n_completed_tasks += 1
   1849                 self.print_progress()

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
    134             config = {}
    135         with config_context(**config):
--> 136             return self.function(*args, **kwargs)
    137 
    138 

[/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py](https://localhost:8080/#) in _fit_transform_one(transformer, X, y, weight, message_clsname, message, params)
   1308     with _print_elapsed_time(message_clsname, message):
   1309         if hasattr(transformer, "fit_transform"):
-> 1310             res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
   1311         else:
   1312             res = transformer.fit(X, y, **params.get("fit", {})).transform(

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in wrapper(estimator, *args, **kwargs)
   1471                 )
   1472             ):
-> 1473                 return fit_method(estimator, *args, **kwargs)
   1474 
   1475         return wrapper

[/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py](https://localhost:8080/#) in fit_transform(self, X, y, **params)
    539             last_step_params = routed_params[self.steps[-1][0]]
    540             if hasattr(last_step, "fit_transform"):
--> 541                 return last_step.fit_transform(
    542                     Xt, y, **last_step_params["fit_transform"]
    543                 )

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://localhost:8080/#) in wrapped(self, X, *args, **kwargs)
    314     @wraps(f)
    315     def wrapped(self, X, *args, **kwargs):
--> 316         data_to_wrap = f(self, X, *args, **kwargs)
    317         if isinstance(data_to_wrap, tuple):
    318             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in fit_transform(self, X, y, **fit_params)
   1099         else:
   1100             # fit method of arity 2 (supervised transformation)
-> 1101             return self.fit(X, y, **fit_params).transform(X)
   1102 
   1103 

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/ple_encoding.py](https://localhost:8080/#) in fit(self, feature, target)
     97         dt.fit(feature, target)
     98 
---> 99         self.conditions = tree_to_code(dt, ["feature"])
    100         return self
    101 

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/ple_encoding.py](https://localhost:8080/#) in tree_to_code(tree, feature_names)
     68             # print(k,')',pathto[parent], tree_.value[node])
     69 
---> 70     recurse(0, 1, 0)
     71 
     72     return my_list

[/usr/local/lib/python3.10/dist-packages/mambular/preprocessing/ple_encoding.py](https://localhost:8080/#) in recurse(node, depth, parent)
     65         else:
     66             k = k + 1
---> 67             my_list.append(pathto[parent])
     68             # print(k,')',pathto[parent], tree_.value[node])
     69 

KeyError: 0

Expected behavior
Normal model training, as this dataset works with other SKLearn models.

Desktop (please complete the following information):

  • Google Colab A100 GPU
  • Python version: 3.10.12
  • Mambular Version: 0.2.4
@LordGedelicious LordGedelicious added the bug Something isn't working label Nov 23, 2024
@AnFreTh
Copy link
Collaborator

AnFreTh commented Nov 23, 2024

Hi,
could you maybe give us a reproducible code example with simulated data where this error occurs? I could so far not reproduce it, but it definitely shows that if not another, functional bug, the error messages in the preprocessor have to be improved.
Once we can reproduce the error, we will try and fix it asap.

@LordGedelicious
Copy link
Author

I'm running the code in Google Colab, would a downloaded ipynb file be sufficient?

@AnFreTh
Copy link
Collaborator

AnFreTh commented Nov 23, 2024

The problem is that we do not have access to the data, since you load it locally from .csv. Or is it a publicly available dataset?

@LordGedelicious
Copy link
Author

Mambular_Experimentation.zip
Apologies, the dataset is publicly available but I have curated it to some extent. So I'll provide a ZIP file containing the data and the ipynb file. Please let me know if you need anything else

@AnFreTh
Copy link
Collaborator

AnFreTh commented Nov 23, 2024

Thanks.

The problem lies in the decision tree during ple preprocessing, since X_train[:,32] is a np.array of only zeroes. So either dropping that feature from training or using

model = FTTransformerClassifier(numerical_preprocessing="standardization")

model.fit(X_train, y_train)

could be a fast workaround for you.

I will leave this issue open, such that we will include better error handling for situations like this.

@LordGedelicious
Copy link
Author

Much appreciated for the fix, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants