-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using ColumnTransformer with OneHotEncoder creates lots of disprepency between the raw predictions and onnx inference #1046
Comments
Is is possible to know if the predictions are all wrong or only a couple of them? You can use |
@xadupre I can reproduce the same issue. I'm trying to convert an sklearn pipleine with an I'm seeing vast differences in predicted probabilities between the "native" sklearn and the onnx converted model (see chart on the left). Only if I comment-out the categorical transformer, do the predicted probabilities align (chart on the left). Here's my environment:
Here's the pipeline + ONNX conversion code import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from skl2onnx.common.data_types import FloatTensorType, StringTensorType, Int64TensorType, DoubleTensorType
from skl2onnx import convert_sklearn, to_onnx, update_registered_converter
from skl2onnx.common.shape_calculator import calculate_linear_classifier_output_shapes
from onnxmltools.convert.xgboost.operator_converters.XGBoost import convert_xgboost
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
def convert_dataframe_schema(df, drop=None):
inputs = []
for k, v in zip(df.columns, df.dtypes):
if drop is not None and k in drop:
continue
if v == "int64":
t = Int64TensorType([None, 1])
elif v == "float64":
t = FloatTensorType([None, 1])
else:
t = StringTensorType([None, 1])
inputs.append((k, t))
return inputs
def get_categorical_features(df: pd.DataFrame):
dtype_ser = df.dtypes
categorical_features = dtype_ser[dtype_ser == "object"].index.tolist()
return categorical_features
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
]
)
categorical_transformer = Pipeline(
steps=[
("onehot", OneHotEncoder(handle_unknown="ignore"))
]
)
categorical_features = get_categorical_features(X_train)
numeric_features = [f for f in X_train.columns if not f in categorical_features]
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
#("cat", categorical_transformer, categorical_features),
]
)
est_val = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", XGBClassifier(
enable_categorical=False,
random_state=42,
))
]
)
est_val.fit(
X_train,
y_train
)
# guess model input schema from training data
schema = convert_dataframe_schema(X_train)
# register XGBoost model converter
update_registered_converter(
XGBClassifier,
"XGBoostXGBClassifier",
calculate_linear_classifier_output_shapes,
convert_xgboost,
options={"nocl": [True, False], "zipmap": [True, False, "columns"]},
)
# convert sklearn pipeline to ONNX
model_onnx = convert_sklearn(
est_val,
"xgb_pipeline",
schema,
target_opset={"": 12, "ai.onnx.ml": 3},
)
# save onnx model file
with open("xgb_pipeline.onnx", "wb") as f:
f.write(model_onnx.SerializeToString()) |
Quick update here, after some deeper digging, I apparently found the root cause: it's the sparsity of the |
If you see any bug, could you complete the PR #1140 with some dummy data which fail. |
I am trying to do a simple sklearn pipeline with a StandardScaler for the numerical values and a OneHotEncoder for the categorical.
As shown here, I use the CastTransformer to reduce discrepencies between the raw prediction and the onnx inference induces by the StandardScaler and it works fine.
The problem arise when I use the OneHotEncoder for the categorical features, which creates a lot of discrepencies.
Here the complete code :
Without integrating the categorical columns (by commenting out the
("cat", categorical_transformer, categorical_features)
line), I get the folowing differences :(1.7136335372924805e-07, 0.09922722248309723)
.But integrating the categorical columns (and so OneHotEncoder), I get the folowing differences :
(0.4294445514678955, 11.320544924924578)
I was expecting some differences in disprepency but not such a large increase.
Is this the expected behavior ? Is there no way to reduce it ?
The text was updated successfully, but these errors were encountered: