Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1788391 Validation Error When Converting Pandas String Column to NumPy Array in Prediction Function #123

Open
kenkoooo opened this issue Oct 27, 2024 · 4 comments

Comments

@kenkoooo
Copy link

When a Pandas DataFrame containing a string column is passed to the prediction function, it is converted to a NumPy array and then validated. During validation, the column's data type is compared with the type specified in the saved model's signature.

Even if the column's type is string[python] in the original Pandas DataFrame, it will be represented as 'O' (object) after conversion to a NumPy array.

As a result, np.can_cast(arr.dtype, feature_type._numpy_type, casting='no') will return False, causing the validation to fail if the first type is 'O' and the second is np.str_.

@sfc-gh-shchen
Copy link

Hi @kenkoooo, could you kindly provide a repro code snippets?

@kenkoooo
Copy link
Author

kenkoooo commented Nov 2, 2024

When you register a model like in the following example, a model that accepts a DataFrame with a column of string type will be registered:

import snowflake.snowpark as snowpark
from snowflake.ml.model import custom_model
import pandas as pd
from snowflake.ml.model.model_signature import ModelSignature, FeatureSpec, DataType
from snowflake.ml.registry import Registry


class MyCustomModel(custom_model.CustomModel):
    def __init__(self, context: custom_model.ModelContext) -> None:
        super().__init__(context)

    @custom_model.inference_api
    def predict(self, X: pd.DataFrame) -> pd.DataFrame:
        return X


def main(session: snowpark.Session):
    mc = custom_model.ModelContext()
    model = MyCustomModel(mc)

    signature = ModelSignature(
        inputs=[FeatureSpec(name="COL", dtype=DataType.STRING)],
        outputs=[FeatureSpec(name="COL", dtype=DataType.STRING)],
    )

    reg = Registry(session=session)
    reg.log_model(model, model_name="MY_COOL_MODEL", signatures={"predict": signature})
    return session.create_dataframe([["OK"]])

You can use this model like this:

import snowflake.snowpark as snowpark
import pandas as pd
from snowflake.ml.registry import Registry


def main(session: snowpark.Session):
    reg = Registry(session=session)
    mv = reg.get_model("MY_COOL_MODEL").last()

    X = pd.DataFrame({"COL": ["A", "B", "C"]})

    res = mv.run(X, function_name="PREDICT", strict_input_validation=False)

    return session.create_dataframe(res)

This works as expected since the COL column is of type object, not string. However, if you explicitly change it to string, it will not work:

import snowflake.snowpark as snowpark
import pandas as pd
from snowflake.ml.registry import Registry


def main(session: snowpark.Session):
    reg = Registry(session=session)
    mv = reg.get_model("MY_COOL_MODEL").last()

    X = pd.DataFrame({"COL": ["A", "B", "C"]})

    # Ensure that the column is of type string
    X["COL"] = X["COL"].astype("string")

    res = mv.run(X, function_name="PREDICT", strict_input_validation=False)

    return session.create_dataframe(res)

You will encounter an error message like the following:

Traceback (most recent call last):
  File "snowflake/ml/_internal/telemetry.py", line 527, in wrap
    return ctx.run(execute_func_with_statement_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "snowflake/ml/_internal/telemetry.py", line 503, in execute_func_with_statement_params
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "snowflake/ml/model/_client/model/model_version_impl.py", line 461, in run
    return self._model_ops.invoke_method(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "snowflake/ml/model/_client/ops/model_ops.py", line 812, in invoke_method
    df = model_signature._convert_and_validate_local_data(X, signature.inputs, strict=strict_input_validation)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "snowflake/ml/model/model_signature.py", line 649, in _convert_and_validate_local_data
    _validate_pandas_df(df, features, strict=strict)
  File "snowflake/ml/model/model_signature.py", line 219, in _validate_pandas_df
    raise snowml_exceptions.SnowflakeMLException(
snowflake.ml._internal.exceptions.exceptions.SnowflakeMLException: ValueError('(2112) Data Validation Error in feature COL: Feature type DataType.STRING is not met by all elements in 0    A\n1    B\n2    C\nName: COL, dtype: string.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  Worksheet, line 17, in main
    res = mv.run(X, function_name="PREDICT", strict_input_validation=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "snowflake/ml/_internal/telemetry.py", line 529, in wrap
    raise e.original_exception from e
ValueError: (2112) Data Validation Error in feature COL: Feature type DataType.STRING is not met by all elements in 0    A
1    B
2    C
Name: COL, dtype: string.

@sfc-gh-wzhao
Copy link
Collaborator

Hi @kenkoooo Thank you for reporting this issue. We believe this is currently because pandas.StringDType has not been supported by us yet. We will add support soon and before that, you could workaround by .astype(np.str_).

@sfc-gh-wzhao sfc-gh-wzhao changed the title Validation Error When Converting Pandas String Column to NumPy Array in Prediction Function SNOW-1788391 Validation Error When Converting Pandas String Column to NumPy Array in Prediction Function Nov 5, 2024
@sfc-gh-pramachandran
Copy link

@kenkoooo This bug is fixed from version snowflake-ml-python>=1.7.0 onwards. Thanks again for raising this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants