Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fillna does not work if fields_group is not None #1851

Open
LeetaH666 opened this issue Sep 26, 2024 · 2 comments
Open

Fillna does not work if fields_group is not None #1851

LeetaH666 opened this issue Sep 26, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@LeetaH666
Copy link

🐛 Bug Description

The Fillna processor does not work if fields_group is not None since assigning values to df.values changes nothing.

To Reproduce

Use any model and specify fields_group for Fillna processor.

Expected Behavior

No nan after calling Fillna.

Additional Notes

Same as the issue here: #1307 (comment).

@LeetaH666 LeetaH666 added the bug Something isn't working label Sep 26, 2024
@LeetaH666
Copy link
Author

I think simply using slice assignment would be ok:

    def __call__(self, df):
        cols = get_group_columns(df, self.fields_group)
        df.loc[:, cols] = df.loc[:, cols].fillna(self.fill_value)
        return df

@LeetaH666
Copy link
Author

Or if you want to use numpy to accelerate (I can achieve 10x speed), you should assign the df.values (or df.to_numpy()) to a variable first, then fill and assign back:

    def __call__(self, df):
        if self.fields_group is None:
            df.fillna(self.fill_value, inplace=True)
        else:
            cols = get_group_columns(df, self.fields_group)
            # this implementation is extremely slow
            # df.fillna({col: self.fill_value for col in cols}, inplace=True)

            #! similar to qlib.data.dataset.processor.Fillna, we use numpy to accelerate
            #! but instead, we assign the numpy array to a variable first
            df_values = df[cols].to_numpy()
            nan_select = np.isnan(df_values)
            #! then fill value and assign back
            df_values[nan_select] = self.fill_value
            df.loc[:, cols] = df_values
        return df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant