Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter logging a nested dict fails if too long #441

Closed
Joenetics opened this issue Aug 4, 2023 · 2 comments
Closed

Parameter logging a nested dict fails if too long #441

Joenetics opened this issue Aug 4, 2023 · 2 comments

Comments

@Joenetics
Copy link

Joenetics commented Aug 4, 2023

Description

When I set the long_params_strategy to 'tag' in my mlflow.yml file, it fails if my parameter is a nested dict.

The error is in https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/framework/hooks/mlflow_hook.py I think.

Context

I have some nested dict parameter file, and I dont even want to log it, but I am forced to. If I must, simply tagging is fine- but it fails as it is a nested dict.

Steps to Reproduce

1- set 'long_params_strategy' to 'tag' in my mlflow.yml file
2- use a large nested dict in a parameter yml
3- call the dict in a node
4- pipeline will crash as nested dict is greater than 5000 chars

Expected Result

log dict as simple tag in params in mlflow

Actual Result

crashes due to being nested dict

RestException: INVALID_PARAMETER_VALUE: Tag value '{'MY KEY': {'NESTED KEY': 'VALUE', 'NESTED KEY': 
'VALUE, 'NESTED KEY': 'VALUE', 'NESTED KEY': "VALUE"}, 'NESTED KEY':' had length 35886, which exceeded length limit of 5000


I replaced keys/values to hide my data.

Your Environment

I do not believe env is an issue

Does the bug also happen with the last version on master?

Yes

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Aug 22, 2023

Hi, sorry for the long delay for the reply. This limit is a mlflow limit and it is unlikely to change as they already stated they won't.

I have 2 workarounds in mind that I list hereafter, please tell me if they fit your need.

Log each key separately

You can use the dict_params argument with recursive=True option in your mlflow.yml:

tracking:
  params:
    dict_params:
      flatten: True # -> change to True. parameter which are dictionary will be splitted in multiple parameters when logged in mlflow, one for each key.
      recursive: True  # Should the dictionary flattening be applied recursively (i.e for nested dictionaries)? Not use if `flatten_dict_params` is False.
      sep: "." # In case of recursive flattening, what separator should be used between the keys? E.g. {hyperaparam1: {p1:1, p2:2}} will be logged as hyperaparam1.p1 and hyperaparam1.p2 in mlflow.

This will create a lot of parameters entry but is likely the "right way" (from the mlflow point of view) to do things.

Avoid logging with an extra file

I dont even want to log it, but I am forced to

It is True that by design kedro-mlflow forces you to log all parameters. If you don't want to log this dictionnary, I guess you think that it will almost never change? In this case, you can avoid logging by converting it to a yaml file and load it directly in the catalog:

# data/01_raw/your_dict_parameter.yaml

 key1:
    key11: a
    key12: b
key2:
    key21: 
        key 211: ca
        key 222: cb
    key22: d
your_dict: 
    type: yaml.YAMLDataSet
    filepath: data/01_raw/your_dict_parameter.yaml

Then replace params:your_dict by your_dict everywhere in your pipeline.

The rest of your pipeline will be unmodified and the dict will not be logged.

@Galileo-Galilei
Copy link
Owner

I close the issue since it is not a bug but a mlflow design. there are workarounds documented above, and i'll enable logging such dicts as artifacts in #442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants