Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded paths in MDIMs make future updates harder #3723

Open
pabloarosado opened this issue Dec 16, 2024 · 2 comments
Open

Hardcoded paths in MDIMs make future updates harder #3723

pabloarosado opened this issue Dec 16, 2024 · 2 comments

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented Dec 16, 2024

Context

Currently, mdim steps require a config yaml file, which includes full paths of indicators in tables of grapher steps.

Potential problems

Possible solution

Treat the MDIM yaml as a template, where we fill in some variables at the time we ship it.

Ideally, we would avoid hardcoding paths (in steps and yaml files), and all dependencies would be specified in the DAG.

After a discussion with @lucasrodes we thought of a possible solution. The config yaml file (for example of the covid mdim step) could have a special placeholder for a dataset path, e.g. {ds:short_name} (e.g. {ds:covid_cases}), specifying the short name of a dataset listed as a dependency of the mdim step in the DAG.
Then, the function paths.load_mdim_config would read the config yaml file, and replace those placeholders by the full URIs of the corresponding dataset.

Possible rabbit holes or related issues

  • But it is possible that multiple dependencies of an mdim step have the same short name. And we may also want to create an mdim that compares different versions of the same dataset. For such cases, we could define custom placeholders, e.g. {ds:custom_short_name}, and then pass a dictionary to paths.load_mdim_config mapping those custom short names to the corresponding dataset URI.
  • We also noticed that it is inconvenient that Table does not have an URI, and we rely on Table.metadata.dataset.uri. Maybe tables should also have a URI attribute.
  • We may need an additional function of paths to get the URI of a table in a dataset. Currently, the way we'd do that is by, e.g. `paths.load_dataset("dataset_path...") + "/table_name...".

Impact

We're not encountering this problem so much yet, but it's more that we are currently setting precedents on how a large amount of work will be done, so we're interested in saving ourselves future work by getting this right.

@larsyencken larsyencken changed the title Track dependencies in mdim steps Some MDIM steps hardcode paths making Jan 31, 2025
@larsyencken larsyencken changed the title Some MDIM steps hardcode paths making Some MDIM steps hardcode paths Jan 31, 2025
@larsyencken larsyencken changed the title Some MDIM steps hardcode paths Hardcoded paths in MDIMs make future updates harder Jan 31, 2025
@larsyencken
Copy link
Collaborator

larsyencken commented Jan 31, 2025

We discussed this in triage today.

@Marigold thought this could be good to tackle at a time when we are fixing a related issue, to make MDIMs on the ETL side...

  • build from data://grapher/... steps (filesystem -> DB)
  • rather than grapher://grapher/... steps (filesystem -> DB -> filesystem -> DB)

as it does now.

@pabloarosado
Copy link
Contributor Author

To clarify, this should not include refactoring the currently existing mdims (e.g. covid). That should happen as a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants