Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidating existing stores? #22

Open
jbusecke opened this issue Aug 28, 2023 · 8 comments
Open

Consolidating existing stores? #22

jbusecke opened this issue Aug 28, 2023 · 8 comments

Comments

@jbusecke
Copy link
Collaborator

Since we are currently not performing consolidation (waiting for pangeo-forge/pangeo-forge-recipes#575), we have two options for the future:

  • go through all the stores manually and consolidate them
  • reprocess them one more time once the above PR is done. (this might be cool once we move the target location to the old Google Bucket).
@cisaacstern
Copy link
Contributor

I think there's a third option of automated use of linked PR against existing stores?

Where the input PCollection is just the zarr.storage.FSStore of the existing store.

@cisaacstern
Copy link
Contributor

Also pangeo-forge/pangeo-forge-recipes#556 falls into this category.

@cisaacstern
Copy link
Contributor

I think there's a third option of automated use of linked PR against existing stores?

Where the input PCollection is just the zarr.storage.FSStore of the existing store.

I think this is what I will do for the ClimSim mlo data leap-stc/ClimSim#38 (comment) which is slow to load without pangeo-forge/pangeo-forge-recipes#556.

@jbusecke
Copy link
Collaborator Author

I am afraid I do not quite understand what that third option is?

@cisaacstern
Copy link
Contributor

I am afraid I do not quite understand what that third option is?

Run a pipeline like this on Dataflow:

from pangeo_forge_recipes.transforms import ConsolidateCoordinateDimensions

existing_paths: list[str] = get_existing_paths_from_bigquery(...)

def path_to_fsstore(path: str) -> zarr.storage.FSStore:
    ...
    return store

recipe = (
    beam.Create(existing_paths)
    | beam.Map(path_to_fsstore)
    | ConsolidateCoordinateDimensions()
)

@jbusecke
Copy link
Collaborator Author

Ahhhhh, yes that makes sense. I could do that in retrospect once, and then add such a stage to new recipes.

@jbusecke
Copy link
Collaborator Author

Still relevant. I am copying the successful ingestions over to the public buckets and catalog them in leap-pangeo.cmip6_pgf_ingestion.leap_legacy. We could probably run a script over this and consolidate the coordinates afterwards.

@jbusecke
Copy link
Collaborator Author

Just going through old issues. I think this might actually be addressed by our current QC (i.e. unconsolidated stores are not passing the tests?), but would need to check that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants