Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and Manage Plan to Restore Lost EarthCODE Data #55

Open
GarinSmith opened this issue Jan 7, 2025 · 19 comments
Open

Create and Manage Plan to Restore Lost EarthCODE Data #55

GarinSmith opened this issue Jan 7, 2025 · 19 comments
Assignees

Comments

@GarinSmith
Copy link

GarinSmith commented Jan 7, 2025

EarthCODE Data Restore Plan

Data (S3 Object Store) - Ewelina to Lead
Assets and Catalogs

  1. What is backed up?
    This is stored in Local drive, VM, External sources. We are hopeful we have most of the lost data.
  2. What is the priority?
    Probably references from any external sources
  3. What is not backed up?
    We will confirm this as part of 1)

MetaData - Garin to Lead
GitHub/EarthCODE Catalogue

  1. Confirm no metadata is lost and we can re-use this? Yes - Done (but we need to change the path)
    This is currently no reason to suspect this is an issue.
  2. This does assume that we use the same data location at CloudFerro? Yes - Done (but we need to change the path)
    We have asked CloudFerro if they can provide the same S3 instance location. They cannot do this
  3. EOX have indicated that we can do global replace on GitHub to change
    https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/OSCAssets
    to
    https://s3.waw4-1.cloudferro.com/swift/v1/EarthCODE/OSCAssets
    and
    https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/Catalogs
    to
    https://s3.waw4-1.cloudferro.com/swift/v1/EarthCODE/Catalogs

E.g. for https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/OSCAssets/seasfire/seasfire-cube/SeasFireCube_v3.zarr

Scripts/Process - Ewelina to Lead

  1. Do we have scripts to move data to S3 or is this done manually?
    Yes, we have scripts with some manual effort
  2. Confirm that we can just move data again without changing the existing metadata?
    Yes we think so if CloudFerro can help above.
  3. Assume that PPR script will be used later?
    Yes, we have suggested some PPR scenarios to support this.

Environment - Garin to Lead

  1. Can we use CloudFerro? - Yes (Done)
    Assumes yes subject to clarification of operational procedures.
  2. Can we use PRR? - Yes as an environment (Done)
    Yes in parallel to S3. We have more info from Salvatore.
    One possible bonus is that we can deploy the above products to PRR when the new script is ready.
  3. CloudFerro alternate S3. Not required (Done)
    Not currently planned unless there are problems with 1)

Operational Stability - Garin to Lead

  1. Review/Confirm CloudFerro Operational Procedures - Done (we know that there is no backup service)
    See Meta Data point 2)
  2. ESA PRR
    When is PRR prototype app package available?
    When is PRR production environment available?
    What is the PRR SLA?
@GarinSmith
Copy link
Author

Hi @edobrowolska,
I assigned this task to both of us, because it seemed easier and more flexible.
I used the core plan we worked on together earlier and I suggested the tasks that we each lead on.
I will arrange a catch-up on Thursday.

@edobrowolska
Copy link

I created a simple excel file with the datasets to be resotred (attached here). Column F indicates the priority of the data to be restored. In two cases we are missing backup for the data itself - I will contact data providers to update us on the access to that assets. In the next step the catalog.json collection files will need to be restored-re-created. This will be next step on me to be checked. Missing-data-list.xlsx

@GarinSmith
Copy link
Author

Hi @edobrowolska,
I have used the new ESA PRR API to add an asset to the PRR and register that asset in the PRR, so that it can be discovered in the PRR catalogue. To use the PRR we need:

i) A unique ID for a product asset like
https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/OSCAssets/seasfire/seasfire-cube/SeasFireCube_v3.zarr ?
or
https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/OSCAssets/Hydrocoastal/L2E/cs2_full/amur/HCA_L2E_CS_OFFL_SIR1SAR_FR_20191231T122515_20191231T122518_D001.nc ?

ii) A unique ID for a product collection like
https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/Catalogs/seasfire/seasfire-cube/catalog.json

iii) To understand how we will use the references in say the catalog.json above.
We can discuss i) and ii) later, but I am wondering if you have an example catalog.json please so that I can look at the structure it uses and the way it references assets?
I am wondering if references are hardcoded or relative so that it might be re-used without change by the PRR.

@edobrowolska
Copy link

edobrowolska commented Jan 17, 2025

Hi @GarinSmith,
Thanks for taking action on that. Regarding the i) point I am not sure if I follow it. Those IDs, this is the ref link that points to the assets itself, and in this case those are two separated and different products. So both are correct. Just the structure of one is different than the other but this should be maintained this way.
ii) this ID (rather path to the product) is pointing to the catalog.json which describes this asset. So yes in this case it stays like this.
iii) this catalog.json specifically for the SeasFire cube has been lost, since I have not keep the copy of this, but it can be reproduced from the .zarr data by just re-creating this file according to the documentation using stactools package: stac datacube create-item s3://OSCAssets/seasfire/seasfire-cube/SeasFireCube_v3.zarr/ item.json '--use-driver ZARR
The source of that file is here: https://zenodo.org/records/8055879
You can find instructions here: https://github.com/ESA-EarthCODE/open-science-catalog-metadata/wiki/User-Guide%E2%80%90v.1.0.0

@edobrowolska
Copy link

Also regarding the example of the catalog, for instance the simple item.json can be found from the Hydrology dataset colelction as attached

HCA_L2E_S3A_SR_1_SRA_A__20160417T091130_20160417T100159_20180203T044515_3029_003_121______LR1_R_NT_003.json

@edobrowolska
Copy link

The catalog.json can also be recreated by just using the tool stac add item. Catalog.json we''ve been using looks like this one:
{
"type": "Catalog",
"id": "examples",
"title": "Example catalog",
"stac_version": "1.0.0",
"description": "This catalog is a simple demonstration of an example catalog that is used to organize STAC Items",
"links": [
{
"rel": "self",
"href": "https://raw.githubusercontent.com/radiantearth/stac-spec/v1.1.0/examples/catalog.json",
"type": "application/json"
},
{
"rel": "root",
"href": "./catalog.json",
"type": "application/json",
"title": "Example catalog"
}
]
}

**Add STAC Items to a common catalog.json _ by applying _ 'stac add' _ command
for item_file in item_files/item
*.json; do stac add "$item_file" catalog.json; done;

@edobrowolska
Copy link

Hi @GarinSmith I have also another example of the catalog.json from the dataset to be restored (this used reference to extrernally stored assets. Find this attached.

catalog-example.json.txt

@GarinSmith
Copy link
Author

  1. Cloud Ferro have confirmed we should have access to create S3 Object Storage.
    I need to create this S3 Object Storage (not done yet)

  2. Cloud Ferro have stated we need to use (see s3.waw4-1)
    E.g. https://s3.waw4-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/OSCAssets/seasfire/seasfire-cube/SeasFireCube_v3.zarr
    instead of
    https://s3.waw2-1.cloudferro.com/swift/v1/AUTH_3f7e5dd853f54cebb046a29a69f1bba6/OSCAssets/seasfire/seasfire-cube/SeasFireCube_v3.zarr
    They cannot forward requests from https://s3.waw4-1 to https://s3.waw2-1

Hi @edobrowolska and @Schpidi
We have a lot of GitHub references using s3.waw2-1 - do you know if there is a way to update them in bulk using find and replace somehow and change them all to s3.waw4-1 (once we have moved assets to Cloudferro)?
I have seen a few examples talking about this, but maybe we have done this before?

@GarinSmith
Copy link
Author

GarinSmith commented Jan 21, 2025

Note that I have used the ESA PRR API to deploy a test asset to the EarthCODE Test Collection - https://eoresults.esa.int/stac/collections/EARTHCODE_TEST/items
I had to tweak the PRR instructions to get it to work, but that may have been because of my environment.

@edobrowolska
Copy link

Hi @GarinSmith Thanks a lot for your work! I would suggest to start uploading the Assets and creating stac catalog first, then updating the 25 items with the reference link is not a lot, even if we will have to do it manually. This link is only referenced in the osc-metadata/products catalog, so it should not be a problem. I'm not aware about the automated way for doing that, but maybe @Schpidi has some solution. As I said, first action would be to upload the datasets to that new s3 bucket, the reference link is just a final step..

@GarinSmith
Copy link
Author

Hi @edobrowolska, I agree with the approach that you suggest. Thanks.

@GarinSmith
Copy link
Author

I have created a bucket and the folders in the new CloudFerro S3 Object Store. These are
OSCAssets
Catalogs
as before

I can access a test file using
https://s3.waw4-1.cloudferro.com/swift/v1/EarthCODE/OSCAssets/garintest.tgz

@GarinSmith
Copy link
Author

I have made progress on the PRR with regard to
PRR Integration Plan
PRR technical progress update – Garin/All
PRR/Cloud Ferro Publish Epic - All
PRR/Cloud Ferro Access Epic - All
PRR ESA Operational Implementation Strategy – ESA
PRR Technical Integration to EarthCODE - Garin/All
Please see https://esait.sharepoint.com/:p:/r/sites/EarthCODE/Shared%20Documents/General/Communication/Scrum%20of%20Scrums/EarthCODE%20PRR%20Notes.pptx?d=wd832840b01704d34a0f0a05ce38964d1&csf=1&web=1&e=FTzedI

@edobrowolska
Copy link

I have created a bucket and the folders in the new CloudFerro S3 Object Store. These are OSCAssets Catalogs as before

I can access a test file using https://s3.waw4-1.cloudferro.com/swift/v1/EarthCODE/OSCAssets/garintest.tgz

Hi Garin,
do you think it would make sense to have quick catch up today to discuss this? It looks good to me, I think we can go ahead and move the products there asap. Let me know

@GarinSmith
Copy link
Author

I @edobrowolska ,
Glad it looks OK.
Sounds good. Are you free after the PRR meeting (1600 CET )?
We could just carry on after if OK.

@edobrowolska
Copy link

Hi Garin, unfortunately I need to stop working today at 4 pm. On Friday's we work only until 16:00. Then let's catch up on Monday morning (before our 11 am meeting? )

@GarinSmith
Copy link
Author

I am now able to access Cloudferro remotely using S3.
The following commands now work
s3cmd ls
s3cmd ls s3://EarthCODE
s3cmd ls s3://EarthCODE/OSCAssets/
s3cmd put garintest2.tgz s3://EarthCODE/OSCAssets/garintest2.tgz
I have sent the access key and secret key to Stephan and Ewelina.

@GarinSmith
Copy link
Author

@edobrowolska
Copy link

Data are available now in s3 bucket (OSCAssets) - only few are missing, as the upload was not successfull. We need to put back some remaining catalog.json for these datasets (in progress)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants