Skip to content

Commit

Permalink
more work on gcr-aware notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
JoanneBogart committed Nov 8, 2024
1 parent a8c3beb commit de415af
Showing 1 changed file with 98 additions and 8 deletions.
106 changes: 98 additions & 8 deletions docs/source/tutorial_notebooks/query_gcr_datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@
},
{
"cell_type": "markdown",
"id": "a4eac203-202b-47fd-b896-a37f54fc6a57",
"id": "40bc9de8-4451-412b-b5dd-025cd5254dc1",
"metadata": {},
"source": [
"## 1) Using the usual GCRCatalogs interface\n",
"\n",
"Note that, using this method, we will not be calling any data registry services directly, but the data registry database still must be accessible. That means you must have gone through at least part of the tutorial setup referred to above, in particular the steps of creating a couple small files needed for authentication. See details [here](http://lsstdesc.org/dataregistry/installation.html#one-time-setup)."
"Note that, using this method, we will not be calling any data registry services directly, but the data registry database still must be accessible. That means you must have gone through at least part of the tutorial setup referred to above, in particular the steps for creating a couple small files needed for authentication. See details [here](http://lsstdesc.org/dataregistry/installation.html#one-time-setup)."
]
},
{
Expand Down Expand Up @@ -151,14 +151,14 @@
},
{
"cell_type": "markdown",
"id": "e06e4d6f-de30-4c34-94bf-0756d3b7c017",
"id": "3d3209c8-8132-421e-9ea2-e7e17bca9f7a",
"metadata": {},
"source": [
"### Dataset attributes\n",
"### Dataset properties\n",
"\n",
"Recall that a `DataRegistry` instance has a member `Query` which provides all the query services.\n",
"Recall that a `DataRegistry` instance has a member `Query` which provides all the query services, the principal one being the ability to ask for values of attributes of datasets, subject to one or more filters. If you haven't already, we recommend you take a look at the tutorial \"Getting started: Part 3 - Simple queries\" before proceeding further.\n",
"\n",
"As described in \"Getting started: Part 3 - Simple queries\" you can ask for values of attributes of datasets, subject to one or more filters. You can find out what those attributes (\"columns\" in database parlance) are with one of those services:"
"You can find out what the dataset properties (\"columns\" in database parlance) are with another of the `Query` services: "
]
},
{
Expand Down Expand Up @@ -197,10 +197,12 @@
},
{
"cell_type": "raw",
"id": "c3baa2fd-d0fa-4acc-96be-e030cb63cf47",
"id": "47b0d238-0fb6-436d-a905-5b39e9831903",
"metadata": {},
"source": [
"Among the more interesting for our purposes are `name`, `relative_path`, `access_api` and `access_api_configuration`."
"Among the more interesting for our purposes are `name`, `relative_path`, `access_api`, `access_api_configuration` and `location_type. In the case of catalogs registered with GCRCatalogs, `name` in the data registry is the same name GCRCatalogs uses to refer to it: the basename of the corresponding config file, not including the suffix `.yaml`. But keep in mind that, unlike GCRCatalog, the dataregistry always respects case in names\n",
"\n",
"Let's look at those properties for the dataset `cosmoDC2_v1.1.4`."
]
},
{
Expand All @@ -209,6 +211,94 @@
"id": "bdbe8537-6195-4239-bbb8-976daacdfab7",
"metadata": {},
"outputs": [],
"source": [
"# Define a filter on the \"name\" property and make query\n",
"from dataregistry.query import Filter\n",
"catname = 'cosmoDC2_v1.1.4'\n",
"filters = [Filter('dataset.name', '==', catname)]\n",
"property_names = ['dataset.name', 'dataset.relative_path', 'dataset.access_api', \n",
" 'dataset.access_api_configuration', 'dataset.location_type']\n",
"result = datareg.Query.find_datasets(property_names=property_names,\n",
" filters=filters)\n",
"# By default the return type is a dict\n",
"for k, v in result.items():\n",
" print(f'Key {k} has value \\n{v[0]}\\n')\n"
]
},
{
"cell_type": "markdown",
"id": "90f5fa67-60c8-4567-a959-94d3ccbc764d",
"metadata": {},
"source": [
"At NERSC (currently the only place this code can be run) the value for `relative_path` is relative to the DESC NERSC production shared area, `/global/cfs/cdirs/lsst/shared`, just like the path names used in GCRCatalogs. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63a7294a-4872-4e42-a956-c603e218c849",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"abs_path = os.path.join('/global/cfs/cdirs/lsst/shared',\n",
" result['dataset.relative_path'][0])\n",
"abs_path"
]
},
{
"cell_type": "markdown",
"id": "22e96998-4862-466d-9b82-1e97a9f2777d",
"metadata": {},
"source": [
"The value \"GCRCatalogs\" for the property `dataset.access_api` is a clue that this\n",
"dataset may be read and interpreted using GCRCatalogs.\n",
"\n",
"The value for `dataset.access_api_configuration` should look familiar. It's just the contents of this catalogs's config file. And the value for the location type, \"dataregistry\", just tells us this is a normal catalog whose data files are kept in the area managed by the data registry."
]
},
{
"cell_type": "markdown",
"id": "de9f1a61-22d4-436b-a428-bfda5bade37f",
"metadata": {},
"source": [
"Let's try this for another catalog. We'll just change the name and make the same query."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "34362bb2-83b5-403c-8a82-2de367affafa",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"catname = 'cosmoDC2_v1.1.4_small'\n",
"filters = [Filter('dataset.name', '==', catname)]\n",
"property_names = ['dataset.name', 'dataset.relative_path', 'dataset.access_api', \n",
" 'dataset.access_api_configuration', 'dataset.location_type']\n",
"result = datareg.Query.find_datasets(property_names=property_names,\n",
" filters=filters)\n",
"# By default the return type is a dict\n",
"for k, v in result.items():\n",
" print(f'Key {k} has value \\n{v[0]}\\n')"
]
},
{
"cell_type": "markdown",
"id": "bca5969a-e267-427a-8b15-e88b6556d9a8",
"metadata": {},
"source": [
"It all looks pretty much as you would expect, except what happened to `dataset.relative_path`? That doesn't look like a path. You can see the reason in the catalog's configuration: it's based on something else. The data registry makes no attempt to sort this out, as GCRCatalogs would. The same thing would happen for a composite catalog: the data registry just stores the catalog's configuration; it doesn't know how to parse it. You can also see this in the value for `dataset.location_type`. \"metadata_only\" means that the data registry is only storing metadata for the catalog; it is not attempting to manage the associated files."
]
},
{
"cell_type": "markdown",
"id": "5721858e-8e42-4285-9ef0-ead3d780e918",
"metadata": {},
"source": []
}
],
Expand Down

0 comments on commit de415af

Please sign in to comment.