Skip to content

Commit

Permalink
Deploying to gh-pages from @ 567000a 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
pre-commit-ci[bot] committed Jan 24, 2025
1 parent 072ee05 commit 92dd175
Show file tree
Hide file tree
Showing 120 changed files with 42,218 additions and 0 deletions.
4 changes: 4 additions & 0 deletions _preview/198/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: c4dbc1ef0c095e785423c388e4448957
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added _preview/198/_images/LEAP_knowledge_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/email_org_invite.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/gh_org_invite_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/gh_org_invite_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/julius.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/raphael.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/sammy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/vm_access_delete.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _preview/198/_images/vm_access_project.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
:root {
--tabs-color-label-active: hsla(231, 99%, 66%, 1);
--tabs-color-label-inactive: rgba(178, 206, 245, 0.62);
--tabs-color-overline: rgb(207, 236, 238);
--tabs-color-underline: rgb(207, 236, 238);
--tabs-size-label: 1rem;
}
196 changes: 196 additions & 0 deletions _preview/198/_sources/explanation/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
(explanation.architecture)=

# LEAP-Pangeo Architecture

LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program.

## Design Principles

In the proposal, we committed to building this in a way that enables the tools and infrastructure to be reused and remixed.
So The challenge for LEAP Pangeo is to deploy an “enterprise quality” platform built entirely out of open-source tools, and to make this platform as reusable and useful for the broader climate science community as possible.
We committed to following the following design principles:

- Open source
- Modular system: built out of smaller, standalone pieces which interoperate through clearly documented interfaces / standards
- Agile development on GitHub
- Following industry-standard best practices for continuous deployment, testing, etc.
- Resuse of existing technologies and contribution to "upstream" open source projects on which LEAP-Pangeo depends
(rather than development of new stuff just for the sake of it).
This is a key part of our sustainability plan.

## Design and Architecture

```{figure} https://i.imgur.com/PVhoQUu.png
---
name: architecture-diagram
---
LEAP-Pangeo high-level architecture diagram
```

There are four primary components to LEAP-Pangeo.

(explanation.architecture.data-library)=

### The Data Library

The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP.
The data library is directly inspired by the [IRI Data Library](https://iridl.ldeo.columbia.edu) mentioned above; however, LEAP-Pangeo data will be hosted in the cloud, for maximum impact, accessibility, and interoperability.

The contents of the data library will evolve dynamically based on the needs of the project.
Examples of data that may become part of the library are

- NOAA [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) sea-surface temperature data,
used in workshops and classes to illustrate the fundamentals of geospatial data science.
- High-resolution climate model simulations from the [NCAR "EarthWorks"](https://news.ucar.edu/132760/csu-ncar-develop-high-res-global-model-community-use)
project, used by LEAP researchers to develop machine-learning parameterizations of climate processes like cloud and ocean eddies.
- Machine-learning "challenge datasets," published by the LEAP Team and accessible to the world, to help broading participation
by ML researchers into climate science.
- Easily accessible syntheses of climate projections from [CMIP6 data](https://esgf-node.llnl.gov/projects/cmip6/), produced by the LEAP team,
for use by industry partners for business strategy and decision making.

(explanation.architecture.catalog)=

#### Data Catalog

A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
It will also provide direct links to raw data in cloud object storage.

The catalog will facilitate several different modes of access:

- Searching, crawling, and opening datasets from within notebooks or scripts
- "Crawling" by search indexes or other machine-to-machine interfaces
- A pretty web front-end interface for interactive public browsing

The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.

### Data Storage Service

The underlying technology for the LEAP Data catalog will be cloud object storage (e.g. Amazon S3),
which enables high throughput concurrent access to many simultaneous users over the public internet.
Cloud Object Storage is the most performant, cost-effective, and simple way to serve such large volumes of data.

Initially, the LEAP data will be stored in Google Cloud Storage, in the same cloud region
as the JupyterHub.
Going forward, we will work with NCAR to obtain an [Open Storage Network](https://www.openstoragenetwork.org/)
pod which allows data to be accessible from both Google Cloud and NCAR's computing system.

#### Pangeo Forge

```{figure} https://raw.githubusercontent.com/pangeo-forge/flow-charts/main/renders/architecture.png
---
width: 600px
name: pangeo-forge-flow
---
Pangeo Forge high-level workflow. Diagram from https://github.com/pangeo-forge/flow-charts
```

A central tool for the population and maintenance of the LEAP-Pangeo data catalog is
[Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/).
Pangeo Forge is an open source tool for data Extraction, Transformation, and Loading (ETL).
The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format.

Pangeo Forge works by allowing domain scientists to define "recipes" that describe data transformation pipelines.
These recipes are stored in GitHub repositories.
Continuous integration monitors GitHub and automatically executes the data pipelines when needed.
The use of distributed, cloud-based processing allows very large volumes of data to be processed quickly.

Pangeo Forge is a new project, funded by the NSF EarthCube program.
LEAP-Pangeo will provide a high-impact use case for Pangeo Forge, and Pangeo Forge
will empower and enhance LEAP research.
This synergistic relationship with be mutually beneficial to two NSF-sponsored projects.
Using Pangeo Forge effectively will require LEAP scientists and data engineers to engage
with the open-source development process around Pangeo Forge and related technologies.

### The Hub

```{figure} https://jupyter.org/assets/homepage/labpreview.webp
---
width: 400px
name: jupyterlab-preview
---
Screenshot from JupyterLab. From <https://jupyter.org/>
```

Jupyter Notebook / Lab has emerged as the standard tool for doing interactive data science.
Jupyter supports combining rich text, code, and generated outputs (e.g. figures) into a single document, creating a way to communicate and share complete data-science research project

```{figure} https://jupyterhub.readthedocs.io/en/stable/_images/jhub-fluxogram.jpeg
---
width: 400px
name: jupyterhub-architecture
---
JupyterHub architecture. From <https://jupyterhub.readthedocs.io/>
```

JupyterHub is a multi-user Jupyter Notebook / Lab environment that runs on a server.
JupyterHub provides a gateway to highly customized software environments backed by dedicated computing with specified resources (CPU, RAM, GPU, etc.)
Running in the cloud, JupyterHub can scale up to accommodate any number of simultaneous users with no degradation in
performance.
JupyterHub environments can support basically [every existing programming language](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
We anticipate that LEAP users will primarily use **Python**, **R**, and **Julia** programming languages.
In addition to Jupyter Notebook / Lab, JupyterHub also supports launching [R Studio](https://www.rstudio.com/).

The Pangeo project already provides [curated Docker images](https://github.com/pangeo-data/pangeo-docker-images)
with full-featured Python software environments for environmental data science.
These environments will be the starting point for LEAP environments.
They may be augmented as LEAP evolves with more specific software as needed by research projects.

Use management and access control for the Hub are described in [](reference.membership).

### The Knowledge Graph

LEAP "outputs" will be of four main types:

- **Datasets** (covered above)
- **Papers** - traditional scientific publications
- **Project Code** - the code behind the papers, used to actually generate the scientific results
- **Trained ML Models** - models that can be used directly for inference by others
- **Educational Modules** - used for teaching

All of these object must be tracked and cataloged in a uniform way.
The [](explanation.code_policy) and [](explanation.data-policy) will help set these standards.

```{figure} ../images/LEAP_knowledge_graph.png
---
width: 600px
name: knowledge-graph
---
LEAP Knowledge Graph
```

By tracking the linked relationships between datasets, papers, code, models, and educational , we will generate a “knowledge graph”.
This graph will reveal the dynamic, evolving state of the outputs of LEAP research and the relationships between different elements of the project.
By also tracking participations (i.e. humans), we will build a novel and inspiring track record of LEAP's impacts through the project lifetime.

This is the most open-ended aspect of our infrastructure.
Organizing and displaying this information effectively is a challenging problem in
information architecture and systems design.

## Related Tools and Platforms

It’s useful to understand the recent history and related efforts in this space.

- **[Google Colab](https://research.google.com/colaboratory/faq.html)** is a free notebook-in-the-cloud service run by Google.
It is built around the open source Jupyter project, but with advanced notebook sharing capabilities (like Google Docs).
- **[Google Earth Engine](https://earthengine.google.org/)** is a reference point for all cloud geospatial analytics platforms.
It’s actually a standalone application that is separate from Google Cloud, the single instance of a highly customized, black box (i.e. not open source) application that enables parallel computing on distributed data.
It’s very good at what it was designed for (analyzing satellite images), but isn’t easily adapted to other applications, such as machine learning.
- **[Columbia IRI Data Library](https://iridl.ldeo.columbia.edu/index.html)** is a powerful and freely accessible online data repository and analysis tool that allows a user to view, analyze, and download hundreds of terabytes of climate-related data through a standard web browser.
Due to its somewhat outdated architecture, IRI data library cannot easily be updated or adapted to new projects.
- **[Pangeo](http://pangeo.io/)** is an open science community oriented around open-source python tools for big-data geoscience.
It is a loose ecosystem of interoperable python packages including [Jupyter](https://jupyter.org/), [Xarray](http://xarray.pydata.org/), [Dask](http://dask.pydata.org/), and [Zarr](https://zarr.readthedocs.io/).
The Pangeo tools have been deployed in nearly all commercial clouds (AWS, GCP, Azure) as well as HPC environments.
[Pangeo Cloud](https://pangeo.io/cloud.html) is a publicly accessible data-proximate computing environment based on Pangeo tools.
Pangeo is used heavily within NCAR.
- **[Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/)** is a collection of datasets and computational tools hosted by Microsoft in the Azure cloud.
It combines Pangeo-style computing environments with a data library based on [SpatioTemporal Asset Catalog](https://stacspec.org/)
- **[Radiant Earth ML Hub](https://www.radiant.earth/mlhub/)** is a cloud-based open library dedicated to Earth observation training data for use with machine learning algorithms.
It focuses mostly on data access and curation.
Data are cataloged using STAC.
- **[Pangeo Forge](https://pangeo-forge.org/)** is a new initiative, funded by the NSF EarthCube program, to build a platform for
"crowdsourcing" the production of analysis-ready, cloud-optimized data.
Once operational, Pangeo Forge will be a useful tool for many different projects which need data in the cloud.

Of these different tools, we opt to build on Pangeo because of its open-source, grassroots
foundations in the climate data science community, strong uptake within NCAR, and track-record of support from NSF.
10 changes: 10 additions & 0 deletions _preview/198/_sources/explanation/code_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
(explanation.code_policy)=

# LEAP-Pangeo Code Policy

(explanation.code-policy.dont-let-perfect-be-the-enemy-of-good)=

## Enable Science now, but keep evolving.

"Don't let perfect be the enemy of good"
🚧
51 changes: 51 additions & 0 deletions _preview/198/_sources/explanation/data_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
abbreviations:
ARCO: Analysis-Ready Cloud-Optimized
---

(explanation.data-policy)=

# LEAP-Pangeo Data Policy

(explanation.data-policies.access)=

## Data Access

🚧

(explanation.data-policy.reproducibility)=

## Reproducibility

🚧

(explanation.data-policy.types)=

## Types of Data Used at LEAP

Within the LEAP project we distinguish between several different types of data mostly based on whether the data was used or produced at LEAP and if the data is already accessible in {abbr}`ARCO` formats in the cloud.

:::\{admonition} LEAP produced
:class: dropdown
Data that has been created or modified by LEAP researchers. We currently do not provide any way of ensuring that data is archived, and users should never rely on LEAP-Pangeo resources as the only replicate of valuable data (see also [](guides.data.ingestion)).
:::

:::\{admonition} LEAP ingested
:class: dropdown
Data that is already publically available but has been ingested into cloud storage in {abbr}`ARCO` formats. The actual data/metadata has not been modified from the original.
:::

:::\{admonition} LEAP curated
:class: dropdown
Data that is already available in {abbr}`ARCO` formats in publically accessible object storage. Adding this data to the LEAP-Pangeo Catalog enables us to visualize it with the Data Viewer, and collect all datasets of importance in one single location, but none of the data itself is modified.
:::

## Roles

Many different people at LEAP interact with data in various ways. Here is a list of typical roles (some people have multiple roles):

(explanation.data-policy.roles.data-expert)=

### Data Expert

🚧
Loading

0 comments on commit 92dd175

Please sign in to comment.