From 293ed18a29d6d0b23274fc4e0d8e4a9e22b9fa9c Mon Sep 17 00:00:00 2001 From: Samarth Agrawal <41808786+SammyAgrawal@users.noreply.github.com> Date: Tue, 19 Dec 2023 16:26:48 -0500 Subject: [PATCH] Layout new PR (from correct branch this time) (#113) * Fixed links in jupyterhub page * Implemented changes proposed in Issue 101 * Reorganized to separate out 4 types of doc * Minor fixes * dummy commit --- README.md | 1 + book/_toc.yml | 16 +- book/guides/hub_guides.md | 172 ++++++++++++++++++++ book/intro.md | 8 + book/leap-pangeo/architecture.md | 68 ++++---- book/leap-pangeo/jupyterhub.md | 263 +------------------------------ book/leap-pangeo/tutorial.md | 73 +++++++++ book/support.rst | 13 +- 8 files changed, 308 insertions(+), 306 deletions(-) create mode 100644 book/guides/hub_guides.md create mode 100644 book/leap-pangeo/tutorial.md diff --git a/README.md b/README.md index a5264f7e..3673eadd 100644 --- a/README.md +++ b/README.md @@ -18,3 +18,4 @@ The website is located at . + diff --git a/book/_toc.yml b/book/_toc.yml index a9c3cf2d..947a435a 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -4,22 +4,24 @@ format: jb-book root: intro parts: - - caption: Policies - chapters: - - file: policies/code_policy - - file: policies/data_policy - - file: policies/infrastructure_policy - - file: policies/users_roles - caption: LEAP-Pangeo chapters: - file: leap-pangeo/jupyterhub.md + - file: leap-pangeo/tutorial.md - file: leap-pangeo/architecture - file: leap-pangeo/implementation - - file: leap-pangeo/solutions - caption: Guides chapters: + - file: guides/hub_guides + - file: leap-pangeo/solutions - file: guides/education - file: guides/team_docs + - caption: Policies + chapters: + - file: policies/code_policy + - file: policies/data_policy + - file: policies/infrastructure_policy + - file: policies/users_roles - caption: Miscellaneous chapters: - file: support diff --git a/book/guides/hub_guides.md b/book/guides/hub_guides.md new file mode 100644 index 00000000..ca7ca7ba --- /dev/null +++ b/book/guides/hub_guides.md @@ -0,0 +1,172 @@ +# How-To Guides for Using the Hub +These are a set of guides for using the JupyterHub Compute Environment effectively. +## Compute +### Dask + +To help you scale up calculations using a cluster, the Hub is configured with Dask Gateway. +For a quick guide on how to start a Dask Cluster, consult this page from the Pangeo docs: + +- https://pangeo.io/cloud.html#dask + +## Data +### I have a dataset and want to work with it on the hub. How do I upload it? + +If you would like to add a new dataset to the LEAP Data Library, please first raise an issue [here](https://github.com/leap-stc/data-management/issues/new?assignees=&labels=dataset&template=new_dataset.yaml&title=New+Dataset+%5BDataset+Name%5D). This enables us to track detailed information about proposed datasets and have an open discussion about how to upload it to the cloud. + +We distinguish between two primary *types* of data to upload: "Original" and "Published" data. + +- **Published Data** has been published and archived in a publically accessible location (e.g. a data repository like [zenodo](https://zenodo.org) or [figshare](https://figshare.com)). We do not recommend uploading this data to the cloud directly, but instead use [Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/) to transform and upload it to the cloud. This ensures that the data is stored in an ARCO format and can be easily accessed by other LEAP members. +- **Original Data** is any dataset that is produced by researchers at LEAP and has not been published yet. The main use case for this data is to share it with other LEAP members and collaborate on it. For original data we support direct uploaded to the cloud. *Be aware that original data could change rapidly as the data producer is iterating on their code*. We encourage all datasets to be archived and published before using them in scientific publications. + +##### Transform and Upload published data to an ARCO format (with Pangeo Forge) + +Coming Soon + +##### Upload medium sized original data from your local machine + +For medium sized datasets, that can be uploaded within an hour, you can use a temporary access token generated on the JupyterHub to upload data to the cloud. + +- Set up a new environment on your local machine (e.g. laptop) + +```shell +mamba env create --name leap_pange_transfer python=3.9 google-auth gcsfs jupyterlab xarray zarr dask #add any other dependencies (e.g. netcdf4) that you need to read your data +``` + +- Activate the environment + +```shell +conda activate leap_pange_transfer +``` + +and set up a jupyter notbook (or a pure python script) that loads your data in as few xarray datasets as possible. For instance, if you have one dataset that consists of many files split in time, you should set your notebook up to read all the files using xarray into a single dataset, and then try to write out a small part of the dataset to a zarr store. + +- Now start up a [LEAP-Pangeo server](leap.2i2c.cloud) and open a terminal. Install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) using mamba + +```shell +mamba install google-cloud-sdk +``` +Now you can generate a temporary token (valid for 1 hour) that allows you to upload data to the cloud. + +```shell +gcloud auth print-access-token +``` + +Copy the resulting token into a plain text file `token.txt` in a convenient location on your **local machine**. + +- Now start a JupyterLab notebook and paste the following code into a cell: + +```python +import gcsfs +import xarray as xr +from google.cloud import storage +from google.oauth2.credentials import Credentials + +# import an access token +# - option 1: read an access token from a file +with open("path/to/your/token.txt") as f: + access_token = f.read().strip() + +# setup a storage client using credentials +credentials = Credentials(access_token) +fs = gcsfs.GCSFileSystem(token=credentials) +``` + +> Make sure to replace the `path/to/your/token.txt` with the actual path to your token file. + +Try to write a small dataset to the cloud: + +```python +ds = xr.DataArray([1]).to_dataset(name='test') +ds.to_zarr('gs://leap-scratch//test_offsite_upload.zarr') #adding the 'gs://' prefix makes this just work with xarray! +``` + +> Replace `` with your actual username on the hub. + +- Make sure that you can read the test dataset from within the hub (go back to [Basic writing to and reading from cloud buckets](hub:data:read_write)). + +- Now the last step is to paste the code to load your actual dataset into the notebook and use `.to_zarr` to upload it. + +> Make sure to give the store a meaningful name, and raise an issue in the [data-management repo](https://github.com/leap-stc/data-management/issues) to get the dataset added to the LEAP Data Library. + +> Make sure to use a different bucket than `leap-scratch`, since that will be deleted every 7 days! For more info refer to the available [storage buckets](hub:data:buckets). + +(hub:data:upload_hpc)= +##### Uploading large original data from an HPC system (no browser access on the system available) + +A commong scenario is the following: A researcher/student has run a simulation on a High Performance Computer (HPC) at their institution, but now wants to collaboratively work on the analysis or train a machine learning model with this data. For this they need to upload it to the cloud storage. + +The following steps will guide you through the steps needed to authenticate and upload data to the cloud, but might have to be slightly modified depending on the actual setup of the users HPC. + +**Conversion Script/Notebook** + +In most cases you do not just want to upload the data in its current form (e.g. many netcdf files). + +Instead we will load the data into an [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) and then write that Dataset object directly to a zarr store in the cloud. For this you need a python environment with `xarray, gcsfs, zarr` installed (you might need additional dependencies for your particular use case). + +1. Spend some time to set up a python script/jupyter notebook on the HPC system that opens your files and combines them in to one or more xarray.Datasets (combine as many files as sensible into a single dataset). Make sure that your data is lazily loaded and the `Dataset.data` is a [dask array](https://docs.dask.org/en/stable/array.html) + +2. Check your dataset: + - Check that the metadata is correct. + - Check that all the variables/dimensions are in the dataset + - Check the dask chunksize. A general rule is to aim for around 100MB size, but the size and structure of chunking that is optimal depends heavily on the later use case. + + +3. Try to write out a subset of the data locally by calling the [`.to_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html) method on the dataset. + +Once that works we can move on to the authentication. + +**Upload Prerequisites** + +Before we are able to set up authentication we need to make sure our HPC and local computer (required) are set up correctly. +- We manage access rights through [Google Groups](https://groups.google.com). Please contact the [](support.data_compute_team) to get added to the appropriate group (a gmail address is required for this). +- Make sure to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) in both your HPC environment, and your local computer that can open a web browser (e.g. your laptop). + +**Steps** +Steps executed on your "local" computer (e.g. laptop) will be colored in green and steps on your "remote" computer (e.g. HPC) in purple. + +1. SSH into the HPC +2. Check that you have an internet connection with `ping www.google.com` +3. Request no browser authentication: + ``` + gcloud auth application-default login --scopes=https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/iam.test --no-browser + ``` + > ๐Ÿšจ It is very important to include the `--scopes=` argument for security reasons. Do not run this command without it! +4. Follow the onscreen prompt and paste the command into a terminal on your local machine. +5. This will open a browser window. Authenticate with the gmail account that was added to the google group. +6. Go back to the terminal and follow the onscreen instructions. Copy the text from the command line and paste the command in the open dialog on the remote machine. +7. Make sure to note the path to the auth json! It will be something like `.../.config/gcloud/....json`. + +Now you are have everything you need to authenticate. + +Lets verify that you can write a small dummy dataset to the cloud. In your notebook/script run the following (make sure to replace the filename and your username as instructed). + +Your dataset should now be available for all LEAP members ๐ŸŽ‰๐Ÿš€ + +```python +import xarray as xr +import gcsfs +import json + +with open("your_auth_file.json") as f: #๐Ÿšจ make sure to enter the `.json` file from step 7 + token=json.load(f) + +# test write a small dummy xarray dataset to zarr +ds = xr.DataArray([1, 4, 6]).to_dataset(name='data') +# Once you have confirmed + +fs = gcsfs.GCSFileSystem(token=token) +mapper = fs.get_mapper("gs://leap-persistent//testing/demo_write_from_remote.zarr") #๐Ÿšจ enter your leap (github) username here +ds.to_zarr(mapper) +``` + +Now you can repeat the same steps but replace your dataset with the full dataset from above and leave your python code running until the upload has finished. Depending on the internet connection speed and the size of the full dataset, this can take a while. + +If you want to see a progress bar, you can wrap the call to `.to_zarr` with a [dask progress bar](https://docs.dask.org/en/stable/diagnostics-local.html#progress-bar) + +```python +from dask.diagnostics import ProgressBar +with ProgressBar(): + ds.to_zarr(mapper) +``` + +Once the data has been uploaded, make sure to erase the `.../.config/gcloud/....json` file from step 7, and ask to be removed from the Google Group. diff --git a/book/intro.md b/book/intro.md index a19a09af..705d9c07 100644 --- a/book/intro.md +++ b/book/intro.md @@ -8,6 +8,14 @@ This website is the home for all technical documentation related to LEAP and LEA | -- | -- | -- | | [![GitHub last commit](https://img.shields.io/github/last-commit/leap-stc/leap-stc.github.io)](https://github.com/leap-stc/leap-stc.github.io) | ![GitHub contributors](https://img.shields.io/github/contributors/leap-stc/leap-stc.github.io) | [![publish-book](https://github.com/leap-stc/leap-stc.github.io/actions/workflows/publish-book.yaml/badge.svg?style=flat-square)](https://github.com/leap-stc/leap-stc.github.io/actions/workflows/publish-book.yaml) | +## Motivation + +The motivation and justification for developing LEAP-Pangeo are laid out in several recent peer-reviewed publications: {cite}`AbernatheyEtAl2021` and {cite}`GentemannEtAl2021`. +To summarize these arguments, a shared data and computing platform will: +- *Facilitate seamless collaboration between project members around data-intensive science, accelerating research progress.* +- *Empower LEAP participants with instant access to high-performance computing and analysis-ready data in order to support ambitious research objectives.* This access is provided through our [JupyterHub platform](leap-pangeo/jupyterhub.md). +- *Place actionable data in the hands of LEAP partners to support knowledge transfer.* Our data catalog can be found [here](https://leap-data-catalog.vercel.app/). See [here](guides/hub_guides.md) to learn how to upload your data to the hub. +- *Enable rich data-driven classroom experiences for learners, helping them transition successfully from coursework to research.* ## Contents diff --git a/book/leap-pangeo/architecture.md b/book/leap-pangeo/architecture.md index 8a0194c6..1bd2ffbf 100644 --- a/book/leap-pangeo/architecture.md +++ b/book/leap-pangeo/architecture.md @@ -3,15 +3,6 @@ LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program. -## Motivation - -The motivation and justification for developing LEAP-Pangeo are laid out in several recent peer-reviewed publications: {cite}`AbernatheyEtAl2021` and {cite}`GentemannEtAl2021`. -To summarize these arguments, a shared data and computing platform will: -- Empower LEAP participants with instant access to high-performance computing and analysis-ready data in order to support ambitious research objectives -- Facilitate seamless collaboration between project members around data-intensive science, accelerating research progress -- Enable rich data-driven classroom experiences for learners, helping them transition successfully from coursework to research -- Place actionable data in the hands of LEAP partners to support knowledge transfer - ## Design Principles In the proposal, we committed to building this in a way that enables the tools and infrastructure to be reused and remixed. @@ -25,35 +16,6 @@ We committed to following the following design principles: (rather than development of new stuff just for the sake of it). This is a key part of our sustainability plan. -## Related Tools and Platforms - - -Itโ€™s useful to understand the recent history and related efforts in this space. - -- **[Google Colab](https://research.google.com/colaboratory/faq.html)** is a free notebook-in-the-cloud service run by Google. - It is built around the open source Jupyter project, but with advanced notebook sharing capabilities (like Google Docs). -- **[Google Earth Engine](https://earthengine.google.org/)** is a reference point for all cloud geospatial analytics platforms. - Itโ€™s actually a standalone application that is separate from Google Cloud, the single instance of a highly customized, black box (i.e. not open source) application that enables parallel computing on distributed data. - Itโ€™s very good at what it was designed for (analyzing satellite images), but isnโ€™t easily adapted to other applications, such as machine learning. -- **[Columbia IRI Data Library](https://iridl.ldeo.columbia.edu/index.html)** is a powerful and freely accessible online data repository and analysis tool that allows a user to view, analyze, and download hundreds of terabytes of climate-related data through a standard web browser. - Due to its somewhat outdated architecture, IRI data library cannot easily be updated or adapted to new projects. -- **[Pangeo](http://pangeo.io/)** is an open science community oriented around open-source python tools for big-data geoscience. - It is a loose ecosystem of interoperable python packages including [Jupyter](https://jupyter.org/), [Xarray](http://xarray.pydata.org/), [Dask](http://dask.pydata.org/), and [Zarr](https://zarr.readthedocs.io/). - The Pangeo tools have been deployed in nearly all commercial clouds (AWS, GCP, Azure) as well as HPC environments. - [Pangeo Cloud](https://pangeo.io/cloud.html) is a publicly accessible data-proximate computing environment based on Pangeo tools. - Pangeo is used heavily within NCAR. -- **[Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/)** is a collection of datasets and computational tools hosted by Microsoft in the Azure cloud. - It combines Pangeo-style computing environments with a data library based on [SpatioTemporal Asset Catalog](https://stacspec.org/) -- **[Radiant Earth ML Hub](https://www.radiant.earth/mlhub/)** is a cloud-based open library dedicated to Earth observation training data for use with machine learning algorithms. - It focuses mostly on data access and curation. - Data are cataloged using STAC. -- **[Pangeo Forge](https://pangeo-forge.org/)** is a new initiative, funded by the NSF EarthCube program, to build a platform for - "crowdsourcing" the production of analysis-ready, cloud-optimized data. - Once operational, Pangeo Forge will be a useful tool for many different projects which need data in the cloud. - -Of these different tools, we opt to build on Pangeo because of its open-source, grassroots -foundations in the climate data science community, strong uptake within NCAR, and track-record of support from NSF. - ## Design and Architecture ```{figure} https://i.imgur.com/PVhoQUu.png @@ -198,3 +160,33 @@ By also tracking participations (i.e. humans), we will build a novel and inspiri This is the most open-ended aspect of our infrastructure. Organizing and displaying this information effectively is a challenging problem in information architecture and systems design. + + + ## Related Tools and Platforms + + +Itโ€™s useful to understand the recent history and related efforts in this space. + +- **[Google Colab](https://research.google.com/colaboratory/faq.html)** is a free notebook-in-the-cloud service run by Google. + It is built around the open source Jupyter project, but with advanced notebook sharing capabilities (like Google Docs). +- **[Google Earth Engine](https://earthengine.google.org/)** is a reference point for all cloud geospatial analytics platforms. + Itโ€™s actually a standalone application that is separate from Google Cloud, the single instance of a highly customized, black box (i.e. not open source) application that enables parallel computing on distributed data. + Itโ€™s very good at what it was designed for (analyzing satellite images), but isnโ€™t easily adapted to other applications, such as machine learning. +- **[Columbia IRI Data Library](https://iridl.ldeo.columbia.edu/index.html)** is a powerful and freely accessible online data repository and analysis tool that allows a user to view, analyze, and download hundreds of terabytes of climate-related data through a standard web browser. + Due to its somewhat outdated architecture, IRI data library cannot easily be updated or adapted to new projects. +- **[Pangeo](http://pangeo.io/)** is an open science community oriented around open-source python tools for big-data geoscience. + It is a loose ecosystem of interoperable python packages including [Jupyter](https://jupyter.org/), [Xarray](http://xarray.pydata.org/), [Dask](http://dask.pydata.org/), and [Zarr](https://zarr.readthedocs.io/). + The Pangeo tools have been deployed in nearly all commercial clouds (AWS, GCP, Azure) as well as HPC environments. + [Pangeo Cloud](https://pangeo.io/cloud.html) is a publicly accessible data-proximate computing environment based on Pangeo tools. + Pangeo is used heavily within NCAR. +- **[Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/)** is a collection of datasets and computational tools hosted by Microsoft in the Azure cloud. + It combines Pangeo-style computing environments with a data library based on [SpatioTemporal Asset Catalog](https://stacspec.org/) +- **[Radiant Earth ML Hub](https://www.radiant.earth/mlhub/)** is a cloud-based open library dedicated to Earth observation training data for use with machine learning algorithms. + It focuses mostly on data access and curation. + Data are cataloged using STAC. +- **[Pangeo Forge](https://pangeo-forge.org/)** is a new initiative, funded by the NSF EarthCube program, to build a platform for + "crowdsourcing" the production of analysis-ready, cloud-optimized data. + Once operational, Pangeo Forge will be a useful tool for many different projects which need data in the cloud. + +Of these different tools, we opt to build on Pangeo because of its open-source, grassroots +foundations in the climate data science community, strong uptake within NCAR, and track-record of support from NSF. diff --git a/book/leap-pangeo/jupyterhub.md b/book/leap-pangeo/jupyterhub.md index 07b8b003..faaf4a0a 100644 --- a/book/leap-pangeo/jupyterhub.md +++ b/book/leap-pangeo/jupyterhub.md @@ -11,67 +11,9 @@ For information who can access the hub with which privileges, please refer to | **Hub Operator**| [2i2c](https://2i2c.org/) | | **Hub Configuration** | https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/leap | -## Getting Started - -To get started using the hub, check out this video by [James Munroe](https://github.com/jmunroe) from [2i2c](https://2i2c.org) explaining the architecture. - - - -## Getting Help - -For questions about how to use the Hub, please use the LEAP-Pangeo discussion forum: - -- https://github.com/leap-stc/leap-stc.github.io/discussions - -### Office Hours - -We also offer in-person and virtual Office Hours on Thursdays for questions about LEAP-Pangeo. -You can reserve an appointment [here](https://app.reclaim.ai/m/leap-pangeo-office-hours). - -## Hub Usage - -This is a rough and ready guide to using the Hub. -This documentation will be expanded as we learn and evolve. -Feel free to [edit it yourself](https://github.com/leap-stc/leap-stc.github.io/blob/main/book/leap-pangeo/jupyterhub.md) if you have suggetions for improvement! - -(hub:server:login)= -### Logging In - -1. ๐Ÿ‘€ Navigate to https://leap.2i2c.cloud/ and click the big orange button that says "Log in to continue" -2. ๐Ÿ” You will be prompted to authorize a GitHub application. Say "yes" to everything. - Note you must belong to the appropriate GitHub team in order to access the hub. - See {doc}`/policies/users_roles` for more information. -3. ๐Ÿ“  You will redirect to a screen with the following options. - -image - -> Note: Depending on your [membership]() you might see additional options (e.g. for GPU machines) - -You have to make 3 choices here: -- The machine type (Choose between "CPU only" or "GPU" if available) - **โš ๏ธThe GPU images should be used only when needed to accelerate model training.** -- The software environment ("Image"). Find more info in the [Software Environment Section](hub:image) below. -- The node share. These are shared resources, and you should try to use the smallest image you need. You can easily start up a new server with a larger share if you find your work to be limited by CPU/RAM - -4. ๐Ÿ•ฅ Wait for your server to start up. It can take up to few minutes. - -### Using JupyterLab - -After your server fires up, you will be dropped into a JupyterLab environment. - -If you are new to JupyterLab, you might want to peruse the [user guide](https://jupyterlab.readthedocs.io/en/stable/user/interface.html). - -### Shutting Down Your Server - -Your server will shut down automatically after a period of inactivity. -However, if you know you are done working, it's best to shut it down directly. -To shut it down, go to https://leap.2i2c.cloud/hub/home and click the big red button that says "Stop My Server" - -image - -You can also navigate to this page from JupyterLab by clicking the `File` menu and going to `Hub Control Panel`. - -(hub:image)= +This document goes over the primary technical details of the JupyterHub. +- For a quick tutorial on basic usage, please see [Getting Started](tutorial.md). +- To get an in-depth overview of the LEAP Pangeo Architecture and how the JupyterHub fits into it, please see the [Architecture](architecture.md) page. ### The Software Environment The software environment you encounter on the Hub is based upon [docker images](https://www.digitalocean.com/community/tutorials/the-docker-ecosystem-an-introduction-to-common-components) which you can run on other machines (like your laptop or an HPC cluster) for better reproducibility. @@ -221,201 +163,4 @@ fs.rm('leap-persistent/funky-user/file_to_delete.nc') If you want to remove zarr stores (which are an 'exploded' data format, and thus represented by a folder structure) you have to recursively delete the store. ```python fs.rm('leap-scratch/funky-user/processed_store.zarr', recursive=True) -``` -:::{warning} -The warning from above is even more important here! Make sure that the folder you are deleting does not contain any data you do not want to delete! -::: - -#### I have a dataset and want to work with it on the hub. How do I upload it? - -If you would like to add a new dataset to the LEAP Data Library, please first raise an issue [here](https://github.com/leap-stc/data-management/issues/new?assignees=&labels=dataset&template=new_dataset.yaml&title=New+Dataset+%5BDataset+Name%5D). This enables us to track detailed information about proposed datasets and have an open discussion about how to upload it to the cloud. - -We distinguish between two primary *types* of data to upload: "Original" and "Published" data. - -- **Published Data** has been published and archived in a publically accessible location (e.g. a data repository like [zenodo](https://zenodo.org) or [figshare](https://figshare.com)). We do not recommend uploading this data to the cloud directly, but instead use [Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/) to transform and upload it to the cloud. This ensures that the data is stored in an ARCO format and can be easily accessed by other LEAP members. -- **Original Data** is any dataset that is produced by researchers at LEAP and has not been published yet. The main use case for this data is to share it with other LEAP members and collaborate on it. For original data we support direct uploaded to the cloud. *Be aware that original data could change rapidly as the data producer is iterating on their code*. We encourage all datasets to be archived and published before using them in scientific publications. - -##### Transform and Upload published data to an ARCO format (with Pangeo Forge) - -Coming Soon - -(hub:data:upload_manual)= -##### Upload medium sized original data from your local machine - -For medium sized datasets, that can be uploaded within an hour, you can use a temporary access token generated on the JupyterHub to upload data to the cloud. - -- Set up a new environment on your local machine (e.g. laptop) - -```shell -mamba create --name leap_pangeo_transfer python=3.9 google-auth gcsfs jupyterlab xarray zarr dask -``` -> add any other dependencies (e.g. netcdf4) that you need to read your data - -- Activate the environment - -```shell -conda activate leap_pangeo_transfer -``` - -and set up a jupyter notbook (or a pure python script) that loads your data in as few xarray datasets as possible. For instance, if you have one dataset that consists of many files split in time, you should set your notebook up to read all the files using xarray into a single dataset, and then try to write out a small part of the dataset to a zarr store. - -- Now start up a [LEAP-Pangeo server](leap.2i2c.cloud) and open a terminal. Install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) using mamba - -```shell -mamba install google-cloud-sdk -``` -Now you can generate a temporary token (valid for 1 hour) that allows you to upload data to the cloud. - -```shell -gcloud auth print-access-token -``` - -Copy the resulting token into a plain text file `token.txt` in a convenient location on your **local machine**. - -- Now start a JupyterLab notebook and paste the following code into a cell: - -```python -import gcsfs -import xarray as xr -from google.cloud import storage -from google.oauth2.credentials import Credentials - -# import an access token -# - option 1: read an access token from a file -with open("path/to/your/token.txt") as f: - access_token = f.read().strip() - -# setup a storage client using credentials -credentials = Credentials(access_token) -fs = gcsfs.GCSFileSystem(token=credentials) -``` - -> Make sure to replace the `path/to/your/token.txt` with the actual path to your token file. - -Try to write a small dataset to the cloud: - -```python -ds = xr.DataArray([1]).to_dataset(name='test') -mapper = fs.get_mapper('gs://leap-scratch//test_offsite_upload.zarr') -ds.to_zarr(mapper) -``` - -> Replace `` with your actual username on the hub. - -- Make sure that you can read the test dataset from within the hub (go back to [Basic writing to and reading from cloud buckets](hub:data:read_write)). - -- Now the last step is to paste the code to load your actual dataset into the notebook and use `.to_zarr` to upload it. - -> Make sure to give the store a meaningful name, and raise an issue in the [data-management repo](https://github.com/leap-stc/data-management/issues) to get the dataset added to the LEAP Data Library. - -> Make sure to use a different bucket than `leap-scratch`, since that will be deleted every 7 days! For more info refer to the available [storage buckets](hub:data:buckets). - -(hub:data:upload_hpc)= -##### Uploading large original data from an HPC system (no browser access on the system available) - -A commong scenario is the following: A researcher/student has run a simulation on a High Performance Computer (HPC) at their institution, but now wants to collaboratively work on the analysis or train a machine learning model with this data. For this they need to upload it to the cloud storage. - -The following steps will guide you through the steps needed to authenticate and upload data to the cloud, but might have to be slightly modified depending on the actual setup of the users HPC. - -**Conversion Script/Notebook** - -In most cases you do not just want to upload the data in its current form (e.g. many netcdf files). - -Instead we will load the data into an [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) and then write that Dataset object directly to a zarr store in the cloud. For this you need a python environment with `xarray, gcsfs, zarr` installed (you might need additional dependencies for your particular use case). - -1. Spend some time to set up a python script/jupyter notebook on the HPC system that opens your files and combines them in to one or more xarray.Datasets (combine as many files as sensible into a single dataset). Make sure that your data is lazily loaded and the `Dataset.data` is a [dask array](https://docs.dask.org/en/stable/array.html) - -2. Check your dataset: - - Check that the metadata is correct. - - Check that all the variables/dimensions are in the dataset - - Check the dask chunksize. A general rule is to aim for around 100MB size, but the size and structure of chunking that is optimal depends heavily on the later use case. - - -3. Try to write out a subset of the data locally by calling the [`.to_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html) method on the dataset. - -Once that works we can move on to the authentication. - -**Upload Prerequisites** - -Before we are able to set up authentication we need to make sure our HPC and local computer (required) are set up correctly. -- We manage access rights through [Google Groups](https://groups.google.com). Please contact the [](support.data_compute_team) to get added to the appropriate group (a gmail address is required for this). -- Make sure to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) in both your HPC environment, and your local computer that can open a web browser (e.g. your laptop). - -**Steps** -Steps executed on your "local" computer (e.g. laptop) will be colored in green and steps on your "remote" computer (e.g. HPC) in purple. - -1. SSH into the HPC -2. Check that you have an internet connection with `ping www.google.com` -3. Request no browser authentication: - ``` - gcloud auth application-default login --scopes=https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/iam.test --no-browser - ``` - > ๐Ÿšจ It is very important to include the `--scopes=` argument for security reasons. Do not run this command without it! -4. Follow the onscreen prompt and paste the command into a terminal on your local machine. -5. This will open a browser window. Authenticate with the gmail account that was added to the google group. -6. Go back to the terminal and follow the onscreen instructions. Copy the text from the command line and paste the command in the open dialog on the remote machine. -7. Make sure to note the path to the auth json! It will be something like `.../.config/gcloud/....json`. - -Now you are have everything you need to authenticate. - -Lets verify that you can write a small dummy dataset to the cloud. In your notebook/script run the following (make sure to replace the filename and your username as instructed). - -Your dataset should now be available for all LEAP members ๐ŸŽ‰๐Ÿš€ - -```python -import xarray as xr -import gcsfs -import json - -with open("your_auth_file.json") as f: #๐Ÿšจ make sure to enter the `.json` file from step 7 - token=json.load(f) - -# test write a small dummy xarray dataset to zarr -ds = xr.DataArray([1, 4, 6]).to_dataset(name='data') -# Once you have confirmed - -fs = gcsfs.GCSFileSystem(token=token) -mapper = fs.get_mapper("gs://leap-persistent//testing/demo_write_from_remote.zarr") #๐Ÿšจ enter your leap (github) username here -ds.to_zarr(mapper) -``` - -Now you can repeat the same steps but replace your dataset with the full dataset from above and leave your python code running until the upload has finished. Depending on the internet connection speed and the size of the full dataset, this can take a while. - -If you want to see a progress bar, you can wrap the call to `.to_zarr` with a [dask progress bar](https://docs.dask.org/en/stable/diagnostics-local.html#progress-bar) - -```python -from dask.diagnostics import ProgressBar -with ProgressBar(): - ds.to_zarr(mapper) -``` - -Once the data has been uploaded, make sure to erase the `.../.config/gcloud/....json` file from step 7, and ask to be removed from the Google Group. - -### Dask - -To help you scale up calculations using a cluster, the Hub is configured with Dask Gateway. -For a quick guide on how to start a Dask Cluster, consult this page from the Pangeo docs: - -- https://pangeo.io/cloud.html#dask - -### GPUs - -Tier2 and Tier3 members (see [Users and Categories](../../policies/users_roles.md)) have access to a 'Large' Server instance with GPU. Currently the GPUs are [Nvidia T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) models. To check what GPU is available on your server you can use [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) in the terminal window. You should get output similar to this: - -```shell - - nvidia-smi - - +-----------------------------------------------------------------------------+ - | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | - |-------------------------------+----------------------+----------------------+ - | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | - | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | - | | | MIG M. | - |===============================+======================+======================| - | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | - | N/A 41C P8 11W / 70W | 0MiB / 15360MiB | 0% Default | - | | | N/A | - +-------------------------------+----------------------+----------------------+ -``` - +``` \ No newline at end of file diff --git a/book/leap-pangeo/tutorial.md b/book/leap-pangeo/tutorial.md new file mode 100644 index 00000000..9c1b4a80 --- /dev/null +++ b/book/leap-pangeo/tutorial.md @@ -0,0 +1,73 @@ +# Getting Started + +To get started using the hub, check out this video by [James Munroe](https://github.com/jmunroe) from [2i2c](https://2i2c.org) explaining the architecture. + + + +## Hub Usage + +This is a rough and ready guide to using the Hub. +This documentation will be expanded as we learn and evolve. +Feel free to [edit it yourself](https://github.com/leap-stc/leap-stc.github.io/blob/main/book/leap-pangeo/jupyterhub.md) if you have suggetions for improvement! + +(hub:server:login)= +### Logging In + +1. ๐Ÿ‘€ Navigate to https://leap.2i2c.cloud/ and click the big orange button that says "Log in to continue" +2. ๐Ÿ” You will be prompted to authorize a GitHub application. Say "yes" to everything. + Note you must belong to the appropriate GitHub team in order to access the hub. + See {doc}`/policies/users_roles` for more information. +3. ๐Ÿ“  You will redirect to a screen with the following options. + +image + +> Note: Depending on your [membership]() you might see additional options (e.g. for GPU machines) + +You have to make 3 choices here: +- The machine type (Choose between "CPU only" or "GPU" if available) + **โš ๏ธThe GPU images should be used only when needed to accelerate model training.** +- The software environment ("Image"). Find more info in the [Software Environment Section](hub:image) below. +- The node share. These are shared resources, and you should try to use the smallest image you need. You can easily start up a new server with a larger share if you find your work to be limited by CPU/RAM + +4. ๐Ÿ•ฅ Wait for your server to start up. It can take up to few minutes. + +#### GPUs + +Tier2 and Tier3 members (see [Users and Categories](../../policies/users_roles.md)) have access to a 'Large' Server instance with GPU. Currently the GPUs are [Nvidia T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) models. To check what GPU is available on your server you can use [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) in the terminal window. You should get output similar to this: + +```shell + + nvidia-smi + + +-----------------------------------------------------------------------------+ + | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | + |-------------------------------+----------------------+----------------------+ + | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |===============================+======================+======================| + | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | + | N/A 41C P8 11W / 70W | 0MiB / 15360MiB | 0% Default | + | | | N/A | + +-------------------------------+----------------------+----------------------+ +``` + +### Using JupyterLab + +After your server fires up, you will be dropped into a JupyterLab environment. + +If you are new to JupyterLab, you might want to peruse the [user guide](https://jupyterlab.readthedocs.io/en/stable/user/interface.html). + +### Shutting Down Your Server + +Your server will shut down automatically after a period of inactivity. +However, if you know you are done working, it's best to shut it down directly. +To shut it down, go to https://leap.2i2c.cloud/hub/home and click the big red button that says "Stop My Server" + +image + +You can also navigate to this page from JupyterLab by clicking the `File` menu and going to `Hub Control Panel`. + +(hub:image)= + +For more information on specific use cases or workflows that might arise while using the Hub, please refer to our [Guides](../guides/hub_guides.md). \ No newline at end of file diff --git a/book/support.rst b/book/support.rst index 0bd213a5..7ba0cb14 100644 --- a/book/support.rst +++ b/book/support.rst @@ -1,10 +1,19 @@ -Support +Getting Help ======= +For questions about how to use the Hub, please use the LEAP-Pangeo discussion forum: + +- `LEAP-Pangeo Discussion Forum `_ + +Office Hours +~~~~~~~~~~~~ +We also offer in-person and virtual Office Hours on Thursdays for questions about LEAP-Pangeo. +You can reserve an appointment `here `_. + .. _support.data_compute_team: Data and Computation Team -~~~~~~~~~~~~~~~~~~~~~ +------------ .. jinja:: team-data