diff --git a/episodes/01-access-data.md b/episodes/01-access-data.md new file mode 100644 index 00000000..c114d300 --- /dev/null +++ b/episodes/01-access-data.md @@ -0,0 +1,429 @@ +--- +title: "Access satellite imagery using Python" +teaching: 30 +exercises: 15 +--- + +:::questions +- Where can I find open-access satellite data? +- How do I search for satellite imagery with the STAC API? +- How do I fetch remote raster datasets using Python? +::: + +:::objectives +- Search public STAC repositories of satellite imagery using Python. +- Inspect search result's metadata. +- Download (a subset of) the assets available for a satellite scene. +- Open satellite imagery as raster data and save it to disk. +::: + + + +## Introduction + +A number of satellites take snapshots of the Earth's surface from space. The images recorded by these remote sensors +represent a very precious data source for any activity that involves monitoring changes on Earth. Satellite imagery is +typically provided in the form of geospatial raster data, with the measurements in each grid cell ("pixel") being +associated to accurate geographic coordinate information. + +In this episode we will explore how to access open satellite data using Python. In particular, we will +consider [the Sentinel-2 data collection that is hosted on AWS](https://registry.opendata.aws/sentinel-2-l2a-cogs). +This dataset consists of multi-band optical images acquired by the two satellites of +[the Sentinel-2 mission](https://sentinel.esa.int/web/sentinel/missions/sentinel-2) and it is continuously updated with +new images. + +## Search for satellite imagery + +### The SpatioTemporal Asset Catalog (STAC) specification + +Current sensor resolutions and satellite revisit periods are such that terabytes of data products are added daily to the +corresponding collections. Such datasets cannot be made accessible to users via full-catalog download. Space agencies +and other data providers often offer access to their data catalogs through interactive Graphical User Interfaces (GUIs), +see for instance the [Copernicus Open Access Hub portal](https://scihub.copernicus.eu/dhus/#/home) for the Sentinel +missions. Accessing data via a GUI is a nice way to explore a catalog and get familiar with its content, but it +represents a heavy and error-prone task that should be avoided if carried out systematically to retrieve data. + +A service that offers programmatic access to the data enables users to reach the desired data in a more reliable, +scalable and reproducible manner. An important element in the software interface exposed to the users, which is generally called +the Application Programming Interface (API), is the use of standards. Standards, in fact, can significantly facilitate +the reusability of tools and scripts across datasets and applications. + +The SpatioTemporal Asset Catalog (STAC) specification is an emerging standard for describing geospatial data. By +organizing metadata in a form that adheres to the STAC specifications, data providers make it possible for users to +access data from different missions, instruments and collections using the same set of tools. + +![STAC-browser](../fig/E05-01-STAC-browser.jpg) + +:::callout +## More Resources on STAC +- [STAC specification](https://github.com/radiantearth/stac-spec#readme) +- [Tools based on STAC](https://stacindex.org/ecosystem) +- [STAC catalogs](https://stacindex.org/catalogs) +::: + +## Search a STAC catalog +The [STAC browser](https://radiantearth.github.io/stac-browser/#/) is a good starting point to discover available +datasets, as it provides an up-to-date list of existing STAC catalogs. From the list, let's click on the +"Earth Search" catalog, i.e. the access point to search the archive of Sentinel-2 images hosted on AWS. + +:::challenge +## Exercise: Discover a STAC catalog +Let's take a moment to explore the Earth Search STAC catalog, which is the catalog indexing the Sentinel-2 collection +that is hosted on AWS. We can interactively browse this catalog using the STAC browser at [this link](https://radiantearth.github.io/stac-browser/#/external/earth-search.aws.element84.com/v0). + +1. Open the link in your web browser. Which (sub-)catalogs are available? +2. Open the Sentinel-2 L2A COGs collection, and select one item from the list. Each item corresponds to a satellite +"scene", i.e. a portion of the footage recorded by the satellite at a given time. Have a look at the metadata fields +and the list of assets. What kind of data do the assets represent? + +::::solution + +![STAC-browser-exercise](../fig/E05-02-STAC-browser-exercise.jpg) + +1. Four subcatalogs are available, including both Sentinel 2 and Landsat 8 images (see left screenshot in the figure above). +2. When you select the Sentinel-2 L2A COGs collection, and randomly choose one of the items from the list, you should find yourself on a page similar to the right screenshot in the figure above. On the left side you will find a list of the available assets: overview images (thumbnail and true color images), metadata files and the "real" satellite images, one for each band captured by the Multispectral Instrument on board Sentinel-2. +:::: +::: + +When opening a catalog with the STAC browser, you can access the API URL by clicking on the "Source" button on the top +right of the page. By using this URL, we have access to the catalog content and, if supported by the catalog, to the +functionality of searching its items. For the Earth Search STAC catalog the API URL is: + +```python +api_url = "https://earth-search.aws.element84.com/v0" +``` + + +You can query a STAC API endpoint from Python using the `pystac_client` library: + +```python +from pystac_client import Client + +client = Client.open(api_url) +``` + +In the following, we ask for scenes belonging to the `sentinel-s2-l2a-cogs` collection. This dataset includes Sentinel-2 +data products pre-processed at level 2A (bottom-of-atmosphere reflectance) and saved in Cloud Optimized GeoTIFF (COG) +format: + +```python +collection = "sentinel-s2-l2a-cogs" # Sentinel-2, Level 2A, COGs +``` + +:::callout +## Cloud Optimized GeoTIFFs + +Cloud Optimized GeoTIFFs (COGs) are regular GeoTIFF files with some additional features that make them ideal to be +employed in the context of cloud computing and other web-based services. This format builds on the widely-employed +GeoTIFF format, already introduced in [Episode 1: Introduction to Raster Data]({{site.baseurl}}/01-intro-raster-data/). +In essence, COGs are regular GeoTIFF files with a special internal structure. One of the features of COGs is that data +is organized in "blocks" that can be accessed remotely via independent HTTP requests. Data users can thus access the +only blocks of a GeoTIFF that are relevant for their analysis, without having to download the full file. In addition, +COGs typically include multiple lower-resolution versions of the original image, called "overviews", which can also be +accessed independently. By providing this "pyramidal" structure, users that are not interested in the details provided +by a high-resolution raster can directly access the lower-resolution versions of the same image, significantly saving +on the downloading time. More information on the COG format can be found [here](https://www.cogeo.org). +::: + +We also ask for scenes intersecting a geometry defined using the `shapely` library (in this case, a point): + +```python +from shapely.geometry import Point +point = Point(4.89, 52.37) # AMS coordinates +``` + + +Note: at this stage, we are only dealing with metadata, so no image is going to be downloaded yet. But even metadata can +be quite bulky if a large number of scenes match our search! For this reason, we limit the search result to 10 items: + +```python +search = client.search( + collections=[collection], + intersects=point, + max_items=10, +) +``` + + +We submit the query and find out how many scenes match our search criteria (please note that this output can be different as more data is added to the catalog): + +```python +print(search.matched()) +``` + +```output +630 +``` + +Finally, we retrieve the metadata of the search results: +```python +items = search.get_all_items() +``` + + +The variable `items` is an `ItemCollection` object. We can check its size by: +```python +print(len(items)) +``` + + +```output +10 +``` + + +which is consistent with the maximum number of items that we have set in the search criteria. We can iterate over +the returned items and print these to show their IDs: + +```python +for item in items: + print(item) +``` + +```output + + + + + + + + + + +``` + + +Each of the items contains information about the scene geometry, its acquisition time, and other metadata that can be +accessed as a dictionary from the `properties` attribute. + +Let's inspect the metadata associated with the first item of the search results: +```python +item = items[0] +print(item.datetime) +print(item.geometry) +print(item.properties) +``` + + +```output +2022-03-13 10:46:26+00:00 +{'type': 'Polygon', 'coordinates': [[[6.071664488869862, 52.22257539160586], [4.81543624496389, 52.24859574950519], [5.134718264192548, 52.98684040773408], [6.115758198408397, 52.84931817668945], [6.071664488869862, 52.22257539160586]]]} +{'datetime': '2022-03-13T10:46:26Z', 'platform': 'sentinel-2b', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'gsd': 10, 'view:off_nadir': 0, 'proj:epsg': 32631, 'sentinel:utm_zone': 31, 'sentinel:latitude_band': 'U', 'sentinel:grid_square': 'FU', 'sentinel:sequence': '0', 'sentinel:product_id': 'S2B_MSIL2A_20220313T103729_N0400_R008_T31UFU_20220313T140234', 'sentinel:data_coverage': 48.71, 'eo:cloud_cover': 99.95, 'sentinel:valid_cloud_cover': True, 'sentinel:processing_baseline': '04.00', 'sentinel:boa_offset_applied': True, 'created': '2022-03-13T17:43:18.861Z', 'updated': '2022-03-13T17:43:18.861Z'} +``` + +:::challenge +## Exercise: Search satellite scenes using metadata filters +Search for all the available Sentinel-2 scenes in the `sentinel-s2-l2a-cogs` collection that satisfy the following +criteria: +- intersect a provided bounding box (use ±0.01 deg in lat/lon from the previously defined point); +- have been recorded between 20 March 2020 and 30 March 2020; +- have a cloud coverage smaller than 10% (hint: use the `query` input argument of `client.search`). + +How many scenes are available? Save the search results in GeoJSON format. + +::::solution +```python +bbox = point.buffer(0.01).bounds +``` + +```python +search = client.search( + collections=[collection], + bbox=bbox, + datetime="2020-03-20/2020-03-30", + query=["eo:cloud_cover<10"] +) +print(search.matched()) +``` + +```output +4 +``` + +```python +items = search.get_all_items() +items.save_object("search.json") +``` +:::: +::: + +## Access the assets + +So far we have only discussed metadata - but how can one get to the actual images of a satellite scene (the "assets" in +the STAC nomenclature)? These can be reached via links that are made available through the item's attribute `assets`. + +```python +assets = items[0].assets # first item's asset dictionary +print(assets.keys()) +``` + + +```output +dict_keys(['thumbnail', 'overview', 'info', 'metadata', 'visual', 'B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12', 'AOT', 'WVP', 'SCL']) +``` + +We can print a minimal description of the available assets: + +```python +for key, asset in assets.items(): + print(f"{key}: {asset.title}") +``` + + +```output +thumbnail: Thumbnail +overview: True color image +info: Original JSON metadata +metadata: Original XML metadata +visual: True color image +B01: Band 1 (coastal) +B02: Band 2 (blue) +B03: Band 3 (green) +B04: Band 4 (red) +B05: Band 5 +B06: Band 6 +B07: Band 7 +B08: Band 8 (nir) +B8A: Band 8A +B09: Band 9 +B11: Band 11 (swir16) +B12: Band 12 (swir22) +AOT: Aerosol Optical Thickness (AOT) +WVP: Water Vapour (WVP) +SCL: Scene Classification Map (SCL) +``` + +Among the others, assets include multiple raster data files (one per optical band, as acquired by the multi-spectral +instrument), a thumbnail, a true-color image ("visual"), instrument metadata and scene-classification information +("SCL"). Let's get the URL links to the actual asset: + +```python +print(assets["thumbnail"].href) +``` + +```output +https://roda.sentinel-hub.com/sentinel-s2-l1c/tiles/31/U/FU/2020/3/28/0/preview.jpg +``` + +This can be used to download the corresponding file: + +![STAC-s2-preview](../fig/E05-03-STAC-s2-preview.jpg) + +Remote raster data can be directly opened via the `rioxarray` library. We will +learn more about this library in the next episodes. +```python +import rioxarray +b01_href = assets["B01"].href +b01 = rioxarray.open_rasterio(b01_href) +print(b01) +``` + +```output + +[3348900 values with dtype=uint16] +Coordinates: + * band (band) int64 1 + * x (x) float64 6e+05 6.001e+05 6.002e+05 ... 7.097e+05 7.098e+05 + * y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06 + spatial_ref int64 0 +Attributes: + _FillValue: 0.0 + scale_factor: 1.0 + add_offset: 0.0 +``` + +We can then save the data to disk: + +```python +# save image to disk +b01.rio.to_raster("B01.tif") +``` + + +:::challenge +## Exercise: Downloading Landsat 8 Assets +In this exercise we put in practice all the skills we have learned in this episode to retrieve images from a different +mission: [Landsat 8](https://www.usgs.gov/landsat-missions/landsat-8). In particular, we browse images from the +[Harmonized Landsat Sentinel-2 (HLS) project](https://lpdaac.usgs.gov/products/hlsl30v002/), which provides images +from NASA's Landsat 8 and ESA's Sentinel-2 that have been made consistent with each other. The HLS catalog is indexed +in the NASA Common Metadata Repository (CMR) and it can be accessed from the STAC API endpoint at the following URL: +`https://cmr.earthdata.nasa.gov/stac/LPCLOUD`. + +- Using `pystac_client`, search for all assets of the Landsat 8 collection (`HLSL30.v2.0`) from February to March + 2021, intersecting the point with longitude/latitute coordinates (-73.97, 40.78) deg. +- Visualize an item's thumbnail (asset key `browse`). + +::::solution +```python +# connect to the STAC endpoint +cmr_api_url = "https://cmr.earthdata.nasa.gov/stac/LPCLOUD" +client = Client.open(cmr_api_url) + +# setup search +search = client.search( + collections=["HLSL30.v2.0"], + intersects=Point(-73.97, 40.78), + datetime="2021-02-01/2021-03-30", +) # nasa cmr cloud cover filtering is currently broken: https://github.com/nasa/cmr-stac/issues/239 + +# retrieve search results +items = search.get_all_items() +print(len(items)) +``` + + +```output +5 +``` + +```python +items_sorted = sorted(items, key=lambda x: x.properties["eo:cloud_cover"]) # sorting and then selecting by cloud cover +item = items_sorted[0] +print(item) +``` + + +```output + +``` + +```python +print(item.assets["browse"].href) +``` + + +```output +'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSL30.020/HLS.L30.T18TWL.2021039T153324.v2.0/HLS.L30.T18TWL.039T153324.v2.0.jpg' +``` + +![STAC-l8-preview](../fig/E05-04-STAC-l8-preview.jpg) +:::: +::: + +:::callout +## Public catalogs, protected data + +Publicly accessible catalogs and STAC endpoints do not necessarily imply publicly accessible data. Data providers, in +fact, may limit data access to specific infrastructures and/or require authentication. For instance, the NASA CMR STAC +endpoint considered in the last exercise offers publicly accessible metadata for the HLS collection, but most of the +linked assets are available only for registered users (the thumbnail is publicly accessible). + +The authentication procedure for dataset with restricted access might differ depending on the data provider. For the +NASA CMR, follow these steps in order to access data using Python: +* Create a NASA Earthdata login account [here](https://urs.earthdata.nasa.gov); +* Set up a netrc file with your credentials, e.g. by using [this script](https://git.earthdata.nasa.gov/projects/LPDUR/repos/ac_data_download_python/browse/EarthdataLoginSetup.py); +* Define the following environment variables: + +``` +import os +os.environ["GDAL_HTTP_COOKIEFILE"] = "./cookies.txt" +os.environ["GDAL_HTTP_COOKIEJAR"] = "./cookies.txt" +``` +::: + +:::keypoints +- Accessing satellite images via the providers' API enables a more reliable and scalable data retrieval. +- STAC catalogs can be browsed and searched using the same tools and scripts. +- `rioxarray` allows you to open and download remote raster files. +::: \ No newline at end of file diff --git a/episodes/02-raster-intro.md b/episodes/02-raster-intro.md new file mode 100644 index 00000000..62a65ec8 --- /dev/null +++ b/episodes/02-raster-intro.md @@ -0,0 +1,444 @@ +--- +title: "Read and visualize raster data" +teaching: 70 +exercises: 30 +--- + +:::questions +- How is a raster represented by rioxarray? +- How do I read and plot raster data in Python? +- How can I handle missing data? +::: + +:::objectives +- Describe the fundamental attributes of a raster dataset. +- Explore raster attributes and metadata using Python. +- Read rasters into Python using the `rioxarray` package. +- Visualize single/multi-band raster data. +::: + +Raster datasets have been introduced in [Episode 1: Introduction to Raster Data]({{site.baseurl}}/01-intro-raster-data). Here, we introduce the fundamental principles, packages and metadata/raster attributes for working with raster data in Python. We will also explore how Python handles missing and bad data values. + +[`rioxarray`](https://corteva.github.io/rioxarray/stable/) is the Python package we will use throughout this lesson to work with raster data. It is based on the popular [`rasterio`](https://rasterio.readthedocs.io/en/latest/) package for working with rasters and [`xarray`](http://xarray.pydata.org/en/stable/) for working with multi-dimensional arrays. +`rioxarray` extends `xarray` by providing top-level functions (e.g. the `open_rasterio` function to open raster datasets) and by adding a set of methods to the main objects of the `xarray` package (the `Dataset` and the `DataArray`). These additional methods are made available via the `rio` accessor and become available from `xarray` objects after importing `rioxarray`. + +We will also use the [`pystac`](https://github.com/stac-utils/pystac) package to load rasters from the search results we created in the previous episode. + +:::callout +## Introduce the Raster Data +We'll continue from the results of the satellite image search that we have carried out in an exercise from +[a previous episode]({{site.baseurl}}/05-access-data). We will load data starting from the `search.json` file, +using one scene from the search results as an example to demonstrate data loading and visualization. + +If you would like to work with the data for this lesson without downloading data on-the-fly, you can download the +raster data ahead of time using this [link](https://figshare.com/ndownloader/files/36028100). Save the `geospatial-python-raster-dataset.tar.gz` +file in your current working directory, and extract the archive file by double-clicking on it or by running the +following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. Use the file `geospatial-python-raster-dataset/search.json` +(instead of `search.json`) to get started with this lesson. + +This can be useful if you need to download the data ahead of time to work through the lesson offline, or if you want +to work with the data in a different GIS. +::: + +## Load a Raster and View Attributes +In the previous episode, we searched for Sentinel-2 images, and then saved the search results to a file:`search.json`. This contains the information on where and how to access the target images from a remote repository. We can use the function `pystac.ItemCollection.from_file()` to load the search results as an `Item` list. + + +```python +import pystac +items = pystac.ItemCollection.from_file("search.json") +``` + + +In the search results, we have 2 `Item` type objects, corresponding to 4 Sentinel-2 scenes from March 26th and 28th in 2020. We will focus on the first scene `S2A_31UFU_20200328_0_L2A`, and load band `B09` (central wavelength 945 nm). We can load this band using the function `rioxarray.open_rasterio()`, via the Hypertext Reference `href` (commonly referred to as a URL): +```python +import rioxarray +raster_ams_b9 = rioxarray.open_rasterio(items[0].assets["B09"].href) +``` + +By calling the variable name in the jupyter notebook we can get a quick look at the shape and attributes of the data. +```python +raster_ams_b9 +``` + +```output + +[3348900 values with dtype=uint16] +Coordinates: + * band (band) int64 1 + * x (x) float64 6e+05 6.001e+05 6.002e+05 ... 7.097e+05 7.098e+05 + * y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06 + spatial_ref int64 0 +Attributes: + _FillValue: 0.0 + scale_factor: 1.0 + add_offset: 0.0 +``` + +The first call to `rioxarray.open_rasterio()` opens the file from remote or local storage, and then returns a `xarray.DataArray` object. The object is stored in a variable, i.e. `raster_ams_b9`. Reading in the data with `xarray` instead of `rioxarray` also returns a `xarray.DataArray`, but the output will not contain the geospatial metadata (such as projection information). You can use numpy functions or built-in Python math operators on a `xarray.DataArray` just like a numpy array. Calling the variable name of the `DataArray` also prints out all of its metadata information. + +The output tells us that we are looking at an `xarray.DataArray`, with `1` band, `1830` rows, and `1830` columns. We can also see the number of pixel values in the `DataArray`, and the type of those pixel values, which is unsigned integer (or `uint16`). The `DataArray` also stores different values for the coordinates of the `DataArray`. When using `rioxarray`, the term coordinates refers to spatial coordinates like `x` and `y` but also the `band` coordinate. Each of these sequences of values has its own data type, like `float64` for the spatial coordinates and `int64` for the `band` coordinate. + +This `DataArray` object also has a couple of attributes that are accessed like `.rio.crs`, `.rio.nodata`, and `.rio.bounds()`, which contain the metadata for the file we opened. Note that many of the metadata are accessed as attributes without `()`, but `bounds()` is a method (i.e. a function in an object) and needs parentheses. + +```python +print(raster_ams_b9.rio.crs) +print(raster_ams_b9.rio.nodata) +print(raster_ams_b9.rio.bounds()) +print(raster_ams_b9.rio.width) +print(raster_ams_b9.rio.height) +``` + +```output +EPSG:32631 +0 +(600000.0, 5790240.0, 709800.0, 5900040.0) +1830 +1830 +``` + +The Coordinate Reference System, or `raster_ams_b9.rio.crs`, is reported as the string `EPSG:32631`. The `nodata` value is encoded as 0 and the bounding box corners of our raster are represented by the output of `.bounds()` as a `tuple` (like a list but you can't edit it). The height and width match what we saw when we printed the `DataArray`, but by using `.rio.width` and `.rio.height` we can access these values if we need them in calculations. + +We will be exploring this data throughout this episode. By the end of this episode, you will be able to understand and explain the metadata output. + +:::callout +## Tip - Variable names +To improve code readability, file and object names should be used that make it clear what is in the file. The data for this episode covers Amsterdam, and is from Band 9, so we'll use a naming convention of `raster_ams_b9`. +::: + +## Visualize a Raster + +After viewing the attributes of our raster, we can examine the raw values of the array with `.values`: + +```python +raster_ams_b9.values +``` + +```output +array([[[ 0, 0, 0, ..., 8888, 9075, 8139], + [ 0, 0, 0, ..., 10444, 10358, 8669], + [ 0, 0, 0, ..., 10346, 10659, 9168], + ..., + [ 0, 0, 0, ..., 4295, 4289, 4320], + [ 0, 0, 0, ..., 4291, 4269, 4179], + [ 0, 0, 0, ..., 3944, 3503, 3862]]], dtype=uint16) +``` + +This can give us a quick view of the values of our array, but only at the corners. Since our raster is loaded in python as a `DataArray` type, we can plot this in one line similar to a pandas `DataFrame` with `DataArray.plot()`. + +```python +raster_ams_b9.plot() +``` + +Raster plot with defualt setting + +Nice plot! Notice that `rioxarray` helpfully allows us to plot this raster with spatial coordinates on the x and y axis (this is not the default in many cases with other functions or libraries). + +This plot shows the satellite measurement of the spectral band `B09` for an area that covers part of the Netherlands. According to the [Sentinel-2 documentaion](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/msi-instrument), this is a band with the central wavelength of 945nm, which is sensitive to water vapor. It has a spatial resolution of 60m. Note that the `band=1` in the image title refers to the ordering of all the bands in the `DataArray`, not the Sentinel-2 band number `B09` that we saw in the pystac search results. + +With a quick view of the image, we notice that half of the image is blank, no data is captured. We also see that the cloudy pixels at the top have high reflectance values, while the contrast of everything else is quite low. This is expected because this band is sensitive to the water vapor. However if one would like to have a better color contrast, one can add the option `robust=True`, which displays values between the 2nd and 98th percentile: + +```python +raster_ams_b9.plot(robust=True) +``` + +Raster plot with robust setting + +Now the color limit is set in a way fitting most of the values in the image. We have a better view of the ground pixels. + +:::callout +## Tool Tip +The option `robust=True` always forces displaying values between the 2nd and 98th percentile. Of course, this will not work for every case. For a customized displaying range, you can also manually specifying the keywords `vmin` and `vmax`. For example ploting between `100` and `7000`: + +```python +raster_ams_b9.plot(vmin=100, vmax=7000) +``` +::: + +## View Raster Coordinate Reference System (CRS) in Python +Another information that we're interested in is the CRS, and it can be accessed with `.rio.crs`. We introduced the concept of a CRS in [an earlier +episode]({{ page.root }}{% link _episodes/03-crs.md %}). +Now we will see how features of the CRS appear in our data file and what +meanings they have. We can view the CRS string associated with our DataArray's `rio` object using the `crs` +attribute. + +```python +print(raster_ams_b9.rio.crs) +``` + +```output +EPSG:32631 +``` + +To print the EPSG code number as an `int`, we use the `.to_epsg()` method: + +```python +raster_ams_b9.rio.crs.to_epsg() +``` + +```output +32631 +``` + + +EPSG codes are great for succinctly representing a particular coordinate reference system. But what if we want to see more details about the CRS, like the units? For that, we can use `pyproj`, a library for representing and working with coordinate reference systems. + +```python +from pyproj import CRS +epsg = raster_ams_b9.rio.crs.to_epsg() +crs = CRS(epsg) +crs +``` + +```output + +Name: WGS 84 / UTM zone 31N +Axis Info [cartesian]: +- E[east]: Easting (metre) +- N[north]: Northing (metre) +Area of Use: +- name: Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea. +- bounds: (0.0, 0.0, 6.0, 84.0) +Coordinate Operation: +- name: UTM zone 31N +- method: Transverse Mercator +Datum: World Geodetic System 1984 ensemble +- Ellipsoid: WGS 84 +- Prime Meridian: Greenwich +``` + +The `CRS` class from the `pyproj` library allows us to create a `CRS` object with methods and attributes for accessing specific information about a CRS, or the detailed summary shown above. + +A particularly useful attribute is `area_of_use`, which shows the geographic bounds that the CRS is intended to be used. + +```python +crs.area_of_use +``` + +```output +AreaOfUse(west=0.0, south=0.0, east=6.0, north=84.0, name='Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea.') +``` + +:::challenge +## Exercise: find the axes units of the CRS +What units are our coordinates in? See if you can find a method to examine this information using `help(crs)` or `dir(crs)` + +::::solution +`crs.axis_info` tells us that the CRS for our raster has two axis and both are in meters. +We could also get this information from the attribute `raster_ams_b9.rio.crs.linear_units`. +:::: +::: + +### Understanding pyproj CRS Summary +Let's break down the pieces of the `pyproj` CRS summary. The string contains all of the individual CRS elements that Python or another GIS might need, separated into distinct sections, and datum. + +```output + +Name: WGS 84 / UTM zone 31N +Axis Info [cartesian]: +- E[east]: Easting (metre) +- N[north]: Northing (metre) +Area of Use: +- name: Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea. +- bounds: (0.0, 0.0, 6.0, 84.0) +Coordinate Operation: +- name: UTM zone 31N +- method: Transverse Mercator +Datum: World Geodetic System 1984 ensemble +- Ellipsoid: WGS 84 +- Prime Meridian: Greenwich +``` + +* **Name** of the projection is UTM zone 31N (UTM has 60 zones, each 6-degrees of longitude in width). The underlying datum is WGS84. +* **Axis Info**: the CRS shows a Cartesian system with two axes, easting and northing, in meter units. +* **Area of Use**: the projection is used for a particular range of longitudes `0°E to 6°E` in the northern hemisphere (`0.0°N to 84.0°N`) +* **Coordinate Operation**: the operation to project the coordinates (if it is projected) onto a cartesian (x, y) plane. Transverse Mercator is accurate for areas with longitudinal widths of a few degrees, hence the distinct UTM zones. +* **Datum**: Details about the datum, or the reference point for coordinates. `WGS 84` and `NAD 1983` are common datums. `NAD 1983` is [set to be replaced in 2022](https://en.wikipedia.org/wiki/Datum_of_2022). + +Note that the zone is unique to the UTM projection. Not all CRSs will have a +zone. Below is a simplified view of US UTM zones. Image source: Chrismurf at English Wikipedia, via [Wikimedia Commons](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Utm-zones-USA.svg) (CC-BY). + +The UTM zones across the continental United States. + + + +## Calculate Raster Statistics + +It is useful to know the minimum or maximum values of a raster dataset. We can compute these and other descriptive statistics with `min`, `max`, `mean`, and `std`. + +```python +print(raster_ams_b9.min()) +print(raster_ams_b9.max()) +print(raster_ams_b9.mean()) +print(raster_ams_b9.std()) +``` + +```output + +array(0, dtype=uint16) +Coordinates: + spatial_ref int64 0 + +array(15497, dtype=uint16) +Coordinates: + spatial_ref int64 0 + +array(1652.44009944) +Coordinates: + spatial_ref int64 0 + +array(2049.16447495) +Coordinates: + spatial_ref int64 0 +``` + + +The information above includes a report of the min, max, mean, and standard deviation values, along with the data type. If we want to see specific quantiles, we can use xarray's `.quantile()` method. For example for the 25% and 75% quantiles: + +```python +print(raster_ams_b9.quantile([0.25, 0.75])) +``` + +```output + +array([ 0., 2911.]) +Coordinates: + * quantile (quantile) float64 0.25 0.75 +``` + +:::callout +## Data Tip - NumPy methods +You could also get each of these values one by one using `numpy`. + +```python +import numpy +print(numpy.percentile(raster_ams_b9, 25)) +print(numpy.percentile(raster_ams_b9, 75)) +``` + +```output +0.0 +2911.0 +``` + +You may notice that `raster_ams_b9.quantile` and `numpy.percentile` didn't require an argument specifying the axis or dimension along which to compute the quantile. This is because `axis=None` is the default for most numpy functions, and therefore `dim=None` is the default for most xarray methods. It's always good to check out the docs on a function to see what the default arguments are, particularly when working with multi-dimensional image data. To do so, we can use`help(raster_ams_b9.quantile)` or `?raster_ams_b9.percentile` if you are using jupyter notebook or jupyter lab. +::: + +## Dealing with Missing Data +So far, we have visualized a band of a Sentinel-2 scene and calculated its statistics. However, we need to take missing data into account. Raster data often has a "no data value" associated with it and for raster datasets read in by `rioxarray`. This value is referred to as `nodata`. This is a value assigned to pixels where data is missing or no data were collected. There can be different cases that cause missing data, and it's common for other values in a raster to represent different cases. The most common example is missing data at the edges of rasters. + +By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn't rectangular, some pixels at the edge of the raster will have no data values. This often happens when the data were collected by a sensor which only flew over some part of a defined region. + +As we have seen above, the `nodata` value of this dataset (`raster_ams_b9.rio.nodata`) is 0. When we have plotted the band data, or calculated statistics, the missing value was not distinguished from other values. Missing data may cause some unexpected results. For example, the 25th percentile we just calculated was 0, probably reflecting the presence of a lot of missing data in the raster. + +To distinguish missing data from real data, one possible way is to use `nan` to represent them. This can be done by specifying `masked=True` when loading the raster: +```python +raster_ams_b9 = rioxarray.open_rasterio(items[0].assets["B09"].href, masked=True) +``` + + +One can also use the `where` function to select all the pixels which are different from the `nodata` value of the raster: +```python +raster_ams_b9.where(raster_ams_b9!=raster_ams_b9.rio.nodata) +``` + + +Either way will change the `nodata` value from 0 to `nan`. Now if we compute the statistics again, the missing data will not be considered: +``` +print(raster_ams_b9.min()) +print(raster_ams_b9.max()) +print(raster_ams_b9.mean()) +print(raster_ams_b9.std()) +```python + +```output + +array(8., dtype=float32) +Coordinates: + spatial_ref int64 0 + +array(15497., dtype=float32) +Coordinates: + spatial_ref int64 0 + +array(2477.405, dtype=float32) +Coordinates: + spatial_ref int64 0 + +array(2061.9539, dtype=float32) +Coordinates: + spatial_ref int64 0 +``` + + +And if we plot the image, the `nodata` pixels are not shown because they are not 0 anymore: + +![Raster plot with rioxarray using the robust setting no data](../fig/E06-03-overview-plot-B09-robust-with-nan.png) + +One should notice that there is a side effect of using `nan` instead of `0` to represent the missing data: the data type of the `DataArray` was changed from integers to float. This need to be taken into consideration when the data type matters in your application. + +## Raster Bands +So far we looked into a single band raster, i.e. the `B09` band of a Sentinel-2 scene. However, to get an overview of the scene, one may also want to visualize the true-color thumbnail of the region. This is provided as a multi-band raster -- a raster dataset that contains more than one band. + +![Multi-band raster image](../fig/E06-07-single_multi_raster.png) + +The `overview` asset in the Sentinel-2 scene is a multiband asset. Similar to `B09`, we can load it by: +```python +raster_ams_overview = rioxarray.open_rasterio(items[0].assets['overview'].href) +raster_ams_overview +``` + +```output + +[352947 values with dtype=uint8] +Coordinates: + * band (band) int64 1 2 3 + * x (x) float64 6.002e+05 6.005e+05 ... 7.093e+05 7.096e+05 + * y (y) float64 5.9e+06 5.9e+06 5.899e+06 ... 5.791e+06 5.79e+06 + spatial_ref int64 0 +Attributes: + _FillValue: 0.0 + scale_factor: 1.0 + add_offset: 0.0 +``` + + +The band number comes first when GeoTiffs are read with the `.open_rasterio()` function. As we can see in the `xarray.DataArray` object, the shape is now `(band: 3, y: 343, x: 343)`, with three bands in the `band` dimension. It's always a good idea to examine the shape of the raster array you are working with and make sure it's what you expect. Many functions, especially the ones that plot images, expect a raster array to have a particular shape. One can also check the shape using the `.shape` attribute: +```python +raster_ams_overview.shape +``` + +```output +(3, 343, 343) +``` + +One can visualize the multi-band data with the `DataArray.plot.imshow()` function: +```python +raster_ams_overview.plot.imshow() +``` + +![Amsterdam true color overview](../fig/E06-04-overview-plot-true-color.png) + +Note that the `DataArray.plot.imshow()` function makes assumptions about the shape of the input DataArray, that since it has three channels, the correct colormap for these channels is RGB. It does not work directly on image arrays with more than 3 channels. One can replace one of the RGB channels with another band, to make a false-color image. + +:::challenge +## Exercise: set the plotting aspect ratio +As seen in the figure above, the true-color image is stretched. Let's visualize it with the right aspect ratio. You can use the [documentation](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.plot.imshow.html) of `DataArray.plot.imshow()`. + +::::solution +Since we know the height/width ratio is 1:1 (check the `rio.height` and `rio.width` attributes), we can set the aspect ratio to be 1. For example, we can choose the size to be 5 inches, and set `aspect=1`. Note that according to the [documentation](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.plot.imshow.html) of `DataArray.plot.imshow()`, when specifying the `aspect` argument, `size` also needs to be provided. + +```python +raster_ams_overview.plot.imshow(size=5, aspect=1) +``` + +![Amsterdam true color overview equal aspect](../fig/E06-05-overview-plot-true-color-aspect-equal.png) + +:::: +::: + +:::keypoints +- `rioxarray` and `xarray` are for working with multidimensional arrays like pandas is for working with tabular data. +- `rioxarray` stores CRS information as a CRS object that can be converted to an EPSG code or PROJ4 string. +- Missing raster data are filled with nodata values, which should be handled with care for statistics and visualization. +::: \ No newline at end of file diff --git a/episodes/03-vector-data-in-python.md b/episodes/03-vector-data-in-python.md new file mode 100644 index 00000000..37fe8dfb --- /dev/null +++ b/episodes/03-vector-data-in-python.md @@ -0,0 +1,342 @@ +--- +title: "Vector data in Python" +teaching: 30 +exercises: 20 +questions: +--- + +:::questions +- How can I load point, line and polygon vector data? +- How can I visualize them? +::: + +:::objectives +- Know the difference between point, line, and polygon vector elements. +- Load point, line, and polygon vector data with `geopandas`. +- Access the attributes of a spatial object with `geopandas`. +::: + +## Introduction + +As discussed in [Episode 2: Introduction to Vector Data]({{site.baseurl}}/02-intro-vector-data.md/), vector data represents specific features on the Earth's surface using points, lines and polygons. These geographic elements can then have one or more attributes assigned to them, such as 'name' and 'population' for a city, or crop type for a field. Vector data can be much smaller in (file) size than raster data, while being very rich in terms of the information captured. + +In this episode, we will be moving from working with raster data to working with vector data. We will use Python to open +and plot point, line and polygon vector data. In particular, we will make use of the [`geopandas`](https://geopandas.org/en/stable/) +package to open, manipulate and write vector datasets. `geopandas` extends the popular `pandas` library for data +analysis to geospatial applications. The main `pandas` objects (the `Series` and the `DataFrame`) are expanded by +including geometric types, represented in Python using the `shapely` library, and by providing dedicated methods for +spatial operations (union, intersection, etc.). + +:::callout +## Introduce the Vector Data + +The data we will use comes from the Dutch government's open geodata sets, obtained from the [PDOK platform](https://www.pdok.nl/). +It provides open data for various applications, e.g. real estate, infrastructure, agriculture, etc. In this episode we +will use three data sets: + +- [Crop fields](https://www.pdok.nl/introductie/-/article/basisregistratie-gewaspercelen-brp-) (polygons) +- [Water ways](https://www.pdok.nl/introductie/-/article/nationaal-wegen-bestand-nwb-) (lines) +- [Ground water monitoring wells](https://www.pdok.nl/downloads/-/article/basisregistratie-ondergrond-bro-) (points) +::: + +In later episodes, we will learn how to work with raster and vector data together and combine them into a single plot. + + +## Import Vector Datasets + +```python +import geopandas as gpd +``` + +We will use the `geopandas` module to load the crop field vector data we downloaded at: `data/brpgewaspercelen_definitief_2020_small.gpkg`. This file contains data for the entirety of the European portion of the Netherlands, resulting in a very large number of crop field parcels. Directly loading the whole file to memory can be slow. Let's consider as Area of Interest (AoI) northern Amsterdam, which is a small portion of the Netherlands. We only need to load this part. + +We define a bounding box, and will only read the data within the extent of the bounding box. +```python +# Define bounding box +xmin, xmax = (110_000, 140_000) +ymin, ymax = (470_000, 510_000) +bbox = (xmin, ymin, xmax, ymax) +``` + +:::callout +## How should I define my bounding box? +For simplicity, here we assume the **Coordinate Reference System (CRS)** and **extent** of the vector file are known (for instance they are provided in the dataset documentation). Some Python tools, e.g. [`fiona`](https://fiona.readthedocs.io/en/latest/)(which is also the backend of `geopandas`), provides the file inspection functionality without actually the need to read the full data set into memory. An example can be found in [the documentation of fiona](https://fiona.readthedocs.io/en/latest/manual.html#format-drivers-crs-bounds-and-schema). +::: + +Using the `bbox` input argument, we can load only the spatial features intersecting the provided bounding box. + +```python +# Partially load data within the bounding box +cropfield = gpd.read_file("data/brpgewaspercelen_definitief_2020_small.gpkg", bbox=bbox) +``` + +## Vector Metadata & Attributes +When we import the vector dataset to Python (as our `cropfield` object) it comes in as a `DataFrame`, specifically a `GeoDataFrame`. The `read_file()` function also automatically stores +geospatial information about the data. We are particularly interested in describing the format, CRS, extent, and other components of +the vector data, and the attributes which describe properties associated +with each individual vector object. + + +## Spatial Metadata +Key metadata includes: + +1. **Object Type:** the class of the imported object. +2. **Coordinate Reference System (CRS):** the projection of the data. +3. **Extent:** the spatial extent (i.e. geographic area that the data covers). Note that the spatial extent for a vector dataset represents the combined +extent for all spatial objects in the dataset. + +Each `GeoDataFrame` has a `"geometry"` column that contains geometries. In the case of our `cropfield` object, this geometry is represented by a `shapely.geometry.Polygon` object. `geopandas` uses the `shapely` library to represent polygons, lines, and points, so the types are inherited from `shapely`. + +We can view the metadata using the `.crs`, `.bounds` and `.type` attributes. First, let's view the +geometry type for our crop field dataset. To view the geometry type, we use the `pandas` method `.type` on the `GeoDataFrame` object, `cropfield`. + +```python +cropfield.type +``` + +```output +0 Polygon +1 Polygon +2 Polygon +3 Polygon +4 Polygon + ... +22026 Polygon +22027 Polygon +22028 Polygon +22029 Polygon +22030 Polygon +Length: 22031, dtype: object +``` + +To view the CRS metadata: + + +```python +cropfield.crs +``` + +```output + +Name: Amersfoort / RD New +Axis Info [cartesian]: +- X[east]: Easting (metre) +- Y[north]: Northing (metre) +Area of Use: +- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone. +- bounds: (3.2, 50.75, 7.22, 53.7) +Coordinate Operation: +- name: RD New +- method: Oblique Stereographic +Datum: Amersfoort +- Ellipsoid: Bessel 1841 +- Prime Meridian: Greenwich +``` + +Our data is in the CRS **RD New**. The CRS is critical to +interpreting the object's extent values as it specifies units. To find +the extent of our dataset in the projected coordinates, we can use the `.total_bounds` attribute: + +```python +cropfield.total_bounds +``` + + +```output +array([109222.03325 , 469461.512625, 140295.122125, 510939.997875]) +``` + +This array contains, in order, the values for minx, miny, maxx and maxy, for the overall dataset. The spatial extent of a GeoDataFrame represents the geographic "edge" or location that is the furthest north, south, east, and west. Thus, it is represents the overall geographic coverage of the spatial object. Image Source: National Ecological Observatory Network (NEON). + +![Extent image](../fig/E07-01-spatial_extent.png) + +We can convert these coordinates to a bounding box or acquire the index of the dataframe to access the geometry. Either of these polygons can be used to clip rasters (more on that later). + +## Selecting spatial features +Sometimes, the loaded data can still be too large. We can cut it is to a even smaller extent using the `.cx` indexer (note the use of square brackets instead of round brackets, which are used instead with functions and methods): + +```python +# Define a Boundingbox in RD +xmin, xmax = (120_000, 135_000) +ymin, ymax = (485_000, 500_000) +cropfield_crop = cropfield.cx[xmin:xmax, ymin:ymax] +``` + + + +This will cut out a smaller area, defined by a box in units of the projection, discarding the rest of the data. The resultant GeoDataframe, which includes all the features intersecting the box, is found in the `cropfield_crop` object. Note that the edge elements are not 'cropped' themselves. We can check the total bounds of this new data as before: + +```python +cropfield_crop.total_bounds +``` + +```output +array([119594.384, 484949.292625, 135375.77025, 500782.531]) +``` + +We can then save this cropped dataset for use in future, using the `to_file()` method of our GeoDataFrame object: + +```python +cropfield_crop.to_file('cropped_field.shp') +``` + +This will write it to disk (in this case, in 'shapefile' format), containing only the data from our cropped area. It can be read in again at a later time using the `read_file()` method we have been using above. Note that this actually writes multiple files to disk (`cropped_field.cpg`, `cropped_field.dbf`, `cropped_field.prj`, `cropped_field.shp`, `cropped_field.shx`). All these files should ideally be present in order to re-read the dataset later, although only the `.shp`, `.shx`, and `.dbf` files are mandatory (see the [Introduction to Vector Data]({{site.baseurl}}/02-intro-to-vector-data) lesson for more information. + +## Plotting a vector dataset + +We can now plot this data. Any `GeoDataFrame` can be plotted in CRS units to view the shape of the object with `.plot()`. + +```python +cropfield_crop.plot() +``` + +We can customize our boundary plot by setting the +`figsize`, `edgecolor`, and `color`. Making some polygons transparent will come in handy when we need to add multiple spatial datasets to a single plot. + +```python +cropfield_crop.plot(figsize=(5,5), edgecolor="purple", facecolor="None") +``` + +![Cropped fields plot image](../fig/E07-02-cropped_fields_plot_output.png) + + +Under the hood, `geopandas` is using `matplotlib` to generate this plot. In the next episode we will see how we can add `DataArrays` and other vector datasets to this plot to start building an informative map of our area of interest. + +## Spatial Data Attributes +We introduced the idea of spatial data attributes in [an earlier lesson]({{site.baseurl}}/02-intro-to-vector-data). Now we will explore +how to use spatial data attributes stored in our data to plot +different features. + + +:::challenge +## Exercise: Import Line and Point Vector Datasets +Using the steps above, load the waterways and groundwater well vector datasets (`data/status_vaarweg.zip` and +`data/brogmwvolledigeset.zip`, respectively) into Python using `geopandas`. Name your variables `waterways_nl` and +`wells_nl` respectively. + +Answer the following questions: +1. What type of spatial features (points, lines, polygons) are present in each dataset? +2. What is the CRS and total extent (bounds) of each dataset? +3. How many spatial features are present in each file? + +::::solution +First we import the datasets: +```python +waterways_nl = gpd.read_file("data/status_vaarweg.zip") +wells_nl = gpd.read_file("data/brogmwvolledigeset.zip") +``` + +Then we check the types: +```python +waterways_nl.type +``` + +```python +wells_nl.type +``` +We also check the CRS and extent of each object: +```python +print(waterways_nl.crs) +print(waterways_nl.total_bounds) +print(wells_nl.crs) +print(wells_nl.total_bounds) +``` +To see the number of features in each file, we can print the dataset objects in a Jupyter notebook or use the `len()` function. +`waterways_nl` contains 91 lines and `wells_nl` contains 51664 points. +:::: +::: + +:::challenge +## Exercise: Investigate the waterway lines +Now we will take a deeper look in the Dutch waterway lines: `waterways_nl`. Let's visualize it with the `plot` function. Can you tell what is wrong with this vector file? + +::::solution +By plotting out the vector file, we can tell that the latitude and longitude of the file are flipped. +```python +waterways_nl.plot() +``` + +![Wrong waterways](../fig/E07-03-waterways-wrong.png) + +:::: +::: + +:::callout +## Axis ordering +According to the standards, the axis ordering for a CRS should follow the definition provided by the competent authority. For the commonly used EPSG:4326 geographic coordinate system, the EPSG defines the ordering as first latitude then longitude. +However, in the GIS world, it is custom to work with coordinate tuples where the first component is aligned with the east/west direction and the second component is aligned with the north/south direction. +Multiple software packages thus implement this convention also when dealing with EPSG:4326. +As a result, one can encounter vector files that implement either convention - keep this in mind and always check your datasets! +::: + +## Modify the geometry of a GeoDataFrame + +Sometimes we need to modify the `geometry` of a `GeoDataFrame`. For example, as we have seen in the previous exercise **Investigate the waterway lines**, the latitude and longitude are flipped in the vector data `waterways_nl`. This error needs to be fixed before performing further analysis. + +Let's first take a look on what makes up the `geometry` column of `waterways_nl`: + +```python +waterways_nl['geometry'] +``` + +```output +0 LINESTRING (52.41810 4.84060, 52.42070 4.84090... +1 LINESTRING (52.11910 4.67450, 52.11930 4.67340... +2 LINESTRING (52.10090 4.25730, 52.10390 4.25530... +3 LINESTRING (53.47250 6.84550, 53.47740 6.83840... +4 LINESTRING (52.32270 5.14300, 52.32100 5.14640... + ... +86 LINESTRING (51.49270 5.39100, 51.48050 5.39160... +87 LINESTRING (52.15900 5.38510, 52.16010 5.38340... +88 LINESTRING (51.97340 4.12420, 51.97110 4.12220... +89 LINESTRING (52.11910 4.67450, 52.11850 4.67430... +90 LINESTRING (51.88940 4.61900, 51.89040 4.61350... +Name: geometry, Length: 91, dtype: geometry +``` + +Each row is a `LINESTRING` object. We can further zoom into one of the rows, for example, the third row: + +```python +print(waterways_nl['geometry'][2]) +print(type(waterways_nl['geometry'][2])) +``` + +```output +LINESTRING (52.100900002 4.25730000099998, 52.1039 4.25529999999998, 52.111299999 4.24929999900002, 52.1274 4.23449999799999) + +``` + +As we can see in the output, the `LINESTRING` object contains a list of coordinates of the vertices. In our situation, we would like to find a way to flip the x and y of every coordinates set. A good way to look for the solution is to use the [documentation](https://shapely.readthedocs.io/en/stable/manual.html) of the `shapely` package, since we are seeking to modify the `LINESTRING` object. Here we are going to use the [`shapely.ops.transform`](https://shapely.readthedocs.io/en/stable/manual.html?highlight=shapely.ops.transform#shapely.ops.transform) function, which applies a self-defined function to all coordinates of a geometry. + +```python +import shapely + +# Define a function flipping the x and y coordinate values +def flip(geometry): + return shapely.ops.transform(lambda x, y: (y, x), geometry) + +# Apply this function to all coordinates and all lines +geom_corrected = waterways_nl['geometry'].apply(flip) +``` + +Then we can update the `geometry` column with the corrected geometry `geom_corrected`, and visualize it to check: +```python +# Update geometry +waterways_nl['geometry'] = geom_corrected + +# Visualization +waterways_nl.plot() +``` + +![Corrected waterways](../fig/E07-04-waterways-corrected.png) + +Now the waterways look good! We can save the vector data for later usage: +```python +# Update geometry +waterways_nl.to_file('waterways_nl_corrected.shp') +``` +:::keypoints +- Vector dataset metadata include geometry type, CRS, and extent. +- Load spatial objects into Python with `geopandas.read_file()` function. +- Spatial objects can be plotted directly with `GeoDataFrame`'s `.plot()` method. +::: diff --git a/episodes/04-crop-raster-data.md b/episodes/04-crop-raster-data.md new file mode 100644 index 00000000..6a3b475d --- /dev/null +++ b/episodes/04-crop-raster-data.md @@ -0,0 +1,410 @@ +--- +title: "Crop raster data with rioxarray and geopandas" +teaching: 70 +exercises: 30 +questions: +--- + +:::questions +- How can I crop my raster data to the area of interest? +::: + +:::objectives +- Crop raster data with a bounding box. +- Crop raster data with a polygon. +- Crop raster data within a geometry buffer. +::: + +It is quite common that the raster data you have in hand is too large to process, or not all the pixels are relevant to your area of interest (AoI). In both situations, you should consider cropping your raster data before performing data analysis. + +In this episode, we will introduce how to crop raster data into the desired area. We will use one Sentinel-2 image over Amsterdam as the example raster data, and introduce how to crop your data to different types of AoIs. + +:::callout +## Introduce the Data + +We'll continue from the results of the satellite image search that we have carried out in an exercise from +[a previous episode]({{site.baseurl}}/05-access-data). We will load data starting from the `search.json` file. + +If you would like to work with the data for this lesson without downloading data on-the-fly, you can download the +raster data using this [link](https://figshare.com/ndownloader/files/36028100). Save the `geospatial-python-raster-dataset.tar.gz` +file in your current working directory, and extract the archive file by double-clicking on it or by running the +following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. Use the file `geospatial-python-raster-dataset/search.json` +(instead of `search.json`) to get started with this lesson. + +We also use the vector data that has already been introduced in episode [Vector data in python]({{site.baseurl}}/07-vector-data-in-python). +::: + +## Crop raster data with a bounding box + +We load a true color image using `pystac` and `rioxarray` and check the shape of the raster: + +```python +import pystac +import rioxarray + +# Load image and inspect the shape +items = pystac.ItemCollection.from_file("search.json") +true_color_image = rioxarray.open_rasterio(items[1].assets["visual"].href) # Select a true color image +print(true_color_image.shape) +``` + +```output +(3, 10980, 10980) +``` + +The large size of the raster data makes it time and memory consuming to visualise in its entirety. Instead, we can plot the "overview" asset, to investigate the coverage of the image. + +```python +# Get the overview asset +overview_image = rioxarray.open_rasterio(items[1].assets["overview"].href) +print(overview_image.shape) + +# Visualize it +overview_image.plot.imshow(figsize=(8,8)) +``` +![Overview of the raster](../fig/E08-01-crop-raster-overview-raster-00.png) + + +As we can see, the overview image is much smaller compared to the original true color image. Therefore the visualization is much faster. If we are interested in the crop fields, then we would like to know where these are located in the image. To compare its coverage with the raster data, we first check the coordinate systems of both raster and vector data. For raster data, we use `pyproj.CRS`: + +```python +from pyproj import CRS + +# Check the coordinate system +CRS(true_color_image.rio.crs) +``` + +```output + +Name: WGS 84 / UTM zone 31N +Axis Info [cartesian]: +- [east]: Easting (metre) +- [north]: Northing (metre) +Area of Use: +- undefined +Coordinate Operation: +- name: UTM zone 31N +- method: Transverse Mercator +Datum: World Geodetic System 1984 +- Ellipsoid: WGS 84 +- Prime Meridian: Greenwich +``` + + +To open and check the coordinate system of vector data, we use `geopandas`: + +```python +import geopandas as gpd + +# Load the polygons of the crop fields +cf_boundary_crop = gpd.read_file("cropped_field.shp") + +# Check the coordinate system +cf_boundary_crop.crs +``` + +```output + +Name: Amersfoort / RD New +Axis Info [cartesian]: +- X[east]: Easting (metre) +- Y[north]: Northing (metre) +Area of Use: +- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone. +- bounds: (3.2, 50.75, 7.22, 53.7) +Coordinate Operation: +- name: RD New +- method: Oblique Stereographic +Datum: Amersfoort +- Ellipsoid: Bessel 1841 +- Prime Meridian: Greenwich +``` + +As seen, the coordinate systems differ. To crop the raster using the shapefile, +we first convert the coordinate system of `cf_boundary_crop` to the coordinate +system of `true_color_image`, and then check the coverage: + +```python +from matplotlib import pyplot as plt + +# Convert the coordinate system +cf_boundary_crop = cf_boundary_crop.to_crs(true_color_image.rio.crs) + +# Plot +fig, ax = plt.subplots() +fig.set_size_inches((8,8)) + +# Plot image +overview_image.plot.imshow(ax=ax) + +# Plot crop fields +cf_boundary_crop.plot( + ax=ax, + edgecolor="red", +) +``` + +![Bounding boxes of AoI over the raster](../fig/E08-02-crop-raster-bounding-box-01.png) + +Seeing from the location of the polygons, the crop fields (red) only takes a small part of +the raster. Therefore, before actual processing, we can first crop the raster to +our area of interest. The `clip_box` function allows one to crop a raster by the +min/max of the x and y coordinates. Note that we are cropping the original image `true_color_image` now, and not the overview image `overview_image`. + +```python +# Crop the raster with the bounding box +raster_clip_box = true_color_image.rio.clip_box(*cf_boundary_crop.total_bounds) +print(raster_clip_box.shape) +``` + +```output +(3, 1565, 1565) +``` + +We successfully cropped the raster to a much smaller piece. We can visualize it now: + +```python +raster_clip_box.plot.imshow(figsize=(8,8)) +``` +![Crop raster by a bounding box](../fig/E08-03-crop-raster-crop-by-bb-02.png) + +This cropped image can be saved for later usage: +```python +raster_clip_box.rio.to_raster("raster_clip.tif") +``` + +## Crop raster data with polygons + +We have a cropped image around the fields. To further analyze the fields, one may want to crop the image to the exact field boundaries. +This can be done with the `clip` function: + +```python +raster_clip_fields = raster_clip_box.rio.clip(cf_boundary_crop['geometry']) +``` + +And we can visualize the results: +```python +raster_clip_fields.plot.imshow(figsize=(8,8)) +``` + +![Raster cropped by crop fields](../fig/E08-04-crop-raster-crop-fields-solution-06.png) + +We can save this image for later usage: +```python +raster_clip_fields.rio.to_raster("crop_fields.tif") +``` + + + +## Crop raster data with a geometry buffer + +It is not always the case that the AoI comes in the format of polygon. Sometimes one would like to perform analysis around a (set of) point(s), or polyline(s). For example, in our AoI, there are also some groundwater monitoring wells available as point vector data. One may also want to perform analysis around these wells. The location of the wells is stored in `data/groundwater_monitoring_well`. + +We can first load the wells vector data, and select wells within the coverage of the image: + +```python +# Load wells +wells = gpd.read_file("data/brogmwvolledigeset.zip") +wells = wells.to_crs(raster_clip_box.rio.crs) + +# Crop the wells to the image extent +xmin, ymin, xmax, ymax = raster_clip_box.rio.bounds() +wells = wells.cx[xmin:xmax, ymin:ymax] +``` + + +Then we can check the location of the wells: +```python +# Plot the wells over raster +fig, ax = plt.subplots() +fig.set_size_inches((8,8)) +raster_clip_box.plot.imshow(ax=ax) +wells.plot(ax=ax, color='red', markersize=2) +``` + +![Ground weter level wells](../fig/E08-05-crop-raster-wells-04.png) + +To select pixels around the geometries, one needs to first define a region including the geometries. This region is called a "buffer" and it is defined in the units of the projection. The size of the buffer depends on the analysis in your research. A buffer is also a polygon, which can be used to crop the raster data. `geopandas`' objects have a `buffer` method to generate buffer polygons. + +```python +# Create 200m buffer around the wells +wells_buffer = wells.buffer(200) +``` + + +Now let's see what do the buffers look like in the image: +```python +# Visualize buffer on raster +fig, ax = plt.subplots() +fig.set_size_inches((8,8)) +raster_clip_box.plot.imshow(ax=ax) +wells_buffer.plot(ax=ax, color='red') +``` + +![Raster croped by buffer around wells](../fig/E08-06-crop-raster-well-buffers-over-raster-05.png) + +The red dots have grown larger indicating the conversion from points to buffer polygons. + +:::challenge +## Exercise: Select the raster data around the wells +Now we have the buffer polygons around the groudwater monitoring wells, i.e. `wells_buffer`. Can you crop the image `raster_clip_box` to the buffer polygons? Can you visualize the results of cropping? + +::::solution + +```python +# Crop +raster_clip_wells = raster_clip_box.rio.clip(wells_buffer) + +# Visualize cropped buffer +raster_clip_wells.plot.imshow() +``` + +![Raster croped by buffer around wells](../fig/E08-07-crop-raster-crop-by-well-buffers-05.png) + +:::: +::: + +:::challenge +## Exercise: Select the raster data around the waterways +In the previous episode we have corrected the waterway vector data and saved it in `waterways_nl_corrected.shp`. Can you select out all the raster data within 100m around the waterways, and visualize the results? + +::::solution + +```python +# Load waterways polyline and convert CRS +waterways_nl = gpd.read_file("waterways_nl_corrected.shp") +waterways_nl = waterways_nl.to_crs(raster_clip_box.rio.crs) + +# Crop the waterways to the image extent +waterways_nl = waterways_nl.cx[xmin:xmax, ymin:ymax] + +# waterways buffer +waterways_nl_buffer = waterways_nl.buffer(100) + +# Crop +raster_clip_waterways = raster_clip_box.rio.clip(waterways_nl_buffer) + +# Visualize +raster_clip_waterways.plot.imshow(figsize=(8,8)) +``` + +![Raster croped by buffer around waterways](../fig/E08-08-crop-waterways-07.png) + +:::: +::: + +## Crop raster data using another raster dataset + +So far we have learned how to crop raster image with vector data. We can also crop a raster with another raster data. In this section, we will demonstrate how to crop the `true_color_image` image using the +`crop_fields` image. + +:::callout +## Using `crop_fields` raster image +For this section, we will use the `crop_fields.tif` image that was produced in the section "**Crop raster data with polygon**". +::: + +We read in the `crop_fields.tif` image. For the demonstration purpose, we will reproject it to the RD CRS system, so it will be in a different CRS from the `true_color_image`: +```python +# Read crop_fields +crop_fields = rioxarray.open_rasterio("crop_fields.tif") + +# Reproject to RD to make the CRS different from the "true_color_image" +crop_fields = crop_fields.rio.reproject("EPSG:28992") +CRS(crop_fields.rio.crs) +``` + +```output + +Name: Amersfoort / RD New +Axis Info [cartesian]: +- [east]: Easting (metre) +- [north]: Northing (metre) +Area of Use: +- undefined +Coordinate Operation: +- name: unnamed +- method: Oblique Stereographic +Datum: Amersfoort +- Ellipsoid: Bessel 1841 +- Prime Meridian: Greenwich +``` + +And let's check again the CRS of `true_color_image`: + +```python +# Get CRS of true_color_image +CRS(true_color_image.rio.crs) +``` + +```output + +Name: WGS 84 / UTM zone 31N +Axis Info [cartesian]: +- [east]: Easting (metre) +- [north]: Northing (metre) +Area of Use: +- undefined +Coordinate Operation: +- name: UTM zone 31N +- method: Transverse Mercator +Datum: World Geodetic System 1984 +- Ellipsoid: WGS 84 +- Prime Meridian: Greenwich +``` + +Now the two images are in different coordinate systems. We can +use `rioxarray.reproject_match()` function to crop `true_color_image` image. +It will perform both the reprojection and the cropping operation. +This might take a few minutes, because the `true_color_image` image is large. + +```python +# Crop and reproject +cropped_raster = true_color_image.rio.reproject_match(crop_fields) + +# Visualize +cropped_raster.plot.imshow(figsize=(8,8)) +``` + + +![Raster croped by raster](../fig/E08-09-crop-raster-raster-intro-08.png) + +In this way, we accomplish the reproject and cropping in one go. + +:::challenge +## Exercise + +This time let's do it the other way around by cropping the `crop_fields` image using the `true_color_image` image. Discuss the results. + +::::solution + +```python +# Crop +cropped_raster = crop_fields.rio.reproject_match(true_color_image) + +# Visualize +cropped_raster.plot.imshow(figsize=(8,8)) +``` + +![Solution: raster croped raster](../fig/E08-10-crop-raster-raster-solution-08.png) + +:::: +::: + +In one line `reproject_match` does a lot of helpful things: + +1. It reprojects. +2. It matches the extent using `nodata` values or by clipping the data. +3. It sets `nodata` values. This means we can run calculations on those two images. + +:::callout +## Code Tip + +As we saw before, there also exists a method called `reproject()`, which only reprojects one raster to another projection. If you want more control over how rasters are resampled, clipped, and/or reprojected, you can use the `reproject()` method and other `rioxarray` methods individually. +::: + +:::keypoints +- Use `clip_box` in `DataArray.rio` to crop a raster with a bounding box. +- Use `clip` in `DataArray.rio` to crop a raster with a given polygon. +- Use `buffer` in `geopandas` to make a buffer polygon of a (multi)point or a polyline. This polygon can be used to crop data. +- Use `reproject_match` function in `DataArray.rio` to reproject and crop a raster data using another raster data. +::: \ No newline at end of file diff --git a/episodes/05-raster-calculations.md b/episodes/05-raster-calculations.md new file mode 100644 index 00000000..45572f97 --- /dev/null +++ b/episodes/05-raster-calculations.md @@ -0,0 +1,381 @@ +--- +title: "Raster Calculations in Python" +teaching: 60 +exercises: 15 +--- + +:::questions +- How do I perform calculations on rasters and extract pixel values for defined locations? +::: + +:::objectives +- Carry out operations with two rasters using Python's built-in math operators. +- Reclassify a continuous raster to a categorical raster. +::: + + + +## Introduction + +We often want to combine values of and perform calculations on rasters to create a new output raster. This episode +covers how to perform basic math operations using raster datasets. It also illustrates how to match rasters with +different resolutions so that they can be used in the same calculation. As an example, we will calculate a vegetation +index over one of the satellite scenes. + +### Normalized Difference Vegetation Index (NDVI) +Suppose we are interested in monitoring vegetation fluctuations using satellite remote sensors. Scientists have defined +a vegetation index to quantify the amount of green leaf vegetation using the light reflected in different wavelengths. +This index, named Normalized Difference Vegetation Index (NDVI), exploits the fact that healthy green leaves strongly +absorb red visible light while they mostly reflect light in the near infrared (NIR). The NDVI is computed as: + +$$ NDVI = \frac{NIR - red}{NIR + red} $$ + +where $NIR$ and $red$ label the reflectance values of the corresponding wavelengths. NDVI values range from -1 to +1. +Values close to one indicate high density of green leaves. Poorly vegetated areas typically have NDVI values close to +zero. Negative NDVI values often indicate cloud and water bodies. + +![PONE-NDVI Source: Wu C-D, McNeely E, Cedeño-Laurent JG, Pan W-C, Adamkiewicz G, Dominici F, et al. (2014) Linking Student Performance in Massachusetts Elementary Schools with the “Greenness” of School Surroundings Using Remote Sensing. PLoS ONE 9(10): e108548. https://doi.org/10.1371/journal.pone.0108548](../fig/E09-01-PONE-NDVI.jpg) + +:::callout +## More Resources +Check out more on NDVI in the NASA Earth Observatory portal: +[Measuring Vegetation](https://earthobservatory.nasa.gov/features/MeasuringVegetation/measuring_vegetation_2.php). +::: + +### Load and crop the Data +For this episode, we will use one of the Sentinel-2 scenes that we have already employed in the previous episodes. + +:::callout +## Introduce the Data + +We'll continue from the results of the satellite image search that we have carried out in an exercise from +[a previous episode]({{site.baseurl}}/05-access-data). We will load data starting from the `search.json` file. + +If you would like to work with the data for this lesson without downloading data on-the-fly, you can download the +raster data using this [link](https://figshare.com/ndownloader/files/36028100). Save the `geospatial-python-raster-dataset.tar.gz` +file in your current working directory, and extract the archive file by double-clicking on it or by running the +following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. Use the file `geospatial-python-raster-dataset/search.json` +(instead of `search.json`) to get started with this lesson. +::: + +Let's load the results of our initial imagery search using `pystac`: + +```python +import pystac +items = pystac.ItemCollection.from_file("search.json") +``` + + +We then select the second item, and extract the URIs of the red and NIR bands ("B04" and "B8A", respectively): + +```python +red_uri = items[1].assets["B04"].href +nir_uri = items[1].assets["B8A"].href +``` + +Let's load the rasters with `open_rasterio` using the argument `masked=True`. + +```python +import rioxarray +red = rioxarray.open_rasterio(red_uri, masked=True) +nir = rioxarray.open_rasterio(nir_uri, masked=True) +``` + +Let's also restrict our analysis to the same crop field area defined in the previous episode by clipping the rasters +using a bounding box: + +```python +bbox = (629_000, 5_804_000, 639_000, 5_814_000) +red_clip = red.rio.clip_box(*bbox) +nir_clip = nir.rio.clip_box(*bbox) +``` + +We can now plot the two rasters. Using `robust=True` color values are stretched between the 2nd and 98th percentiles of +the data, which results in clearer distinctions between high and low reflectances: + +```python +red_clip.plot(robust=True) +``` + +![red-band](../fig/E09-02-red-band.png) + +```python +nir_clip.plot(robust=True) +``` + +![NIR-band](../fig/E09-03-NIR-band.png) + +It is immediately evident how crop fields (rectangular shapes in the central part of the two figures) appear as dark and +bright spots in the red-visible and NIR wavelengths, respectively, suggesting the presence of leafy crop at the time of +observation (end of March). The same fields would instead appear as dark spots in the off season. + +## Raster Math +We can perform raster calculations by subtracting (or adding, +multiplying, etc.) two rasters. In the geospatial world, we call this +"raster math", and typically it refers to operations on rasters that +have the same width and height (including `nodata` pixels). +We can check the shapes of the two rasters in the following way: +```python +print(red_clip.shape, nir_clip.shape) +``` + +```output +(1, 1000, 1000) (1, 500, 500) +``` + +Both rasters include a single band, but their width and height do not match. +We can now use the `reproject_match` function, which both reprojects and clips +a raster to the CRS and extent of another raster. + +```python +red_clip_matched = red_clip.rio.reproject_match(nir_clip) +print(red_clip_matched.shape) +``` + +```output +(1, 500, 500) +``` + +Let's now compute the NDVI as a new raster using the formula presented above. +We'll use `rioxarray` objects so that we can easily plot our result and keep +track of the metadata. + +```python +ndvi = (nir_clip - red_clip_matched)/ (nir_clip + red_clip_matched) +print(ndvi) +``` + +```output + +array([[[ 0.7379576 , 0.77153456, 0.54531944, ..., 0.39254385, + 0.49227372, 0.4465174 ], + [ 0.7024894 , 0.7074668 , 0.3903298 , ..., 0.423283 , + 0.4706971 , 0.45964912], + [ 0.6557818 , 0.5610572 , 0.46742022, ..., 0.4510345 , + 0.43815723, 0.6005133 ], + ..., + [ 0.02391171, 0.21843003, 0.02479339, ..., -0.50923485, + -0.53367877, -0.4955414 ], + [ 0.11376493, 0.17681159, -0.1673566 , ..., -0.5221932 , + -0.5271318 , -0.4852753 ], + [ 0.45398772, -0.00518135, 0.03346133, ..., -0.5019455 , + -0.4987013 , -0.49081364]]], dtype=float32) +Coordinates: + * band (band) int64 1 + * x (x) float64 6.29e+05 6.29e+05 6.29e+05 ... 6.39e+05 6.39e+05 + * y (y) float64 5.814e+06 5.814e+06 ... 5.804e+06 5.804e+06 + spatial_ref int64 0 +``` + +We can now plot the output NDVI: + +```python +ndvi.plot() +``` + +![NDVI-map](../fig/E09-04-NDVI-map.png) + +Notice that the range of values for the output NDVI is between -1 and 1. +Does this make sense for the selected region? + +Maps are great, but it can also be informative to plot histograms of values to better understand the distribution. We can accomplish this using a built-in xarray method we have already been using: `plot` + +```python +ndvi.plot.hist() +``` + +![NDVI-hist](../fig/E09-05-NDVI-hist.png) + +:::challenge +## Exercise: Explore NDVI Raster Values + +It's often a good idea to explore the range of values in a raster dataset just like we might explore a dataset that we collected in the field. The histogram we just made is a good start but there's more we can do to improve our understanding of the data. + +1. What is the min and maximum value for the NDVI raster (`ndvi`) that we just created? Are there missing values? +2. Plot a histogram with 50 bins instead of 8. What do you notice that wasn't clear before? +3. Plot the `ndvi` raster using breaks that make sense for the data. + +::::solution + +1) Recall, if there were nodata values in our raster, +we would need to filter them out with `.where()`. Since we have loaded the rasters with `masked=True`, missing +values are already encoded as `np.nan`'s. The `ndvi` array actually includes a single missing value. +```python +print(ndvi.min().values) +print(ndvi.max().values) +print(ndvi.isnull().sum().values) +``` + +```output +-0.99864775 +0.9995788 +1 +``` + +2) Increasing the number of bins gives us a much clearer view of the distribution. Also, there seem to be very few +NDVI values larger than ~0.9. +```python +ndvi.plot.hist(bins=50) +``` + +![NDVI-hist-bins](../fig/E09-06-NDVI-hist-bins.png) +3) We can discretize the color bar by specifying the intervals via the `levels` argument to `plot()`. +Suppose we want to bin our data in the following intervals: +* $-1 \le NDVI \lt 0$ for water; +* $0 \le NDVI \lt 0.2$ for no vegetation; +* $0.2 \le NDVI \lt 0.7$ for sparse vegetation; +* $0.7 \le NDVI \lt 1$ for dense vegetation. + +```python +class_bins = (-1, 0., 0.2, 0.7, 1) +ndvi.plot(levels=class_bins) +``` + +![NDVI-map-binned](../fig/E09-07-NDVI-map-binned.png) +:::: +::: + +Missing values can be interpolated from the values of neighbouring grid cells using the `.interpolate_na` method. We +then save `ndvi` as a GeoTiff file: +```python +ndvi_nonan = ndvi.interpolate_na(dim="x") +ndvi_nonan.rio.to_raster("NDVI.tif") +``` + + +## Classifying Continuous Rasters in Python + +Now that we have a sense of the distribution of our NDVI raster, we +can reduce the complexity of our map by classifying it. Classification involves +assigning each pixel in the raster to a class based on its value. In Python, we +can accomplish this using the `numpy.digitize` function. + +First, we define NDVI classes based on a list of values, as defined in the last exercise: +`[-1, 0., 0.2, 0.7, 1]`. When bins are ordered from +low to high, as here, `numpy.digitize` assigns classes like so: + +![NDVI-classes Source: Image created for this lesson ([license]({{ page.root }}{% link LICENSE.md %}))](../fig/E09-08-NDVI-classes.jpg) + + +Note that, by default, each class includes the left but not the right bound. This is not an issue here, since the +computed range of NDVI values is fully contained in the open interval (-1; 1) (see exercise above). + +```python +import numpy as np +import xarray + +# Defines the bins for pixel values +class_bins = (-1, 0., 0.2, 0.7, 1) + +# The numpy.digitize function returns an unlabeled array, in this case, a +# classified array without any metadata. That doesn't work--we need the +# coordinates and other spatial metadata. We can get around this using +# xarray.apply_ufunc, which can run the function across the data array while +# preserving metadata. +ndvi_classified = xarray.apply_ufunc( + np.digitize, + ndvi_nonan, + class_bins +) +``` + +Let's now visualize the classified NDVI, customizing the plot with proper title and legend. We then export the +figure in PNG format: + +```python +import earthpy.plot as ep +import matplotlib.pyplot as plt + +from matplotlib.colors import ListedColormap + +# Define color map of the map legend +ndvi_colors = ["blue", "gray", "green", "darkgreen"] +ndvi_cmap = ListedColormap(ndvi_colors) + +# Define class names for the legend +category_names = [ + "Water", + "No Vegetation", + "Sparse Vegetation", + "Dense Vegetation" +] + +# We need to know in what order the legend items should be arranged +category_indices = list(range(len(category_names))) + +# Make the plot +im = ndvi_classified.plot(cmap=ndvi_cmap, add_colorbar=False) +plt.title("Classified NDVI") +# earthpy helps us by drawing a legend given an existing image plot and legend items, plus indices +ep.draw_legend(im_ax=im, classes=category_indices, titles=category_names) + +# Save the figure +plt.savefig("NDVI_classified.png", bbox_inches="tight", dpi=300) +``` + +![NDVI-classified](../fig/E09-09-NDVI-classified.png) + +We can finally export the classified NDVI raster object to a GeoTiff file. The `to_raster()` function +by default writes the output file to your working directory unless you specify a +full file path. + +```python +ndvi_classified.rio.to_raster("NDVI_classified.tif", dtype="int32") +``` + +:::challenge +## Exercise: Compute the NDVI for the Texel island + +Data are often more interesting and powerful when we compare them across various +locations. Let's compare the computed NDVI map with the one of another region in the same Sentinel-2 scene: +the [Texel island](https://en.wikipedia.org/wiki/Texel), located in the North Sea. + +0. You should have the red- and the NIR-band rasters already loaded (`red` and `nir` variables, respectively). +1. Crop the two rasters using the following bounding box: `(610000, 5870000, 630000, 5900000)`. Don't forget to check the shape of the data, and make sure the cropped areas have the same CRSs, heights and widths. +2. Compute the NDVI from the two raster layers and check the max/min values to make sure the data +is what you expect. +3. Plot the NDVI map and export the NDVI as a GeoTiff. +4. Compare the distributions of NDVI values for the two regions investigated. + +::::solution +1) We crop the area of interest using `clip_box`: +```python +bbox_texel = (610000, 5870000, 630000, 5900000) +nir_texel = nir.rio.clip_box(*bbox_texel) +red_texel = red.rio.clip_box(*bbox_texel) +``` + +2) Reproject and clip one raster to the extent of the smaller raster using `reproject_match`. The lines of code below +assign a variable to the reprojected raster and calculate the NDVI. + +```python +red_texel_matched = red_texel.rio.reproject_match(nir_texel) +ndvi_texel = (nir_texel - red_texel_matched)/ (nir_texel + red_texel_matched) +``` + +3) Plot the NDVI and save the raster data as a GeoTIFF file. + +```python +ndvi_texel.plot() +ndvi_texel.rio.to_raster("NDVI_Texel.tif") +``` + +![NDVI-map-Texel](../fig/E09-10-NDVI-map-Texel.png) + +4) Compute the NDVI histogram and compare it with the region that we have previously investigated. Many more grid +cells have negative NDVI values, since the area of interest includes much more water. Also, NDVI values close to +zero are more abundant, indicating the presence of bare ground (sand) regions. +```python +ndvi_texel.plot.hist(bins=50) +``` + +![NDVI-hist-Texel](../fig/E09-11-NDVI-hist-Texel.png) + +:::: +::: + + +:::keypoints +- Python's built-in math operators are fast and simple options for raster math. +- numpy.digitize can be used to classify raster values in order to generate a less complicated map. +::: diff --git a/episodes/06-zonal-statistics.md b/episodes/06-zonal-statistics.md new file mode 100644 index 00000000..22434e5f --- /dev/null +++ b/episodes/06-zonal-statistics.md @@ -0,0 +1,178 @@ +--- +title: "Calculating Zonal Statistics on Rasters" +teaching: 40 +exercises: 20 +--- + +:::questions +- How to compute raster statistics on different zones delineated by vector data? +::: + +:::objectives +- Extract zones from the vector dataset +- Convert vector data to raster +- Calculate raster statistics over zones +::: + + + + +# Introduction + +Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we used the crop field polygon dataset. The fields with the same crop type can be identified as a "zone", resulting in multiple zones in one vector dataset. One may be interested in performing statistical analysis over these crop zones. + +In this episode, we will explore how to calculate zonal statistics based on the types of crops in `fields_cropped.shp`. To do this, we will first identify zones from the vector data, then rasterize these vector zones. Finally the zonal statistics for `ndvi` will be calculated over the rasterized zones. + + +# Making vector and raster data compatible +First, let's load the `NDVI.tif` file saved in the previous episode to obtained our calculated raster `ndvi` data. We also use the `squeeze()` function in order to reduce our raster data `ndvi` dimension to 2D by removing the singular `band` dimension - this is necessary for use with the `rasterize` and `zonal_stats` functions: + +```python +import rioxarray +ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze() +``` + +Let's also read the crop fields vector data from our saved `fields_cropped.shp` file. + +```python +import geopandas as gpd +fields = gpd.read_file('fields_cropped.shp') +``` + +In order to use the vector data as a classifier for our raster, we need to convert the vector data to the appropriate CRS. We can perform the CRS conversion from the vector CRS (EPSG:28992) to our raster `ndvi` CRS (EPSG:32631) with: +```python +# Uniform CRS +fields_utm = fields.to_crs(ndvi.rio.crs) +``` + +# Rasterizing the vector data + +Before calculating zonal statistics, we first need to rasterize our `field_to_raster_crs` vector geodataframe with the `rasterio.features.rasterize` function. With this function, we aim to produce a grid with numerical values representing the types of crops as defined by the column `gewascode` from `field_cropped` - `gewascode` stands for the crop codes as defined by the Netherlands Enterprise Agency (RVO) for different types of crop or `gewas` (Grassland, permanent; Grassland, temporary; corn fields; etc.). This grid of values thus defines the zones for the `xrspatial.zonal_stats` function, where each pixel in the zone grid overlaps with a corresponding pixel in our NDVI raster. + +We can generate the `geometry, gewascode` pairs for each vector feature to be used as the first argument to `rasterio.features.rasterize` as: + +```python +geom = fields_utm[['geometry', 'gewascode']].values.tolist() +geom +``` + +```output +[[, 265], + [, 265], + [, 265], + [, 265], + [, 265], + [, 265], +... + [, 265], + [, 265], + [, 265], + [, 331], + ...] +``` + +This generates a list of the shapely geometries from the `geometry` column, and the unique field ID from the `gewascode` column in the `fields_utm` geodataframe. + +We can now rasterize our vector data using `rasterio.features.rasterize`: + +```python +from rasterio import features +fields_rasterized = features.rasterize(geom, out_shape=ndvi.shape, transform=ndvi.rio.transform()) +``` + +The argument `out_shape` specifies the shape of the output grid in pixel units, while `transform` represents the projection from pixel space to the projected coordinate space. By default, the pixels that are not contained within a polygon in our shapefile will be filled with 0. It's important to pick a fill value that is not the same as any value already defined in `gewascode` or else we won't distinguish between this zone and the background. + +Let's inspect the results of rasterization: + +```python +import numpy as np +print(fields_rasterized.shape) +print(np.unique(fields_rasterized)) +``` + +```output +(500, 500) +[ 0 259 265 266 331 332 335 863] +``` + +The output `fields_rasterized` is an `np.ndarray` with the same shape as `ndvi`. It contains `gewascode` values at the location of fields, and 0 outside the fields. Let's visualize it: + +```python +from matplotlib import pyplot as plt +plt.imshow(fields_rasterized) +plt.colorbar() +``` + +![Rasterization results](../fig/E10-01-rasterization-results.png) + +We will convert the output to `xarray.DataArray` which will be used further. To do this, we will "borrow" the coordinates from `ndvi`, and fill in the rasterization data: +```python +import xarray as xr +fields_rasterized_xarr = ndvi.copy() +fields_rasterized_xarr.data = fields_rasterized + +# visualize +fields_rasterized_xarr.plot(robust=True) +``` + +![Rasterization results Xarray](../fig/E10-02-rasterization-results-xr.png) + +# Calculate zonal statistics + +In order to calculate the statistics for each crop zone, we call the function, `xrspatial.zonal_stats`. The `xrspatial.zonal_stats` function takes as input `zones`, a 2D `xarray.DataArray`, that defines different zones, and `values`, a 2D `xarray.DataArray` providing input values for calculating statistics. + +We call the `zonal_stats` function with `fields_rasterized_xarr` as our classifier and the 2D raster with our values of interest `ndvi` to obtain the NDVI statistics for each crop type: + +```python +from xrspatial import zonal_stats +zonal_stats(fields_rasterized_xarr, ndvi) +``` + +```output + zone mean max min sum std var count +0 0 0.266531 0.999579 -0.998648 38887.648438 0.409970 0.168075 145903.0 +1 259 0.520282 0.885242 0.289196 449.003052 0.111205 0.012366 863.0 +2 265 0.775609 0.925955 0.060755 66478.976562 0.091089 0.008297 85712.0 +3 266 0.794128 0.918048 0.544686 1037.925781 0.074009 0.005477 1307.0 +4 331 0.703056 0.905304 0.142226 10725.819336 0.102255 0.010456 15256.0 +5 332 0.681699 0.849158 0.178113 321.080261 0.123633 0.015285 471.0 +6 335 0.648063 0.865804 0.239661 313.662598 0.146582 0.021486 484.0 +7 863 0.388575 0.510572 0.185987 1.165724 0.144245 0.020807 3.0 +``` + +The `zonal_stats` function calculates the minimum, maximum, and sum for each zone along with statistical measures such as the mean, variance and standard deviation for each rasterized vector zone. In our raster dataset `zone = 0`, corresponding to non-crop areas, has the highest count followed by `zone = 265` which corresponds to 'Grasland, blijvend' or 'Grassland, permanent'. The highest mean NDVI is observed for `zone = 266` for 'Grasslands, temporary' with the lowest mean, aside from non-crop area, going to `zone = 863` representing 'Forest without replanting obligation'. Thus, the `zonal_stats` function can be used to analyze and understand different sections of our raster data. The definition of the zones can be derived from vector data or from classified raster data as presented in the challenge below: + +:::challenge +## Exercise: Calculate zonal statistics for zones defined by `ndvi_classified` + +Let's calculate NDVI zonal statistics for the different zones as classified by `ndvi_classified` in the previous episode. + +Load both raster datasets: `NDVI.tif` and `NDVI_classified.tif`. Then, calculate zonal statistics for each `class_bins`. Inspect the output of the `zonal_stats` function. + + +::::solution +1) Load and convert raster data into suitable inputs for `zonal_stats`: +```python +ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze() +ndvi_classified = rioxarray.open_rasterio("NDVI_classified.tif").squeeze() +``` +2) Create and display the zonal statistics table. +```python +zonal_stats(ndvi_classified, ndvi) +``` + +```output + zone mean max min sum std var count +0 1 -0.355660 -0.000257 -0.998648 -12838.253906 0.145916 0.021291 36097.0 +1 2 0.110731 0.199839 0.000000 1754.752441 0.055864 0.003121 15847.0 +2 3 0.507998 0.700000 0.200000 50410.167969 0.140193 0.019654 99233.0 +3 4 0.798281 0.999579 0.700025 78888.523438 0.051730 0.002676 98823.0 +``` +:::: +::: + +:::keypoints +- Zones can be extracted by attribute columns of a vector dataset +- Zones can be rasterized using `rasterio.features.rasterize` +- Calculate zonal statistics with `xrspatial.zonal_stats` over the rasterized zones. +::: \ No newline at end of file diff --git a/episodes/07-parallel-raster-computations.md b/episodes/07-parallel-raster-computations.md new file mode 100644 index 00000000..86ac47ae --- /dev/null +++ b/episodes/07-parallel-raster-computations.md @@ -0,0 +1,363 @@ +--- +title: "Parallel raster computations using Dask" +teaching: 45 +exercises: 25 +--- + +:::questions +- How can I parallelize computations on rasters with Dask? +- How can I determine if parallelization improves calculation speed? +- What are good practices in applying parallelization to my raster calculations? +::: + +:::objectives +- Profile the timing of the raster calculations. +- Open raster data as a chunked array. +- Recognize good practices in selecting proper chunk sizes. +- Setup raster calculations that take advantage of parallelization. +::: + +## Introduction + +Very often raster computations involve applying the same operation to different pieces of data. Think, for instance, to +the "pixel"-wise sum of two raster datasets, where the same sum operation is applied to all the matching grid-cells of +the two rasters. This class of tasks can benefit from chunking the input raster(s) into smaller pieces: operations on +different pieces can be run in parallel using multiple computing units (e.g., multi-core CPUs), thus potentially +speeding up calculations. In addition, working on chunked data can lead to smaller memory footprints, since one +may bypass the need to store the full dataset in memory by processing it chunk by chunk. + +In this episode, we will introduce the use of Dask in the context of raster calculations. Dask is a Python library for +parallel and distributed computing. It provides a framework to work with different data structures, including chunked +arrays (Dask Arrays). Dask is well integrated with (`rio`)`xarray` objects, which can use Dask arrays as underlying +data structures. + +:::callout + +## Dask + +This episode shows how Dask can be used to parallelize operations on local CPUs. However, the same library can be +configured to run tasks on large compute clusters. + +More resources on Dask: +- [Dask](https://dask.org) and [Dask Array](https://docs.dask.org/en/stable/array.html). +- [Xarray with Dask](https://xarray.pydata.org/en/stable/user-guide/dask.html). + +::: + +It is important to realize, however, that many details determine the extent to which using Dask's chunked arrays instead +of regular Numpy arrays leads to faster calculations (and lower memory requirements). The actual operations to carry +out, the size of the dataset, and parameters such as the chunks' shape and size, all affects the performance of our +computations. Depending on the specifics of the calculations, serial calculations might actually turn out to be faster! +Being able to profile the computational time is thus essential, and we will see how to do that in a Jupyter environment +in the next section. + +## Time profiling in Jupyter + +:::callout + +## Introduce the Data + +We'll continue from the results of the satellite image search that we have carried out in an exercise from +[a previous episode](./01-access-data.md). We will load data starting from the `search.json` file. + +If you would like to work with the data for this lesson without downloading data on-the-fly, you can download the +raster data using this [link](https://figshare.com/ndownloader/files/36028100). Save the `geospatial-python-raster-dataset.tar.gz` +file in your current working directory, and extract the archive file by double-clicking on it or by running the +following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. Use the file `geospatial-python-raster-dataset/search.json` +(instead of `search.json`) to get started with this lesson. +::: + +Let's set up a raster calculation using assets from our previous search of satellite scenes. We first load the item +collection using the `pystac` library: + +```python +import pystac +items = pystac.ItemCollection.from_file("search.json") +``` + +We select the last scene, and extract the URLs of two assets: the true-color image ("visual") and the scene +classification layer ("SCL"). The latter is a mask where each grid cell is assigned a label that represents a specific +class e.g. "4" for vegetation, "6" for water, etc. (all classes and labels are reported in the +[Sentinel-2 documentation](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm), +see Figure 3): + +```python +assets = items[-1].assets # last item's assets +visual_href = assets["visual"].href # true color image +scl_href = assets["SCL"].href # scene classification layer +``` + +Opening the two assets with `rioxarray` shows that the true-color image is available as a raster file with 10 m +resolution, while the scene classification layer has a lower resolution (20 m): + +```python +import rioxarray +scl = rioxarray.open_rasterio(scl_href) +visual = rioxarray.open_rasterio(visual_href) +print(scl.rio.resolution(), visual.rio.resolution()) +``` + +```output +(20.0, -20.0), (10.0, -10.0) +``` + + +In order to match the image and the mask pixels, we take advantage of a feature of the cloud-optimized GeoTIFF (COG) +format, which is used to store these raster files. COGs typically include multiple lower-resolution versions of the +original image, called "overviews", in the same file. This allows to avoid downloading high-resolution images when only +quick previews are required. + +Overviews are often computed using powers of 2 as down-sampling (or zoom) factors (e.g. 2, 4, 8, 16). For the true-color +image we thus open the first level overview (zoom factor 2) and check that the resolution is now also 20 m: + +```python +visual = rioxarray.open_rasterio(visual_href, overview_level=0) +print(visual.rio.resolution()) +``` + +```output +(20.0, -20.0) +``` + +We can now time profile the first step of our raster calculation: the (down)loading of the rasters' content. We do it by +using the Jupyter magic `%%time`, which returns the time required to run the content of a cell: + +```python +%%time +scl = scl.load() +visual = visual.load() +``` + +```output +CPU times: user 729 ms, sys: 852 ms, total: 1.58 s +Wall time: 40.5 s +``` + +```python +visual.plot.imshow(figsize=(10,10)) +scl.squeeze().plot.imshow(levels=range(13), figsize=(12,10)) +``` + +![Scene true color image](../fig/E11-01-true-color-image.png) +![Scene classification](../fig/E11-02-scene-classification.png) + + +After loading the raster files into memory, we run the following steps: + +- We create a mask of the grid cells that are labeled as "cloud" in the scene classification layer (values "8" and "9", + standing for medium- and high-cloud probability, respectively). +- We use this mask to set the corresponding grid cells in the true-color image to null values. +- We save the masked image to disk as in COG format. + +Again, we measure the cell execution time using `%%time`: + +```python +%%time +mask = scl.squeeze().isin([8, 9]) +visual_masked = visual.where(~mask, other=visual.rio.nodata) +visual_masked.rio.to_raster("band_masked.tif") +``` + +```output +CPU times: user 270 ms, sys: 366 ms, total: 636 ms +Wall time: 647 ms +``` + +We can inspect the masked image as: + +```python +visual_masked.plot.imshow(figsize=(10, 10)) +``` + +![True color image masked](../fig/E11-03-true-color-image_masked.png) + +In the following section we will see how to parallelize these raster calculations, and we will compare timings to the +serial calculations that we have just run. + +## Dask-powered rasters + +### Chunked arrays + +As we have mentioned, `rioxarray` supports the use of Dask's chunked arrays as underlying data structure. When opening +a raster file with `open_rasterio` and providing the `chunks` argument, Dask arrays are employed instead of regular +Numpy arrays. `chunks` describes the shape of the blocks which the data will be split in. As an example, we +open the blue band raster ("B02") using a chunk shape of `(1, 4000, 4000)` (block size of `1` in the first dimension and +of `4000` in the second and third dimensions): + +```python +blue_band_href = assets["B02"].href +blue_band = rioxarray.open_rasterio(blue_band_href, chunks=(1, 4000, 4000)) +``` + +Xarray and Dask also provide a graphical representation of the raster data array and of its blocked structure. + +![Xarray Dask-backed DataArray](../fig/E11-04-xarray-with-dask.png) + +:::challenge + +### Exercise: Chunk sizes matter + +We have already seen how COGs are regular GeoTIFF files with a special internal structure. other feature of COGs is +that data is organized in "blocks" that can be accessed remotely via independent HTTP requests, abling partial file +readings. This is useful if you want to access only a portion of your raster file, but it also lows for efficient +parallel reading. You can check the blocksize employed in a COG file with the following code ippet: + +```python +import rasterio +with rasterio.open(visual_href) as r: + if r.is_tiled: + print(f"Chunk size: {r.block_shapes}") +``` + +In order to optimally access COGs it is best to align the blocksize of the file with the chunks ployed when loading +the file. Open the blue-band asset ("B02") of a Sentinel-2 scene as a chunked `DataArray` object ing a suitable +chunk size. Which elements do you think should be considered when choosing the chunk size? + +::::solution + +### Solution + +```python +import rasterio +with rasterio.open(blue_band_href) as r: + if r.is_tiled: + print(f"Chunk size: {r.block_shapes}") +``` + +```output +Chunk size: [(1024, 1024)] +``` + +Ideal chunk size values for this raster are thus multiples of 1024. An element to consider is e mber of +resulting chunks and their size. Chunks should not be too big nor too small (i.e. too many). a le of thumb, +chunk sizes of 100 MB typically work well with Dask (see, e.g., this +[blog post](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)). Also, the shape ght be levant, +depending on the application! Here, we might select a chunks shape of `(1, 6144, 6144)`: + +```python +band = rioxarray.open_rasterio(blue_band_href, chunks=(1, 6144, 6144)) +``` + +which leads to chunks 72 MB large: ((1 x 6144 x 6144) x 2 bytes / 2^20 = 72 MB). Also, we can t ioxarray` and Dask +figure out appropriate chunk shapes by setting `chunks="auto"`: + +```python +band = rioxarray.open_rasterio(blue_band_href, chunks="auto") +``` + +which leads to `(1, 8192, 8192)` chunks (128 MB). + +:::: +::: + +### Parallel computations + +Operations performed on a `DataArray` that has been opened as a chunked Dask array are executed using Dask. Dask +coordinates how the operations should be executed on the individual chunks of data, and runs these tasks in parallel as +much as possible. + +Let's now repeat the raster calculations that we have carried out in the previous section, but running calculations in +parallel over a multi-core CPU. We first open the relevant rasters as chunked arrays: + +```python +scl = rioxarray.open_rasterio(scl_href, lock=False, chunks=(1, 2048, 2048)) +visual = rioxarray.open_rasterio(visual_href, overview_level=0, lock=False, chunks=(3, 2048, 2048)) +``` + +Setting `lock=False` tells `rioxarray` that the individual data chunks can be loaded simultaneously from the source by +the Dask workers. + +As the next step, we trigger the download of the data using the `.persist()` method, see below. This makes sure that the downloaded +chunks are stored in the form of a chunked Dask array (calling `.load()` would instead merge the chunks in a single +Numpy array). + +We explicitly tell Dask to parallelize the required workload over 4 threads. Don't forget to add the Jupyter magic to +record the timing! + +```python +%%time +scl = scl.persist(scheduler="threads", num_workers=4) +visual = visual.persist(scheduler="threads", num_workers=4) +``` + +```output +CPU times: user 1.18 s, sys: 806 ms, total: 1.99 s +Wall time: 12.6 s +``` + +So downloading chunks of data using 4 workers gave a speed-up of almost 4 times (40.5 s vs 12.6 s)! + +Let's now continue to the second step of the calculation. Note how the same syntax as for its serial version is employed +for creating and applying the cloud mask. Only the raster saving includes additional arguments: + +- `tiled=True`: write raster as a chunked GeoTIFF. +- `lock=threading.Lock()`: the threads which are splitting the workload must "synchronise" when writing to the same file + (they might otherwise overwrite each other's output). +- `compute=False`: do not immediately run the calculation, more on this later. + +```python +from threading import Lock +``` + +```python +%%time +mask = scl.squeeze().isin([8, 9]) +visual_masked = visual.where(~mask, other=0) +visual_store = visual_masked.rio.to_raster("band_masked.tif", tiled=True, lock=Lock(), compute=False) +``` + +```output +CPU times: user 13.3 ms, sys: 4.98 ms, total: 18.3 ms +Wall time: 17.8 ms +``` + +Did we just observe a 36x speed-up when comparing to the serial calculation (647 ms vs 17.8 ms)? Actually, no +calculation has run yet. This is because operations performed on Dask arrays are executed "lazily", i.e. they are not +immediately run. + +:::callout + +### Dask graph + +The sequence of operations to carry out is stored in a task graph, which can be visualized with: + +```python +import dask +dask.visualize(visual_store) +``` + +![Dask graph](../fig/E11-05-dask-graph.png) + +The task graph gives Dask the complete "overview" of the calculation, thus enabling a better nagement of tasks and +resources when dispatching calculations to be run in parallel. +::: + +While most methods of `DataArray`'s run operations lazily when Dask arrays are employed, some methods by default +trigger immediate calculations, like the method `to_raster()` (we have changed this behaviour by specifying +`compute=False`). In order to trigger calculations, we can use the `.compute()` method. Again, we explicitly tell Dask +to run tasks on 4 threads. Let's time the cell execution: + +```python +%%time +visual_store.compute(scheduler="threads", num_workers=4) +``` + +```output +CPU times: user 532 ms, sys: 488 ms, total: 1.02 s +Wall time: 791 ms +``` + +The timing that we have recorded for this step is now closer to the one recorded for the serial calculation (the +parallel calculation actually took slightly longer). The explanation for this behaviour lies in the overhead that Dask +introduces to manage the tasks in the Dask graph. This overhead, which is typically of the order of milliseconds per +task, can be larger than the parallelization gain, and this is typically the case for calculations with small chunks +(note that here we have used chunks that are only 4 to 32 MB large). + +:::keypoints +- The `%%time` Jupyter magic command can be used to profile calculations. +- Data 'chunks' are the unit of parallelization in raster calculations. +- (`rio`)`xarray` can open raster files as chunked arrays. +- The chunk shape and size can significantly affect the calculation performance. +- Cloud-optimized GeoTIFFs have an internal structure that enables performant parallel read. +::: +