diff --git a/01-intro-raster-data.md b/01-intro-raster-data.md index c56b9dd5..22b4f4b9 100644 --- a/01-intro-raster-data.md +++ b/01-intro-raster-data.md @@ -15,49 +15,32 @@ exercises: 5 - Describe the strengths and weaknesses of storing data in raster format. - Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format. ::: + ## Introduction -This episode introduces the two primary types of geospatial -data: rasters and vectors. After briefly introducing these -data types, this episode focuses on raster data, describing -some major features and types of raster data. +This episode introduces the two primary types of data models that are used to digitally represent the earth's surface: raster and vector. After briefly introducing these data models, this episode focuses on the raster representation, describing some major features and types of raster data. This workshop will focus on how to work with both raster and vector data sets, therefore it is essential that we understand the basic structures of these types of data and the types of phenomena that they can represent. ## Data Structures: Raster and Vector -The two primary types of geospatial data are raster -and vector data. Raster data is stored as a grid of values which are rendered on a -map as pixels. Each pixel value represents an area on the Earth's surface. Vector data structures represent specific features on the -Earth's surface, and -assign attributes to those features. Vector data structures -will be discussed in more detail in [the next episode](02-intro-vector-data.md). - -This workshop will focus on how to work with both raster and vector -data sets, therefore it is essential that we understand the -basic structures of these types of data and the types of data -that they can be used to represent. - -### About Raster Data - -Raster data is any pixelated (or gridded) data where each pixel is associated -with a specific geographic location. The value of a pixel can be -continuous (e.g. elevation) or categorical (e.g. land use). If this sounds -familiar, it is because this data structure is very common: it's how -we represent any digital image. A geospatial raster is only different -from a digital photo in that it is accompanied by spatial information -that connects the data to a particular location. This includes the -raster's extent and cell size, the number of rows and columns, and -its coordinate reference system (or CRS). +The two primary data models that are used to represent the earth's surface digitally are the raster and vector. **Raster data** is stored as a grid of values which are rendered on a map as pixels—also known as cells—where each pixel—or cell—represents a value of the earth's surface. Examples of raster data are satellite images or aerial photographs. Data stored according to the **vector data** model are represented by points, lines, or polygons. Examples of vector representation are points of interest, buildings—often represented as building footprints—or roads. + +Representing phenomena as vector data allows you to add attribute information to them. For instance, a polygon of a house can contain multiple attributes containing information about the address like the street name, zip code, city, and number. More explanations about vector data will be discussed in the [next episode](02-intro-vector-data.md). + +When working with spatial information, you will experience that many phenomena can be represented as vector data and raster data. A house, for instance, can be represented by a set of cells in a raster having all the same value or by a polygon as vector containing attribute information (figure 1). It depends on the purpose for which the data is collected and intended to be used which data model it is stored in. But as a rule of thumb, you can apply that discrete phenomena like buildings, roads, trees, signs are represented as vector data, whereas continuous phenomena like temperature, wind speed, elevation are represented as raster data. Yet, one of the things a spatial data analyst often has to do is to transform data from vector to raster or the other way around. Keep in mind that this can cause problems in the data quality. + +### Raster Data + +Raster data is any pixelated (or gridded) data where each pixel has a value and is associated with a specific geographic location. The value of a pixel can be continuous (e.g., elevation, temperature) or categorical (e.g., land-use type). If this sounds familiar, it is because this data structure is very common: it's how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster's extent and cell size, the number of rows and columns, and its Coordinate Reference System (CRS), which will be explained in [episode 3](03-crs.md) of this workshop. ![Raster Concept (Source: National Ecological Observatory Network (NEON))](fig/E01/raster_concept.png){alt="raster concept"} Some examples of continuous rasters include: 1. Precipitation maps. -2. Maps of tree height derived from LiDAR data. -3. Elevation values for a region. +2. Elevation maps. -A map of elevation for Harvard Forest derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing) +A map of elevation for *Harvard Forest* derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing) is below. Elevation is represented as a continuous numeric variable in this map. The legend shows the continuous range of values in the data from around 300 to 420 meters. @@ -69,8 +52,7 @@ continuous value such as elevation or temperature. Some examples of classified maps include: 1. Landcover / land-use maps. -2. Tree height maps classified as short, medium, and tall trees. -3. Elevation maps classified as low, medium, and high elevation. +2. Elevation maps classified as low, medium, and high elevation. ![USA landcover classification](fig/E01/USA_landcover_classification.png){alt="USA landcover classification"} @@ -147,12 +129,12 @@ of changes in resolution. ### Raster Data Format for this Workshop Raster data can come in many different formats. For this workshop, we will use -the GeoTIFF format which has the extension `.tif`. A `.tif` file stores metadata -or attributes about the file as embedded `tif tags`. For instance, your camera +the GeoTIFF format which has the extension `.tif`, since this is one of the most common formats to be used. +A `.tif` file stores metadata or attributes about the file as embedded `tif tags`. For instance, your camera might store a tag that describes the make and model of the camera or the date the photo was taken when it saves a `.tif`. A GeoTIFF is a standard `.tif` image format with additional spatial (georeferencing) information embedded in the file -as tags. These tags should include the following raster metadata: +as tags. These tags include the following raster metadata: 1. Extent 2. Resolution @@ -174,14 +156,11 @@ from a GeoTIFF file. ### Multi-band Raster Data A raster can contain one or more bands. One type of multi-band raster -dataset that is familiar to many of us is a color -image. A basic color image consists of three bands: red, green, and blue. -Each +dataset that is familiar to many of us is a color image. A basic color +image often consists of three bands: red, green, and blue (RGB). Each band represents light reflected from the red, green or blue portions of -the -electromagnetic spectrum. The pixel brightness for each band, when -composited -creates the colors that we see in an image. +the electromagnetic spectrum. The pixel brightness for each band, when +composited creates the colors that we see in an image. ![RGB multi-band raster image (Source: National Ecological Observatory Network (NEON).)](fig/E01/RGBSTack_1.jpg){alt="multi-band raster"} diff --git a/02-intro-vector-data.md b/02-intro-vector-data.md index 0962201c..66a9f661 100644 --- a/02-intro-vector-data.md +++ b/02-intro-vector-data.md @@ -34,7 +34,9 @@ vertex that has a defined x, y location. * **Polygons:** A polygon consists of 3 or more vertices that are connected and closed. The outlines of survey plot boundaries, lakes, oceans, and states or -countries are often represented by polygons. +countries are often represented by polygons. Note, that polygons can also contain one +or multiple holes, for instance a plot boundary with a lake in it. These polygons are +considered *complex* or *donut* polygons. :::callout ## Data Tip @@ -56,8 +58,11 @@ are represented by which vector type. ::::solution ## Solution -State boundaries are polygons. The Fisher Tower location is -a point. There are no line features shown. +State boundaries are shown as polygons. The Fisher Tower location is +represented by a purple point. There are no line features shown. +Note, that at a different scale the Fischer Tower coudl also have been represented as a polygon. +Keep in mind that the purpose for which the dataset is created and aimed to be used for determines +which vector type it uses. :::: ::: @@ -66,14 +71,14 @@ Vector data has some important advantages: * The geometry itself contains information about what the dataset creator thought was important * The geometry structures hold information in themselves - why choose point over polygon, for instance? * Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc -* Data storage can be very efficient compared to rasters +* Data storage can, depending on the scale, be very efficient compared to rasters +* When working with network analysis, for instance to calculate the shortest route between A and B, topologically correct lines are essential. This is not possible through raster data. The downsides of vector data include: -* Potential loss of detail compared to raster -* Potential bias in datasets - what didn't get recorded? +* Potential bias in datasets - what didn't get recorded? Often vector data are interpreted datasets like topographical maps and have been collected by someone else, for another purpose. * Calculations involving multiple vector layers need to do math on the - geometry as well as the attributes, so can be slow compared to raster math. + geometry as well as the attributes, which potentially can be slow compared to raster calculations. Vector datasets are in use in many industries besides geospatial fields. For instance, computer graphics are largely vector-based, although the data @@ -85,8 +90,9 @@ their features to real-world locations. ## Vector Data Format for this Workshop Like raster data, vector data can also come in many different formats. For this -workshop, we will use the Shapefile format. A Shapefile format consists of multiple -files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files. +workshop, we will use the GeoPackage format. GeoPackage is developed by the [Open Geospatial Consortium](https://www.ogc.org/) and is *is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information.* (source: [https://www.geopackage.org/](https://www.geopackage.org/) ) A GeoPackage file, **.gpkg**, is a single file that contains the geometries of features, their attributes and information about the CRS used. + +Another vector format that you will probably come accross quite often is a Shapefile. Although we will not be using that format in this workshop we do believe it is useful to understand how that format works. A Shapefile format consists of multiple files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files. - The `.shp` file stores the feature geometry itself - `.shx` is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector @@ -114,8 +120,8 @@ objects in a single shapefile. More about shapefiles can be found on [Wikipedia.](https://en.wikipedia.org/wiki/Shapefile) Shapefiles are often publicly -available from government services, such as [this page from the US Census Bureau][us-cb] or -[this one from Australia's Data.gov.au website](https://data.gov.au/data/dataset?res_format=SHP). +available from government services, such as [this page containing all administrative boundaries for countries in the world](https://gadm.org/download_country.html) or +[topographical vector data from Open Street Maps](https://download.geofabrik.de/). ::: :::callout @@ -131,9 +137,9 @@ effects of particular data manipulations are more predictable if you are confident that all of your input data has the same characteristics. ::: -[us-cb]: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html :::keypoints - Vector data structures represent specific features on the Earth's surface along with attributes of those features. +- Vector data is often interpreted data and collected for a different purpose than you would want to use it for. - Vector objects are either points, lines, or polygons. ::: diff --git a/03-crs.md b/03-crs.md index fcdd91f2..8acbf27a 100644 --- a/03-crs.md +++ b/03-crs.md @@ -27,22 +27,22 @@ CRS (coordinate reference system) and SRS (spatial reference system) are synonym will use only CRS throughout this workshop. ::: -The CRS associated with a dataset tells your mapping software (for example Python) +The CRS associated with a dataset tells your mapping software where the raster is located in geographic space. It also tells the mapping software what method should be used to flatten or project the raster in geographic space. -![Maps of the United States in different projections (Source: opennews.org)](https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg){alt="US difference projections"} - -The above image shows maps of the United States in different projections. Notice +The image below (figure 3.1) shows maps of the United States in different projections. Notice the differences in shape associated with each projection. These differences are a direct result of the calculations used to flatten the data onto a 2-dimensional map. +![Figure 3.1: Maps of the United States in different projections (Source: opennews.org)](https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg){alt="US difference projections"} + There are lots of great resources that describe coordinate reference systems and projections in greater detail. For the purposes of this workshop, what is important to understand is that data from the same location but saved in -different projections will not line up in any GIS or other program. Thus, it's +different projections will not line up. Thus, it is important when working with spatial data to identify the coordinate reference system applied to the data and retain it throughout data processing and analysis. @@ -55,14 +55,14 @@ CRS information has three components: degrees) and defines the starting point (i.e. where is [0,0]?) so the angles reference a meaningful spot on the earth. Common global datums are WGS84 and NAD83. Datums can also be local - fit to a particular area of the globe, but -ill-fitting outside the area of intended use. In this workshop, we will use the -[WGS84 -datum](https://www.linz.govt.nz/data/geodetic-system/datums-projections-and-heights/geodetic-datums/world-geodetic-system-1984-wgs84). +ill-fitting outside the area of intended use For instance local Cadastre, Land Registry and Mapping Agencies require a high quality of their datasets, which can be obtained using a local system. In this workshop, we will use the +[WGS84 datum](https://www.linz.govt.nz/data/geodetic-system/datums-projections-and-heights/geodetic-datums/world-geodetic-system-1984-wgs84). The datum is often also refered to as the Geographical Coordinate System. * **Projection:** A mathematical transformation of the angular measurements on a round earth to a flat surface (i.e. paper or a computer screen). The units associated with a given projection are usually linear (feet, meters, etc.). In this workshop, we will see data in two different projections. +Note that the projection is also often refered to as Projected Coordinate System. * **Additional Parameters:** Additional parameters are often necessary to create the full coordinate reference system. One common additional parameter is a @@ -91,6 +91,8 @@ stem of the fruit. What other parameters could be included in this analogy? ## Which projection should I use? +A well know projection is the [Mercator projection](https://en.wikipedia.org/wiki/Mercator_projection) introduced by the Flamisch Cartographer Gerardus Mercator in the 16th Century. This so-called cilindrical projection, meaning that a virtual cilinder is place on the globe to flatten it, it relatively accurate near to the equator, but towards the poles blows things up see:[Cylindrical projection](https://gisgeography.com/cylindrical-projection/). The main advantage of the Mercator projection is that it is very suitable for navigation purpuses since it always north as *up* and south and as *down*, in the 17th century this projection was essential for sailors to navigate the oceans. + To decide if a projection is right for your data, answer these questions: * What is the area of minimal distortion? @@ -212,9 +214,11 @@ generated and maintained manually. * [Choosing the Right Map Projection.](https://source.opennews.org/en-US/learning/choosing-right-map-projection/) * [Video](https://www.youtube.com/embed/KUF_Ckv8HbE) highlighting how map projections can make continents seems proportionally larger or smaller than they actually are. +* [The True size](https://www.thetruesize.com/) An intuitive webmap that allows you to drag countries to another place in the webmercator projection. ::: :::keypoints - All geospatial datasets (raster and vector) are associated with a specific coordinate reference system. - A coordinate reference system includes datum, projection, and additional parameters specific to the dataset. +- All maps are distored because of the projection. ::: diff --git a/04-geo-landscape.md b/04-geo-landscape.md index 6d8fdc70..23483075 100644 --- a/04-geo-landscape.md +++ b/04-geo-landscape.md @@ -27,7 +27,7 @@ The [Open Source Geospatial Foundation (OSGEO)](https://www.osgeo.org/) supports * [QGIS](https://www.qgis.org/en/site/) is a professional GIS application that is built on top of and proud to be itself Free and Open Source Software (FOSS). QGIS is - written in Python, has a python console interface, and has several interfaces written in R including + written in Python and C++, has a python console interface, allows to develop plugins and has several interfaces written in R including [RQGIS](https://cran.r-project.org/package=RQGIS). * [GRASS GIS](https://grass.osgeo.org/), commonly referred to as GRASS (Geographic Resources Analysis Support System), is a FOSS-GIS software suite used for @@ -50,7 +50,9 @@ The [Open Source Geospatial Foundation (OSGEO)](https://www.osgeo.org/) supports visualisation options, and runs under Windows and Linux operating systems. Like GRASS GIS, it can also be installed and made accessible in QGIS3. * [PostGIS](https://postgis.net/) is a geospatial extension to the PostGreSQL - relational database. + relational database and is especially suited to work with large vector datasets. + * [GeoDMS](https://geodms.nl/) is a powerful Open sources GIS which allows for + fast calculations and calculations with large datasets. Furthermore it allows for complex scenario analyses. ### Commercial software @@ -63,7 +65,7 @@ The [Open Source Geospatial Foundation (OSGEO)](https://www.osgeo.org/) supports ArGIS Online which you host locally. ESRI welcomes development on their platforms through their [DevLabs](https://developers.arcgis.com/). ArcGIS software can be installed using - [Chef Cookbooks from Github](https://github.com/Esri/arcgis-cookbook). + [Chef Cookbooks from Github](https://github.com/Esri/arcgis-cookbook). In addition, ESRI offers the [arcpy python library](https://pro.arcgis.com/en/pro-app/3.1/arcpy/get-started/what-is-arcpy-.htm) as part of an ArcGIS pro licence allowing bring all operations from the ArcGIS pro GUI to the python ecosystem. * Pitney Bowes produce [MapInfo Professional](https://www.pitneybowes.com/us/location-intelligence/geographic-information-systems/mapinfo-pro.html), which was one of the earliest desktop GIS programs on the market. * [Hexagon Geospatial Power Portfolio](https://www.hexagongeospatial.com/products/products) @@ -76,7 +78,7 @@ The [Open Source Geospatial Foundation (OSGEO)](https://www.osgeo.org/) supports * [PANGEO](https://pangeo.io/) is a community organization dedicated to open and reproducible data science with python. They focus on the Pangeo software ecosystem for working with big data in the geosciences. - * Google has created [Google Earth Engine](https://earthengine.google.com/) which + * Google developed [Google Earth Engine](https://earthengine.google.com/) which combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences @@ -112,15 +114,15 @@ Benefits of using a GUI include: Downsides of using a GUI include: - - Low reproducibility - you can't record your actions and replay - - Most are not designed for batch-processing files + - Low reproducibility - you can record your actions and replay, but this is limited to the functionalities of the software + - Batch-processing is possible, but limited to the funstionalities of the software and hard to be integrated with other workflows - Limited ability to customise functions or write your own - Intimidating interface for new users - so many buttons! In scientific computing, the lack of reproducibility in point-and-click software has come to be viewed as a critical weakness. As such, scripted CLI-style workflows are -again becoming popular, which leads us to another approach to doing GIS — via a -programming language. This is the approach we will be using throughout this workshop. +becoming popular, which leads us to another approach to doing GIS — via a +programming language. Therefore this is the approach we will be using throughout this workshop. ## GIS in programming languages @@ -129,7 +131,7 @@ programming languages like Java and C++. However, the learning curve for these languages is steep and the effort required is excessive for users who only need a subset of their functionality. -Higher-level scripting languages like Python and R are easier to learn and use. Both +Higher-level scripting languages like Python and R are considered easier to learn and use. Both now have their own packages that wrap up those geospatial processing libraries and make them easy to access and use safely. A key example is the Java Topology Suite (JTS), which is implemented in C++ as GEOS. GEOS is accessible in Python via the `shapely` @@ -161,7 +163,7 @@ are other popular options for data science. Traditional GIS apps are also moving back towards providing a scripting environment for users, further blurring the CLI/GUI divide. ESRI have adopted Python into their -software, and QGIS is both Python and R-friendly. +software by introducing [arcpy](https://developers.arcgis.com/documentation/arcgis-add-ins-and-automation/arcpy/), and QGIS is both Python and R-friendly. ## GIS File Types @@ -174,8 +176,10 @@ raster file types. | File Type | Extensions | Description | | --------- | ---------- | ----------- | | Esri Shapefile | .SHP .DBF .SHX | The most common geospatial file type. This has become the industry standard. The three required files are: SHP is the feature geometry. SHX is the shape index position. DBF is the attribute data. | +| GeoPackage | .gpkg | As an alternative for a Shapfile. This open file format is gaining terrain and exists of one file containing all necessary attribute information. | | Geographic JavaScript Object Notation (GeoJSON) | .GEOJSON .JSON |Used for web-based mapping and uses JavaScript Object Notation to store the coordinates as text. | | Google Keyhole Markup Language (KML) | .KML .KMZ | KML stands for Keyhole Markup Language. This GIS format is XML-based and is primarily used for Google Earth. | +| GPX or GPS Exchange Format | .gpx | Is an XML schema designed as a common GPS data format for software applications. This format is often used for tracking activities e.g. hiking, cycling, running etc. | | OpenStreetMap | .OSM | OSM files are the native file for OpenStreetMap which had become the largest crowdsourcing GIS data project in the world. These files are a collection of vector features from crowd-sourced contributions from the open community. | ### Raster diff --git a/05-access-data.md b/05-access-data.md index f0fee95d..8cd90d08 100644 --- a/05-access-data.md +++ b/05-access-data.md @@ -4,21 +4,19 @@ teaching: 30 exercises: 15 --- -:::questions +::: questions - Where can I find open-access satellite data? - How do I search for satellite imagery with the STAC API? - How do I fetch remote raster datasets using Python? ::: -:::objectives +::: objectives - Search public STAC repositories of satellite imagery using Python. - Inspect search result's metadata. - Download (a subset of) the assets available for a satellite scene. - Open satellite imagery as raster data and save it to disk. ::: - - ## Introduction A number of satellites take snapshots of the Earth's surface from space. The images recorded by these remote sensors @@ -26,9 +24,9 @@ represent a very precious data source for any activity that involves monitoring typically provided in the form of geospatial raster data, with the measurements in each grid cell ("pixel") being associated to accurate geographic coordinate information. -In this episode we will explore how to access open satellite data using Python. In particular, we will -consider [the Sentinel-2 data collection that is hosted on AWS](https://registry.opendata.aws/sentinel-2-l2a-cogs). -This dataset consists of multi-band optical images acquired by the two satellites of +In this episode we will explore how to access open satellite data using Python. In particular, we will +consider [the Sentinel-2 data collection that is hosted on Amazon Web Services (AWS)](https://registry.opendata.aws/sentinel-2-l2a-cogs). +This dataset consists of multi-band optical images acquired by the constellation of two satellites from [the Sentinel-2 mission](https://sentinel.esa.int/web/sentinel/missions/sentinel-2) and it is continuously updated with new images. @@ -37,9 +35,7 @@ new images. ### The SpatioTemporal Asset Catalog (STAC) specification Current sensor resolutions and satellite revisit periods are such that terabytes of data products are added daily to the -corresponding collections. Such datasets cannot be made accessible to users via full-catalog download. Space agencies -and other data providers often offer access to their data catalogs through interactive Graphical User Interfaces (GUIs), -see for instance the [Copernicus Open Access Hub portal](https://scihub.copernicus.eu/dhus/#/home) for the Sentinel +corresponding collections. Such datasets cannot be made accessible to users via full-catalog download. Therefore, space agencies and other data providers often offer access to their data catalogs through interactive Graphical User Interfaces (GUIs), see for instance the [Copernicus Browser](https://browser.dataspace.copernicus.eu) for the Sentinel missions. Accessing data via a GUI is a nice way to explore a catalog and get familiar with its content, but it represents a heavy and error-prone task that should be avoided if carried out systematically to retrieve data. @@ -54,7 +50,7 @@ access data from different missions, instruments and collections using the same ![Views of the STAC browser](fig/E05/STAC-browser.jpg){alt="STAC browser screenshots"} -:::callout +::: callout ## More Resources on STAC - [STAC specification](https://github.com/radiantearth/stac-spec#readme) - [Tools based on STAC](https://stacindex.org/ecosystem) @@ -67,7 +63,7 @@ The [STAC browser](https://radiantearth.github.io/stac-browser/#/) is a good sta datasets, as it provides an up-to-date list of existing STAC catalogs. From the list, let's click on the "Earth Search" catalog, i.e. the access point to search the archive of Sentinel-2 images hosted on AWS. -:::challenge +::: challenge ## Exercise: Discover a STAC catalog Let's take a moment to explore the Earth Search STAC catalog, which is the catalog indexing the Sentinel-2 collection that is hosted on AWS. We can interactively browse this catalog using the STAC browser at [this link](https://radiantearth.github.io/stac-browser/#/external/earth-search.aws.element84.com/v1). @@ -77,11 +73,14 @@ that is hosted on AWS. We can interactively browse this catalog using the STAC b "scene", i.e. a portion of the footage recorded by the satellite at a given time. Have a look at the metadata fields and the list of assets. What kind of data do the assets represent? -::::solution +:::: solution ![Views of the Earth Search STAC endpoint](fig/E05/STAC-browser-exercise.jpg){alt="earth-search stac catalog views"} -1. 7 subcatalogs are available, including a catalog for Landsat Collection 2, Level-2 and Sentinel-2 Level 2A (see left screenshot in the figure above). +1. 8 sub-catalogs are available. In the STAC nomenclature, these are actually "collections", i.e. catalogs with +additional information about the elements they list: spatial and temporal extents, license, providers, etc. +Among the available collections, we have Landsat Collection 2, Level-2 and Sentinel-2 Level 2A (see left screenshot in +the figure above). 2. When you select the Sentinel-2 Level 2A collection, and randomly choose one of the items from the list, you should find yourself on a page similar to the right screenshot in the figure above. On the left side you will find a list of the available assets: overview images (thumbnail and true color images), metadata files and the "real" @@ -90,14 +89,15 @@ satellite images, one for each band captured by the Multispectral Instrument on ::: When opening a catalog with the STAC browser, you can access the API URL by clicking on the "Source" button on the top -right of the page. By using this URL, we have access to the catalog content and, if supported by the catalog, to the +right of the page. By using this URL, you have access to the catalog content and, if supported by the catalog, to the functionality of searching its items. For the Earth Search STAC catalog the API URL is: ```python api_url = "https://earth-search.aws.element84.com/v1" ``` -You can query a STAC API endpoint from Python using the `pystac_client` library: +You can query a STAC API endpoint from Python using the [`pystac_client` library](https://pystac-client.readthedocs.io). +To do so we will first import `Client` from `pystac_client` and use the [method `open` from the Client object](https://pystac-client.readthedocs.io/en/stable/quickstart.html): ```python from pystac_client import Client @@ -105,15 +105,43 @@ from pystac_client import Client client = Client.open(api_url) ``` -In the following, we ask for scenes belonging to the `sentinel-2-l2a` collection. This dataset includes Sentinel-2 -data products pre-processed at level 2A (bottom-of-atmosphere reflectance) and saved in Cloud Optimized GeoTIFF (COG) -format: +For this episode we will focus at scenes belonging to the `sentinel-2-l2a` collection. +This dataset is useful for our case and includes Sentinel-2 data products pre-processed at level 2A (bottom-of-atmosphere reflectance). + +In order to see which collections are available in the provided `api_url` the +[`get_collections`](https://pystac-client.readthedocs.io/en/stable/api.html#pystac_client.Client.get_collections) method can be used on the Client object. + +```python +collections = client.get_collections() +``` + +To print the collections we can make a for loop doing: + +```python +for collection in collections: + print(collection) +``` + +```output + + + + + + + + +``` + +As said, we want to focus to the `sentinel-2-l2a` collection. To do so, we set this collection into a variable: ```python -collection = "sentinel-2-l2a" # Sentinel-2, Level 2A, Cloud Optimized GeoTiffs (COGs) +collection_sentinel_2_l2a = "sentinel-2-l2a" ``` -:::callout +The data in this collection is stored in the Cloud Optimized GeoTIFF (COG) format and as JPEG2000 images. In this episode we will focus at COGs, as these offer useful functionalities for our purpose. + +::: callout ## Cloud Optimized GeoTIFFs Cloud Optimized GeoTIFFs (COGs) are regular GeoTIFF files with some additional features that make them ideal to be @@ -128,52 +156,90 @@ by a high-resolution raster can directly access the lower-resolution versions of on the downloading time. More information on the COG format can be found [here](https://www.cogeo.org). ::: -We also ask for scenes intersecting a geometry defined using the `shapely` library (in this case, a point): +In order to get data for a specific location you can add longitude latitude coordinates (World Geodetic System 1984 EPSG:4326) in your request. +In order to do so we are using the `shapely` library to define a geometrical point. +Below we have included a point on the island of Rhodes, which is the location of interest for our case study (i.e. Longitude: 27.95 | Latitude 36.20). ```python from shapely.geometry import Point -point = Point(4.89, 52.37) # AMS coordinates +point = Point(27.95, 36.20) # Coordinates of a point on Rhodes ``` Note: at this stage, we are only dealing with metadata, so no image is going to be downloaded yet. But even metadata can -be quite bulky if a large number of scenes match our search! For this reason, we limit the search result to 10 items: +be quite bulky if a large number of scenes match our search! For this reason, we limit the search by the intersection of the point (by setting the parameter `intersects`) and assign the collection (by setting the parameter `collections`). More information about the possible parameters to be set can be found in the `pystac_client` documentation for the [Client's `search` method](https://pystac-client.readthedocs.io/en/stable/api.html#pystac_client.Client.search). + +We now set up our search of satellite images in the following way: ```python search = client.search( - collections=[collection], + collections=[collection_sentinel_2_l2a], intersects=point, - max_items=10, ) ``` -We submit the query and find out how many scenes match our search criteria (please note that this output can be different as more data is added to the catalog): +Now we submit the query in order te find out how many scenes match our search criteria with the parameters assigned above (please note that this output can be different as more data is added to the catalog to when this episode was created): + +```python +print(search.matched()) +``` + +```output +611 +``` + +You will notice that more than 500 scenes match our search criteria. We are however interested in the period right before and after the wildfire of Rhodes. In the following exercise you will therefore have to add a time filter to +our search criteria to narrow down our search for images of that period. + +::: challenge +## Exercise: Search satellite scenes with a time filter + +Search for all the available Sentinel-2 scenes in the `sentinel-2-c1-l2a` collection that have been recorded between +1st of July 2023 and 31st of August 2023 (few weeks before and after the time in which the wildfire took place). + +Hint: You can find the input argument and the required syntax in the documentation of `client.search` (which you can access from + Python or [online](https://pystac-client.readthedocs.io/en/stable/api.html#pystac_client.Client.search)) + +How many scenes are available? + +:::: solution ```python +search = client.search( + collections=[collection_sentinel_2_l2a], + intersects=point, + datetime='2023-07-01/2023-08-31' +) print(search.matched()) ``` ```output -840 +12 ``` -Finally, we retrieve the metadata of the search results: +This means that 12 scenes satisfy the search criteria. +:::: +::: + +Now that we have added a time filter, we retrieve the metadata of the search results by calling the method `item_collection`: ```python items = search.item_collection() ``` -The variable `items` is an `ItemCollection` object. We can check its size by: +The variable `items` is an `ItemCollection` object. More information can be found at the [pystac documentation](https://pystac.readthedocs.io/en/latest/api/pystac.html#pystac.ItemCollection) + +Now let us check the size using `len`: ```python print(len(items)) ``` ```output -10 +12 ``` -which is consistent with the maximum number of items that we have set in the search criteria. We can iterate over -the returned items and print these to show their IDs: +which is consistent with the number of scenes matching our search results as found with `search.matched()`. We can +iterate over the returned items and print these to show their IDs: ```python for item in items: @@ -181,78 +247,99 @@ for item in items: ``` ```output - - - - - - - - - - + + + + + + + + + + + + ``` Each of the items contains information about the scene geometry, its acquisition time, and other metadata that can be -accessed as a dictionary from the `properties` attribute. +accessed as a dictionary from the `properties` attribute. To see which information Item objects can contain you can have a look at the [pystac documentation](https://pystac.readthedocs.io/en/latest/api/pystac.html#pystac.Item). + +Let us inspect the metadata associated with the first item of the search results. Let us first look at the collection date of the first item:: -Let's inspect the metadata associated with the first item of the search results: ```python item = items[0] print(item.datetime) +``` + +```output +2023-08-27 09:00:21.327000+00:00 +``` + +Let us now look at the geometry and other properties as well. + +```python print(item.geometry) print(item.properties) ``` ```output -2023-07-01 10:46:30.262000+00:00 -{'type': 'Polygon', 'coordinates': [[[5.233744523520149, 53.228684673408296], [6.141754296879459, 53.20819279121764], [6.071664488869862, 52.22257539160585], [4.80943323800081, 52.2486879358387], [5.233744523520149, 53.228684673408296]]]} -{'created': '2023-07-02T01:49:17.191Z', 'platform': 'sentinel-2a', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'eo:cloud_cover': 99.952936, 'proj:epsg': 32631, 'mgrs:utm_zone': 31, 'mgrs:latitude_band': 'U', 'mgrs:grid_square': 'FU', 'grid:code': 'MGRS-31UFU', 'view:sun_azimuth': 154.716674921261, 'view:sun_elevation': 58.4960054056685, 's2:degraded_msi_data_percentage': 0.0346, 's2:nodata_pixel_percentage': 33.00232, 's2:saturated_defective_pixel_percentage': 0, 's2:dark_features_percentage': 0, 's2:cloud_shadow_percentage': 0.030847, 's2:vegetation_percentage': 0, 's2:not_vegetated_percentage': 0.004947, 's2:water_percentage': 0.011271, 's2:unclassified_percentage': 0, 's2:medium_proba_clouds_percentage': 5.838514, 's2:high_proba_clouds_percentage': 94.035202, 's2:thin_cirrus_percentage': 0.07922, 's2:snow_ice_percentage': 0, 's2:product_type': 'S2MSI2A', 's2:processing_baseline': '05.09', 's2:product_uri': 'S2A_MSIL2A_20230701T103631_N0509_R008_T31UFU_20230701T200058.SAFE', 's2:generation_time': '2023-07-01T20:00:58.000000Z', 's2:datatake_id': 'GS2A_20230701T103631_041904_N05.09', 's2:datatake_type': 'INS-NOBS', 's2:datastrip_id': 'S2A_OPER_MSI_L2A_DS_2APS_20230701T200058_S20230701T104159_N05.09', 's2:granule_id': 'S2A_OPER_MSI_L2A_TL_2APS_20230701T200058_A041904_T31UFU_N05.09', 's2:reflectance_conversion_factor': 0.967641353116838, 'datetime': '2023-07-01T10:46:30.262000Z', 's2:sequence': '0', 'earthsearch:s3_path': 's3://sentinel-cogs/sentinel-s2-l2a-cogs/31/U/FU/2023/7/S2A_31UFU_20230701_0_L2A', 'earthsearch:payload_id': 'roda-sentinel2/workflow-sentinel2-to-stac/7b1a81ed3fb8d763a0cecf8d9edd4d4a', 'earthsearch:boa_offset_applied': True, 'processing:software': {'sentinel2-to-stac': '0.1.0'}, 'updated': '2023-07-02T01:49:17.191Z'} +{'type': 'Polygon', 'coordinates': [[[27.290401625602243, 37.04621863329741], [27.23303872472207, 36.83882218126937], [27.011145718480538, 36.05673246264742], [28.21878905911668, 36.05053734221328], [28.234426643135546, 37.04015200857309], [27.290401625602243, 37.04621863329741]]]} +{'created': '2023-08-27T18:15:43.106Z', 'platform': 'sentinel-2a', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'eo:cloud_cover': 0.955362, 'proj:epsg': 32635, 'mgrs:utm_zone': 35, 'mgrs:latitude_band': 'S', 'mgrs:grid_square': 'NA', 'grid:code': 'MGRS-35SNA', 'view:sun_azimuth': 144.36354987218, 'view:sun_elevation': 59.06665363921, 's2:degraded_msi_data_percentage': 0.0126, 's2:nodata_pixel_percentage': 12.146327, 's2:saturated_defective_pixel_percentage': 0, 's2:dark_features_percentage': 0.249403, 's2:cloud_shadow_percentage': 0.237454, 's2:vegetation_percentage': 6.073786, 's2:not_vegetated_percentage': 18.026696, 's2:water_percentage': 74.259061, 's2:unclassified_percentage': 0.198216, 's2:medium_proba_clouds_percentage': 0.613614, 's2:high_proba_clouds_percentage': 0.341423, 's2:thin_cirrus_percentage': 0.000325, 's2:snow_ice_percentage': 2.3e-05, 's2:product_type': 'S2MSI2A', 's2:processing_baseline': '05.09', 's2:product_uri': 'S2A_MSIL2A_20230827T084601_N0509_R107_T35SNA_20230827T115803.SAFE', 's2:generation_time': '2023-08-27T11:58:03.000000Z', 's2:datatake_id': 'GS2A_20230827T084601_042718_N05.09', 's2:datatake_type': 'INS-NOBS', 's2:datastrip_id': 'S2A_OPER_MSI_L2A_DS_2APS_20230827T115803_S20230827T085947_N05.09', 's2:granule_id': 'S2A_OPER_MSI_L2A_TL_2APS_20230827T115803_A042718_T35SNA_N05.09', 's2:reflectance_conversion_factor': 0.978189079756816, 'datetime': '2023-08-27T09:00:21.327000Z', 's2:sequence': '0', 'earthsearch:s3_path': 's3://sentinel-cogs/sentinel-s2-l2a-cogs/35/S/NA/2023/8/S2A_35SNA_20230827_0_L2A', 'earthsearch:payload_id': 'roda-sentinel2/workflow-sentinel2-to-stac/af0287974aaa3fbb037c6a7632f72742', 'earthsearch:boa_offset_applied': True, 'processing:software': {'sentinel2-to-stac': '0.1.1'}, 'updated': '2023-08-27T18:15:43.106Z'} ``` -:::challenge -## Exercise: Search satellite scenes using metadata filters -Search for all the available Sentinel-2 scenes in the `sentinel-2-l2a` collection that satisfy the following -criteria: -- intersect a provided bounding box (use ±0.01 deg in lat/lon from the previously defined point); -- have been recorded between 20 March 2020 and 30 March 2020; -- have a cloud coverage smaller than 10% (hint: use the `query` input argument of `client.search`). +You can access items from the `properties` dictionary as usual in Python. For instance, for the EPSG code of the projected coordinate system: -How many scenes are available? Save the search results in GeoJSON format. - -::::solution ```python -bbox = point.buffer(0.01).bounds +print(item.properties['proj:epsg']) ``` +::: challenge +## Exercise: Search satellite scenes using metadata filters + +Let's add a filter on the cloud cover to select the only scenes with less than 1% cloud coverage. How many scenes do now +match our search? + +Hint: generic metadata filters can be implemented via the `query` input argument of `client.search`, which requires the +following syntax (see [docs](https://pystac-client.readthedocs.io/en/stable/usage.html#query-extension)): +`query=['']`. + +:::: solution + ```python search = client.search( collections=[collection], - bbox=bbox, - datetime="2020-03-20/2020-03-30", - query=["eo:cloud_cover<15"] + intersects=point, + datetime='2023-07-01/2023-08-31', + query=['eo:cloud_cover<1'] ) print(search.matched()) ``` ```output -4 +11 ``` +:::: +::: + +Once we are happy with our search, we save the search results in a file: + ```python items = search.item_collection() -items.save_object("search.json") +items.save_object("rhodes_sentinel-2.json") ``` -:::: -::: + +This creates a file in GeoJSON format, which we will reuse here and in the next episodes. Note that this file contains the metadata of the files that meet out criteria. It does not include the data itself, only their metadata and links to where the data files can be accessed. ## Access the assets So far we have only discussed metadata - but how can one get to the actual images of a satellite scene (the "assets" in the STAC nomenclature)? These can be reached via links that are made available through the item's attribute `assets`. +Let's focus on the last item in the collection: this is the oldest in time, and it thus corresponds to an image taken +before the wildfires. ```python -assets = items[0].assets # first item's asset dictionary +assets = items[-1].assets # last item's asset dictionary print(assets.keys()) ``` @@ -305,39 +392,58 @@ visual-jp2: True color image wvp-jp2: Water vapour (WVP) ``` -Among the others, assets include multiple raster data files (one per optical band, as acquired by the multi-spectral +Among the other data files, assets include multiple raster data files (one per optical band, as acquired by the multi-spectral instrument), a thumbnail, a true-color image ("visual"), instrument metadata and scene-classification information -("SCL"). Let's get the URL links to the actual asset: +("SCL"). Let's get the URL link to the thumbnail, which gives us a glimpse of the Sentinel-2 scene: ```python print(assets["thumbnail"].href) ``` ```output -https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/31/U/FU/2020/3/S2A_31UFU_20200328_0_L2A/thumbnail.jpg +https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/35/S/NA/2023/7/S2A_35SNA_20230708_0_L2A/thumbnail.jpg ``` This can be used to download the corresponding file: -![Overview of the true-color image ("thumbnail")](fig/E05/STAC-s2-preview.jpg){alt="thumbnail of the sentinel-2 scene"} +![Overview of the true-color image ("thumbnail") before the wildfires on Rhodes](fig/E05/STAC-s2-preview-before.jpg){alt="thumbnail of the sentinel-2 scene before the wildfires"} + +For comparison, we can check out the thumbnail of the most recent scene of the sequence considered (i.e. the first item +in the item collection), which has been taken after the wildfires: + +```python +print(items[0].assets["thumbnail"].href) +``` + +```output +https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/35/S/NA/2023/8/S2A_35SNA_20230827_0_L2A/thumbnail.jpg +``` + +![Overview of the true-color image ("thumbnail") after the wildfires on Rhodes](fig/E05/STAC-s2-preview-after.jpg){alt="thumbnail of the sentinel-2 scene after the wildfires"} + +From the thumbnails alone we can already observe some dark spots on the island of Rhodes at the bottom right of the image! + +In order to open the high-resolution satellite images and investigate the scenes in more detail, we will be using the [`rioxarray` library](https://corteva.github.io/rioxarray). Note that this library can both work with local and remote raster data. At this moment, we will only take a sneak peek of the functionality of this library. We will learn more about it in the next episode. + +Now let us focus on the red band by accessing the item `red` from the assets dictionary and get the Hypertext Reference (also known as URL) attribute using `.href` after the item selection. + +For now we are using [rioxarray to open the raster file](https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray-open-rasterio). -Remote raster data can be directly opened via the `rioxarray` library. We will -learn more about this library in the next episodes. ```python import rioxarray -nir_href = assets["nir"].href -nir = rioxarray.open_rasterio(nir_href) -print(nir) +red_href = assets["red"].href +red = rioxarray.open_rasterio(red_href) +print(red) ``` ```output - + Size: 241MB [120560400 values with dtype=uint16] Coordinates: - * band (band) int64 1 - * x (x) float64 6e+05 6e+05 6e+05 ... 7.098e+05 7.098e+05 7.098e+05 - * y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06 - spatial_ref int64 0 + * band (band) int32 4B 1 + * x (x) float64 88kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05 + * y (y) float64 88kB 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06 + spatial_ref int32 4B 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE @@ -346,23 +452,34 @@ Attributes: add_offset: 0.0 ``` -We can then save the data to disk: +Now we want to save the data to our local machine using the [to_raster](https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.raster_array.RasterArray.to_raster) method: ```python # save whole image to disk -nir.rio.to_raster("nir.tif") +red.rio.to_raster("red.tif") ``` -Since that might take a while, given there are over 10000 x 10000 = a hundred million pixels in the 10 meter NIR band, you can take a smaller subset before downloading it. Becuase the raster is a COG, we can download just what we need! +That might take a while, given there are over 10000 x 10000 = a hundred million pixels in the 10-meter NIR band. +But we can take a smaller subset before downloading it. Because the raster is a COG, we can download just what we need! -Here, we specify that we want to download the first (and only) band in the tif file, and a slice of the width and height dimensions. +In order to do that, we are using rioxarray´s [`clip_box`](https://corteva.github.io/rioxarray/stable/examples/clip_box.html) with which you can set a bounding box defining the area you want. ```python -# save portion of an image to disk -nir[0,1500:2200,1500:2200].rio.to_raster("nir_subset.tif") +red_subset = red.rio.clip_box( + minx=560900, + miny=3995000, + maxx=570900, + maxy=4015000 +) ``` +Next, we save the subset using `to_raster` again. + +```python +red_subset.rio.to_raster("red_subset.tif") +``` + +The difference is 241 Megabytes for the full image vs less than 10 Megabytes for the subset. -The difference is 155 Megabytes for the large image vs about 1 Megabyte for the subset. :::challenge ## Exercise: Downloading Landsat 8 Assets @@ -377,7 +494,7 @@ in the NASA Common Metadata Repository (CMR) and it can be accessed from the STA 2021, intersecting the point with longitude/latitute coordinates (-73.97, 40.78) deg. - Visualize an item's thumbnail (asset key `browse`). -::::solution +:::: solution ```python # connect to the STAC endpoint cmr_api_url = "https://cmr.earthdata.nasa.gov/stac/LPCLOUD" @@ -422,7 +539,7 @@ print(item.assets["browse"].href) :::: ::: -:::callout +::: callout ## Public catalogs, protected data Publicly accessible catalogs and STAC endpoints do not necessarily imply publicly accessible data. Data providers, in @@ -444,7 +561,7 @@ os.environ["GDAL_HTTP_COOKIEJAR"] = "./cookies.txt" ``` ::: -:::keypoints +::: keypoints - Accessing satellite images via the providers' API enables a more reliable and scalable data retrieval. - STAC catalogs can be browsed and searched using the same tools and scripts. - `rioxarray` allows you to open and download remote raster files. diff --git a/06-raster-intro.md b/06-raster-intro.md index 99598cec..faa4cba5 100644 --- a/06-raster-intro.md +++ b/06-raster-intro.md @@ -17,101 +17,283 @@ exercises: 30 - Visualize single/multi-band raster data. ::: -Raster datasets have been introduced in [Episode 1: Introduction to Raster Data](01-intro-raster-data.md). Here, we introduce the fundamental principles, packages and metadata/raster attributes for working with raster data in Python. We will also explore how Python handles missing and bad data values. -[`rioxarray`](https://corteva.github.io/rioxarray/stable/) is the Python package we will use throughout this lesson to work with raster data. It is based on the popular [`rasterio`](https://rasterio.readthedocs.io/en/latest/) package for working with rasters and [`xarray`](https://xarray.pydata.org/en/stable/) for working with multi-dimensional arrays. -`rioxarray` extends `xarray` by providing top-level functions (e.g. the `open_rasterio` function to open raster datasets) and by adding a set of methods to the main objects of the `xarray` package (the `Dataset` and the `DataArray`). These additional methods are made available via the `rio` accessor and become available from `xarray` objects after importing `rioxarray`. +:::callout +## Raster Data -We will also use the [`pystac`](https://github.com/stac-utils/pystac) package to load rasters from the search results we created in the previous episode. +In the [first episode](01-intro-raster-data.md) of this course we provided an introduction on what Raster datasets are and how these divert from vector data. In this episode we will dive more into raster data and focus on how to work with them. We introduce fundamental principles, python packages, metadata and raster attributes for working with this type of data. In addition, we will explore how Python handles missing and bad data values. + +The Python package we will use throughout this episode to handle raster data is [`rioxarray`](https://corteva.github.io/rioxarray/stable/). This package is based on the popular [`rasterio`](https://rasterio.readthedocs.io/en/latest/) (which is build upon the GDAL library) for working with raster data and [`xarray`](https://xarray.pydata.org/en/stable/) for working with multi-dimensional arrays. + +`rioxarray` extends `xarray` by providing top-level functions like the [`open_rasterio`](https://corteva.github.io/rioxarray/html/rioxarray.html#rioxarray-open-rasterio) function to open raster datasets. Furthermore, it adds a set of methods to the main objects of the `xarray` package like the [`Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) and the [`DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray). These methods are made available via the `rio` accessor and become available from `xarray` objects after importing `rioxarray`. Since a lot of the functions, methods and attributes do not originate from `rioxarray`, but come from the other packages mentioned and are accessible through the accessor, the documentation is in some cases limited and requires a little puzzling. It is therefore recommended to foremost focus at the notebook´s functionality to use tab completion and go through the various functionalities. In addition, adding a question mark `?` after every function or method offers the opportunity to see the various options. + +For instance if you want to understand the options for rioxarray´s `open_rasterio` function: + +```python +rioxarray.open_rasterio? +``` + +```output +Signature: +rioxarray.open_rasterio( + filename: Union[str, os.PathLike, rasterio.io.DatasetReader, rasterio.vrt.WarpedVRT, rioxarray._io.SingleBandDatasetReader], + parse_coordinates: Optional[bool] = None, + chunks: Union[int, tuple, dict, NoneType] = None, + cache: Optional[bool] = None, + lock: Optional[Any] = None, + masked: bool = False, + mask_and_scale: bool = False, + variable: Union[str, list[str], tuple[str, ...], NoneType] = None, + group: Union[str, list[str], tuple[str, ...], NoneType] = None, + default_name: Optional[str] = None, + decode_times: bool = True, + decode_timedelta: Optional[bool] = None, + band_as_variable: bool = False, + **open_kwargs, +) -> Union[xarray.core.dataset.Dataset, xarray.core.dataarray.DataArray, list[xarray.core.dataset.Dataset]] +Docstring: +Open a file with rasterio (experimental). + +This should work with any file that rasterio can open (most often: +geoTIFF). The x and y coordinates are generated automatically from the +file's geoinformation, shifted to the center of each pixel (see +`"PixelIsArea" Raster Space +`_ +for more information). + +.. versionadded:: 0.13 band_as_variable + +Parameters +---------- +filename: str, rasterio.io.DatasetReader, or rasterio.vrt.WarpedVRT + Path to the file to open. Or already open rasterio dataset. +parse_coordinates: bool, optional + Whether to parse the x and y coordinates out of the file's + ``transform`` attribute or not. The default is to automatically + parse the coordinates only if they are rectilinear (1D). + It can be useful to set ``parse_coordinates=False`` + if your files are very large or if you don't need the coordinates. +chunks: int, tuple or dict, optional + Chunk sizes along each dimension, e.g., ``5``, ``(5, 5)`` or + ``{'x': 5, 'y': 5}``. If chunks is provided, it used to load the new + DataArray into a dask array. Chunks can also be set to + ``True`` or ``"auto"`` to choose sensible chunk sizes according to + ``dask.config.get("array.chunk-size")``. +cache: bool, optional + If True, cache data loaded from the underlying datastore in memory as + NumPy arrays when accessed to avoid reading from the underlying data- + store multiple times. Defaults to True unless you specify the `chunks` + argument to use dask, in which case it defaults to False. +lock: bool or dask.utils.SerializableLock, optional + + If chunks is provided, this argument is used to ensure that only one + thread per process is reading from a rasterio file object at a time. + + By default and when a lock instance is provided, + a :class:`xarray.backends.CachingFileManager` is used to cache File objects. + Since rasterio also caches some data, this will make repeated reads from the + same object fast. + + When ``lock=False``, no lock is used, allowing for completely parallel reads + from multiple threads or processes. However, a new file handle is opened on + each request. + +masked: bool, optional + If True, read the mask and set values to NaN. Defaults to False. +mask_and_scale: bool, default=False + Lazily scale (using the `scales` and `offsets` from rasterio) and mask. + If the _Unsigned attribute is present treat integer arrays as unsigned. +variable: str or list or tuple, optional + Variable name or names to use to filter loading. +group: str or list or tuple, optional + Group name or names to use to filter loading. +default_name: str, optional + The name of the data array if none exists. Default is None. +decode_times: bool, default=True + If True, decode times encoded in the standard NetCDF datetime format + into datetime objects. Otherwise, leave them encoded as numbers. +decode_timedelta: bool, optional + If True, decode variables and coordinates with time units in + {“days”, “hours”, “minutes”, “seconds”, “milliseconds”, “microseconds”} + into timedelta objects. If False, leave them encoded as numbers. + If None (default), assume the same value of decode_time. +band_as_variable: bool, default=False + If True, will load bands in a raster to separate variables. +**open_kwargs: kwargs, optional + Optional keyword arguments to pass into :func:`rasterio.open`. + +Returns +------- +:obj:`xarray.Dataset` | :obj:`xarray.DataArray` | list[:obj:`xarray.Dataset`]: + The newly created dataset(s). +File: c:\miniconda3\envs\geospatial\lib\site-packages\rioxarray\_io.py +Type: function +``` + +In addition to rioxarray, we will use the [`pystac`](https://github.com/stac-utils/pystac) package to load rasters from the search results that were created in [episode 5](05-access-data.md). -:::callout -## Introduce the Raster Data -We'll continue from the results of the satellite image search that we have carried out in an exercise from -[a previous episode](05-access-data.md). We will load data starting from the `search.json` file, -using one scene from the search results as an example to demonstrate data loading and visualization. - -If you would like to work with the data for this lesson without downloading data on-the-fly, you can download the -raster data ahead of time using this [link](https://figshare.com/ndownloader/files/36028100). Save the `geospatial-python-raster-dataset.tar.gz` -file in your current working directory, and extract the archive file by double-clicking on it or by running the -following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. Use the file `geospatial-python-raster-dataset/search.json` -(instead of `search.json`) to get started with this lesson. - -This can be useful if you need to download the data ahead of time to work through the lesson offline, or if you want -to work with the data in a different GIS. ::: -## Load a Raster and View Attributes -In the previous episode, we searched for Sentinel-2 images, and then saved the search results to a file: `search.json`. This contains the information on where and how to access the target images from a remote repository. We can use the function `pystac.ItemCollection.from_file()` to load the search results as an `Item` list. +## Getting the data + +We will continue from the results of the satellite image search that we have carried out in an exercise from +[the previous episode](05-access-data.md). We will load data using the created metadata file and use one scene from the search results as an example to demonstrate data loading and visualization. To do so you can use the `rhodes_sentinel-2.json` you created or download and use the created JSON file [here](../data/stac_json/rhodes_sentinel-2.json). + +In case you would like to work with raster data for this lesson without downloading data on-the-fly, you can download the raster data using this [link](https://figshare.com/ndownloader/files/36028100). Save the `geospatial-python-raster-dataset.tar.gz` file in your current working directory, and extract the archive file by double-clicking on it or by running the +following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. +If you use choose to download the data you can skip the following part and continue with + +**Load a Raster and View Attributes**. + +When you use the `rhodes_sentinel-2.json` you will remember that this file contains information on where and how to access the target images from a remote repository. To load the saved search results as an `Item` list we will use [`pystac.ItemCollection.from_file()`](https://pystac.readthedocs.io/en/stable/api/item_collection.html). Through this, we are instructing Python to use the `from_file` method of the `ItemCollection` class in the `pystac` module to load data from a specified JSON file. ```python import pystac -items = pystac.ItemCollection.from_file("search.json") +file_items = pystac.ItemCollection.from_file("../data/stac_json/rhodes_sentinel-2.json") +``` +In the search results, we have eleven `Item` type objects, corresponding to the Sentinel-2 scenes parameters we set in episode 5 (which were all images from July 8th till the 27th of August 2023 that have less than 1% cloud coverage). In this episode we will focus on the newest item in our collection, since that would show the situation after the wildfire. + +Like when reading the data from the api we can perform the same actions on the collection since it is stored to the class ItemColleciton. For instance, you can check the number of items in the item collection, you can use `len` or to check the items themselved include a for loop. + +```python +print(len(file_items)) +``` +To check the items themselves you can make a for loop on the items (as we did in episode 5). To look at specific parameters of the item you can have a look [here](https://pystac.readthedocs.io/en/latest/api/item.html). Since we are interested in the most recent image let us print the `datetime` and `id`: + +```python +for item in file_items: + print(item.id) + print(item.datetime) +``` + +```output +S2A_35SNA_20230827_0_L2A +2023-08-27 09:00:21.327000+00:00 +S2B_35SNA_20230822_0_L2A +2023-08-22 09:00:20.047000+00:00 +S2A_35SNA_20230817_0_L2A +2023-08-17 09:00:21.370000+00:00 +S2B_35SNA_20230812_0_L2A +2023-08-12 09:00:21.069000+00:00 +S2A_35SNA_20230807_0_L2A +2023-08-07 09:00:20.107000+00:00 +S2B_35SNA_20230802_0_L2A +2023-08-02 09:00:20.321000+00:00 +S2A_35SNA_20230728_0_L2A +2023-07-28 09:00:20.662000+00:00 +S2B_35SNA_20230723_0_L2A +2023-07-23 09:00:21.476000+00:00 +S2A_35SNA_20230718_0_L2A +2023-07-18 09:00:19.860000+00:00 +S2B_35SNA_20230713_0_L2A +2023-07-13 09:00:20.114000+00:00 +S2A_35SNA_20230708_0_L2A +2023-07-08 09:00:20.745000+00:00 +``` + +You will notice that the item collection `S2A_35SNA_20230827_0_L2A` would be the newest since it is printed as first. + +```python +print(file_items[0]) +``` + +```output + +``` + +To access an actual raster image from that date we look at the item´s attribute `assets` which is accessible through a dictionary. This contains all the various datasets that have been collected at that specific date. + +```python +assets = file_items[0].assets +print(assets.keys()) +``` + +```output +dict_keys(['aot', 'blue', 'coastal', 'granule_metadata', 'green', 'nir', 'nir08', 'nir09', 'red', 'rededge1', 'rededge2', 'rededge3', 'scl', 'swir16', 'swir22', 'thumbnail', 'tileinfo_metadata', 'visual', 'wvp', 'aot-jp2', 'blue-jp2', 'coastal-jp2', 'green-jp2', 'nir-jp2', 'nir08-jp2', 'nir09-jp2', 'red-jp2', 'rededge1-jp2', 'rededge2-jp2', 'rededge3-jp2', 'scl-jp2', 'swir16-jp2', 'swir22-jp2', 'visual-jp2', 'wvp-jp2'] +``` + +To analyse the burned areas, we are interested in the `red` band of the satellite scene. In [episode 9](/episodes/09-raster-calculations.md) we will further explain why the characteristics of that band are interesting in relation to wildfires. + +To select the red band we need to get the URL / `href` (Hypertext Referenc) and put it into a variable. + +```python +rhodes_red_href = assets["red"].href +``` + +```python +print(rhodes_red_href) +``` + +## Load a Raster and View Attributes + +Now we can load `red` band using the function [`rioxarray.open_rasterio()`](), via the variable we created. + +```python +import rioxarray +rhodes_red = rioxarray.open_rasterio(rhodes_red_href) ``` +In case you used the downloaded data you can do. -In the search results, we have 2 `Item` type objects, corresponding to 4 Sentinel-2 scenes from March 26th and 28th in 2020. We will focus on the first scene `S2A_31UFU_20200328_0_L2A`, and load band `nir09` (central wavelength 945 nm). We can load this band using the function `rioxarray.open_rasterio()`, via the Hypertext Reference `href` (commonly referred to as a URL): ```python import rioxarray -raster_ams_b9 = rioxarray.open_rasterio(items[0].assets["nir09"].href) +rhodes_red = rioxarray.open_rasterio("../data/stac_json/rhodes_red.tif") ``` -By calling the variable name in the jupyter notebook we can get a quick look at the shape and attributes of the data. +The first call to `rioxarray.open_rasterio()` opens the file from remote or local storage, and then returns a `xarray.DataArray` object. The object is stored in a variable, i.e. `rhodes_red`. Reading in the data with `xarray` instead of `rioxarray` also returns a `xarray.DataArray`, but the output will not contain the geospatial metadata (such as projection information). You can use numpy functions or built-in Python math operators on a `xarray.DataArray` just like a numpy array. Calling the variable name of the `DataArray` also prints out all of its metadata information. + +By calling the variable name we can get a quick look at the shape and attributes of the data. ```python -raster_ams_b9 +print(rhodes_red) ``` ```output - -[3348900 values with dtype=uint16] + Size: 241MB +[120560400 values with dtype=uint16] Coordinates: - * band (band) int64 1 - * x (x) float64 6e+05 6.001e+05 6.002e+05 ... 7.097e+05 7.098e+05 - * y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06 - spatial_ref int64 0 + * band (band) int32 4B 1 + * x (x) float64 88kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05 + * y (y) float64 88kB 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06 + spatial_ref int32 4B 0 Attributes: - _FillValue: 0.0 - scale_factor: 1.0 - add_offset: 0.0 + AREA_OR_POINT: Area + OVR_RESAMPLING_ALG: AVERAGE + _FillValue: 0 + scale_factor: 1.0 + add_offset: 0.0 ``` -The first call to `rioxarray.open_rasterio()` opens the file from remote or local storage, and then returns a `xarray.DataArray` object. The object is stored in a variable, i.e. `raster_ams_b9`. Reading in the data with `xarray` instead of `rioxarray` also returns a `xarray.DataArray`, but the output will not contain the geospatial metadata (such as projection information). You can use numpy functions or built-in Python math operators on a `xarray.DataArray` just like a numpy array. Calling the variable name of the `DataArray` also prints out all of its metadata information. +The output tells us that we are looking at an `xarray.DataArray`, with `1` band, `10980` rows, and `10980` columns. We can also see the number of pixel values in the `DataArray`, and the type of those pixel values, which is unsigned integer (or `uint16`). The `DataArray` also stores different values for the coordinates of the `DataArray`. When using `rioxarray`, the term coordinates refers to spatial coordinates like `x` and `y` but also the `band` coordinate. Each of these sequences of values has its own data type, like `float64` for the spatial coordinates and `int64` for the `band` coordinate. -The output tells us that we are looking at an `xarray.DataArray`, with `1` band, `1830` rows, and `1830` columns. We can also see the number of pixel values in the `DataArray`, and the type of those pixel values, which is unsigned integer (or `uint16`). The `DataArray` also stores different values for the coordinates of the `DataArray`. When using `rioxarray`, the term coordinates refers to spatial coordinates like `x` and `y` but also the `band` coordinate. Each of these sequences of values has its own data type, like `float64` for the spatial coordinates and `int64` for the `band` coordinate. - -This `DataArray` object also has a couple of attributes that are accessed like `.rio.crs`, `.rio.nodata`, and `.rio.bounds()`, which contain the metadata for the file we opened. Note that many of the metadata are accessed as attributes without `()`, but `bounds()` is a method (i.e. a function in an object) and needs parentheses. +This `DataArray` object also has a couple of attributes that are accessed like `.rio.crs`, `.rio.nodata`, and `.rio.bounds()` (in jupyter you can browse through these attributes by using `tab` for auto completion or have a look at the documentation [here](https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray-rio-accessors)), which contains the metadata for the file we opened. Note that many of the metadata are accessed as attributes without `()`, however since `bounds()` is a method (i.e. a function in an object) it requires these parentheses this is also the case for `.rio.resolution()`. ```python -print(raster_ams_b9.rio.crs) -print(raster_ams_b9.rio.nodata) -print(raster_ams_b9.rio.bounds()) -print(raster_ams_b9.rio.width) -print(raster_ams_b9.rio.height) +print(rhodes_red.rio.crs) +print(rhodes_red.rio.nodata) +print(rhodes_red.rio.bounds()) +print(rhodes_red.rio.width) +print(rhodes_red.rio.height) +print(rhodes_red.rio.resolution()) ``` ```output -EPSG:32631 +EPSG:32635 0 -(600000.0, 5790240.0, 709800.0, 5900040.0) -1830 -1830 +(499980.0, 3990240.0, 609780.0, 4100040.0) +10980 +10980 +(10.0, -10.0) ``` -The Coordinate Reference System, or `raster_ams_b9.rio.crs`, is reported as the string `EPSG:32631`. The `nodata` value is encoded as 0 and the bounding box corners of our raster are represented by the output of `.bounds()` as a `tuple` (like a list but you can't edit it). The height and width match what we saw when we printed the `DataArray`, but by using `.rio.width` and `.rio.height` we can access these values if we need them in calculations. - -We will be exploring this data throughout this episode. By the end of this episode, you will be able to understand and explain the metadata output. - -:::callout -## Tip - Variable names -To improve code readability, file and object names should be used that make it clear what is in the file. The data for this episode covers Amsterdam, and is from Band 9, so we'll use a naming convention of `raster_ams_b9`. -::: +The Coordinate Reference System, or `rhodes_red.rio.crs`, is reported as the string `EPSG:32635`. The `nodata` value is encoded as 0 and the bounding box corners of our raster are represented by the output of `.bounds()` as a `tuple` (like a list but you can't edit it). The height and width match what we saw when we printed the `DataArray`, but by using `.rio.width` and `.rio.height` we can access these values if we need them in calculations. ## Visualize a Raster After viewing the attributes of our raster, we can examine the raw values of the array with `.values`: ```python -raster_ams_b9.values +rhodes_red.values ``` ```output @@ -127,80 +309,110 @@ array([[[ 0, 0, 0, ..., 8888, 9075, 8139], This can give us a quick view of the values of our array, but only at the corners. Since our raster is loaded in python as a `DataArray` type, we can plot this in one line similar to a pandas `DataFrame` with `DataArray.plot()`. ```python -raster_ams_b9.plot() +rhodes_red.plot() ``` -![Raster plot with rioxarray](fig/E06/overview-plot-B09.png){alt="raster plot with defualt setting"} +![Raster plot with rioxarray](fig/E06/rhodes_red_B04.png){alt="raster plot with defualt setting"} + +Notice that `rioxarray` helpfully allows us to plot this raster with spatial coordinates on the x and y axis (this is not the default in many cases with other functions or libraries). Nice plot! However, it probably took a while for it to load therefore it would make sense to resample it. -Nice plot! Notice that `rioxarray` helpfully allows us to plot this raster with spatial coordinates on the x and y axis (this is not the default in many cases with other functions or libraries). +# Resampling the raster image -This plot shows the satellite measurement of the spectral band `nir09` for an area that covers part of the Netherlands. According to the [Sentinel-2 documentaion](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/msi-instrument), this is a band with the central wavelength of 945nm, which is sensitive to water vapor. It has a spatial resolution of 60m. Note that the `band=1` in the image title refers to the ordering of all the bands in the `DataArray`, not the Sentinel-2 band number `09` that we saw in the pystac search results. +The red band image is available as a raster file with 10 m resolution, which makes it a relatively large file (few hundreds MBs). +In order to keep calculations "manageable" (reasonable execution time and memory usage) we select here a lower resolution version of the image, taking +advantage of the so-called "pyramidal" structure of cloud-optimized GeoTIFFs (COGs). COGs, in fact, typically include +multiple lower-resolution versions of the original image, called "overviews", in the same file. This allows us to avoid +downloading high-resolution images when only quick previews are required. -With a quick view of the image, we notice that half of the image is blank, no data is captured. We also see that the cloudy pixels at the top have high reflectance values, while the contrast of everything else is quite low. This is expected because this band is sensitive to the water vapor. However if one would like to have a better color contrast, one can add the option `robust=True`, which displays values between the 2nd and 98th percentile: +Overviews are often computed using powers of 2 as down-sampling (or zoom) factors. So, typically, the first level +overview (index 0) corresponds to a zoom factor of 2, the second level overview (index 1) corresponds to a zoom factor +of 4, and so on. Here, we open the third level overview (index 2, zoom factor 8) and check that the resolution is about 80 m: ```python -raster_ams_b9.plot(robust=True) +import rioxarray +rhodes_red_80 = rioxarray.open_rasterio(red_href, overview_level=2) +print(rhodes_red_80.rio.resolution()) ``` -![Raster plot using the "robust" setting](fig/E06/overview-plot-B09-robust.png){alt="raster plot with robust setting"} +```output +(79.97086671522214, -79.97086671522214) +``` +Lets plot this one. + +```python +rhodes_red_80.plot() +``` +![Raster plot 80 x 80 meter resolution with rioxarray](fig/E06/rhodes_red_80_B04.png){alt="raster plot with defualt setting"} + +This plot shows the satellite measurement of the band `red` for Rhodes before the wildfire. According to the [Sentinel-2 documentaion](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/msi-instrument), this is a band with the central wavelength of 665nm. It has a spatial resolution of 10m. Note that the `band=1` in the image title refers to the ordering of all the bands in the `DataArray`, not the Sentinel-2 band number `04` that we saw in the pystac search results. -Now the color limit is set in a way fitting most of the values in the image. We have a better view of the ground pixels. :::callout ## Tool Tip -The option `robust=True` always forces displaying values between the 2nd and 98th percentile. Of course, this will not work for every case. For a customized displaying range, you can also manually specifying the keywords `vmin` and `vmax`. For example ploting between `100` and `7000`: +The option `robust=True` always forces displaying values between the 2nd and 98th percentile. Of course, this will not work for every case. ```python -raster_ams_b9.plot(vmin=100, vmax=7000) +rhodes_red_80.plot(robust=True) ``` +![Raster plot using the "robust" setting](fig/E06/rhodes_red_80_B04_robust.png){alt="raster plot with robust setting"} + +Now the color limit is set in a way fitting most of the values in the image. We have a better view of the ground pixels. + +For a customized displaying range, you can also manually specifying the keywords `vmin` and `vmax`. For example ploting between `100` and `2000`: + +```python +rhodes_red_80.plot(vmin=100, vmax=2000) +``` + +![Raster plot using vmin 100 and vmax 2000](fig/E06/rhodes_red_80_B04_vmin100_vmax2000.png){alt="raster plot with robust setting"} + +More options can be consulted [here](https://docs.xarray.dev/en/v2024.02.0/generated/xarray.plot.imshow.html). You will notice that these parameters are part of the `imshow` method from the plot function. Since plot originates from matplotlib and is so widely used, your python environment helps you to interpret the parameters without having to specify the method. It is a service to help you, but can be confusing when teaching it. We will explain more about this below. + ::: ## View Raster Coordinate Reference System (CRS) in Python Another information that we're interested in is the CRS, and it can be accessed with `.rio.crs`. We introduced the concept of a CRS in [an earlier -episode](03-crs.md). -Now we will see how features of the CRS appear in our data file and what -meanings they have. We can view the CRS string associated with our DataArray's `rio` object using the `crs` -attribute. +episode](03-crs.md). Now we will see how features of the CRS appear in our data file and what +meanings they have. We can view the CRS string associated with our DataArray's `rio` object using the `crs` attribute. ```python -print(raster_ams_b9.rio.crs) +print(rhodes_red_80.rio.crs) ``` ```output -EPSG:32631 +EPSG:32635 ``` -To print the EPSG code number as an `int`, we use the `.to_epsg()` method: +To print the EPSG code number as an `int`, we use the `.to_epsg()` method (which originally is part of rasterio [`to_epsg`](https://rasterio.readthedocs.io/en/stable/api/rasterio.crs.html#rasterio.crs.CRS.to_epsg)): ```python -raster_ams_b9.rio.crs.to_epsg() +rhodes_red_80.rio.crs.to_epsg() ``` ```output -32631 +32635 ``` - -EPSG codes are great for succinctly representing a particular coordinate reference system. But what if we want to see more details about the CRS, like the units? For that, we can use `pyproj`, a library for representing and working with coordinate reference systems. +EPSG codes are great for succinctly representing a particular coordinate reference system. But what if we want to see more details about the CRS, like the units? For that, we can use [`pyproj`](https://pyproj4.github.io/pyproj/stable/api/index.html) , a library for representing and working with coordinate reference systems. ```python from pyproj import CRS -epsg = raster_ams_b9.rio.crs.to_epsg() +epsg = rhodes_red_80.rio.crs.to_epsg() crs = CRS(epsg) crs ``` ```output - -Name: WGS 84 / UTM zone 31N + +Name: WGS 84 / UTM zone 35N Axis Info [cartesian]: - E[east]: Easting (metre) - N[north]: Northing (metre) Area of Use: -- name: Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea. -- bounds: (0.0, 0.0, 6.0, 84.0) +- name: Between 24°E and 30°E, northern hemisphere between equator and 84°N, onshore and offshore. Belarus. Bulgaria. Central African Republic. Democratic Republic of the Congo (Zaire). Egypt. Estonia. Finland. Greece. Latvia. Lesotho. Libya. Lithuania. Moldova. Norway. Poland. Romania. Russian Federation. Sudan. Svalbard. Türkiye (Turkey). Uganda. Ukraine. +- bounds: (24.0, 0.0, 30.0, 84.0) Coordinate Operation: -- name: UTM zone 31N +- name: UTM zone 35N - method: Transverse Mercator Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 @@ -209,14 +421,14 @@ Datum: World Geodetic System 1984 ensemble The `CRS` class from the `pyproj` library allows us to create a `CRS` object with methods and attributes for accessing specific information about a CRS, or the detailed summary shown above. -A particularly useful attribute is `area_of_use`, which shows the geographic bounds that the CRS is intended to be used. +A particularly useful attribute is [`area_of_use`](https://pyproj4.github.io/pyproj/stable/api/crs/crs.html#pyproj.crs.CRS.area_of_use), which shows the geographic bounds that the CRS is intended to be used. ```python crs.area_of_use ``` ```output -AreaOfUse(west=0.0, south=0.0, east=6.0, north=84.0, name='Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea.') +AreaOfUse(west=24.0, south=0.0, east=30.0, north=84.0, name='Between 24°E and 30°E, northern hemisphere between equator and 84°N, onshore and offshore. Belarus. Bulgaria. Central African Republic. Democratic Republic of the Congo (Zaire). Egypt. Estonia. Finland. Greece. Latvia. Lesotho. Libya. Lithuania. Moldova. Norway. Poland. Romania. Russian Federation. Sudan. Svalbard. Türkiye (Turkey). Uganda. Ukraine.') ``` :::challenge @@ -225,7 +437,7 @@ What units are our data in? See if you can find a method to examine this informa ::::solution `crs.axis_info` tells us that the CRS for our raster has two axis and both are in meters. -We could also get this information from the attribute `raster_ams_b9.rio.crs.linear_units`. +We could also get this information from the attribute `rhodes_red_80.rio.crs.linear_units`. :::: ::: @@ -233,25 +445,25 @@ We could also get this information from the attribute `raster_ams_b9.rio.crs.lin Let's break down the pieces of the `pyproj` CRS summary. The string contains all of the individual CRS elements that Python or another GIS might need, separated into distinct sections, and datum. ```output - -Name: WGS 84 / UTM zone 31N + +Name: WGS 84 / UTM zone 35N Axis Info [cartesian]: - E[east]: Easting (metre) - N[north]: Northing (metre) Area of Use: -- name: Between 0°E and 6°E, northern hemisphere between equator and 84°N, onshore and offshore. Algeria. Andorra. Belgium. Benin. Burkina Faso. Denmark - North Sea. France. Germany - North Sea. Ghana. Luxembourg. Mali. Netherlands. Niger. Nigeria. Norway. Spain. Togo. United Kingdom (UK) - North Sea. -- bounds: (0.0, 0.0, 6.0, 84.0) +- name: Between 24°E and 30°E, northern hemisphere between equator and 84°N, onshore and offshore. Belarus. Bulgaria. Central African Republic. Democratic Republic of the Congo (Zaire). Egypt. Estonia. Finland. Greece. Latvia. Lesotho. Libya. Lithuania. Moldova. Norway. Poland. Romania. Russian Federation. Sudan. Svalbard. Türkiye (Turkey). Uganda. Ukraine. +- bounds: (24.0, 0.0, 30.0, 84.0) Coordinate Operation: -- name: UTM zone 31N +- name: UTM zone 35N - method: Transverse Mercator Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich ``` -* **Name** of the projection is UTM zone 31N (UTM has 60 zones, each 6-degrees of longitude in width). The underlying datum is WGS84. +* **Name** of the projection is UTM zone 35N (UTM has 60 zones, each 6-degrees of longitude in width). The underlying datum is WGS84. * **Axis Info**: the CRS shows a Cartesian system with two axes, easting and northing, in meter units. -* **Area of Use**: the projection is used for a particular range of longitudes `0°E to 6°E` in the northern hemisphere (`0.0°N to 84.0°N`) +* **Area of Use**: the projection is used for a particular range of longitudes `24°E to 30°E` in the northern hemisphere (`0.0°N to 84.0°N`) * **Coordinate Operation**: the operation to project the coordinates (if it is projected) onto a cartesian (x, y) plane. Transverse Mercator is accurate for areas with longitudinal widths of a few degrees, hence the distinct UTM zones. * **Datum**: Details about the datum, or the reference point for coordinates. `WGS 84` and `NAD 1983` are common datums. `NAD 1983` is [set to be replaced in 2022](https://en.wikipedia.org/wiki/Datum_of_2022). @@ -265,43 +477,43 @@ zone. Below is a simplified view of US UTM zones. It is useful to know the minimum or maximum values of a raster dataset. We can compute these and other descriptive statistics with `min`, `max`, `mean`, and `std`. ```python -print(raster_ams_b9.min()) -print(raster_ams_b9.max()) -print(raster_ams_b9.mean()) -print(raster_ams_b9.std()) +print(rhodes_red_80.min()) +print(rhodes_red_80.max()) +print(rhodes_red_80.mean()) +print(rhodes_red_80.std()) ``` ```output - + Size: 2B array(0, dtype=uint16) Coordinates: - spatial_ref int64 0 - -array(15497, dtype=uint16) + spatial_ref int32 4B 0 + Size: 2B +array(7277, dtype=uint16) Coordinates: - spatial_ref int64 0 - -array(1652.44009944) + spatial_ref int32 4B 0 + Size: 8B +array(404.07532588) Coordinates: - spatial_ref int64 0 - -array(2049.16447495) + spatial_ref int32 4B 0 + Size: 8B +array(527.5557502) Coordinates: - spatial_ref int64 0 + spatial_ref int32 4B 0 ``` The information above includes a report of the min, max, mean, and standard deviation values, along with the data type. If we want to see specific quantiles, we can use xarray's `.quantile()` method. For example for the 25% and 75% quantiles: ```python -print(raster_ams_b9.quantile([0.25, 0.75])) +print(rhodes_red_80.quantile([0.25, 0.75])) ``` ```output - -array([ 0., 2911.]) + Size: 16B +array([165., 315.]) Coordinates: - * quantile (quantile) float64 0.25 0.75 + * quantile (quantile) float64 16B 0.25 0.75 ``` :::callout @@ -310,90 +522,127 @@ You could also get each of these values one by one using `numpy`. ```python import numpy -print(numpy.percentile(raster_ams_b9, 25)) -print(numpy.percentile(raster_ams_b9, 75)) +print(numpy.percentile(rhodes_red_80, 25)) +print(numpy.percentile(rhodes_red_80, 75)) ``` ```output -0.0 -2911.0 +165.0 +315.0 ``` -You may notice that `raster_ams_b9.quantile` and `numpy.percentile` didn't require an argument specifying the axis or dimension along which to compute the quantile. This is because `axis=None` is the default for most numpy functions, and therefore `dim=None` is the default for most xarray methods. It's always good to check out the docs on a function to see what the default arguments are, particularly when working with multi-dimensional image data. To do so, we can use`help(raster_ams_b9.quantile)` or `?raster_ams_b9.percentile` if you are using jupyter notebook or jupyter lab. +You may notice that `rhodes_red_80.quantile` and `numpy.percentile` didn't require an argument specifying the axis or dimension along which to compute the quantile. This is because `axis=None` is the default for most numpy functions, and therefore `dim=None` is the default for most xarray methods. It's always good to check out the docs on a function to see what the default arguments are, particularly when working with multi-dimensional image data. To do so, we can use`help(rhodes_red_80.quantile)` or `?rhodes_red_80.percentile` if you are using jupyter notebook or jupyter lab. ::: ## Dealing with Missing Data -So far, we have visualized a band of a Sentinel-2 scene and calculated its statistics. However, we need to take missing data into account. Raster data often has a "no data value" associated with it and for raster datasets read in by `rioxarray`. This value is referred to as `nodata`. This is a value assigned to pixels where data is missing or no data were collected. There can be different cases that cause missing data, and it's common for other values in a raster to represent different cases. The most common example is missing data at the edges of rasters. +So far, we have visualized a band of a Sentinel-2 scene and calculated its statistics. However, as you can see on the image it also contains an artificial band to the top left where data is missing. In order to calculate meaningfull statistics, we need to take missing data into account. Raster data often has a "no data value" associated with it and for raster datasets read in by `rioxarray`. This value is referred to as `nodata`. This is a value assigned to pixels where data is missing or no data were collected. There can be different cases that cause missing data, and it's common for other values in a raster to represent different cases. The most common example is missing data at the edges of rasters. -By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn't rectangular, some pixels at the edge of the raster will have no data values. This often happens when the data were collected by a sensor which only flew over some part of a defined region. +By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn't rectangular, like most satellite images, some pixels at the edge of the raster will have no data values. This often happens when the data were collected by a sensor which only flew over some part of a defined region and is also almost by default because of the fact that the earth is not flat and that we work with geographic and projected coordinate system. -As we have seen above, the `nodata` value of this dataset (`raster_ams_b9.rio.nodata`) is 0. When we have plotted the band data, or calculated statistics, the missing value was not distinguished from other values. Missing data may cause some unexpected results. For example, the 25th percentile we just calculated was 0, probably reflecting the presence of a lot of missing data in the raster. +To check the value of [`nodata`](https://corteva.github.io/rioxarray/html/rioxarray.html#rioxarray.raster_array.RasterArray.nodata) of this dataset you can use: -To distinguish missing data from real data, one possible way is to use `nan` to represent them. This can be done by specifying `masked=True` when loading the raster: ```python -raster_ams_b9 = rioxarray.open_rasterio(items[0].assets["nir09"].href, masked=True) +rhodes_red_80.rio.nodata ``` -One can also use the `where` function to select all the pixels which are different from the `nodata` value of the raster: +```output +0 +``` + +You will find out that this is 0. When we have plotted the band data, or calculated statistics, the missing value was not distinguished from other values. Missing data may cause some unexpected results. + +To distinguish missing data from real data, one possible way is to use `nan`(which stands for Not a Number) to represent them. This can be done by specifying `masked=True` when loading the raster. Let us reload our data and put it into a different variable with the mask. Remember the `red_href` variable we created previously. (If you used the downloaded data you can use a direct link to the file): + ```python -raster_ams_b9.where(raster_ams_b9!=raster_ams_b9.rio.nodata) +rhodes_red_mask_80 = rioxarray.open_rasterio(red_href, masked=True, overview_level=2) ``` +Let us have a look at the data. -Either way will change the `nodata` value from 0 to `nan`. Now if we compute the statistics again, the missing data will not be considered: +```python +print(rhodes_red_mask_80) ``` -print(raster_ams_b9.min()) -print(raster_ams_b9.max()) -print(raster_ams_b9.mean()) -print(raster_ams_b9.std()) + +One can also use the `where` function, which is standard python functionality, to select all the pixels which are different from the `nodata` value of the raster: + +```python +rhodes_red_altmask_80 = rhodes_red_80.where(rhodes_red_80!=rhodes_red_80.rio.nodata) +``` +*[comment mdk]: Do I get it right, that this is actually the same as we have sone above with the masking? + +Either way will change the `nodata` value from 0 to `nan`. Now if we compute the statistics again, the missing data will not be considered. Let´s compare them: +``` +print(rhodes_red_80.min()) +print(rhodes_red_mask_80.min()) +print(rhodes_red_80.max()) +print(rhodes_red_mask_80.max()) +print(rhodes_red_80.mean()) +print(rhodes_red_mask_80.mean() +print(rhodes_red_80.std()) +print(rhodes_red_mask_80.std()) ```python ```output - -array(8., dtype=float32) + Size: 2B +array(0, dtype=uint16) Coordinates: - spatial_ref int64 0 - -array(15497., dtype=float32) + spatial_ref int32 4B 0 + Size: 4B +array(1., dtype=float32) Coordinates: - spatial_ref int64 0 - -array(2477.405, dtype=float32) + spatial_ref int32 4B 0 + Size: 2B +array(7277, dtype=uint16) Coordinates: - spatial_ref int64 0 - -array(2061.9539, dtype=float32) + spatial_ref int32 4B 0 + Size: 4B +array(7277., dtype=float32) Coordinates: - spatial_ref int64 0 + spatial_ref int32 4B 0 + Size: 8B +array(404.07532588) +Coordinates: + spatial_ref int32 4B 0 + Size: 4B +array(461.78833, dtype=float32) +Coordinates: + spatial_ref int32 4B 0 + Size: 8B +array(527.5557502) +Coordinates: + spatial_ref int32 4B 0 + Size: 4B +array(539.82855, dtype=float32) +Coordinates: + spatial_ref int32 4B 0 ``` - And if we plot the image, the `nodata` pixels are not shown because they are not 0 anymore: -![Raster plot after masking out missing values](fig/E06/overview-plot-B09-robust-with-nan.png){alt="raster plot masking missing values"} +![Raster plot after masking out missing values](fig/E06/rhodes_red_80_B04_robust_nan.png){alt="raster plot masking missing values"} -One should notice that there is a side effect of using `nan` instead of `0` to represent the missing data: the data type of the `DataArray` was changed from integers to float. This need to be taken into consideration when the data type matters in your application. +One should notice that there is a side effect of using `nan` instead of `0` to represent the missing data: the data type of the `DataArray` was changed from integers to float (as can be seen when we printed the statistics). This needs to be taken into consideration when the data type matters in your application. ## Raster Bands -So far we looked into a single band raster, i.e. the `nir09` band of a Sentinel-2 scene. However, to get a smaller, non georeferenced version of the scene, one may also want to visualize the true-color overview of the region. This is provided as a multi-band raster -- a raster dataset that contains more than one band. +So far we looked into a single band raster, i.e. the `red` band of a Sentinel-2 scene. However, to get a smaller, non georeferenced version of the scene, one may also want to visualize the true-color overview of the region. This is provided as a multi-band raster -- a raster dataset that contains more than one band. ![Sketch of a multi-band raster image](fig/E06/single_multi_raster.png){alt="multi-band raster"} -The `overview` asset in the Sentinel-2 scene is a multiband asset. Similar to `nir09`, we can load it by: +The `overview` asset in the Sentinel-2 scene is a multiband asset. Similar to `red`, we can load it by: ```python -raster_ams_overview = rioxarray.open_rasterio(items[0].assets['visual'].href, overview_level=3) -raster_ams_overview +rhodes_overview = rioxarray.open_rasterio(items[-1].assets['visual'].href, overview_level=2) +rhodes_overview ``` ```output - -[1415907 values with dtype=uint8] + Size: 6MB +[5655387 values with dtype=uint8] Coordinates: - * band (band) int64 1 2 3 - * x (x) float64 6.001e+05 6.002e+05 ... 7.096e+05 7.097e+05 - * y (y) float64 5.9e+06 5.9e+06 5.9e+06 ... 5.79e+06 5.79e+06 - spatial_ref int64 0 + * band (band) int32 12B 1 2 3 + * x (x) float64 11kB 5e+05 5.001e+05 ... 6.097e+05 6.097e+05 + * y (y) float64 11kB 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06 + spatial_ref int32 4B 0 Attributes: AREA_OR_POINT: Area OVR_RESAMPLING_ALG: AVERAGE @@ -403,21 +652,22 @@ Attributes: ``` -The band number comes first when GeoTiffs are read with the `.open_rasterio()` function. As we can see in the `xarray.DataArray` object, the shape is now `(band: 3, y: 687, x: 687)`, with three bands in the `band` dimension. It's always a good idea to examine the shape of the raster array you are working with and make sure it's what you expect. Many functions, especially the ones that plot images, expect a raster array to have a particular shape. One can also check the shape using the `.shape` attribute: +The band number comes first when GeoTiffs are read with the `.open_rasterio()` function. As we can see in the `xarray.DataArray` object, the shape is now `(band: 3, y: 1373, x: 1373)`, with three bands in the `band` dimension. It's always a good idea to examine the shape of the raster array you are working with and make sure it's what you expect. Many functions, especially the ones that plot images, expect a raster array to have a particular shape. One can also check the shape using the [`.shape`](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.shape.html) attribute: + ```python -raster_ams_overview.shape +rhodes_overview.shape ``` ```output -(3, 687, 687) +(3, 1373, 1373) ``` One can visualize the multi-band data with the `DataArray.plot.imshow()` function: ```python -raster_ams_overview.plot.imshow() +rhodes_overview.plot.imshow() ``` -![Overview of the true-color image (multi-band raster)](fig/E06/overview-plot-true-color.png){alt="true-color image overview"} +![Overview of the true-color image (multi-band raster)](fig/E06/rhodes_multiband_80.png){alt="true-color image overview"} Note that the `DataArray.plot.imshow()` function makes assumptions about the shape of the input DataArray, that since it has three channels, the correct colormap for these channels is RGB. It does not work directly on image arrays with more than 3 channels. One can replace one of the RGB channels with another band, to make a false-color image. @@ -429,10 +679,10 @@ As seen in the figure above, the true-color image is stretched. Let's visualize Since we know the height/width ratio is 1:1 (check the `rio.height` and `rio.width` attributes), we can set the aspect ratio to be 1. For example, we can choose the size to be 5 inches, and set `aspect=1`. Note that according to the [documentation](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.plot.imshow.html) of `DataArray.plot.imshow()`, when specifying the `aspect` argument, `size` also needs to be provided. ```python -raster_ams_overview.plot.imshow(size=5, aspect=1) +rhodes_overview.plot.imshow(size=5, aspect=1) ``` -![Overview of the true-color image with the correct aspect ratio](fig/E06/overview-plot-true-color-aspect-equal.png){alt="raster plot with correct aspect ratio"} +![Overview of the true-color image with the correct aspect ratio](fig/E06/rhodes_multiband_80_equal_aspect.png){alt="raster plot with correct aspect ratio"} :::: ::: diff --git a/07-vector-data-in-python.md b/07-vector-data-in-python.md index d8ee7716..61b9eb9c 100644 --- a/07-vector-data-in-python.md +++ b/07-vector-data-in-python.md @@ -19,426 +19,390 @@ questions: ## Introduction -As discussed in [Episode 2: Introduction to Vector Data](02-intro-vector-data.md), vector data represents specific features on the Earth's surface using points, lines, and polygons. These geographic elements can then have one or more attributes assigned to them, such as 'name' and 'population' for a city, or crop type for a field. Vector data can be much smaller in (file) size than raster data, while being very rich in terms of the information captured. +In the preceding episodes, we have prepared, selected and downloaded raster data from before and after the wildfire event in the summer of 2023 on the Greek island of Rhodes. To evaluate the impact of this wildfire on the vital infrastructure and built-up areas we are going to create a subset of vector data representing these assets. In this episode you will learn how to extract vector data with specific characteristics like the type of attributes or their locations. The dataset that we will generate in this episode can lateron be confronted with scorched areas which we determine by analyzing the satellite images [Episode 9: Raster Calculations in Python](09-raster-calculations.md). -In this episode, we will be moving from working with raster data to working with vector data. We will use Python to open and plot point, line, and polygon vector data. In particular, we will make use of the [`geopandas`](https://geopandas.org/en/stable/) package to open, manipulate and write vector datasets. +We'll be examining vector datasets that represent the valuable assests of Rhodes. As mentioned in [Episode 2: Introduction to Vector Data](02-intro-vector-data.md), vector data uses points, lines, and polygons to depict specific features on the Earth's surface. These geographic elements can have one or more attributes, like 'name' and 'population' for a city. In this epidoe we'll be using two open data sources: the Database of Global Administrative Areas (GADM) dataset to generate a polygon for the island of Rhodes and and Open Street Map data for the vital infrastructure and valuable assets. -![](fig/E07/pandas_geopandas_relation.png){alt="Pandas and Geopandas"} +To handle the vector data in python we use the package [`geopandas`](https://geopandas.org/en/stable/). This package allows us to open, manipulate, and write vector dataset through python. -`geopandas` extends the popular `pandas` library for data analysis to geospatial applications. The main `pandas` objects (the `Series` and the `DataFrame`) are expanded to `geopandas` objects (`GeoSeries` and `GeoDataFrame`). This extension is implemented by including geometric types, represented in Python using the `shapely` library, and by providing dedicated methods for spatial operations (union, intersection, etc.). The relationship between `Series`, `DataFrame`, `GeoSeries` and `GeoDataFrame` can be briefly explained as follow: +![](fig/E07/pandas_geopandas_relation.png){alt="Pandas and Geopandas"} - - A `Series` is a one-dimensional array with axis, holding any data type (integers, strings, floating-point numbers, Python objects, etc.) - - A `DataFrame` is a two-dimensional labeled data structure with columns of potentially different types1. - - A `GeoSeries` is a `Series` object designed to store shapely geometry objects. - - A `GeoDataFrame` is an extened `pandas.DataFrame`, which has a column with geometry objects, and this column is a `GeoSeries`. +`geopandas` enhances the widely-used `pandas` library for data analysis by extending its functionality to geospatial applications. The primary `pandas` objects (`Series` and `DataFrame`) are extended to `geopandas` objects (`GeoSeries` and `GeoDataFrame`). This extension is achieved by incorporating geometric types, represented in Python using the `shapely` library, and by offering dedicated methods for spatial operations like `union`, `spatial joins` and `intersect`. In order to understand how geopandas works, it is good to provide a brief explanation of the relationship between `Series`, a `DataFrame`, `GeoSeries`, and a `GeoDataFrame`: -In later episodes, we will learn how to work with raster and vector data together and combine them into a single plot. +- A `Series` is a one-dimensional array with an axis that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.) +- A `DataFrame` is a two-dimensional labeled data structure with columns that can potentially hold different types of data. +- A `GeoSeries` is a `Series` object designed to store shapely geometry objects. +- A `GeoDataFrame` is an extended `pandas.DataFrame` that includes a column with geometry objects, which is a `GeoSeries`. :::callout ## Introduce the Vector Data -In this episode, we will use the downloaded vector data in the `data` directory. Please refer to the [setup page](../learners/setup.md) on how to download the data. +In this episode, we will use the downloaded vector data from the `data` directory. Please refer to the [setup page](../learners/setup.md) on where to download the data. Note that we manipulated that data a little for the purposes of +this workshop. The link to the original source can be found on the [setup page](../learners/setup.md). ::: -## Import Vector Datasets +## Get the administration boundary of study area + +The first thing we want to do is to extract a polygon containing the boundary of the island of Rhodes from Greece. For this we will use the [GADM dataset](https://gadm.org/download_country.html) layer `ADM_ADM_3.gpkg` for Greece. For your convenience we saved a copy at: `data/data/gadm/ADM_ADM_3.gpkg` +We will use the `geopandas` package to load the file and use the `read_file` function [see](https://geopandas.org/en/stable/docs/user_guide/io.html). Note that geopandas is often abbreviated as gpd. + ```python import geopandas as gpd +gdf_greece = gpd.read_file('../data/gadm/ADM_ADM_3.gpkg') ``` - -We will use the `geopandas` package to load the crop field vector data we downloaded at: `data/brpgewaspercelen_definitief_2020_small.gpkg`. +We can print out the `gdf_greece`variable: ```python -fields = gpd.read_file("data/brpgewaspercelen_definitief_2020_small.gpkg") -fields +gdf_greece ``` +```output +GID_3 GID_0 COUNTRY GID_1 NAME_1 \ +0 GRC.1.1.1_1 GRC Greece GRC.1_1 Aegean +1 GRC.1.1.2_1 GRC Greece GRC.1_1 Aegean +2 GRC.1.1.3_1 GRC Greece GRC.1_1 Aegean +.. ... ... ... ... ... +324 GRC.8.2.24_1 GRC Greece GRC.8_1 Thessaly and Central Greece +325 GRC.8.2.25_1 GRC Greece GRC.8_1 Thessaly and Central Greece + NL_NAME_1 GID_2 NAME_2 NL_NAME_2 \ +0 Αιγαίο GRC.1.1_1 North Aegean Βόρειο Αιγαίο +1 Αιγαίο GRC.1.1_1 North Aegean Βόρειο Αιγαίο +2 Αιγαίο GRC.1.1_1 North Aegean Βόρειο Αιγαίο +.. ... ... ... ... +324 Θεσσαλία και Στερεά Ελλάδα GRC.8.2_1 Thessaly Θεσσαλία +325 Θεσσαλία και Στερεά Ελλάδα GRC.8.2_1 Thessaly Θεσσαλία +... +324 POLYGON ((22.81903 39.27344, 22.81884 39.27332... +325 POLYGON ((23.21375 39.36514, 23.21272 39.36469... -The data are read into the variable `fields` as a `GeoDataFrame`. This is an extened data format of `pandas.DataFrame`, with an extra column `geometry`. - -This file contains a relatively large number of crop field parcels. Directly loading a large file to memory can be slow. If the Area of Interest (AoI) is small, we can define a bounding box of the AoI, and only read the data within the extent of the bounding box. - -```python -# Define bounding box -xmin, xmax = (110_000, 140_000) -ymin, ymax = (470_000, 510_000) -bbox = (xmin, ymin, xmax, ymax) +[326 rows x 17 columns] ``` +The data are read into the variable fields as a `GeoDataFrame`. This is an extened data format of `pandas.DataFrame`, with an extra column `geometry`. To explore the dataframe you can call this variable just like a `pandas dataframe` by using functions like `.shape`, `.head` and `.tail` etc. -Using the `bbox` input argument, we can load only the spatial features intersecting the provided bounding box. +To visualize the polygons we can use the [`plot()`](https://geopandas.org/en/stable/docs/user_guide/mapping.html) function to the `GeoDataFrame` we have loaded `gdf_greece`: ```python -# Partially load data within the bounding box -fields = gpd.read_file("data/brpgewaspercelen_definitief_2020_small.gpkg", bbox=bbox) +gdf_greece.plot() ``` -:::callout -## How should I define my bounding box? -For simplicity, here we assume the **Coordinate Reference System (CRS)** and **extent** of the vector file are known (for instance they are provided in the dataset documentation). - -You can also define your bounding box with online coordinates visualization tools. For example, we can use the "Draw Rectangular Polygon" tool in [geojson.io](https://geojson.io/#map=8.62/52.45/4.96). +![](fig/E07/greece_administration_areas.png){alt="greece_administrations"} -Some Python tools, e.g. [`fiona`](https://fiona.readthedocs.io/en/latest/)(which is also the backend of `geopandas`), provide the file inspection functionality without the need to read the full data set into memory. An example can be found in [the documentation of fiona](https://fiona.readthedocs.io/en/latest/manual.html#format-drivers-crs-bounds-and-schema). -::: +If you want to interactively explore your data you can use the [`.explore`](https://geopandas.org/en/stable/docs/user_guide/interactive_mapping.html) function in geopandas: -And we can plot the overview by: ```python -fields.plot() +gdf_greece.explore() ``` +In this interactive map you can easily zoom in and out and hover over the polygons to see which attributes, stored in the rows of your GeoDataFrame, are related to each polygon. -![](fig/E07/fields.png){alt="Crop fields inside the AOI"} +Next, we'll focus on isolating the administrative area of Rhodes Island. Once you hover over the polygon of Rhodos (the relatively big island to the east) you will find out that the label `Rhodos` is stored in the `NAME_3` column of `gdf_greece`, where Rhodes Island is listed as `Rhodos`. Since our goal is to have a boundary of Rhodes, we'll now create a new variable that exclusively represents Rhodes Island. -## Vector Metadata & Attributes -When we read the vector dataset with Python (as our `fields` variable) it is loaded as a `GeoDataFrame` object. The `read_file()` function also automatically stores geospatial information about the data. We are particularly interested in describing the format, CRS, extent, and other components of the vector data, and the attributes which describe properties associated -with each vector object. +To select an item in our GeoDataFrame with a specific value is done the same way in which this is done in a pandas `DataFrame` using [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). -We will explore +```python +gdf_rhodes = gdf_greece.loc[gdf_greece['NAME_3']=='Rhodos'] +``` -1. **Object Type:** the class of the imported object. -2. **Coordinate Reference System (CRS):** the projection of the data. -3. **Extent:** the spatial extent (i.e. geographic area that the data covers). Note that the spatial extent for a vector dataset represents the combined extent for all spatial objects in the dataset. +And we can plot the overview by (or show it interactively using `.explore`): +```python +gdf_rhodes.plot() +``` -Each `GeoDataFrame` has a `"geometry"` column that contains geometries. In the case of our `fields` object, this geometry is represented by a `shapely.geometry.Polygon` object. `geopandas` uses the `shapely` library to represent polygons, lines, and points, so the types are inherited from `shapely`. +![](fig/E07/rhodes_administration_areas.png){alt="rhodes_administrations"} -We can view the metadata using the `.crs`, `.bounds` and `.type` attributes. First, let's view the -geometry type for our crop field dataset. To view the geometry type, we use the `pandas` method `.type` on the `GeoDataFrame` object, `fields`. +Now that we have the administrative area of Rhodes Island. We can use the `to_file()` function save this file for future use. ```python -fields.type +# Save the rhodes_boundary to gpkg +gdf_rhodes.to_file('rhodes.gpkg') ``` +## Get the vital infrastructure and built-up areas -```output -0 Polygon -1 Polygon -2 Polygon -3 Polygon -4 Polygon - ... -22026 Polygon -22027 Polygon -22028 Polygon -22029 Polygon -22030 Polygon -Length: 22031, dtype: object -``` - -To view the CRS metadata: +### Road data from Open Street Map (OSM) -```python -fields.crs -``` +Now that we have the boundary of our study area, we will make use this to select the main roads in our study area. We will make the following processing steps: +1. Select roads of study area +2. Select key infrastruture: 'primary', 'secondary', 'tertiary' +3. Create a 100m buffer around the rounds. This buffer will be regarded as the infrastructure region. (note that this buffer is arbitrary and can be changed afterwards if you want!) -```output - -Name: Amersfoort / RD New -Axis Info [cartesian]: -- X[east]: Easting (metre) -- Y[north]: Northing (metre) -Area of Use: -- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone. -- bounds: (3.2, 50.75, 7.22, 53.7) -Coordinate Operation: -- name: RD New -- method: Oblique Stereographic -Datum: Amersfoort -- Ellipsoid: Bessel 1841 -- Prime Meridian: Greenwich -``` - -Our data is in the CRS **RD New**. The CRS is critical to -interpreting the object's extent values as it specifies units. To find -the extent of our dataset in the projected coordinates, we can use the `.total_bounds` attribute: +#### Step 1: Select roads of study area + +For this workshop, in particular to not have everyone downloading too much data, we created a subset of the [Openstreetmap](https://www.openstreetmap.org/) data we downloaded for Greece from [the Geofabrik](https://download.geofabrik.de/europe.html). This data comes in the form of a shapefile (see [episode 2](02-intro-vector-data.md)) from which we extracted all the roads for `Rhodes` and some surrounding islands. The data is stored in the osm folder as `osm_roads.gpkg`, but contains *all* the roads on the island (so also hiking paths, private roads etc.), whereas we in particular are interested in the key infrastructure which we consider to be roads classified as primary, secondary or tertiary roads. + +Let's load the file and plot it: ```python -fields.total_bounds +gdf_roads = gpd.read_file('../data/osm/osm_roads.gpkg') ``` +We can explore it using the same commands as above: -```output -array([109222.03325 , 469461.512625, 140295.122125, 510939.997875]) +```python +gdf_roads.plot() ``` -This array contains, in order, the values for minx, miny, maxx and maxy, for the overall dataset. The spatial extent of a GeoDataFrame represents the geographic "edge" or location that is the furthest north, south, east, and west. Thus, it represents the overall geographic coverage of the spatial object. +![](fig/E07/greece_highways.png){alt="greece_highways"} -We can convert these coordinates to a bounding box or acquire the index of the Dataframe to access the geometry. Either of these polygons can be used to clip rasters (more on that later). +As you may have noticed, loading and plotting `osm_roads.gpkg` takes a bit long. This is because the file contains all the roads of Rhodos and some surrounding islands as well. Since we are only interested in the roads on Rhodes Island. We will use the [`mask`](https://geopandas.org/en/stable/docs/user_guide/io.html) parameter of the `read_file()` function to load only the roads on Rhodes Island. +Now let us overwrite the GeoDataframe `gdf_roads` using the mask with the GeoDataFrame `gdf_rhodes` we created above. -## Further crop the dataset +```python +gdf_roads = gpd.read_file('../data/osm/osm_roads.gpkg', mask=gdf_rhodes) +``` -We might realize that the loaded dataset is still too large. If we want to refine our area of interest to an even smaller extent, without reloading the data, we can use the [`cx`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.cx.html) indexer: +Now let us explore these roads using `.explore` (or `.plot`): - ```python - # A smaller bounding box in RD - xmin, xmax = (120_000, 135_000) - ymin, ymax = (485_000, 500_000) +```python +gdf_roads.explore() +``` - fields_cx = fields.cx[xmin:xmax, ymin:ymax] - ``` +![](fig/E07/rhodes_highways.png){alt="rhodes_highways"} -## Export data to file +#### Step 2: Select key infrastruture -We will save the cropped results to a shapefile (`.shp`) and use it later. The `to_file` function can be used for exportation: +As you will find out while exploring the roads dataset, information about the type of roads is stored in the `fclass` column. To get an overview of the different values that are present in the collumn `fclass` , we can use the [`unique()`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) function from pandas: ```python -fields_cx.to_file('fields_cropped.shp') +gdf_roads['fclass'].unique() ``` +```output +array(['residential', 'service', 'unclassified', 'footway', + 'track_grade4', 'primary', 'track', 'tertiary', 'track_grade3', + 'path', 'track_grade5', 'steps', 'secondary', 'primary_link', + 'track_grade2', 'track_grade1', 'pedestrian', 'tertiary_link', + 'secondary_link', 'living_street', 'cycleway'], dtype=object) +``` +It seems the variable `gdf_roads` contains all kind of hiking paths and footpaths as well. Since we are only interested in vital infrastructure, classified as "primary", "secondary" and "tertiary" roads, we need to make a subselection. -This will write it to disk (in this case, in 'shapefile' format), containing only the data from our cropped area. It can be read again at a later time using the `read_file()` method we have been using above. Note that this actually writes multiple files to disk (`fields_cropped.cpg`, `fields_cropped.dbf`, `fields_cropped.prj`, `fields_cropped.shp`, `fields_cropped.shx`). All these files should ideally be present in order to re-read the dataset later, although only the `.shp`, `.shx`, and `.dbf` files are mandatory (see the [Introduction to Vector Data](02-intro-vector-data.md) lesson for more information.) +Let us first create a list with the labels we want to select. +```python +key_infra_labels = ['primary', 'secondary', 'tertiary'] +``` -## Selecting spatial features +Now we are using this list make a subselection of the key infrastructure using pandas´ [`.isin` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html). -From now on, we will take in a point dataset `brogmwvolledigeset.zip`, which is the underground water monitoring wells. We will perform vector processing on this dataset, together with the cropped field polygons `fields_cropped.shp`. +```python +key_infra = gdf_roads.loc[gdf_roads['fclass'].isin(key_infra_labels)] +``` -Let's read the two datasets. +We can plot the key infrastructure : ```python -fields = gpd.read_file("fields_cropped.shp") -wells = gpd.read_file("data/brogmwvolledigeset.zip") +key_infra.plot() ``` +![](fig/E07/rhodes_infra_highways.png){alt="rhodes_infra_highways"} -And take a look at the wells: +#### Step 3: Create a 100m buffer around the key infrastructure -```python -wells.plot(markersize=0.1) -``` +Now that we selected the key infrastructure, we want to create a 100m buffer around them. This buffer will be regarded as the infrastructure region. -![](fig/E07/wells-nl.png){alt="all wells in the NL"} +As you might have notice, the numbers on the x and y axis of our plots represent Lon Lat coordinates, meaning that the data is not yet projected. The current data has a geographic coordinate system with measures in degrees but not meter. Creating a buffer of 100 meters is not possible. Therfore, in order to create a 100m buffer, we first need to project our data. In our case we decided to project the data as +WGS 84 / UTM zone 31N, with EPSG code 32631 ([see chapter 03 for more information about the CRS and EPSG codes](/episodes/03-crs.md). -The points represents all the wells over the Netherlands. Since the wells are in the lat/lon coordinates. To make it comparable with fields, we need to first transfer the CRS to the "RD New" projection: +To project our data we use [.to_crs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_crs.html). We first define a variable with the EPSG value (in our case 32631), which we then us in the to_crs function. ```python -wells = wells.to_crs(epsg=28992) +epsg_code = 32631 +key_infra_meters = key_infra.to_crs(epsg_code) ``` +Now that our data is projected, we can create a buffer. For this we make use of [geopandas´ .buffer function](https://geopandas.org/en/stable/docs/user_guide/geometric_manipulations.html#GeoSeries.buffer) : +```python +key_infra_meters_buffer = key_infra_meters.buffer(100) +key_infra_meters_buffer +``` -Now we would like to compare the wells with the cropped fields. We can select the wells within the fields using the `.clip` function: +```output +53 POLYGON ((2779295.383 4319805.295, 2779317.029... +54 POLYGON ((2779270.962 4319974.441, 2779272.393... +55 POLYGON ((2779172.341 4319578.062, 2779165.312... +84 POLYGON ((2779615.109 4319862.058, 2779665.519... +140 POLYGON ((2781330.698 4320046.538, 2781330.749... + ... +19020 POLYGON ((2780193.230 4337691.133, 2780184.279... +19021 POLYGON ((2780330.823 4337772.262, 2780324.966... +19022 POLYGON ((2780179.850 4337917.135, 2780188.871... +19024 POLYGON ((2780516.550 4339028.863, 2780519.340... +19032 POLYGON ((2780272.050 4338213.937, 2780274.519... +Length: 1386, dtype: geometry +``` + +Note that the type of the `key_infra_meters_buffer` is a `GeoSeries` and not a `GeoDataFrame`. This is because the `buffer()` function returns a `GeoSeries` object. You can check that by calling the type of the variable. ```python -wells_clip = wells.clip(fields) -wells_clip +type(key_infra_meters_buffer) ``` - ```output -bro_id delivery_accountable_party quality_regime ... -40744 GMW000000043703 27364178 IMBRO/A ... -38728 GMW000000045818 27364178 IMBRO/A ... -... ... ... ... ... -40174 GMW000000043963 27364178 IMBRO/A ... -19445 GMW000000024992 50200097 IMBRO/A ... -[79 rows x 40 columns] +geopandas.geoseries.GeoSeries ``` -After this selection, all the wells outside the fields are dropped. This takes a while to execute, because we are clipping a relatively large number of points with many polygons. - -If we do not want a precise clipping, but rather have the points in the neighborhood of the fields, we will need to create another polygon, which is slightly bigger than the coverage of the field. To do this, we can increase the size of the field polygons, to make them overlap with each other, and then merge the overlapping polygons together. +Now that we have a buffer, we can convert it back to the geographic coordinate system to keep the data consistent. Note that we are now using the crs information from the `key_infra`, instead of using the EPSG code directly (EPSG:4326): -We will first use the `buffer` function to increase field size with a given `distance`. The unit of the `distance` argument is the same as the CRS. Here we use a 50-meter buffer. Also notice that the `.buffer` function produces a `GeoSeries`, so to keep the other columns, we assign it to the `GeoDataFrame` as a geometry column. ```python -buffer = fields.buffer(50) -fields_buffer = fields.copy() -fields_buffer['geometry'] = buffer -fields_buffer.plot() +key_infra_buffer = key_infra_meters_buffer.to_crs(key_infra.crs) +key_infra_buffer ``` +```output +45 POLYGON ((27.72826 36.12409, 27.72839 36.12426... +58 POLYGON ((27.71666 36.11678, 27.71665 36.11678... +99 POLYGON ((27.75485 35.95242, 27.75493 35.95248... +100 POLYGON ((27.76737 35.95086, 27.76733 35.95086... +108 POLYGON ((27.76706 35.95199, 27.76702 35.95201... + ... +18876 POLYGON ((28.22855 36.41890, 28.22861 36.41899... +18877 POLYGON ((28.22819 36.41838, 28.22825 36.41845... +18878 POLYGON ((28.22865 36.41904, 28.22871 36.41912... +18879 POLYGON ((28.23026 36.41927, 28.23034 36.41921... +18880 POLYGON ((28.23020 36.41779, 28.23007 36.41745... +Length: 1369, dtype: geometry +``` -![](fig/E07/fields-buffer.png){alt="50m buffer around the fields"} +As you can see, the buffers created in `key_infra_buffer` have the `Polygon` geometry type. -To further simplify them, we can use the `dissolve` function to dissolve the buffers into one: +To double check the EPSG code of key_infra: ```python -fields_buffer_dissolve = fields_buffer.dissolve() -fields_buffer_dissolve +print(key_infra.crs) + ``` +```output +EPSG:4326 +``` -All the fields will be dissolved into one multi-polygon, which can be used to `clip` the wells. +Reprojecting and buffering our data is something that we are going to do multiple times during this episode. To avoid have to call the same functions multiple times it would make sense to create a function. Therefore, let us create a function in which we can add the buffer as a variable. ```python -wells_clip_buffer = wells.clip(fields_buffer_dissolve) -wells_clip_buffer.plot() +def buffer_crs(gdf, size, meter_crs=32631, target_crs=4326): + return gdf.to_crs(meter_crs).buffer(size).to_crs(target_crs) ``` +For example, we can use this function to create a 200m buffer around the infrastructure the key infrastructure by doing: -![](fig/E07/wells-in-buffer.png){alt="Wells within 50m buffer of fields"} - -In this way, we selected all wells within the 50m range of the fields. It is also significantly faster than the previous `clip` operation, since the number of polygons is much smaller after `dissolve`. - -:::challenge -## Exercise: clip fields within 500m from the wells -This time, we will do a selection the other way around. Can you clip the field polygons (stored in fields_cropped.shp) with the 500m buffer of the wells (stored in brogmwvolledigeset.zip)? Please visualize the results. - -- Hint 1: The file `brogmwvolledigeset.zip` is in CRS 4326. Don’t forget the CRS conversion. - -- Hint 2: `brogmwvolledigeset.zip` contains all the wells in the Netherlands, which means it might be too big for the `.buffer()` function. To improve the performance, first crop it with the bounding box of the fields. - -::::solution ```python -# Read in data -fields = gpd.read_file("fields_cropped.shp") -wells = gpd.read_file("data/brogmwvolledigeset.zip") - -# Crop points with bounding box -xmin, ymin, xmax, ymax = fields.total_bounds -wells = wells.to_crs(28992) -wells_cx = wells.cx[xmin-500:xmax+500, ymin-500:ymax+500] +key_infra_buffer_200 = buffer_crs(key_infra, 200) +``` -# Create buffer -wells_cx_500mbuffer = wells_cx.copy() -wells_cx_500mbuffer['geometry'] = wells_cx.buffer(500) +### Get built-up regions from Open Street Map (OSM) -# Clip -fields_clip_buffer = fields.clip(wells_cx_500mbuffer) -fields_clip_buffer.plot() -``` -![](fig/E07/fields-in-buffer-clip.png){alt="fields within 50m buffer of the wells, truncated"} +Now that we have a buffered dataset for the key infrastructure of Rhodes, our next step is to create a dataset with all the built-up areas. To do so we will use the land use data from OSM, which we prepared for you in the file `data/osm_landuse.gpkg`. This file includes the land use data for the entire Greece. We assume the built-up regions to be the union of three types of land use: "commercial", "industrial", and "residential". -:::: -::: +Note that for the simplicity of this course, we limit the built-up regions to these three types of land use. In reality, the built-up regions can be more complex also there is definately more high quality (e.g. local government). +Now it will be up to you to create a dataset with valueable assets. You should be able to complete this task by yourself with the knowledge you have gained from the previous steps and links to the documentation we provided. -## Spatially join the features +:::challenge +## Exercise: Get the built-up regions -In the exercise, we clipped the fields polygons with the 500m buffers of wells. The results from this clipping changed the shape of the polygons. If we would like to keep the original shape of the fields, one way is to use the `sjoin` function, which join two `GeoDataFrame`'s on the basis of their spatial relationship: +Create a `builtup_buffer` from the file `data/osm_landuse.gpkg` by the following steps: -```python -# Join fields and wells_cx_500mbuffer -fields_wells_buffer = fields.sjoin(wells_cx_500mbuffer) -print(fields_wells_buffer.shape) -``` +1. Load the land use data from `data/osm_landuse.gpkg` and mask it with the administrative boundary of Rhodes Island (`gdf_rhodes`). +2. Select the land use data for "commercial", "industrial", and "residential". +3. Create a 10m buffer around the land use data. +4. Visualize the results. +After completing the exercise, answer the following questions: -```output -(11420, 46) -``` +1. How many unique land use types are there in `landuse.gpkg`? +2. After selecting the three types of land use, how many entries (rows) are there in the results? +Hints: -This will result in a `GeodataFrame` of all possible combinations of polygons and well buffers intersecting each other. Since a polygon can fall into multiple buffers, there will be duplicated field indexes in the results. To select the fields which intersects the well buffers, we can first get the unique indexes, and use the `iloc` indexer to select: +- `data/osm_landuse.gpkg` contains the land use data for the entire Greece. Use the administrative boundary of Rhodes Island (`gdf_rhodes`) to select the land use data for Rhodes Island. +- The land use attribute is stored in the `fclass` column. +- Reuse `buffer_crs` function to create the buffer. +::::solution ```python -idx = fields_wells_buffer.index.unique() -fiedls_in_buffer = fields.iloc[idx] +# Read data with a mask of Rhodes +gdf_landuse = gpd.read_file('../data/osm/landuse.gpkg', mask=gdf_rhodes) -fiedls_in_buffer.plot() -``` +# Find number of unique landuse types +print(len(gdf_landuse['fclass'].unique())) +# Extract built-up regions +builtup_labels = ['commercial', 'industrial', 'residential'] +builtup = gdf_landuse.loc[gdf_landuse['fclass'].isin(builtup_labels)] -![](fig/E07/fields-in-buffer-sjoin.png){alt="Fields in 50m buffer of wells, not truncated"} +# Create 10m buffer around the built-up regions +builtup_buffer = buffer_crs(builtup, 10) -## Modify the geometry of a GeoDataFrame +# Get the number of entries +print(len(builtup_buffer)) -:::challenge -## Exercise: Investigate the waterway lines -Now we will take a deeper look at the Dutch waterway lines: `waterways_nl`. Let's load the file `status_vaarweg.zip`, and visualize it with the `plot()` function. Can you tell what is wrong with this vector file? +# Visualize the buffer +builtup_buffer.plot() +``` -::::solution -By plotting out the vector file, we can tell that the latitude and longitude of the file are flipped. -```python -waterways_nl = gpd.read_file('data/status_vaarweg.zip') -waterways_nl.plot() +```output +19 +1336 ``` -![](fig/E07/waterways-wrong.png){alt="waterways, rotated"} -:::: -::: +![](fig/E07/rhodes_builtup_buffer.png){alt="rhodes_builtup_buffer"} -:::callout -## Axis ordering -According to the standards, the axis ordering for a CRS should follow the definition provided by the competent authority. For the commonly used EPSG:4326 geographic coordinate system, the EPSG defines the ordering as first latitude then longitude. -However, in the GIS world, it is custom to work with coordinate tuples where the first component is aligned with the east/west direction and the second component is aligned with the north/south direction. -Multiple software packages thus implement this convention also when dealing with EPSG:4326. -As a result, one can encounter vector files that implement either convention - keep this in mind and always check your datasets! +:::: ::: -Sometimes we need to modify the `geometry` of a `GeoDataFrame`. For example, as we have seen in the previous exercise **Investigate the waterway lines**, the latitude and longitude are flipped in the vector data `waterways_nl`. This error needs to be fixed before performing further analysis. +## Merge the infrastructure regions and built-up regions +Now that we have the infrastructure regions and built-up regions, we can merge them into a single region. We would like to keep track of the type after merging, so we will add two new columns: `type` and `code` by converting the `GeoSeries` to `GeoDataFrame`. -Let's first take a look on what makes up the `geometry` column of `waterways_nl`: +First we convert the buffer around key infrastructure: ```python -waterways_nl['geometry'] +data = {'geometry': key_infra_buffer, 'type': 'infrastructure', 'code': 1} +gdf_infra = gpd.GeoDataFrame(data) ``` -```output -0 LINESTRING (52.41810 4.84060, 52.42070 4.84090... -1 LINESTRING (52.11910 4.67450, 52.11930 4.67340... -2 LINESTRING (52.10090 4.25730, 52.10390 4.25530... -3 LINESTRING (53.47250 6.84550, 53.47740 6.83840... -4 LINESTRING (52.32270 5.14300, 52.32100 5.14640... - ... -86 LINESTRING (51.49270 5.39100, 51.48050 5.39160... -87 LINESTRING (52.15900 5.38510, 52.16010 5.38340... -88 LINESTRING (51.97340 4.12420, 51.97110 4.12220... -89 LINESTRING (52.11910 4.67450, 52.11850 4.67430... -90 LINESTRING (51.88940 4.61900, 51.89040 4.61350... -Name: geometry, Length: 91, dtype: geometry -``` - - -Each row is a `LINESTRING` object. We can further zoom into one of the rows, for example, the third row: +Then we convert the built-up buffer: ```python -print(waterways_nl['geometry'][2]) -print(type(waterways_nl['geometry'][2])) -``` - -```output -LINESTRING (52.100900002 4.25730000099998, 52.1039 4.25529999999998, 52.111299999 4.24929999900002, 52.1274 4.23449999799999) - +data = {'geometry': builtup_buffer, 'type': 'builtup', 'code': 2} +gdf_builtup = gpd.GeoDataFrame(data) ``` - -As we can see in the output, the `LINESTRING` object contains a list of coordinates of the vertices. In our situation, we would like to find a way to flip the x and y of every coordinates set. A good way to look for the solution is to use the [documentation](https://shapely.readthedocs.io/en/stable/manual.html) of the `shapely` package, since we are seeking to modify the `LINESTRING` object. Here we are going to use the [`shapely.ops.transform`](https://shapely.readthedocs.io/en/stable/manual.html?highlight=shapely.ops.transform#shapely.ops.transform) function, which applies a self-defined function to all coordinates of a geometry. +After that, we can merge the two `GeoDataFrame` into one: ```python -import shapely - -# Define a function flipping the x and y coordinate values -def flip(geometry): - return shapely.ops.transform(lambda x, y: (y, x), geometry) - -# Apply this function to all coordinates and all lines -geom_corrected = waterways_nl['geometry'].apply(flip) +import pandas as pd +gdf_assets = pd.concat([gdf_infra, gdf_builtup]).reset_index(drop=True) ``` +In `gdf_assets`, we can distinguish the infrastructure regions and built-up regions by the `type` and `code` columns. We can plot the `gdf_assets` to visualize the merged regions. See the [geopandas documentation](https://geopandas.org/en/stable/docs/user_guide/mapping.html) on how to do this: -Then we can update the `geometry` column with the corrected geometry `geom_corrected`, and visualize it to check: ```python -# Update geometry -waterways_nl['geometry'] = geom_corrected - -# Visualization -waterways_nl.plot() +gdf_assets.plot(column='type', legend=True) ``` +![](fig/E07/rhodes_assets.png){alt="rhodes_assets"} + -![](fig/E07/waterways-corrected.png){alt="waterways, corrected"} +Finally, we can save the `gdf_assets` to a file for future use: -Now the waterways look good! We can save the vector data for later usage: ```python -# Update geometry -waterways_nl.to_file('waterways_nl_corrected.shp') +gdf_assets.to_file('assets.gpkg') ``` :::keypoints - Load spatial objects into Python with `geopandas.read_file()` function. - Spatial objects can be plotted directly with `GeoDataFrame`'s `.plot()` method. -- Crop spatial objects with `.cx[]` indexer. -- Convert CRS of spatial objects with `.to_crs()`. -- Select spatial features with `.clip()`. +- Convert CRS of spatial objects with `.to_crs()`. Note that this generates a `GeoSeries` object. - Create a buffer of spatial objects with `.buffer()`. -- Merge overlapping spatial objects with `.dissolve()`. -- Join spatial features spatially with `.sjoin()`. +- Merge spatial objects with `pd.concat()`. ::: diff --git a/08-crop-raster-data.md b/08-crop-raster-data.md index ed224d95..66a1ec9e 100644 --- a/08-crop-raster-data.md +++ b/08-crop-raster-data.md @@ -1,7 +1,7 @@ --- title: "Crop raster data with rioxarray and geopandas" -teaching: 70 -exercises: 30 +teaching: 40 +exercises: 0 --- :::questions @@ -9,15 +9,15 @@ exercises: 30 ::: :::objectives +- Align the CRS of geopatial data. - Crop raster data with a bounding box. - Crop raster data with a polygon. -- Match two raster datasets in different CRS. ::: It is quite common that the raster data you have in hand is too large to process, or not all the pixels are relevant to your area of interest (AoI). In both situations, you should consider cropping your raster data before performing data analysis. -In this episode, we will introduce how to crop raster data into the desired area. We will use one Sentinel-2 image over Amsterdam as the example raster data, and introduce how to crop your data to different types of AoIs. +In this episode, we will introduce how to crop raster data into the desired area. We will use one Sentinel-2 image over Rhodes Island as the example raster data, and introduce how to crop your data to different types of AoIs. :::callout ## Introduce the Data @@ -36,259 +36,171 @@ We also use the cropped fields polygons `fields_cropped.shp`, which was generate ## Align the CRS of the raster and the vector data -We load a true color image using `pystac` and `rioxarray` and check the shape of the raster: +### Data loading + +First, we will load the visual image of Sentinel-2 over Rhodes Island, which we downloaded and stored in `data/sentinel2/visual.tif`. + +We can open this asset with `rioxarray`, and specify the overview level, since this is a Cloud-Optimized GeoTIFF (COG) file. As explained in episode 6 raster images can be quite big, therefore we decided to resample the data using ´rioxarray's´ overview parameter and set it to `overview_level=1`. ```python -import pystac import rioxarray - -# Load image and inspect the shape -items = pystac.ItemCollection.from_file("search.json") -raster = rioxarray.open_rasterio(items[1].assets["visual"].href) # Select a true color image -print(raster.shape) +path_visual = 'data/sentinel2/visual.tif' +visual = rioxarray.open_rasterio(path_visual, overview_level=1) +visual ``` - ```output -(3, 10980, 10980) + +[22605075 values with dtype=uint8] +Coordinates: + * band (band) int64 1 2 3 + * x (x) float64 5e+05 5e+05 5.001e+05 ... 6.097e+05 6.098e+05 + * y (y) float64 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06 + spatial_ref int64 0 +Attributes: + AREA_OR_POINT: Area + OVR_RESAMPLING_ALG: AVERAGE + _FillValue: 0 + scale_factor: 1.0 + add_offset: 0.0 ``` +As we introduced in the raster data introduction episode, this will perform a "lazy" loading of the image meaning that the image will not be loaded into the memory until necessary. -This will perform a "lazy" loading of the image, i.e. the image will not be loaded into the memory until necessary, but we can still access some attributes, e.g. the shape of the image. - -The large size of the raster data makes it time and memory consuming to visualize in its entirety. Instead, we can fetch and plot the overviews of the raster. "Overviews" are precomputed lower resolution representations of a raster, stored in the same COG that contains the original raster. +Let's also load the assets file generated in the vector data episode: ```python -# Get the overview asset -raster_overview = rioxarray.open_rasterio(items[1].assets["visual"].href, overview_level=3) -print(raster_overview.shape) - -# Visualize it -raster_overview.plot.imshow(figsize=(8,8)) +import geopandas as gpd +assets = gpd.read_file('assets.gpkg') ``` +### Crop the raster with a bounding box -![](fig/E08/crop-raster-overview-raster-00.png){alt="Overview of the raster"} - -As we can see, the overview image is much smaller compared to the original true color image. - -To align the raster and vector data, we first check each coordinate system. For raster data, we use `pyproj.CRS`: +The assets file contains the information of the vital infrastructure and built-up areas on the island Rhodes. The visual image, on the other hand, has a larger extent. Let us check this by visualizing the raster image: ```python -from pyproj import CRS - -# Check the coordinate system -CRS(raster.rio.crs) -``` - - -```output - -Name: WGS 84 / UTM zone 31N -Axis Info [cartesian]: -- [east]: Easting (metre) -- [north]: Northing (metre) -Area of Use: -- undefined -Coordinate Operation: -- name: UTM zone 31N -- method: Transverse Mercator -Datum: World Geodetic System 1984 -- Ellipsoid: WGS 84 -- Prime Meridian: Greenwich +visual.plot.imshow() ``` +![](fig/E08/visual_large.png){alt="Large visual raster"} - -To open and check the coordinate system of vector data, we use `geopandas`: +Let's check the extent of the assets to find out its rough location in the raster image. +We can use the [`total_bounds`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.total_bounds.html) attribute from `GeoSeries` of `geopandas` to get the bounding box: ```python -import geopandas as gpd - -# Load the polygons of the crop fields -fields = gpd.read_file("fields_cropped.shp") - -# Check the coordinate system -fields.crs +assets.total_bounds ``` - ```output - -Name: Amersfoort / RD New -Axis Info [cartesian]: -- X[east]: Easting (metre) -- Y[north]: Northing (metre) -Area of Use: -- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone. -- bounds: (3.2, 50.75, 7.22, 53.7) -Coordinate Operation: -- name: RD New -- method: Oblique Stereographic -Datum: Amersfoort -- Ellipsoid: Bessel 1841 -- Prime Meridian: Greenwich +array([27.7121001 , 35.87837949, 28.24591124, 36.45725024]) ``` +The bounding box is composed of the `[minx, miny, maxx, maxy]` values of the raster. Comparing these values with the raster image, we can identify that the magnitude of the bounding box coordinates does not match the coordinates of the raster image. This is because the two datasets have different coordinate reference systems (CRS). This will cause problems when cropping the raster image, therefore we first need to align the CRS-s of the two datasets -As seen, the coordinate systems differ. To crop the raster using the shapefile, we first need to reproject one dataset to the other's CRS. Since `raster` is large, we will convert the CRS of `fields` to the CRS of `raster` to avoid loading the entire image: +Considering the raster image has larger data volume than the vector data, we will reproject the vector data to the CRS of the raster data. We can use the `to_crs` method: ```python -fields = fields.to_crs(raster.rio.crs) -``` - +# Reproject +assets = assets.to_crs(visual.rio.crs) -## Crop raster data with a bounding box - -The `clip_box` function allows one to crop a raster by the -min/max of the x and y coordinates. Note that we are cropping the original image `raster` now, and not the overview image `raster_overview`. - -```python -# Crop the raster with the bounding box -raster_clip_box = raster.rio.clip_box(*fields.total_bounds) -print(raster_clip_box.shape) +# Check the new bounding box +assets.total_bounds ``` ```output -(3, 1565, 1565) +array([ 564058.0257114, 3970719.4080227, 611743.71498815, 4035358.56340039]) ``` - -We successfully cropped the raster to a much smaller piece. We can visualize it now: +Now the bounding box coordinates are updated. We can use the `clip_box` function, through the `rioaxarray` accessor, to crop the raster image to the bounding box of the vector data. `clip_box` takes four positional input arguments in the order of `xmin`, `ymin`, `xmax`, `ymax`, which is exactly the same order in the `assets.total_bounds`. Since `assets.total_bounds` is an `numpy.array`, we can use the symbol `*` to unpack it to the relevant positions in `clip_box`. ```python -raster_clip_box.plot.imshow(figsize=(8,8)) -``` -![](fig/E08/crop-raster-crop-by-bb-02.png){alt="Raster cropped by a bounding box"} +# Crop the raster with the bounding box +visual_clipbox = visual.rio.clip_box(*assets.total_bounds) -This cropped image can be saved for later usage: -```python -raster_clip_box.rio.to_raster("raster_clip.tif") +# Visualize the cropped image +visual_clipbox.plot.imshow() ``` +![](fig/E08/visual_clip_box.png){alt="Clip box results"} -## Crop raster data with polygons +:::callout +## Code Tip +Cropping a raster with a bounding box is a quick way to reduce the size of the raster data. Since this operation is based on min/max coordinates, it is not as computational extensive as cropping with polygons, which requires more accurate overlay operations. +::: -We have a cropped image around the fields. To further analyze the fields, one may want to crop the image to the exact field boundaries. -This can be done with the `clip` function: +### Crop the raster with a polygon -```python -raster_clip_fields = raster_clip_box.rio.clip(fields['geometry']) -``` +We can also crop the raster with a polygon. In this case, we will use the raster `clip` function through the `rio` accessor. For this we will use the `geometry` column of the `assets` GeoDataFrame to specify the polygon: - -And we can visualize the results: ```python -raster_clip_fields.plot.imshow(figsize=(8,8)) -``` -![](fig/E08/crop-raster-crop-fields.png){alt="Ratser cropped by field polygons"} - -:::challenge -## Exercise: crop raster data with a specific code -In the column "gewascode" (translated as "crop code") of `fields`, you can find the code representing the types of plants grown in each field. Can you: +# Crop the raster with the polygon +visual_clip = visual_clipbox.rio.clip(assets["geometry"]) -1. Select the fields with "gewascode" equal to `257`; -2. Crop the raster `raster_clip_box` with the selected fields; -3. Visualize the cropped image. - -::::solution -```python -mask = fields['gewascode']==257 -fields_gwascode = fields.where(mask) -fields_gwascode = fields_gwascode.dropna() -raster_clip_fields_gwascode = raster_clip_box.rio.clip(fields_gwascode['geometry']) -raster_clip_fields_gwascode.plot.imshow(figsize=(8,8)) +# Visualize the cropped image +visual_clip.plot.imshow() ``` -![](fig/E08/crop-raster-fields-gewascode.png){alt="Raster croped by fields with gewascode 257"} - -:::: -::: - +![](fig/E08/visual_clip.png){alt="Clip results"} -## Crop raster data using `reproject_match()` function +### Match two rasters -So far we have learned how to crop raster images with vector data. We can also crop a raster with another raster data. In this section, we will demonstrate how to crop the `raster_clip_box` image using the `raster_clip_fields_gwascode` image. We will use the `reproject_match` function. As indicated by its name, it performs reprojection and clipping in one go. +Sometimes you need to match two rasters with different extents, resolutions, or CRS. The `reproject_match` function can be used for this purpose. We will demonstrate this by matching the cropped raster `visual_clip` with the Digital Elevation Model (DEM),`rhodes_dem.tif` of Rhodes. +First, let's load the DEM: -To demonstrate the reprojection, we will first reproject `raster_clip_fields_gwascode` to the RD CRS system, so it will be in a different CRS from `raster_clip_box`: ```python -# Reproject to RD to make the CRS different from the "raster" -raster_clip_fields_gwascode = raster_clip_fields_gwascode.rio.reproject("EPSG:28992") -CRS(raster_clip_fields_gwascode.rio.crs) +dem = rioxarray.open_rasterio('./data/dem/rhodes_dem.tif') ``` +And visualize it: -```output - -Name: Amersfoort / RD New -Axis Info [cartesian]: -- [east]: Easting (metre) -- [north]: Northing (metre) -Area of Use: -- undefined -Coordinate Operation: -- name: unnamed -- method: Oblique Stereographic -Datum: Amersfoort -- Ellipsoid: Bessel 1841 -- Prime Meridian: Greenwich +```python +dem.plot() ``` +![](fig/E08/dem.png){alt="DEM"} -And let's check again the CRS of `raster_clip_box`: +From the visualization, we can see that the DEM has a different extent, resolution and CRS compared to the cropped visual image. We can also confirm this by checking the CRS of the two images: ```python -CRS(raster_clip_box.rio.crs) +print(dem.rio.crs) +print(visual_clip.rio.crs) ``` - ```output - -Name: WGS 84 / UTM zone 31N -Axis Info [cartesian]: -- [east]: Easting (metre) -- [north]: Northing (metre) -Area of Use: -- undefined -Coordinate Operation: -- name: UTM zone 31N -- method: Transverse Mercator -Datum: World Geodetic System 1984 -- Ellipsoid: WGS 84 -- Prime Meridian: Greenwich +EPSG:4326 +EPSG:32635 ``` - -Now the two images are in different coordinate systems. We can use `rioxarray.reproject_match()` function to crop `raster_clip_box` image. +We can use the `reproject_match` function to match the two rasters. One can choose to match the dem to the visual image or vice versa. Here we will match the DEM to the visual image: ```python -raster_reproject_match = raster_clip_box.rio.reproject_match(raster_clip_fields_gwascode) -raster_reproject_match.plot.imshow(figsize=(8,8)) +dem_matched = dem.rio.reproject_match(visual_clip) ``` - -![](fig/E08/reprojectmatch-big-to-small.png){alt="Reproject match big to small"} - -We can also use it to expand `raster_clip_fields_gwascode` to the extent of `raster_clip_box`: +And then visualize the matched DEM: ```python -raster_reproject_match = raster_clip_fields_gwascode.rio.reproject_match(raster_clip_box) -raster_reproject_match.plot.imshow(figsize=(8,8)) +dem_matched.plot() ``` -![](fig/E08/reprojectmatch-small-to-big.png){alt="Reproject match small to big"} +![](fig/E08/dem_matched.png){alt="Matched DEM"} -In one line `reproject_match` does a lot of helpful things: +As we can see, `reproject_match` does a lot of helpful things in one line of code: 1. It reprojects. -2. It matches the extent using `nodata` values or by clipping the data. -3. It sets `nodata` values. This means we can run calculations on those two images. +2. It matches the extent. +3. It matches the resolution. -:::callout +Finally, we can save the matched DEM for later use. We save it as a Cloud-Optimized GeoTIFF (COG) file: + +```python +dem_match.rio.to_raster('dem_rhodes_match.tif', driver='COG') +``` +:::callout ## Code Tip -As we saw before, there also exists a method called `reproject()`, which only reprojects one raster to another projection. If you want more control over how rasters are resampled, clipped, and/or reprojected, you can use the `reproject()` method and other `rioxarray` methods individually. +There is also a method in rioxarray: [`reproject()`](https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.raster_array.RasterArray.reproject), which only reprojects one raster to another projection. If you want more control over how rasters are resampled, clipped, and/or reprojected, you can use the `reproject()` method individually. ::: :::keypoints diff --git a/09-raster-calculations.md b/09-raster-calculations.md index 2ca0f32a..608f019d 100644 --- a/09-raster-calculations.md +++ b/09-raster-calculations.md @@ -10,42 +10,43 @@ exercises: 15 :::objectives - Carry out operations with two rasters using Python's built-in math operators. -- Reclassify a continuous raster to a categorical raster. ::: - - ## Introduction We often want to combine values of and perform calculations on rasters to create a new output raster. This episode covers how to perform basic math operations using raster datasets. It also illustrates how to match rasters with -different resolutions so that they can be used in the same calculation. As an example, we will calculate a vegetation -index over one of the satellite scenes. +different resolutions so that they can be used in the same calculation. As an example, we will calculate +[a binary classification mask](https://custom-scripts.sentinel-hub.com/sentinel-2/burned_area_ms/) to identify burned +area over a satellite scene. -### Normalized Difference Vegetation Index (NDVI) -Suppose we are interested in monitoring vegetation fluctuations using satellite remote sensors. Scientists have defined -a vegetation index to quantify the amount of green leaf vegetation using the light reflected in different wavelengths. -This index, named Normalized Difference Vegetation Index (NDVI), exploits the fact that healthy green leaves strongly -absorb red visible light while they mostly reflect light in the near infrared (NIR). The NDVI is computed as: +The classification mask requires the following of [the Sentinel-2 bands](https://gisgeography.com/sentinel-2-bands-combinations/) +(and derived indices): + +* [Normalized difference vegetation index (NDVI)](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index), + derived from the **near-infrared (NIR)** and **red** bands: $$ NDVI = \frac{NIR - red}{NIR + red} $$ -where $NIR$ and $red$ label the reflectance values of the corresponding wavelengths. NDVI values range from -1 to +1. -Values close to one indicate high density of green leaves. Poorly vegetated areas typically have NDVI values close to -zero. Negative NDVI values often indicate cloud and water bodies. +* [Normalized difference water index (NDWI)](https://en.wikipedia.org/wiki/Normalized_difference_water_index), derived + from the **green** and **NIR** bands: -![Source: Wu C-D, McNeely E, Cedeño-Laurent JG, Pan W-C, Adamkiewicz G, Dominici F, et al. (2014) Linking Student Performance in Massachusetts Elementary Schools with the “Greenness” of School Surroundings Using Remote Sensing. PLoS ONE 9(10): e108548. https://doi.org/10.1371/journal.pone.0108548](fig/E09/PONE-NDVI.jpg){alt="PONE-NDVI image"} +$$ NDWI = \frac{green - NIR}{green + NIR} $$ -:::callout -## More Resources -Check out more on NDVI in the NASA Earth Observatory portal: -[Measuring Vegetation](https://earthobservatory.nasa.gov/features/MeasuringVegetation/measuring_vegetation_2.php). -::: +* A custom index derived from two of the **short-wave infrared (SWIR)** bands (with wavelenght ~1600 nm and ~2200 nm, + respectively): + +$$ INDEX = \frac{SWIR_{16} - SWIR_{22}}{SWIR_{16} + SWIR_{22}}$$ + +* The **blue**, **NIR**, and **SWIR** (1600 nm) bands. + +In the following, we start by computing the NDVI. ### Load and crop the Data For this episode, we will use one of the Sentinel-2 scenes that we have already employed in the previous episodes. :::callout + ## Introduce the Data We'll continue from the results of the satellite image search that we have carried out in an exercise from @@ -56,36 +57,43 @@ raster data using this [link](https://figshare.com/ndownloader/files/36028100). file in your current working directory, and extract the archive file by double-clicking on it or by running the following command in your terminal `tar -zxvf geospatial-python-raster-dataset.tar.gz`. Use the file `geospatial-python-raster-dataset/search.json` (instead of `search.json`) to get started with this lesson. + ::: Let's load the results of our initial imagery search using `pystac`: ```python import pystac -items = pystac.ItemCollection.from_file("search.json") +items = pystac.ItemCollection.from_file("rhodes_sentinel-2.json") ``` - -We then select the second item, and extract the URIs of the red and NIR bands ("red" and "nir08", respectively): +We then select the first item, and extract the URLs of the red and NIR bands ("red" and "nir", respectively): ```python -red_uri = items[1].assets["red"].href -nir_uri = items[1].assets["nir08"].href +item = items[0] +red_href = item.assets["red"].get_absolute_href() +nir_href = item.assets["nir"].get_absolute_href() ``` Let's load the rasters with `open_rasterio` using the argument `masked=True`. ```python import rioxarray -red = rioxarray.open_rasterio(red_uri, masked=True) -nir = rioxarray.open_rasterio(nir_uri, masked=True) +red = rioxarray.open_rasterio(red_href, masked=True) +nir = rioxarray.open_rasterio(nir_href, masked=True) ``` -Let's also restrict our analysis to the same crop field area defined in the previous episode by clipping the rasters -using a bounding box: +Let’s also restrict our analysis to the island of Rhodes - we can extract the bounding box from the vector file written +in an earlier episode (note that we need to match the CRS to the one of the raster files): ```python -bbox = (629_000, 5_804_000, 639_000, 5_814_000) +# determine bounding box of Rhodes, in the projected CRS +import geopandas +rhodes = geopandas.read_file('rhodes.gpkg') +rhodes_reprojected = rhodes.to_crs(red.rio.crs) +bbox = rhodes_reprojected.total_bounds + +# crop the rasters red_clip = red.rio.clip_box(*bbox) nir_clip = nir.rio.clip_box(*bbox) ``` @@ -105,65 +113,49 @@ nir_clip.plot(robust=True) ![](fig/E09/NIR-band.png){alt="near infra-red band image"} -It is immediately evident how crop fields (rectangular shapes in the central part of the two figures) appear as dark and -bright spots in the red-visible and NIR wavelengths, respectively, suggesting the presence of leafy crop at the time of -observation (end of March). The same fields would instead appear as dark spots in the off season. +The burned area is immediately evident as a dark spot in the NIR wavelength, due to the lack of reflection from the +vegetation in the scorched area. ## Raster Math + We can perform raster calculations by subtracting (or adding, multiplying, etc.) two rasters. In the geospatial world, we call this "raster math", and typically it refers to operations on rasters that have the same width and height (including `nodata` pixels). We can check the shapes of the two rasters in the following way: + ```python print(red_clip.shape, nir_clip.shape) ``` ```output -(1, 1000, 1000) (1, 500, 500) +(1, 1131, 1207) (1, 1131, 1207) ``` -Both rasters include a single band, but their width and height do not match. -We can now use the `reproject_match` function, which both reprojects and clips -a raster to the CRS and extent of another raster. - -```python -red_clip_matched = red_clip.rio.reproject_match(nir_clip) -print(red_clip_matched.shape) -``` - -```output -(1, 500, 500) -``` +The shapes of the two rasters match (and, not shown, the coordinates and the CRSs match too). Let's now compute the NDVI as a new raster using the formula presented above. -We'll use `rioxarray` objects so that we can easily plot our result and keep +We'll use `DataArray` objects so that we can easily plot our result and keep track of the metadata. ```python -ndvi = (nir_clip - red_clip_matched)/ (nir_clip + red_clip_matched) +ndvi = (nir_clip - red_clip)/ (nir_clip + red_clip) print(ndvi) ``` ```output - -array([[[ 0.7379576 , 0.77153456, 0.54531944, ..., 0.39254385, - 0.49227372, 0.4465174 ], - [ 0.7024894 , 0.7074668 , 0.3903298 , ..., 0.423283 , - 0.4706971 , 0.45964912], - [ 0.6557818 , 0.5610572 , 0.46742022, ..., 0.4510345 , - 0.43815723, 0.6005133 ], + +array([[[0., 0., 0., ..., 0., 0., 0.], + [0., 0., 0., ..., 0., 0., 0.], + [0., 0., 0., ..., 0., 0., 0.], ..., - [ 0.02391171, 0.21843003, 0.02479339, ..., -0.50923485, - -0.53367877, -0.4955414 ], - [ 0.11376493, 0.17681159, -0.1673566 , ..., -0.5221932 , - -0.5271318 , -0.4852753 ], - [ 0.45398772, -0.00518135, 0.03346133, ..., -0.5019455 , - -0.4987013 , -0.49081364]]], dtype=float32) + [0., 0., 0., ..., 0., 0., 0.], + [0., 0., 0., ..., 0., 0., 0.], + [0., 0., 0., ..., 0., 0., 0.]]], dtype=float32) Coordinates: * band (band) int64 1 - * x (x) float64 6.29e+05 6.29e+05 6.29e+05 ... 6.39e+05 6.39e+05 - * y (y) float64 5.814e+06 5.814e+06 ... 5.804e+06 5.804e+06 + * x (x) float64 5.615e+05 5.616e+05 ... 6.097e+05 6.098e+05 + * y (y) float64 4.035e+06 4.035e+06 4.035e+06 ... 3.99e+06 3.99e+06 spatial_ref int64 0 ``` @@ -178,7 +170,7 @@ ndvi.plot() Notice that the range of values for the output NDVI is between -1 and 1. Does this make sense for the selected region? -Maps are great, but it can also be informative to plot histograms of values to better understand the distribution. We can accomplish this using a built-in xarray method we have already been using: `plot` +Maps are great, but it can also be informative to plot histograms of values to better understand the distribution. We can accomplish this using a built-in xarray method we have already been using: `plot.hist()` ```python ndvi.plot.hist() @@ -187,196 +179,135 @@ ndvi.plot.hist() ![](fig/E09/NDVI-hist.png){alt="NDVI histogram"} :::challenge -## Exercise: Explore NDVI Raster Values -It's often a good idea to explore the range of values in a raster dataset just like we might explore a dataset that we collected in the field. The histogram we just made is a good start but there's more we can do to improve our understanding of the data. +### Exercise: NDWI and custom index to detect burned areas -1. What is the min and maximum value for the NDVI raster (`ndvi`) that we just created? Are there missing values? -2. Plot a histogram with 50 bins instead of 8. What do you notice that wasn't clear before? -3. Plot the `ndvi` raster using breaks that make sense for the data. +Calculate the other two indices required to compute the burned area classification mask, specifically: + +* The [normalized difference water index (NDWI)](https://en.wikipedia.org/wiki/Normalized_difference_water_index), derived from the **green** and **NIR** bands (with band ID "green" and "nir", respectively): + +$$ NDWI = \frac{green - NIR}{green + NIR} $$ + +* A custom index derived from the 1600 nm and the 2200 nm **short-wave infrared (SWIR)** bands ( "swir16" and "swir22", respectively): + +$$ INDEX = \frac{SWIR_{16} - SWIR_{22}}{SWIR_{16} + SWIR_{22}}$$ + +What challenge do you foresee in combining the data from the two indices? ::::solution -1. Recall, if there were nodata values in our raster, -we would need to filter them out with `.where()`. Since we have loaded the rasters with `masked=True`, missing -values are already encoded as `np.nan`'s. The `ndvi` array actually includes a single missing value. ```python -print(ndvi.min().values) -print(ndvi.max().values) -print(ndvi.isnull().sum().values) +def get_band_and_clip(item, band_name, bbox): + href = item.assets[band_name].get_absolute_href() + band = rioxarray.open_rasterio(href, masked=True) + return band.rio.clip_box(*bbox) ``` -```output --0.99864775 -0.9995788 -1 -``` -2. Increasing the number of bins gives us a much clearer view of the distribution. Also, there seem to be very few -NDVI values larger than ~0.9. ```python -ndvi.plot.hist(bins=50) +green_clip = get_band_and_clip(item, 'green', bbox) +swir16_clip = get_band_and_clip(item, 'swir16', bbox) +swir22_clip = get_band_and_clip(item, 'swir22', bbox) ``` -![](fig/E09/NDVI-hist-bins.png){alt="NDVI histogram with 50 bins"} -3. We can discretize the color bar by specifying the intervals via the `levels` argument to `plot()`. -Suppose we want to bin our data in the following intervals: -* $-1 \le NDVI \lt 0$ for water; -* $0 \le NDVI \lt 0.2$ for no vegetation; -* $0.2 \le NDVI \lt 0.7$ for sparse vegetation; -* $0.7 \le NDVI \lt 1$ for dense vegetation. +```python +ndwi = (green_clip - nir_clip)/(green_clip + nir_clip) +index = (swir16_clip - swir22_clip)/(swir16_clip + swir22_clip) +``` ```python -class_bins = (-1, 0., 0.2, 0.7, 1) -ndvi.plot(levels=class_bins) +ndwi.plot(robust=True) ``` -![](fig/E09/NDVI-map-binned.png){alt="binned NDVI map"} -:::: -::: +![](fig/E09/NDWI.png){alt="NDWI index"} -Missing values can be interpolated from the values of neighbouring grid cells using the `.interpolate_na` method. We -then save `ndvi` as a GeoTiff file: ```python -ndvi_nonan = ndvi.interpolate_na(dim="x") -ndvi_nonan.rio.to_raster("NDVI.tif") +index.plot(robust=True) ``` +![](fig/E09/custom-index.png){alt="custom index"} -## Classifying Continuous Rasters in Python +The challenge in combining the different indices is that the SWIR bands (and thus the derived custom index) have lower resolution: -Now that we have a sense of the distribution of our NDVI raster, we -can reduce the complexity of our map by classifying it. Classification involves -assigning each pixel in the raster to a class based on its value. In Python, we -can accomplish this using the `numpy.digitize` function. +```python +ndvi.rio.resolution(), ndwi.rio.resolution(), index.rio.resolution() +``` -First, we define NDVI classes based on a list of values, as defined in the last exercise: -`[-1, 0., 0.2, 0.7, 1]`. When bins are ordered from -low to high, as here, `numpy.digitize` assigns classes like so: +```output +((10.0, -10.0), (10.0, -10.0), (20.0, -20.0)) +``` + +:::: -![Source: Image created for this lesson ([license](../LICENSE.md))](fig/E09/NDVI-classes.jpg){alt="NDVI classes"} +::: +In order to combine data from the computed indices, we use the `reproject_match` method, which reprojects, clips and +match the resolution of a raster using another raster as a template. We use the `index` raster as a template, and match +`ndvi` and `ndwi` to its resolution and extent: -Note that, by default, each class includes the left but not the right bound. This is not an issue here, since the -computed range of NDVI values is fully contained in the open interval (-1; 1) (see exercise above). ```python -import numpy as np -import xarray - -# Defines the bins for pixel values -class_bins = (-1, 0., 0.2, 0.7, 1) - -# The numpy.digitize function returns an unlabeled array, in this case, a -# classified array without any metadata. That doesn't work--we need the -# coordinates and other spatial metadata. We can get around this using -# xarray.apply_ufunc, which can run the function across the data array while -# preserving metadata. -ndvi_classified = xarray.apply_ufunc( - np.digitize, - ndvi_nonan, - class_bins -) +ndvi_match = ndvi.rio.reproject_match(index) +ndwi_match = ndwi.rio.reproject_match(index) ``` -Let's now visualize the classified NDVI, customizing the plot with proper title and legend. We then export the -figure in PNG format: +Finally, we also fetch the blue band data and match this and the NIR band data to the template: ```python -import earthpy.plot as ep -import matplotlib.pyplot as plt - -from matplotlib.colors import ListedColormap - -# Define color map of the map legend -ndvi_colors = ["blue", "gray", "green", "darkgreen"] -ndvi_cmap = ListedColormap(ndvi_colors) - -# Define class names for the legend -category_names = [ - "Water", - "No Vegetation", - "Sparse Vegetation", - "Dense Vegetation" -] - -# We need to know in what order the legend items should be arranged -category_indices = list(range(len(category_names))) - -# Make the plot -im = ndvi_classified.plot(cmap=ndvi_cmap, add_colorbar=False) -plt.title("Classified NDVI") -# earthpy helps us by drawing a legend given an existing image plot and legend items, plus indices -ep.draw_legend(im_ax=im, classes=category_indices, titles=category_names) - -# Save the figure -plt.savefig("NDVI_classified.png", bbox_inches="tight", dpi=300) +blue_clip = get_band_and_clip(item, 'blue', bbox) +blue_match = blue_clip.rio.reproject_match(index) +nir_match = nir_clip.rio.reproject_match(index) ``` -![](fig/E09/NDVI-classified.png){alt="classified NDVI map"} - -We can finally export the classified NDVI raster object to a GeoTiff file. The `to_raster()` function -by default writes the output file to your working directory unless you specify a -full file path. +We can now go ahead and compute the binary classification mask for burned areas. Note that we need to convert the unit +of the Sentinel-2 bands [from digital numbers to reflectance](https://docs.sentinel-hub.com/api/latest/data/sentinel-2-l2a/#units) +(this is achieved by dividing by 10,000): ```python -ndvi_classified.rio.to_raster("NDVI_classified.tif", dtype="int32") +burned = ( + (ndvi_match <= 0.3) & + (ndwi_match <= 0.1) & + ((index + nir_match/10_000) <= 0.1) & + ((blue_match/10_000) <= 0.1) & + ((swir16_clip/10_000) >= 0.1) +) ``` -:::challenge -## Exercise: Compute the NDVI for the Texel island - -Data are often more interesting and powerful when we compare them across various -locations. Let's compare the computed NDVI map with the one of another region in the same Sentinel-2 scene: -the [Texel island](https://en.wikipedia.org/wiki/Texel), located in the North Sea. - -0. You should have the red- and the NIR-band rasters already loaded (`red` and `nir` variables, respectively). -1. Crop the two rasters using the following bounding box: `(610000, 5870000, 630000, 5900000)`. Don't forget to check the shape of the data, and make sure the cropped areas have the same CRSs, heights and widths. -2. Compute the NDVI from the two raster layers and check the max/min values to make sure the data -is what you expect. -3. Plot the NDVI map and export the NDVI as a GeoTiff. -4. Compare the distributions of NDVI values for the two regions investigated. +The classification mask has a single element along the "band" axis, we can drop this dimension in the following way: -::::solution -1) We crop the area of interest using `clip_box`: ```python -bbox_texel = (610000, 5870000, 630000, 5900000) -nir_texel = nir.rio.clip_box(*bbox_texel) -red_texel = red.rio.clip_box(*bbox_texel) +burned = burned.squeeze() ``` -2) Reproject and clip one raster to the extent of the smaller raster using `reproject_match`. The lines of code below -assign a variable to the reprojected raster and calculate the NDVI. +Let's now fetch and visualize the true color image of Rhodes, after coloring the pixels identified as burned area in red: ```python -red_texel_matched = red_texel.rio.reproject_match(nir_texel) -ndvi_texel = (nir_texel - red_texel_matched)/ (nir_texel + red_texel_matched) +visual_href = item.assets['visual'].get_absolute_href() +visual = rioxarray.open_rasterio(visual_href) +visual_clip = visual.rio.clip_box(*bbox) ``` -3) Plot the NDVI and save the raster data as a GeoTIFF file. ```python -ndvi_texel.plot() -ndvi_texel.rio.to_raster("NDVI_Texel.tif") +# set red channel to max (255), green and blue channels to min (0). +visual_clip[0] = visual_clip[0].where(~burned, 255) +visual_clip[1:3] = visual_clip[1:3].where(~burned, 0) ``` -![](fig/E09/NDVI-map-Texel.png){alt="NDVI map Texel"} - -4) Compute the NDVI histogram and compare it with the region that we have previously investigated. Many more grid -cells have negative NDVI values, since the area of interest includes much more water. Also, NDVI values close to -zero are more abundant, indicating the presence of bare ground (sand) regions. ```python -ndvi_texel.plot.hist(bins=50) +visual_clip.plot.imshow() ``` -![](fig/E09/NDVI-hist-Texel.png){alt="NDVI histogram Texel"} +![](fig/E09/visual-burned-index.png){alt="RGB image with burned area in red"} -:::: -::: +We can save the burned classification mask to disk after converting booleans to integers: + +```python +burned.rio.to_raster('burned.tif', dtype='int8') +``` :::keypoints - Python's built-in math operators are fast and simple options for raster math. -- numpy.digitize can be used to classify raster values in order to generate a less complicated map. ::: diff --git a/10-zonal-statistics.md b/10-zonal-statistics.md index ed7631c4..9064b747 100644 --- a/10-zonal-statistics.md +++ b/10-zonal-statistics.md @@ -1,7 +1,7 @@ --- title: "Calculating Zonal Statistics on Rasters" teaching: 40 -exercises: 20 +exercises: 0 --- :::questions @@ -17,159 +17,131 @@ exercises: 20 -# Introduction +## Introduction -Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we used the crop field polygon dataset. The fields with the same crop type can be identified as a "zone", resulting in multiple zones in one vector dataset. One may be interested in performing statistical analysis over these crop zones. +Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we defined infrastructure regions and built-up regions on Rhodes Island as polygons. Each region can be respectively identified as a "zone", resulting in two zones. One can evualuate the effect of the wild fire on the two zones by calculating the zonal statistics. -In this episode, we will explore how to calculate zonal statistics based on the types of crops in `fields_cropped.shp`. To do this, we will first identify zones from the vector data, then rasterize these vector zones. Finally the zonal statistics for `ndvi` will be calculated over the rasterized zones. +## Data loading -# Making vector and raster data compatible -First, let's load the `NDVI.tif` file saved in the previous episode to obtained our calculated raster `ndvi` data. We also use the `squeeze()` function in order to reduce our raster data `ndvi` dimension to 2D by removing the singular `band` dimension - this is necessary for use with the `rasterize` and `zonal_stats` functions: +We have created `assets.gpkg` in Episode "Vector data in Python", which contains the infrastructure regions and built-up regions . We also calculated the burned index in Episode "Raster Calculations in Python" and saved it in `burned.tif`. Lets load them: ```python +# Load burned index import rioxarray -ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze() -``` - -Let's also read the crop fields vector data from our saved `fields_cropped.shp` file. +burned = rioxarray.open_rasterio('burned.tif') -```python +# Load assests polygons import geopandas as gpd -fields = gpd.read_file('fields_cropped.shp') -``` - -In order to use the vector data as a classifier for our raster, we need to convert the vector data to the appropriate CRS. We can perform the CRS conversion from the vector CRS (EPSG:28992) to our raster `ndvi` CRS (EPSG:32631) with: -```python -# Uniform CRS -fields_utm = fields.to_crs(ndvi.rio.crs) +assets = gpd.read_file('assets.gpkg') ``` -# Rasterizing the vector data +## Align datasets -Before calculating zonal statistics, we first need to rasterize our `fields_utm` vector geodataframe with the `rasterio.features.rasterize` function. With this function, we aim to produce a grid with numerical values representing the types of crops as defined by the column `gewascode` from `field_cropped` - `gewascode` stands for the crop codes as defined by the Netherlands Enterprise Agency (RVO) for different types of crop or `gewas` (Grassland, permanent; Grassland, temporary; corn fields; etc.). This grid of values thus defines the zones for the `xrspatial.zonal_stats` function, where each pixel in the zone grid overlaps with a corresponding pixel in our NDVI raster. - -We can generate the `geometry, gewascode` pairs for each vector feature to be used as the first argument to `rasterio.features.rasterize` as: +Before we continue, let's check if the two datasets are in the same CRS: ```python -geom = fields_utm[['geometry', 'gewascode']].values.tolist() -geom +print(assets.crs) +print(burned.rio.crs) ``` ```output -[[, 265], - [, 265], - [, 265], - [, 265], - [, 265], - [, 265], -... - [, 265], - [, 265], - [, 265], - [, 331], - ...] +EPSG:4326 +EPSG:32635 ``` -This generates a list of the shapely geometries from the `geometry` column, and the unique field ID from the `gewascode` column in the `fields_utm` geodataframe. - -We can now rasterize our vector data using `rasterio.features.rasterize`: +The two datasets are in different CRS. Let's reproject the assets to the same CRS as the burned index raster: ```python -from rasterio import features -fields_rasterized = features.rasterize(geom, out_shape=ndvi.shape, transform=ndvi.rio.transform()) +assets = assets.to_crs(burned.rio.crs) ``` -The argument `out_shape` specifies the shape of the output grid in pixel units, while `transform` represents the projection from pixel space to the projected coordinate space. By default, the pixels that are not contained within a polygon in our shapefile will be filled with 0. It's important to pick a fill value that is not the same as any value already defined in `gewascode` or else we won't distinguish between this zone and the background. +## Rasterize the vector data -Let's inspect the results of rasterization: +One way to define the zones, is to create a grid space with the same extent and resolution as the burned index raster, and with the numerical values in the grid representing the type of infrastructure, i.e., the zones. This can be done by rasterize the vector data `assets` to the grid space of `burned`. + +Let's first take two elements from `assets`, the geometry column, and the code of the region. ```python -import numpy as np -print(fields_rasterized.shape) -print(np.unique(fields_rasterized)) +geom = assets[['geometry', 'code']].values.tolist() +geom ``` ```output -(500, 500) -[ 0 259 265 266 331 332 335 863] +[[, + 1], + [, + 2]] ``` -The output `fields_rasterized` is an `np.ndarray` with the same shape as `ndvi`. It contains `gewascode` values at the location of fields, and 0 outside the fields. Let's visualize it: +The raster image `burned` is a 3D image with a "band" dimension. ```python -from matplotlib import pyplot as plt -plt.imshow(fields_rasterized) -plt.colorbar() +burned.shape ``` -![](fig/E10/rasterization-results.png){alt="rasterization results"} - -We will convert the output to `xarray.DataArray` which will be used further. To do this, we will "borrow" the coordinates from `ndvi`, and fill in the rasterization data: -```python -import xarray as xr -fields_rasterized_xarr = ndvi.copy() -fields_rasterized_xarr.data = fields_rasterized - -# visualize -fields_rasterized_xarr.plot(robust=True) +```output +(1, 1131, 1207) ``` -![](fig/E10/rasterization-results-xr.png){alt="Rasterization results Xarray"} +To create the grid space, we only need the two spatial dimensions. We can used `.squeeze()` to drop the band dimension: -# Calculate zonal statistics +```python +burned_squeeze = burned.squeeze() +burned_squeeze.shape +``` -In order to calculate the statistics for each crop zone, we call the function, `xrspatial.zonal_stats`. The `xrspatial.zonal_stats` function takes as input `zones`, a 2D `xarray.DataArray`, that defines different zones, and `values`, a 2D `xarray.DataArray` providing input values for calculating statistics. +```output +(1131, 1207) +``` -We call the `zonal_stats` function with `fields_rasterized_xarr` as our classifier and the 2D raster with our values of interest `ndvi` to obtain the NDVI statistics for each crop type: +Now we can use `features.rasterize` from `rasterio` to rasterize the vector data `assets` to the grid space of `burned`: ```python -from xrspatial import zonal_stats -zonal_stats(fields_rasterized_xarr, ndvi) +from rasterio import features +assets_rasterized = features.rasterize(geom, out_shape=burned_squeeze.shape, transform=burned.rio.transform()) +assets_rasterized ``` ```output - zone mean max min sum std var count -0 0 0.266531 0.999579 -0.998648 38887.648438 0.409970 0.168075 145903.0 -1 259 0.520282 0.885242 0.289196 449.003052 0.111205 0.012366 863.0 -2 265 0.775609 0.925955 0.060755 66478.976562 0.091089 0.008297 85712.0 -3 266 0.794128 0.918048 0.544686 1037.925781 0.074009 0.005477 1307.0 -4 331 0.703056 0.905304 0.142226 10725.819336 0.102255 0.010456 15256.0 -5 332 0.681699 0.849158 0.178113 321.080261 0.123633 0.015285 471.0 -6 335 0.648063 0.865804 0.239661 313.662598 0.146582 0.021486 484.0 -7 863 0.388575 0.510572 0.185987 1.165724 0.144245 0.020807 3.0 +array([[0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + ..., + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0]], dtype=uint8) ``` -The `zonal_stats` function calculates the minimum, maximum, and sum for each zone along with statistical measures such as the mean, variance and standard deviation for each rasterized vector zone. In our raster dataset `zone = 0`, corresponding to non-crop areas, has the highest count followed by `zone = 265` which corresponds to 'Grasland, blijvend' or 'Grassland, permanent'. The highest mean NDVI is observed for `zone = 266` for 'Grasslands, temporary' with the lowest mean, aside from non-crop area, going to `zone = 863` representing 'Forest without replanting obligation'. Thus, the `zonal_stats` function can be used to analyze and understand different sections of our raster data. The definition of the zones can be derived from vector data or from classified raster data as presented in the challenge below: +## Perform zonal statistics -:::challenge -## Exercise: Calculate zonal statistics for zones defined by `ndvi_classified` +The rasterized zones `assets_rasterized` is a `numpy` array. The Python package `xrspatial`, which is the one we will use for zoning statistics, accepts `xarray.DataArray`. We need to first convert `assets_rasterized`. We can use `burned_squeeze` as a template: -Let's calculate NDVI zonal statistics for the different zones as classified by `ndvi_classified` in the previous episode. +```python +assets_rasterized_xarr = burned_squeeze.copy() +assets_rasterized_xarr.data = assets_rasterized +assets_rasterized_xarr.plot() +``` -Load both raster datasets: `NDVI.tif` and `NDVI_classified.tif`. Then, calculate zonal statistics for each `class_bins`. Inspect the output of the `zonal_stats` function. +![](../fig/E10/zones_rasterized_xarray.png){alt="Rasterized zones"} +Then we can calculate the zonal statistics using the `zonal_stats` function: -::::solution -1) Load and convert raster data into suitable inputs for `zonal_stats`: -```python -ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze() -ndvi_classified = rioxarray.open_rasterio("NDVI_classified.tif").squeeze() -``` -2) Create and display the zonal statistics table. ```python -zonal_stats(ndvi_classified, ndvi) +from xrspatial import zonal_stats +stats = zonal_stats(assets_rasterized_xarr, burned_squeeze) +stats ``` ```output - zone mean max min sum std var count -0 1 -0.355660 -0.000257 -0.998648 -12838.253906 0.145916 0.021291 36097.0 -1 2 0.110731 0.199839 0.000000 1754.752441 0.055864 0.003121 15847.0 -2 3 0.507998 0.700000 0.200000 50410.167969 0.140193 0.019654 99233.0 -3 4 0.798281 0.999579 0.700025 78888.523438 0.051730 0.002676 98823.0 + zone mean max min sum std var count +0 0 0.022570 1.0 0.0 28929.0 0.148528 0.022061 1281749.0 +1 1 0.009607 1.0 0.0 568.0 0.097542 0.009514 59125.0 +2 2 0.000000 0.0 0.0 0.0 0.000000 0.000000 24243.0 ``` -:::: -::: + +The results provide statistics for three zones: `1` represents infrastructure regions, `2` represents built-up regions, and `0` represents the rest of the area. + :::keypoints - Zones can be extracted by attribute columns of a vector dataset diff --git a/11-parallel-raster-computations.md b/11-parallel-raster-computations.md index c94287b7..17eee39a 100644 --- a/11-parallel-raster-computations.md +++ b/11-parallel-raster-computations.md @@ -1,7 +1,7 @@ --- title: "Parallel raster computations using Dask" teaching: 45 -exercises: 25 +exercises: 10 --- :::questions @@ -51,12 +51,6 @@ computations. Depending on the specifics of the calculations, serial calculation Being able to profile the computational time is thus essential, and we will see how to do that in a Jupyter environment in the next section. -The example that we consider here is the application of a median filter to a satellite image. -[Median filtering](https://en.wikipedia.org/wiki/Median_filter) is a common noise removal technique which -replaces a pixel's value with the median value computed from its surrounding pixels. - -## Time profiling in Jupyter - :::callout ## Introduce the Data @@ -79,96 +73,15 @@ import pystac items = pystac.ItemCollection.from_file("search.json") ``` -We select the last scene, and extract the URL of the true-color image ("visual"): - -```python -assets = items[-1].assets # last item's assets -visual_href = assets["visual"].href # true color image -``` - -The true-color image is available as a raster file with 10 m resolution and 3 bands (you can verify this by opening the -file with `rioxarray`), which makes it a relatively large file (few hundreds MBs). In order to keep calculations -"manageable" (reasonable execution time and memory usage) we select here a lower resolution version of the image, taking -advantage of the so-called "pyramidal" structure of cloud-optimized GeoTIFFs (COGs). COGs, in fact, typically include -multiple lower-resolution versions of the original image, called "overviews", in the same file. This allows us to avoid -downloading high-resolution images when only quick previews are required. - -Overviews are often computed using powers of 2 as down-sampling (or zoom) factors. So, typically, the first level -overview (index 0) corresponds to a zoom factor of 2, the second level overview (index 1) corresponds to a zoom factor -of 4, and so on. Here, we open the third level overview (zoom factor 8) and check that the resolution is about 80 m: +We select the last scene, extract the URLs of the red ("red") and near-infrared ("nir") bands, open them with +`rioxarray` and save them to disk: ```python import rioxarray -visual = rioxarray.open_rasterio(visual_href, overview_level=2) -print(visual.rio.resolution()) -``` - -```output -(79.97086671522214, -79.97086671522214) -``` - -Let's make sure the data has been loaded into memory before proceeding to time profile our raster calculation. Calling -the `.load()` method of a DataArray object triggers data loading: - -```python -visual = visual.load() -``` - -Note that by default data is loaded using Numpy arrays as underlying data structure. We can visualize the raster: - -```python -visual.plot.imshow(figsize=(10,10)) -``` - -![Scene's true-color image](fig/E11/true-color-image.png){alt="true color image scene"} - -Let's now apply a median filter to the image while keeping track of the execution time of this task. The filter is -carried out in two steps: first, we define the size and centering of the region around each pixel that will be -considered for the median calculation (the so-called "windows"), using the `.rolling()` method. We choose here windows -that are 7 pixel wide in both x and y dimensions, and, for each window, consider the central pixel as the window target. -We then call the `.median()` method, which initiates the construction of the windows and the actual calculation. - -For the time profiling, we make use of the Jupyter magic `%%time`, which returns the time required to run the content -of a cell (note that commands starting with `%%` needs to be on the first line of the cell!): - -```python -%%time -median = visual.rolling(x=7, y=7, center=True).median() -``` - -```output -CPU times: user 15.6 s, sys: 3.2 s, total: 18.8 s -Wall time: 19.6 s -``` - -Let's note down the calculation's "Wall time" (actual time to perform the task). We can inspect the image resulting -after the application of median filtering: - - -```python -median.plot.imshow(robust=True, figsize=(10,10)) +rioxarray.open_rasterio(items[-1].assets["red"].href).rio.to_raster("red.tif", driver="COG") +rioxarray.open_rasterio(items[-1].assets["nir"].href).rio.to_raster("nir.tif", driver="COG") ``` -![True-color image after median filtering](fig/E11/true-color-image_median-filter.png){alt="median filter true color image"} - -:::callout - -## Handling edges - -By looking closely, you might notice a tiny white edge at the plot boundaries. These are the pixels that are less than -3 pixels away from the border of the image. These pixels cannot be surrounded by a 7 pixel wide window. The default -behaviour is to assign these with nodata values. -::: - -Finally, let's write the data to disk: - -```python -median.rio.to_raster("visual_median-filter.tif") -``` - -In the following section we will see how to parallelize these raster calculations, and we will compare timings to the -serial calculations that we have just run. - ## Dask-powered rasters ### Chunked arrays @@ -176,12 +89,11 @@ serial calculations that we have just run. As we have mentioned, `rioxarray` supports the use of Dask's chunked arrays as underlying data structure. When opening a raster file with `open_rasterio` and providing the `chunks` argument, Dask arrays are employed instead of regular Numpy arrays. `chunks` describes the shape of the blocks which the data will be split in. As an example, we -open the blue band raster using a chunk shape of `(1, 4000, 4000)` (block size of `1` in the first dimension and +open the red band raster using a chunk shape of `(1, 4000, 4000)` (block size of `1` in the first dimension and of `4000` in the second and third dimensions): ```python -blue_band_href = assets["blue"].href -blue_band = rioxarray.open_rasterio(blue_band_href, chunks=(1, 4000, 4000)) +red = rioxarray.open_rasterio("red.tif", chunks=(1, 4000, 4000)) ``` Xarray and Dask also provide a graphical representation of the raster data array and of its blocked structure. @@ -192,21 +104,19 @@ Xarray and Dask also provide a graphical representation of the raster data array ### Exercise: Chunk sizes matter -We have already seen how COGs are regular GeoTIFF files with a special internal structure. other feature of COGs is -that data is organized in "blocks" that can be accessed remotely via independent HTTP requests, abling partial file -readings. This is useful if you want to access only a portion of your raster file, but it also lows for efficient -parallel reading. You can check the blocksize employed in a COG file with the following code ippet: +We have already seen how COGs are regular GeoTIFF files with a special internal structure. Another feature of COGs is +that data is organized in “blocks” that can be accessed remotely via independent HTTP requests, enabling partial file +readings. This is useful if you want to access only a portion of your raster file, but it also allows for efficient +parallel reading. You can check the blocksize employed in a COG file with the following code snippet: ```python import rasterio -with rasterio.open(visual_href) as r: +with rasterio.open("/path/or/URL/to/file.tif") as r: if r.is_tiled: print(f"Chunk size: {r.block_shapes}") ``` - -In order to optimally access COGs it is best to align the blocksize of the file with the chunks ployed when loading -the file. Open the blue-band asset (the "blue" asset) of a Sentinel-2 scene as a chunked `DataArray` object using a suitable -chunk size. Which elements do you think should be considered when choosing the chunk size? +In order to optimally access COGs it is best to align the blocksize of the file with the chunks enployed when loading +the file. Which other elements do you think should be considered when choosing the chunk size? What do you think are suitable chunk sizes for the red band raster? ::::solution @@ -214,7 +124,7 @@ chunk size. Which elements do you think should be considered when choosing the c ```python import rasterio -with rasterio.open(blue_band_href) as r: +with rasterio.open("red.tif") as r: if r.is_tiled: print(f"Chunk size: {r.block_shapes}") ``` @@ -223,7 +133,7 @@ with rasterio.open(blue_band_href) as r: Chunk size: [(1024, 1024)] ``` -Ideal chunk size values for this raster are thus multiples of 1024. An element to consider is number of +Ideal chunk size values for this raster are multiples of 1024. An element to consider is the number of resulting chunks and their size. While the optimal chunk size strongly depends on the specific application, chunks should in general not be too big nor too small (i.e. too many). As a rule of thumb, chunk sizes of 100 MB typically work well with Dask (see, e.g., this [blog post](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)). Also, @@ -231,14 +141,14 @@ the shape might be relevant, depending on the application! Here, we might select `(1, 6144, 6144)`:: ```python -band = rioxarray.open_rasterio(blue_band_href, chunks=(1, 6144, 6144)) +red = rioxarray.open_rasterio("red.tif", chunks=(1, 6144, 6144)) ``` -which leads to chunks 72 MB large: ((1 x 6144 x 6144) x 2 bytes / 2^20 = 72 MB). Also, we can t ioxarray` and Dask +which leads to chunks 72 MB large: ((1 x 6144 x 6144) x 2 bytes / 2^20 = 72 MB). Also, we can let rioxarray and Dask figure out appropriate chunk shapes by setting `chunks="auto"`: ```python -band = rioxarray.open_rasterio(blue_band_href, chunks="auto") +red = rioxarray.open_rasterio("red.tif", chunks="auto") ``` which leads to `(1, 8192, 8192)` chunks (128 MB). @@ -252,41 +162,54 @@ Operations performed on a `DataArray` that has been opened as a chunked Dask arr coordinates how the operations should be executed on the individual chunks of data, and runs these tasks in parallel as much as possible. -Let's now repeat the raster calculations that we have carried out in the previous section, but running calculations in -parallel over a multi-core CPU. We first open the relevant rasters as chunked arrays: +Let us set up an example where we calculate the NDVI for a full Sentinel-2 tile, and try to estimate the performance gain by running the calculation in parallel on a multi-core CPU. + +To run the calculation serially, we open the relevant raster bands, as we have learned in the previous episodes: ```python -visual_dask = rioxarray.open_rasterio(visual_href, overview_level=2, lock=False, chunks=(3, 500, 500)) +red = rioxarray.open_rasterio('red.tif', masked=True) +nir = rioxarray.open_rasterio('nir.tif', masked=True) ``` -Setting `lock=False` tells `rioxarray` that the individual data chunks can be loaded simultaneously from the source by -the Dask workers. +We then compute the NDVI. Note the Jupyter magic `%%time`, which returns the time required to run the content of a cell (note that commands starting with `%%` needs to be on the first line of the cell!): + +```python +%%time +ndvi = (nir - red)/(nir + red) +``` + +```output +CPU times: user 4.99 s, sys: 3.44 s, total: 8.43 s +Wall time: 8.53 s +``` -As the next step, we trigger the download of the data using the `.persist()` method, see below. This makes sure that -the downloaded chunks are stored in the form of a chunked Dask array (calling `.load()` would instead merge the chunks -in a single Numpy array). +We note down the calculation's "Wall time" (actual time to perform the task). -We explicitly tell Dask to parallelize the required workload over 4 threads: +Now we run the same task in parallel using Dask. To do so, we open the relevant rasters as chunked arrays: ```python -visual_dask = visual_dask.persist(scheduler="threads", num_workers=4) +red_dask = rioxarray.open_rasterio('red.tif', masked=True, lock=False, chunks=(1, 6144, 6144)) +nir_dask = rioxarray.open_rasterio('nir.tif', masked=True, lock=False, chunks=(1, 6144, 6144)) ``` -Let's now continue to the actual calculation. Note how the same syntax as for its serial version is employed for -applying the median filter. Don't forget to add the Jupyter magic to record the timing! +Setting `lock=False` tells `rioxarray` that the individual data chunks can be loaded simultaneously from the source by +the Dask workers. + +We now continue to the actual calculation: Note how the same syntax as for its serial version is employed for +computing the NDVI. Don’t forget to add the Jupyter magic to record the timing! ```python %%time -median_dask = visual_dask.rolling(x=7, y=7, center=True).median() +ndvi_dask = (nir_dask - red_dask)/(nir_dask + red_dask) ``` ```output -CPU times: user 20.6 ms, sys: 3.71 ms, total: 24.3 ms -Wall time: 25.2 ms +CPU times: user 7.71 ms, sys: 1.71 ms, total: 9.42 ms +Wall time: 8.61 ms ``` -Did we just observe a 700x speed-up when comparing to the serial calculation (19.6 s vs 25.2 ms)? Actually, no -calculation has run yet. This is because operations performed on Dask arrays are executed "lazily", i.e. they are not +Did we just observe a 1000x speed-up when comparing to the serial calculation (~8 s vs ~8 ms)? Actually, no +calculation has run yet. This is because operations performed on Dask arrays are executed “lazily”, i.e. they are not immediately run. :::callout @@ -297,12 +220,12 @@ The sequence of operations to carry out is stored in a task graph, which can be ```python import dask -dask.visualize(median_dask) +dask.visualize(ndvi_dask) ``` ![Dask graph](fig/E11/dask-graph.png){alt="dask graph"} -The task graph gives Dask the complete "overview" of the calculation, thus enabling a better nagement of tasks and +The task graph gives Dask the complete "overview" of the calculation, thus enabling a better management of tasks and resources when dispatching calculations to be run in parallel. ::: @@ -310,21 +233,21 @@ Most methods of `DataArray`'s run operations lazily when Dask arrays are employe can use either `.persist()` or `.compute()`. The former keeps data in the form of chunked Dask arrays, and it should thus be used to run intermediate steps that will be followed by additional calculations. The latter merges instead the chunks in a single Numpy array, and it should be used at the very end of a sequence of calculations. Both methods -accept the same parameters (here, we again explicitly tell Dask to run tasks on 4 threads). Let's again time the cell +accept the same parameters. Here, we explicitly tell Dask to parallelize the required workload over 4 threads. Let's again time the cell execution: ```python %%time -median_dask = median_dask.persist(scheduler="threads", num_workers=4) +ndvi_dask = ndvi_dask.persist(scheduler="threads", num_workers=4) ``` ```output -CPU times: user 19.1 s, sys: 3.2 s, total: 22.3 s -Wall time: 6.61 s +CPU times: user 4.18 s, sys: 2.19 s, total: 6.37 s +Wall time: 2.32 s ``` The timing that we have recorded makes much more sense now. When running the task on a 4-core CPU laptop, we observe a -x3 speed-up when comparing to the analogous serial calculation (19.6 s vs 6.61 s). +x3.6 speed-up when comparing to the analogous serial calculation (8.53 s vs 2.32 s). Once again, we stress that one does not always obtain similar performance gains by exploiting the Dask-based parallelization. Even if the algorithm employed is well suited for parallelization, Dask introduces some overhead time @@ -332,18 +255,16 @@ to manage the tasks in the Dask graph. This overhead, which is typically of the be larger than the parallelization gain. This is the typical situation with calculations with many small chunks. Finally, let's have a look at how Dask can be used to save raster files. When calling `.to_raster()`, we provide the -following additional arguments: -* `tiled=True`: write raster as a chunked GeoTIFF. -* `lock=threading.Lock()`: the threads which are splitting the workload must "synchronise" when writing to the same file - (they might otherwise overwrite each other's output). +as additional argument `lock=threading.Lock()`. This is because the threads which are splitting the workload must +"synchronise" when writing to the same file (they might otherwise overwrite each other's output). ```python from threading import Lock -median_dask.rio.to_raster("visual_median-filter_dask.tif", tiled=True, lock=Lock()) +ndvi_dask.rio.to_raster('ndvi.tif', driver='COG', lock=Lock()) ``` Note that `.to_raster()` is among the methods that trigger immediate calculations (one can change this behaviour by -specifying `compute=False`) +specifying `compute=False`). :::keypoints - The `%%time` Jupyter magic command can be used to profile calculations. diff --git a/fig/E05/STAC-s2-preview-after.jpg b/fig/E05/STAC-s2-preview-after.jpg new file mode 100644 index 00000000..363cc073 Binary files /dev/null and b/fig/E05/STAC-s2-preview-after.jpg differ diff --git a/fig/E05/STAC-s2-preview-before.jpg b/fig/E05/STAC-s2-preview-before.jpg new file mode 100644 index 00000000..ae105588 Binary files /dev/null and b/fig/E05/STAC-s2-preview-before.jpg differ diff --git a/fig/E06/rhodes_multiband_80.png b/fig/E06/rhodes_multiband_80.png new file mode 100644 index 00000000..30f58226 Binary files /dev/null and b/fig/E06/rhodes_multiband_80.png differ diff --git a/fig/E06/rhodes_multiband_80_equal_aspect.png b/fig/E06/rhodes_multiband_80_equal_aspect.png new file mode 100644 index 00000000..356f4d43 Binary files /dev/null and b/fig/E06/rhodes_multiband_80_equal_aspect.png differ diff --git a/fig/E06/rhodes_red_80_B04.png b/fig/E06/rhodes_red_80_B04.png new file mode 100644 index 00000000..f4a852db Binary files /dev/null and b/fig/E06/rhodes_red_80_B04.png differ diff --git a/fig/E06/rhodes_red_80_B04_robust.png b/fig/E06/rhodes_red_80_B04_robust.png new file mode 100644 index 00000000..0fa2711c Binary files /dev/null and b/fig/E06/rhodes_red_80_B04_robust.png differ diff --git a/fig/E06/rhodes_red_80_B04_robust_nan.png b/fig/E06/rhodes_red_80_B04_robust_nan.png new file mode 100644 index 00000000..1f8cdeae Binary files /dev/null and b/fig/E06/rhodes_red_80_B04_robust_nan.png differ diff --git a/fig/E06/rhodes_red_80_B04_vmin100_vmax2000.png b/fig/E06/rhodes_red_80_B04_vmin100_vmax2000.png new file mode 100644 index 00000000..7bbf05c4 Binary files /dev/null and b/fig/E06/rhodes_red_80_B04_vmin100_vmax2000.png differ diff --git a/fig/E06/rhodes_red_B04.png b/fig/E06/rhodes_red_B04.png new file mode 100644 index 00000000..e63930c6 Binary files /dev/null and b/fig/E06/rhodes_red_B04.png differ diff --git a/fig/E07/greece_administration_areas.png b/fig/E07/greece_administration_areas.png new file mode 100644 index 00000000..9f99b42b Binary files /dev/null and b/fig/E07/greece_administration_areas.png differ diff --git a/fig/E07/greece_highways.png b/fig/E07/greece_highways.png new file mode 100644 index 00000000..35162087 Binary files /dev/null and b/fig/E07/greece_highways.png differ diff --git a/fig/E07/rhodes_administration_areas.png b/fig/E07/rhodes_administration_areas.png new file mode 100644 index 00000000..d1c2024c Binary files /dev/null and b/fig/E07/rhodes_administration_areas.png differ diff --git a/fig/E07/rhodes_assets.png b/fig/E07/rhodes_assets.png new file mode 100644 index 00000000..38df2a30 Binary files /dev/null and b/fig/E07/rhodes_assets.png differ diff --git a/fig/E07/rhodes_builtup_buffer.png b/fig/E07/rhodes_builtup_buffer.png new file mode 100644 index 00000000..2944ee21 Binary files /dev/null and b/fig/E07/rhodes_builtup_buffer.png differ diff --git a/fig/E07/rhodes_highways.png b/fig/E07/rhodes_highways.png new file mode 100644 index 00000000..de241aaf Binary files /dev/null and b/fig/E07/rhodes_highways.png differ diff --git a/fig/E07/rhodes_infra_highways.png b/fig/E07/rhodes_infra_highways.png new file mode 100644 index 00000000..641c23cc Binary files /dev/null and b/fig/E07/rhodes_infra_highways.png differ diff --git a/fig/E08/dem.png b/fig/E08/dem.png new file mode 100644 index 00000000..08437810 Binary files /dev/null and b/fig/E08/dem.png differ diff --git a/fig/E08/dem_matched.png b/fig/E08/dem_matched.png new file mode 100644 index 00000000..92ac04d3 Binary files /dev/null and b/fig/E08/dem_matched.png differ diff --git a/fig/E08/visual_clip.png b/fig/E08/visual_clip.png new file mode 100644 index 00000000..9a68c1c4 Binary files /dev/null and b/fig/E08/visual_clip.png differ diff --git a/fig/E08/visual_clip_box.png b/fig/E08/visual_clip_box.png new file mode 100644 index 00000000..2dcf31ab Binary files /dev/null and b/fig/E08/visual_clip_box.png differ diff --git a/fig/E08/visual_large.png b/fig/E08/visual_large.png new file mode 100644 index 00000000..b6380a57 Binary files /dev/null and b/fig/E08/visual_large.png differ diff --git a/fig/E09/NDVI-hist.png b/fig/E09/NDVI-hist.png index 1ecbcc39..7b16ad62 100644 Binary files a/fig/E09/NDVI-hist.png and b/fig/E09/NDVI-hist.png differ diff --git a/fig/E09/NDVI-map.png b/fig/E09/NDVI-map.png index 4bde6c21..635cac5d 100644 Binary files a/fig/E09/NDVI-map.png and b/fig/E09/NDVI-map.png differ diff --git a/fig/E09/NDWI.png b/fig/E09/NDWI.png new file mode 100644 index 00000000..fd025c1b Binary files /dev/null and b/fig/E09/NDWI.png differ diff --git a/fig/E09/NIR-band.png b/fig/E09/NIR-band.png index 73658e9e..827d2642 100644 Binary files a/fig/E09/NIR-band.png and b/fig/E09/NIR-band.png differ diff --git a/fig/E09/custom-index.png b/fig/E09/custom-index.png new file mode 100644 index 00000000..92b805a1 Binary files /dev/null and b/fig/E09/custom-index.png differ diff --git a/fig/E09/red-band.png b/fig/E09/red-band.png index fb3a5011..b0ea4aba 100644 Binary files a/fig/E09/red-band.png and b/fig/E09/red-band.png differ diff --git a/fig/E09/visual-burned-index.png b/fig/E09/visual-burned-index.png new file mode 100644 index 00000000..9f9fd6ac Binary files /dev/null and b/fig/E09/visual-burned-index.png differ diff --git a/fig/E10/zones_rasterized_xarray.png b/fig/E10/zones_rasterized_xarray.png new file mode 100644 index 00000000..5643b06b Binary files /dev/null and b/fig/E10/zones_rasterized_xarray.png differ diff --git a/fig/E11/dask-graph.png b/fig/E11/dask-graph.png index ab6faa5b..22952d33 100644 Binary files a/fig/E11/dask-graph.png and b/fig/E11/dask-graph.png differ diff --git a/index.md b/index.md index 02d54801..24c045bf 100644 --- a/index.md +++ b/index.md @@ -2,19 +2,35 @@ site: sandpaper::sandpaper_site --- -Data Carpentry’s teaching is hands-on, so participants are encouraged to use -their own computers to ensure the proper setup of tools for an efficient -workflow. To most effectively use these materials, please make sure to download -the data and install everything before working through this lesson. - -The data used in this lesson includes optical satellite images from -[the Copernicus Sentinel-2 mission][sentinel-2] and public geographical datasets -from [the dedicated distribution platform of the Dutch government][pdok]. -These are real-world data sets that entail sufficient complexity to teach many -aspects of data analysis and management. They have been selected to allow students -to focus on the core ideas and skills being taught while offering the chance -to encounter common challenges with geospatial data. +## Data Carpentries + +[Data Carpentry’s](https://datacarpentry.org/) teaching is hands-on. Participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow. + +## Geospatial Raster and Vector Data with Python + +In this lesson you will learn how to work with geospatial data and how to process these with python. Python is one of the most popular programming languages for data science and analytics, with a large and steadily growing community in the field of Earth and Space Sciences. The lesson is meant for participants with a working basic knowledge of Python and allow them to to familiarize with the world of geospatial raster and vector data. (If you are unfamiliar to python we recommend you to follow [this course](https://swcarpentry.github.io/python-novice-inflammation/) or have a look [here](https://greenteapress.com/thinkpython2/thinkpython2.pdf) ). In the *Introduction to Geospatial Raster and Vector Data with Python* lesson you will be introduced to a set of tools from the Python ecosystem and learn how these can be used to carry out geospatial data analysis tasks. In particular, you will learn to work with satellite images (i.e. [the Copernicus Sentinel-2 mission][sentinel-2] ) and open topographical geo-datasets (i.e. [OpenStreetmap][osm]). You will learn how these spatial datasets can be accessed, explored, manipulated and visualized using Python. + +## Case study - Wildfires + +As a case study for this lesson we will focus on wildfires. According to the IPCC assessment report, the wildfire seasons are lengthening as a result of changes in temperature and increasing drought conditions [IPCC](https://www.ipcc.ch/report/ar6/wg2/about/frequently-asked-questions/keyfaq1/). To analyse the impact of these wildfires, we will focus on the wildfire that occured on the Greek island [Rhodes in the summer of 2023](https://news.sky.com/story/wildfires-on-rhodes-force-hundreds-of-holidaymakers-to-flee-their-hotels-12925583), which had a devastating effect and led to the evacuation of [19.000 people](https://en.wikipedia.org/wiki/2023_Greece_wildfires). In this lesson we are going analyse the effect of this disaster by estimating which built-up areas were affected by these wildfires. Furthermore, we will analyse which vegetation and land-use types have been affected the most by the wildfire in order to get an understanding of which areas are more vulnerable to wildfires. Finally we are going to estimate which locations would be most suitable for placing watchtowers in the region. The analysis that we set up provides insights in the effect of the wildfire and generates input for wildfire mitigation strategies. + +*Note, that the analyses presented in this lesson are developed for educational purposes. Therefore in some occasions the analysis steps have been simplified and assumptions have been made.* + +The data used in this lesson includes optical satellite images from [the Copernicus Sentinel-2 mission][sentinel-2] and topographical data from [OpenStreetMap (OSM)][osm]. These datasets are real-world open data sets that entail sufficient complexity to teach many aspects of data analysis and management. The datasets have been selected to allow participants to focus on the core ideas and skills being taught while offering the chance to encounter common challenges with geospatial data. Furthermore, we have selected datasets which are available anywhere on earth. + +During this lesson we will setup an analysis pipeline which identifies scorched areas based bands of satelite images collected after the disaster in july 2023. Next, we will analyse the vegetation type, by calculating the [NDVI index](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index), that was present in these areas before the wildfire by looking at satellite images before the disaster and compare them with the scorched areas. To confront the effected built-up areas and most important roads, we will be using OSM vector data and compare that with the scorched areas identified above. Finally, we will use elevation data to perform viewshed analyses in order to locate the best locations for (hypothetical) watchtowers, which in theory would allow to identify a wildfire earlier, thus allowing to react more quickly. + +To most effectively use these materials, make sure to download the data and install everything before working through this lesson (this especially accounts for learners that follow this lesson in a workshop). + +## Python libraries used in this lesson + +The main python libraries that are used in this lesson are: + +- [geopandas](https://geopandas.org/en/stable/) +- [rioxarray](https://github.com/corteva/rioxarray) +- [xarray-spatial](https://xarray-spatial.org/) +- [dask](https://www.dask.org/) +- [pystac](https://pystac.readthedocs.io/en/stable/) [sentinel-2]: https://sentinel.esa.int/web/sentinel/missions/sentinel-2 -[pdok]: https://www.pdok.nl -[workbench]: https://carpentries.github.io/sandpaper-docs +[osm]: https://www.openstreetmap.org/#map=14/45.2935/18.7986 diff --git a/md5sum.txt b/md5sum.txt index 48aa8fda..85193853 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -2,19 +2,19 @@ "CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-07-27" "LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-07-27" "config.yaml" "36ecf0d8c86d44ced0b1bd223bc22fe5" "site/built/config.yaml" "2024-02-26" -"index.md" "7a1a621a2439b80c06afc6bb6665fc40" "site/built/index.md" "2023-07-27" +"index.md" "d398c6c98681439c0a422d047d5751f1" "site/built/index.md" "2024-06-19" "links.md" "f70df760c54e918207d3fafb9f48e1bf" "site/built/links.md" "2023-07-27" -"episodes/01-intro-raster-data.md" "61745ee38a6c30a40eb66cfc323f176f" "site/built/01-intro-raster-data.md" "2023-07-27" -"episodes/02-intro-vector-data.md" "48082eb809ec7caa0c3005fd6cdd5dae" "site/built/02-intro-vector-data.md" "2023-07-27" -"episodes/03-crs.md" "85d370c723c2a6e8b39e853d25363aef" "site/built/03-crs.md" "2023-07-27" -"episodes/04-geo-landscape.md" "04937f2c15667adbe259b68e76ac0d46" "site/built/04-geo-landscape.md" "2023-07-27" -"episodes/05-access-data.md" "3f4586875eaa4c175f5ed7cd6730b6be" "site/built/05-access-data.md" "2023-09-12" -"episodes/06-raster-intro.md" "686d165b48bea5469d3eb73f0f54a421" "site/built/06-raster-intro.md" "2023-08-23" -"episodes/07-vector-data-in-python.md" "01a7e0adfcc44f6a1ad5c8ede060473d" "site/built/07-vector-data-in-python.md" "2023-08-08" -"episodes/08-crop-raster-data.md" "87861d050911db9082000d4788140766" "site/built/08-crop-raster-data.md" "2023-07-27" -"episodes/09-raster-calculations.md" "1d10f506ff34c4f024929719efcb1717" "site/built/09-raster-calculations.md" "2023-07-27" -"episodes/10-zonal-statistics.md" "90edeb980d0bc716becda6a750ccb2df" "site/built/10-zonal-statistics.md" "2023-07-27" -"episodes/11-parallel-raster-computations.md" "abf3942aef90750228d659914a8fec70" "site/built/11-parallel-raster-computations.md" "2023-07-27" +"episodes/01-intro-raster-data.md" "8a040fd799ddc19bb53644930bc15e82" "site/built/01-intro-raster-data.md" "2024-06-19" +"episodes/02-intro-vector-data.md" "3bf0a19a7e0e3990a0f2ee02815d1147" "site/built/02-intro-vector-data.md" "2024-06-19" +"episodes/03-crs.md" "40e659ca7228c07f84c40b51b601dc55" "site/built/03-crs.md" "2024-06-19" +"episodes/04-geo-landscape.md" "09e93cefc450045e23f31f3f0d00c6e1" "site/built/04-geo-landscape.md" "2024-06-19" +"episodes/05-access-data.md" "7164c1f0fd4a387708fa771314158327" "site/built/05-access-data.md" "2024-06-19" +"episodes/06-raster-intro.md" "9efe2ff7489365a77ea49140d96f9d1a" "site/built/06-raster-intro.md" "2024-06-19" +"episodes/07-vector-data-in-python.md" "980d24ebc36dc8bb0746711fc5820e9a" "site/built/07-vector-data-in-python.md" "2024-06-19" +"episodes/08-crop-raster-data.md" "a3fc0ee9b83b78e4f4c1d04bda052ffa" "site/built/08-crop-raster-data.md" "2024-06-19" +"episodes/09-raster-calculations.md" "ef02d74e56538dfd2ca992d0e5017e95" "site/built/09-raster-calculations.md" "2024-06-19" +"episodes/10-zonal-statistics.md" "651b45d7a1a13e035ca17c10eed5519c" "site/built/10-zonal-statistics.md" "2024-06-19" +"episodes/11-parallel-raster-computations.md" "5017875610a559a34a07c6a416d6a926" "site/built/11-parallel-raster-computations.md" "2024-06-19" "instructors/instructor-notes.md" "1b1cbfc8fff44565421842208ccdab4f" "site/built/instructor-notes.md" "2023-07-27" -"learners/setup.md" "c8efe7f5a88db9f82d4c43820e8897ce" "site/built/setup.md" "2023-10-09" +"learners/setup.md" "57a2bc11894512886ad1c3fbf51970d6" "site/built/setup.md" "2024-06-19" "profiles/learner-profiles.md" "ab92fe3511bf951857b47c0ecbf2ab79" "site/built/learner-profiles.md" "2023-07-27" diff --git a/setup.md b/setup.md index 6858b442..b9573e51 100644 --- a/setup.md +++ b/setup.md @@ -6,10 +6,15 @@ title: Setup 1. Create a new directory on your Desktop called `geospatial-python`. 2. Within `geospatial-python`, create a directory called `data`. -3. Download the following files and save them to the just created `data` directory (**do not unzip the files**, we will read from them directly): - * [brpgewaspercelen_definitief_2020_small.gpkg](https://figshare.com/ndownloader/files/37729413) - * [brogmwvolledigeset.zip](https://figshare.com/ndownloader/files/37729416) - * [status_vaarweg.zip](https://figshare.com/ndownloader/files/37729419) +3. Download the data required for this lesson via [this link](https://figshare.com/ndownloader/articles/25721754/versions/3) (612MB). +4. Unzip downloaded files and save them to the just created `data` directory. + +Now you should have the following files in the `data` directory: + +- `sentinel-2` - This is a directory containing multiple bands of Sentinel-2 raster images over the island of Rhodes, on Aug 27, 2023. +- `dem/rhodes_dem.tif` - This is the Digital Elevation Model (DEM) of the island of Rhodes, retrieved from [Copernicus Digital Elevation Model (GLO-30 instance)](https://spacedata.copernicus.eu/collections/copernicus-digital-elevation-model) and modified for this course. +- `gadm/ADM_ADM_3.gpkg` - This is the administration boundaries of Rhodes, downloaded from [GADM](https://gadm.org/) and modified for this course. +- `osm/osm_landuse.gpkg` and `osm/osm_roads.gpkg` - They are landuse poylgons and roads polylines of Rhodes, downloaded from [Openstreetmaps](www.openstreetmaps.org) via [Geofabrik](http://www.geofabrik.de/data/download.html) and modified for this course. ## Software Setup @@ -121,6 +126,15 @@ command in your terminal (use the **Anaconda prompt** on **Windows**). 3. Create the Python environment for the workshop by running: + The main python libraries that are used in this lesson are: + - [geopandas](https://geopandas.org/en/stable/) + - [rioxarray](https://github.com/corteva/rioxarray) + - [xarray-spatial](https://xarray-spatial.org/) + - [dask](https://www.dask.org/) + - [pystac](https://pystac.readthedocs.io/en/stable/) + + In order to **not** have to install all these libraries and their dependencies seperately, a .yaml configuration file has been created which will install all of them automatically. (If you still want to install all these packages seperately you could use `pip`. However, the easiest way is just to execute the .yaml file run the following command: + ```bash mamba env create -n geospatial -f https://raw.githubusercontent.com/carpentries-incubator/geospatial-python/main/files/environment.yaml ``` @@ -137,7 +151,7 @@ command in your terminal (use the **Anaconda prompt** on **Windows**). # $ conda deactivate ``` -5. Activate the `geospatial` environment: +5. Now Activate the `geospatial` environment by running: ```bash conda activate geospatial @@ -216,5 +230,4 @@ If all the steps above completed successfully you are ready to follow along with [anaconda-windows]: https://www.anaconda.com/download/#windows [python]: https://python.org [video-mac]: https://www.youtube.com/watch?v=TcSAln46u9U -[video-windows]: https://www.youtube.com/watch?v=xxQ0mzZ8UvA - +[video-windows]: https://www.youtube.com/watch?v=xxQ0mzZ8UvA \ No newline at end of file