diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md index c3722f0..f1f282e 100644 --- a/docs/src/SUMMARY.md +++ b/docs/src/SUMMARY.md @@ -1,5 +1,4 @@ # Summary - [Setup the script](./setup.md) -- [Query the data](./query.md) diff --git a/docs/src/query.md b/docs/src/query.md deleted file mode 100644 index ca0bf92..0000000 --- a/docs/src/query.md +++ /dev/null @@ -1,18 +0,0 @@ -# Query radar track points -The data is stored in partitioned parquet files, the recommended way of accessing the data is with `DuckDB`. -It's possible to use DuckDB for different languages, in particular [R](https://duckdb.org/docs/api/r.html) and [Python](https://duckdb.org/docs/api/python/overview), so this overview will focus on the SQL aspect rather than the wrapper language. - -## How to read a parquet file - -```sql -describe select * from read_parquet('track.parquet/year=2019/m201910'); -select * from read_parquet('track_points.parquet/year=2019/month=10/*.parquet'); -``` - -[Reference](https://duckdb.org/docs/guides/file_formats/parquet_import) - - -Example: read all the track points -```sql -select * from read_parquet('track_points.parquet/*/*/*.parquet', hive_partitioning = true) where "month" = 10; -``` diff --git a/docs/src/setup.md b/docs/src/setup.md index 813ba99..1022b58 100644 --- a/docs/src/setup.md +++ b/docs/src/setup.md @@ -13,39 +13,21 @@ ### How to use ```bash nix-shell -make param1=value -``` - -#### Available parameters -from `.env`: -- `PG_HOST` hostname for postgresql connection -- `PG_USER` username for postgresql connection -- `ELEVATION_MODEL` path to the raster that contains the elevation model - -from command line: -- `YEAR` year to process -- `MONTH` month to process -- `PG_DBNAME` database to connect to - - -## Tips -It's possible to use GDAL to merge on the fly different raster datasets -```bash -gdalbuildvrt dem.vrt /path/to/dir/*.tif +task -a ``` ### How does it work? The software relies on different technologies to efficiently work, in particular to overcome issues of scalability the procedure works by partitioning the dataset on the fly and then parallelizing the operations. - Nix, for creating a reproducible environment -- Makefile, for describing a pipeline +- Taskfile, for describing a pipeline - GNU Parallel, for running tasks on partitions in parallel -- GDAL, for extracting data from postgres and for efficiently compute the pixel value of the DEM -- DuckDB, for ultra efficient data operations in-memory +- GDAL, for efficiently compute the pixel value of the DEM +- DuckDB, for efficient data operations in-memory, postgres extraction - Apache Parquet format, for storing intermediate and final results A short description of the procedure follows: -- convert the `track` table in postgis (containing linestrings) to a local parquet file with GDAL +- convert the `track` table in postgis (containing linestrings) to a local parquet file with DuckDB - chunk the parquet in partitions of N elements in-memory - using GNU Parallel a DuckDB query is run on each chunk, the query produces a row for each point in the linestring in the chunk and outputs to a parquet file - each chunk is then sent to GDAL and the pixel value of the raster at the coordinates of each point is computed in parallel