Skip to content

graceebc9/NebulaDataset

Repository files navigation

NEBULA Pipeline

NEBULA Dataset Generation

This repository contains scripts for generating the NEBULA dataset, a postcode-level dataset for neighbourhood energy modelling.

NEBULA Pipeline

Prerequisites

Environment Setup

# Create new environment
conda create -n nebula python=3.10

# Activate environment
conda activate nebula

# Install requirements
pip install -r requirements.txt
conda install conda-forge::libgdal==3.6.4
#  libtiff==4.5.0

Required Data Sources

User-Provided Data (Non-Open License)

  • Building Stock Data (Verisk)
  • Postcode Shapefiles (Edina)

Conversations with OS indicated postcodes shapefiles are open access data but we reccomned user download them themselves from accredited sources.

Provided Data (Open Government License)

Place these files in the input_data_sources directory, or download from our Zip:

  1. Gas and Electricity Data (DESNZ, 2022)
  2. ONS UPRN to Postcode Mapping (2022)
  3. Building Floor Count Global Averages
  4. Census 2021 Statistics
  5. Census 2021 Postcode-Output Area-Region Mappings
  6. Output Areas 2011-2021 Mapping
  7. Postcode Areas: area of postcodes (dervied from postcode shapefiles)
  8. Climate Data (HAD-UK Monthly Temperature, 2022)

Directory Structure

input_data_sources/                   # Input data files
├── census_2021/
├── climate_data/
├── energy_data/
├── lookups/
│   ├── oa_lsoa_2021/               # OA to LSOA mapping
│   └── oa_2011_2021/               # OA conversion lookup
├── ONS_UPRN_DATABASE/
├── postcode_areas/
└── urban_rural_2011/

batches/                         # Processing batch lists

src/                              # Source code


intermediate_data/                # Temporary processing files - sub-themes results stored here
├── age/
├── census_attrs/
├── fuel/ 
├── temp_data/         
└── type/

final_dataset/                   # Output files
├── NEBULA_data_filtered.csv
├── Unfiltered_processed_data.csv
└── attribute_logs/             # Logs for building stock batch calculations - shows counts of records in each batch 
    ├── age_log_file.csv
    ├── fuel_log_file.csv
│   └── fuel_log_file.csv


main.py                     # Process for generating whole dataset if running locally 

split_onsud.py               # If running on HPC - stage 1 generates batch files 
generate_building_stock.py   # HPC python wrapper 
nebula_job.sh                # If running on HPC - bash script to submit multiple batches 
submit_nebula.sh            # If running on HPC - slurm submit for single batch 

create_global_averages.py  #Script for generating the global averages table. We include the 2022 global averages in intermediate data. Script provded for reference.  

License

© 2024 Grace Colverd

This code is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/

For commercial use, please contact: [email protected].

The processed dataset is available under an open licence - please see the accompanying paper for details.

Usage

  1. Install dependencies from requirements.txt
  2. Place input data in appropriate directories

If running locally

  1. Configure variables in main.py as needed
  2. Run the processing pipeline:
    python main.py

If running on HPC

  1. Generate the batches of 10k
    split_onsud.py
  2. Update slurm scripts nebula_job.sh and submit_nebula.sh to run fuel, age and typology calculation
  3. Submit multiple jobs using nebula_job.sh
  4. When all themes finished calculting, update main.py to just call the post process section

Output Dataset

The pipeline generates postcode-level statistics including:

  • Building age and type distributions
  • Temperature data (HDD/CDD)
  • Census demographics
  • Building statistics and averages

Notes

  • We batch up the process of converting the building stock dataset into postcode attributes (themes: building stock, typoloy and age). This enables better logging and multi threading. Current set up is to process each region seperartely and split into batches of 10k postcodes.
  • We provide two generation routes: local and HPC genreation. For one region: running locally takes an estimated 48 hours. Multi threading can speed this up.
  • When running on HPC, we submit each type / region / batch as a seperate job. Using a 8GB (3 CPUS) job, each 10k batch takes approx. 1.5 hours for fuel and 20 minutes for age/tpye. Total run time: (152 * 1.5) + (2 * 152 * .3) = 319 hours.
  • Check overlapping_pcs.txt for postcode boundary issues
  • See global_avs/ for reference statistics
  • Intermediate files can be safely deleted after final dataset generation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published