This repository contains scripts for generating the NEBULA dataset, a postcode-level dataset for neighbourhood energy modelling.
- A conference paper introducing this dataset - NeurIPS 2024 Climate Change AI
- Work that uses this dataset BuildSys 2024 Benchmarking paper
- Data Descriptor Paper - coming soon.
# Create new environment
conda create -n nebula python=3.10
# Activate environment
conda activate nebula
# Install requirements
pip install -r requirements.txt
conda install conda-forge::libgdal==3.6.4
# libtiff==4.5.0
- Building Stock Data (Verisk)
- Postcode Shapefiles (Edina)
Conversations with OS indicated postcodes shapefiles are open access data but we reccomned user download them themselves from accredited sources.
Place these files in the input_data_sources
directory, or download from our Zip:
- Gas and Electricity Data (DESNZ, 2022)
- ONS UPRN to Postcode Mapping (2022)
- Building Floor Count Global Averages
- Census 2021 Statistics
- Census 2021 Postcode-Output Area-Region Mappings
- Output Areas 2011-2021 Mapping
- Postcode Areas: area of postcodes (dervied from postcode shapefiles)
- Climate Data (HAD-UK Monthly Temperature, 2022)
input_data_sources/ # Input data files
├── census_2021/
├── climate_data/
├── energy_data/
├── lookups/
│ ├── oa_lsoa_2021/ # OA to LSOA mapping
│ └── oa_2011_2021/ # OA conversion lookup
├── ONS_UPRN_DATABASE/
├── postcode_areas/
└── urban_rural_2011/
batches/ # Processing batch lists
src/ # Source code
intermediate_data/ # Temporary processing files - sub-themes results stored here
├── age/
├── census_attrs/
├── fuel/
├── temp_data/
└── type/
final_dataset/ # Output files
├── NEBULA_data_filtered.csv
├── Unfiltered_processed_data.csv
└── attribute_logs/ # Logs for building stock batch calculations - shows counts of records in each batch
├── age_log_file.csv
├── fuel_log_file.csv
│ └── fuel_log_file.csv
main.py # Process for generating whole dataset if running locally
split_onsud.py # If running on HPC - stage 1 generates batch files
generate_building_stock.py # HPC python wrapper
nebula_job.sh # If running on HPC - bash script to submit multiple batches
submit_nebula.sh # If running on HPC - slurm submit for single batch
create_global_averages.py #Script for generating the global averages table. We include the 2022 global averages in intermediate data. Script provded for reference.
© 2024 Grace Colverd
This code is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/
For commercial use, please contact: [email protected].
The processed dataset is available under an open licence - please see the accompanying paper for details.
- Install dependencies from requirements.txt
- Place input data in appropriate directories
- Configure variables in main.py as needed
- Run the processing pipeline:
python main.py
- Generate the batches of 10k
split_onsud.py
- Update slurm scripts nebula_job.sh and submit_nebula.sh to run fuel, age and typology calculation
- Submit multiple jobs using nebula_job.sh
- When all themes finished calculting, update main.py to just call the post process section
The pipeline generates postcode-level statistics including:
- Building age and type distributions
- Temperature data (HDD/CDD)
- Census demographics
- Building statistics and averages
- We batch up the process of converting the building stock dataset into postcode attributes (themes: building stock, typoloy and age). This enables better logging and multi threading. Current set up is to process each region seperartely and split into batches of 10k postcodes.
- We provide two generation routes: local and HPC genreation. For one region: running locally takes an estimated 48 hours. Multi threading can speed this up.
- When running on HPC, we submit each type / region / batch as a seperate job. Using a 8GB (3 CPUS) job, each 10k batch takes approx. 1.5 hours for fuel and 20 minutes for age/tpye. Total run time: (152 * 1.5) + (2 * 152 * .3) = 319 hours.
- Check overlapping_pcs.txt for postcode boundary issues
- See global_avs/ for reference statistics
- Intermediate files can be safely deleted after final dataset generation