NEBULA Dataset Generation

This repository contains scripts for generating the NEBULA dataset, a postcode-level dataset for neighbourhood energy modelling.

A conference paper introducing this dataset - NeurIPS 2024 Climate Change AI
Work that uses this dataset BuildSys 2024 Benchmarking paper
Data Descriptor Paper - coming soon.

Prerequisites

Environment Setup

# Create new environment
conda create -n nebula python=3.10

# Activate environment
conda activate nebula

# Install requirements
pip install -r requirements.txt
conda install conda-forge::libgdal==3.6.4
#  libtiff==4.5.0

Required Data Sources

User-Provided Data (Non-Open License)

Building Stock Data (Verisk)
Postcode Shapefiles (Edina)

Conversations with OS indicated postcodes shapefiles are open access data but we reccomned user download them themselves from accredited sources.

Provided Data (Open Government License)

Place these files in the input_data_sources directory, or download from our Zip:

Gas and Electricity Data (DESNZ, 2022)
ONS UPRN to Postcode Mapping (2022)
Building Floor Count Global Averages
Census 2021 Statistics
Census 2021 Postcode-Output Area-Region Mappings
Output Areas 2011-2021 Mapping
Postcode Areas: area of postcodes (dervied from postcode shapefiles)
Climate Data (HAD-UK Monthly Temperature, 2022)

Directory Structure

input_data_sources/                   # Input data files
├── census_2021/
├── climate_data/
├── energy_data/
├── lookups/
│   ├── oa_lsoa_2021/               # OA to LSOA mapping
│   └── oa_2011_2021/               # OA conversion lookup
├── ONS_UPRN_DATABASE/
├── postcode_areas/
└── urban_rural_2011/

batches/                         # Processing batch lists

src/                              # Source code


intermediate_data/                # Temporary processing files - sub-themes results stored here
├── age/
├── census_attrs/
├── fuel/ 
├── temp_data/         
└── type/

final_dataset/                   # Output files
├── NEBULA_data_filtered.csv
├── Unfiltered_processed_data.csv
└── attribute_logs/             # Logs for building stock batch calculations - shows counts of records in each batch 
    ├── age_log_file.csv
    ├── fuel_log_file.csv
│   └── fuel_log_file.csv


main.py                     # Process for generating whole dataset if running locally 

split_onsud.py               # If running on HPC - stage 1 generates batch files 
generate_building_stock.py   # HPC python wrapper 
nebula_job.sh                # If running on HPC - bash script to submit multiple batches 
submit_nebula.sh            # If running on HPC - slurm submit for single batch 

create_global_averages.py  #Script for generating the global averages table. We include the 2022 global averages in intermediate data. Script provded for reference.

License

This code is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/

For commercial use, please contact: [email protected].

The processed dataset is available under an open licence - please see the accompanying paper for details.

Usage

Install dependencies from requirements.txt
Place input data in appropriate directories

If running locally

Configure variables in main.py as needed
Run the processing pipeline:
```
python main.py
```

If running on HPC

Generate the batches of 10k
```
split_onsud.py
```
Update slurm scripts nebula_job.sh and submit_nebula.sh to run fuel, age and typology calculation
Submit multiple jobs using nebula_job.sh
When all themes finished calculting, update main.py to just call the post process section

Output Dataset

The pipeline generates postcode-level statistics including:

Building age and type distributions
Temperature data (HDD/CDD)
Census demographics
Building statistics and averages

Notes

We batch up the process of converting the building stock dataset into postcode attributes (themes: building stock, typoloy and age). This enables better logging and multi threading. Current set up is to process each region seperartely and split into batches of 10k postcodes.
We provide two generation routes: local and HPC genreation. For one region: running locally takes an estimated 48 hours. Multi threading can speed this up.
When running on HPC, we submit each type / region / batch as a seperate job. Using a 8GB (3 CPUS) job, each 10k batch takes approx. 1.5 hours for fuel and 20 minutes for age/tpye. Total run time: (152 * 1.5) + (2 * 152 * .3) = 319 hours.
Check overlapping_pcs.txt for postcode boundary issues
See global_avs/ for reference statistics
Intermediate files can be safely deleted after final dataset generation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEBULA Dataset Generation

Prerequisites

Environment Setup

Required Data Sources

User-Provided Data (Non-Open License)

Provided Data (Open Government License)

Directory Structure

License

Usage

If running locally

If running on HPC

Output Dataset

Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
batches		batches
final_dataset		final_dataset
images		images
input_data_sources		input_data_sources
intermediate_data		intermediate_data
src		src
unittests		unittests
LICENSE		LICENSE
README.md		README.md
generate_building_stock.py		generate_building_stock.py
main.py		main.py
nebula_job.sh		nebula_job.sh
requirements.txt		requirements.txt
split_onsud.py		split_onsud.py
submit_nebula.sh		submit_nebula.sh

License

graceebc9/NebulaDataset

Folders and files

Latest commit

History

Repository files navigation

NEBULA Dataset Generation

Prerequisites

Environment Setup

Required Data Sources

User-Provided Data (Non-Open License)

Provided Data (Open Government License)

Directory Structure

License

Usage

If running locally

If running on HPC

Output Dataset

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages