PRC Data Challenge - Actual TakeOff Weight (ATOW) Prediction

Overview

The Performance Review Commission (PRC) Data Challenge is designed to engage data scientists, even without an aviation background, to create teams and compete in building an open Machine Learning (ML) model. The challenge is to accurately infer the Actual TakeOff Weight (ATOW) of flights across Europe in 2022.

We provide detailed flight information for 369,013 flights, including origin/destination airports, aircraft types, off-block and arrival times, and the estimated TakeOff Weight (ETOW). Thanks to collaboration with the OpenSky Network (OSN), we also provide the corresponding flight trajectories, sampled at a maximum 1-second granularity, accounting for 158 GiB of parquet files.

The challenge will be scored using two datasets:

The submission_set.csv, containing 105,959 flights, will be used for ranking intermediate submissions.
An additional 52,190 flights will be used for the final ranking and prize evaluation.

For more information, visit the Data page on the challenge website.

Acronyms

ADS-B: Automatic Dependent Surveillance–Broadcast
ATOW: Actual TakeOff Weight
ETOW: Estimated TakeOff Weight
ML: Machine Learning
MTOW: Maximum TakeOff Weight
OSN: OpenSky Network
PRC: Performance Review Commission
TOW: TakeOff Weight

Flight List

The dataset contains 369,013 flights that departed or arrived in Europe in 2022. It includes the following details:

Flight Identification: Unique ID (flight_id), obfuscated callsign (callsign)
Origin/Destination:
- Aerodrome of Departure (adep) [ICAO code]
- Aerodrome of Destination (ades) [ICAO code]
- Airport name (name_adep, name_ades)
- Country codes (country_code_adep, country_code_ades) [ISO2C]
Timing:
- Date of flight (date) [ISO 8601 UTC]
- Actual Off-Block Time (actual_offblock_time) [ISO 8601 UTC]
- Arrival Time (arrival_time) [ISO 8601 UTC]
Aircraft:
- Aircraft type code (aircraft_type) [ICAO aircraft type]
- Wake Turbulence Category (wtc)
Airline:
- Obfuscated Aircraft Operator (AO) code (airline)
Operational Values:
- Flight duration (flight_duration) [min]
- Taxi-out time (taxiout_time) [min]
- Route length (flown_distance) [nmi]
- Estimated TakeOff Weight (tow) [kg]

Trajectory Data

Flight trajectories, provided as daily .parquet files, amount to approximately 158 GiB and include a 1-second granularity ADS-B position report for each flight. These trajectories cover most flights, though some might be incomplete due to limited ADS-B coverage.

Each trajectory file contains:

Flight Identification: Unique ID (flight_id), ICAO 24-bit address (icao24)
4D Position: Longitude, latitude, altitude, and timestamp
Speed: Ground speed (groundspeed), track angle (track, track_unwrapped), vertical rate of climb/descent (vertical_rate)
Meteorological Info (optional):
- Wind (u_component_of_wind, v_component_of_wind) [m/s]
- Temperature [Kelvin]

Files are named in the format <yyyy-mm-dd>.parquet and contain all position reports for that date in UTC..

Getting Started

This project uses Poetry to manage dependencies and virtual environments. Poetry ensures that your project environment is consistent across different machines and provides an easy way to manage dependencies and package your application.

Prerequisites

Install Poetry: If you don’t have Poetry installed, you can install it by following the instructions here.
```
curl -sSL https://install.python-poetry.org | python3 -
```
Verify Poetry Installation: Run the following command to ensure Poetry is correctly installed:
```
poetry --version
```

Installation and Setup

Clone the repository:

git clone https://github.com/your-username/prc-data-challenge.git
cd prc-data-challenge

Install Dependencies: Poetry will automatically create a virtual environment and install the required dependencies specified in the pyproject.toml file.
```
poetry install
```
Activate the Virtual Environment: Poetry manages virtual environments automatically. You can activate it using the following command:
```
poetry shell
```

Adding New Dependencies

To add new dependencies, use:

poetry add <package-name>

This will automatically update your pyproject.toml and lock the package version in poetry.lock.

Running the Project

Once your virtual environment is activated, you can run your project with any of your custom commands or scripts:

poetry shell
python <your_script.py>

or directly:

poetry python <your_script.py>

Managing Dependencies

Poetry manages dependency versions and ensures your project remains consistent. To update dependencies:

poetry update

Dataset Access

An access was granted to the participants of the challenge trought MinIO Client . The dataset files are hosted on OSN infrastructure. Upon registration of your team you should have received the relevant

team name and ID
BUCKET_ACCESS_KEY and BUCKET_ACCESS_SECRET.

Additional datasets

Two additional datasets were used in this challenge:

The Global Airport Database (here)
CADO airplane database (Link here)

The Global Airport Database

Description: The Global Airport Database (GADB) is a FREE downloadable database of 9300 airports big and small from all around the world. The database is presented in a simple token delimited format. The database provides detailed information about the airports listed including:

ICAO code
IATA code
Name
Country
City
Latitude-Longitude position
Altitude

License: Mit License

CADO airplane database

Description: This database contains data of nearly 230 airplanes. Each airplane is described by 31 parameters such as: name, IATA code and category (general, commuter, regional, short-medium, long range), geometry, mass, max speed, typical cruise mach number, typical range, typical approach speed, take-off field length, landing field length, number of engine, type of engine, typical engine model, bypass ratio, max thrust or max power.

Contribution: Kambiri, Y.A. et al. (2024) ‘Energy consumption of Aircraft with new propulsion systems and storage media’, in. AIAA SCITECH 2024 Forum, American Institute of Aeronautics and Astronautics. Available at: https://doi.org/10.2514/6.2024-1707.

License: ODbL 1.0 license

Model

The model used in this challenge is an XG Boost.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. XGBoost is ideal for ATOW prediction due to its ability to handle complex, non-linear relationships across these diverse features. It efficiently manages both categorical and continuous variables, which is beneficial for combining static factors like aircraft type with dynamic ones such as weather and flight parameters. In addition, XGBoost supports missing values which is helpful when dealing with noisy and incomplete trajectory data.

Run the experiements

We provide script that performs data preparation, feature engineering, and XGBoost-based regression modeling to predict the takeoff weight (TOW) of flights using Optuna for hyperparameter optimization. The final model's predictions are saved in a CSV file, and feature importance is visualized for analysis. Configuration File

The script reads a configuration file located at configs/credentials.json, which must contain:

{
    "team_name": "your_team_name",
    "code": "your_code",
    "output_folder": "your_output_folder"
}

For feature engineering , we tried many different approaches in general we distinguishe between two :

General feature extract: This is a simpler and more straightforward method for extracting features from trajectory data. In this approach, we analyze each signal in the dataset and extract basic statistics, including the mean, maximum, and standard deviation

Usage

Run the script using:
```
python feature_extractor/general_feature_extractor.py
```
Climb & takeoff segmentation: This method focuses on extracting statistics from the takeoff and climb phases. In the literature, many papers confirm that the Take-Off Weight (TOW) is strongly related to the vertical rate and speed of the aircraft during the early stages of flight. Therefore, we focused on segmenting this particular phase using a handcrafted method that considers various types of noise that may occur in the data, as well as occasional missing chunks in some trajectory data. The pipeline uses ploars optimized for speed and memory efficiency, particularly with large datasets. Polars supports lazy evaluation and parallelized operations, allowing for faster data manipulation and transformation.

Usage

Run the script using:
```
python feature_extractor/feature_extractor_climb_takeoff.py
```
Note: There are some parameters in this script that were setup intuitively and they can be different depending on the dataset (vertical_rate_threshold : Threshold for vertical rate min_duration_threshold_minutes : Minimum takeoff duration in minutes). We suggested these values after an extensive analysis of the Trajectory data.

Climb and takeoff feature overview

Here’s a concise description of each feature:
- Altitude: The height of the aircraft above sea level, which provides insights into the flight’s elevation profile.
- Groundspeed: The aircraft's speed relative to the ground, reflecting its actual travel speed over the Earth’s surface.
- Vertical Rate: The rate at which the aircraft changes altitude, indicating climb or descent behaviors during the flight.
- True Air Speed (TAS): The speed of the aircraft relative to the surrounding air, accounting for atmospheric conditions.
  
  More infos here
- Groundspeed Difference: The variation in groundspeed across the flight, highlighting speed changes and potential adjustments in flight pace.
- Track Deviation: The difference between the aircraft’s planned path and actual path, showing directional stability or adjustments made during flight.
- Track Variance: A measure of consistency in the aircraft’s directional changes, which provides an overview of path stability.
- Takeoff Duration: The total duration of the takeoff phase, calculated as the time difference between the earliest and latest timestamps.

Train description

Main function for model training and tuning:

Defines an Optuna objective function for optimizing model hyperparameters.
Trains an XGBoost model using the best-found parameters.
Evaluates the model and calculates RMSE.
Generates feature importance plots.

Run the script using:

python module/xgboost_model.py

This methods can work very well on balanced dataset. But the challenge_set showed a unbalanced representation of each aircraft_type therefore we suggest a new method where instead of predicting the TOW directly we try to predict the (mean(TOW@ChallengeSet)-TOW). This methods boosted considerable our performance in the final submission_set.

Before running the train script we need some inputs that are going to be given using this script :

python module/pre_processing.py

Then run the final script using:

python module/xgboost_mean_diff.py

On the other hand due to the class imbalance (related to aircraft type) in the dataset, we tought about another approach that can boost the performance of the model. This approach is based on creating a model for each group of aircraft_types. Try this approach using:

python module/xgboost_model_categories.py

Notes:

You can re-define the sub-categories that you want to use depending on the objectives.
This methodes uses the xgboost_mean_diff's approach to compute TOW.

Our process includes a visualization for the feature importance (Top 15) that looks like this:

Model Submission

Submit your models for evaluation through the challenge submission platform. Models will be evaluated based on their ability to accurately predict the Actual TakeOff Weight (ATOW) for the flights in the provided dataset. Intermediate rankings will be done using submission_set.csv.

License

This project is licensed under the GNU General Public License v3.0. You may obtain a copy of the license at GPL-3.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRC Data Challenge - Actual TakeOff Weight (ATOW) Prediction

Overview

Table of Contents

Acronyms

Flight List

Trajectory Data

Getting Started

Prerequisites

Installation and Setup

Adding New Dependencies

Running the Project

Managing Dependencies

Dataset Access

Additional datasets

The Global Airport Database

CADO airplane database

Model

Run the experiements

Usage

Usage

Climb and takeoff feature overview

Train description

Model Submission

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
configs		configs
data		data
feature_extractor		feature_extractor
module		module
reporting		reporting
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

euranova/aviation-data-challenge-2024

Folders and files

Latest commit

History

Repository files navigation

PRC Data Challenge - Actual TakeOff Weight (ATOW) Prediction

Overview

Table of Contents

Acronyms

Flight List

Trajectory Data

Getting Started

Prerequisites

Installation and Setup

Adding New Dependencies

Running the Project

Managing Dependencies

Dataset Access

Additional datasets

The Global Airport Database

CADO airplane database

Model

Run the experiements

Usage

Usage

Climb and takeoff feature overview

Train description

Model Submission

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages