🏎️ F1 Insights: Real-Time Replay & Historical Analytics 🏁

📖 Project Overview

Click below image or this link for a video Overview 👇 ▶️

Project overview explanation Link: Click Here

Click below image or this link for a dashboard Overview 👇 ▶️

Dashboards explanation Link: Click Here

A real-time Formula 1 telemetry system and historical data analysis system that captures, processes, and visualizes car data during race sessions, enabling both live replay and historical analysis.

The project involves a two fold analysis approach

Realtime Race replay - To replay how the Race would pan out, and dive deep into what and how things changed that led to an outcome. (Would work during the race as well :) )
Historical analysis - To answer important questions which involves post race analysis, about how weather impacts the outcome, tyre choices , race strategies and pit strategies

The system serves two primary analytical purposes:

Real-Time Race Replay

Using live telemetry data, the system reconstructs race sessions in real-time, allowing viewers to experience the race as it unfolds. This feature provides deep insights into crucial moments by visualizing:

Driver performance through speed, throttle, and brake data
Strategic DRS (Drag Reduction System) usage
Gear selection patterns across different track sections
Dynamic position changes and overtaking maneuvers

Historical Analysis & Strategic Insights

Beyond real-time replay, the system enables comprehensive post-race analysis to understand the intricate factors that influence race outcomes:

Weather impact on tire degradation and pit stop timing
Effectiveness of different pit strategies across varying track conditions
Analysis of tire compound choices and their influence on race pace

By combining real-time telemetry with historical data, this platform serves as both an engaging visualization tool and a powerful analytical resource for understanding the technical and strategic elements that define Formula 1 racing.

Motivation

Formula 1 racing represents the pinnacle of motorsport technology, where split-second decisions and minute performance differences determine victory. This project presents a comprehensive Formula 1 telemetry analysis system that brings together real-time data streaming and historical insights to unlock the complexities of race strategy and performance.

This has been a passion project of mine since long and i hope to keep improving and deploying it down the line for anyone to use and make sense of what factors led to the Race outcome

Project Scope

The project comprises 3 major sections :
- Real-time Streaming Pipeline using Confluent Kafka, SingleStore , AWS ec2 deployment
- Historical Trend Analysis using DBT , Snowflake , Airflow (WAP maintaining idempodency)
- Grafana (Interactive visualizations and deep dive)

Dataset Choices

For the most part i came across multiple data sources that help

Data_source	Data_source_type	Cost	Longevity	Support and community	Advantages
Fast-F1	Python / pandas-based	Free	Din’t offer a robust API	2.5k stars on github
• Implements caching for all API requests to speed up your scripts
Ergast	REST-API	Free	will be Depricated (End of 2024)		Been around for a while, quite relaible with the data
OpenF1	open-source API	Free	Quite active (as of today)	368 stars
SportMonks	API	Paid (65 euros/month)	Reliable	-	Accurate well maintained and can be banked on

My choice was OpenF1 for the following reasons:

The API is directly sourced from F1 LiveTiming , So in future after i subscribe the data json would look similar and i can directly source from the official site
Offers multiple data in depth with different API endpoints. essentially 1 API with multiple API endpoints that give different data (High frequency data and low frequency generic info)
Had structure and relationships between endpoints, that would let me correlate the effect of one over the other

High level Architecture Diagram

Technology choices:

Real-time Streaming Pipeline

The real-time component leverages

Confluent Kafka hosted on AWS EC2, paired with SingleStore as our operational database. This architecture was chosen for several compelling reasons:

Confluent Kafka serves as our message broker because it excels at handling high-throughput, real-time data streams with minimal latency.
- Formula 1 telemetry generates thousands of data points per second across multiple cars especially the telemetry metrics like speed , throttle metrics, and Kafka's publish-subscribe model perfectly suits this use case.
- The platform's robust partitioning and fault tolerance ensure we never miss critical race data. Four topics in my producer that is consumed by singlestore pipelines
SingleStore

SingleStore was selected as our operational database due to its unique ability to handle both real-time data ingestion and analytical queries simultaneously. Its columnar storage format and vector processing capabilities make it ideal for processing time-series telemetry data while maintaining sub-second query response times for our dashboards. Serving as both OLTP and OLAP database in one, which can perform high fidelity data ingests, analysis over millions of rows with ms latency
Process Flow:

AWS EC2 ( t3.micro )

It has a Flask listener that does the following

Listens via port 5000 → /run-and-redirect url (Triggered from Grafana Cloud dashboard )→ calls a Python (Producer) script with necessary parameters → redirects to my Grafana ec2 deployment

       
       app = Flask(__name__)
       
       @app.route('/run-and-redirect', methods=['GET'])
       def run_script():

     cmd = [
         'python3',
         '/home/ubuntu/kafka_producer_v6_cloud.py',
         '--param1', start_time,
         '--param2', end_time,
         '--session', session,
         '--data-types', 'car',
         '--driver', driver
     ]

     logger.info(f"Running command: {' '.join(cmd)}")
     subprocess.Popen(cmd)

     redirect_url = f'http://<ip_address>:3000/d/f1-telemetry/f1-real-time-telemetry?orgId=1&from={start_time.replace("T", " ")}&to={end_time.replace("T", " ")}&timezone=browser&var-driver={driver}&var-session={session}&refresh=1s'

     return redirect(redirect_url)

..........

Historical Trend Analysis

For deeper historical analysis, we implemented a robust data warehouse solution using:

DBT (Data Build Tool) manages our data transformations, enabling us to create clean, tested, and documented data models. Its version control and modularity allow us to maintain complex racing analytics while ensuring data quality through built-in testing.
Snowflake serves as our data warehouse, chosen for its separation of storage and compute resources. This architecture is particularly valuable for F1 analysis, where we might need to process years of historical race data while simultaneously handling real-time queries.
Apache Airflow orchestrates our ETL workflows using the Write Audit Publish (WAP) pattern to maintain idempotency. This ensures our historical analysis remains accurate and reproducible, even when processing data from multiple racing seasons.

Interactive Visualization

Grafana ties our entire solution together through interactive dashboards that serve both real-time and historical analysis needs. We chose Grafana for its:

Ability to handle real-time data streams with refresh rates as low as 1s
Support for complex time-series visualizations essential for race telemetry
Flexible query builders that work seamlessly with both SingleStore and Snowflake

API responses converted into Tables (ERD)

erDiagram
    MEETINGS ||--o{ SESSIONS : contains
    MEETINGS ||--o{ DRIVERS : participates-in
    MEETINGS ||--o{ RACE_CONTROL : records
    MEETINGS ||--o{ WEATHER : tracks
    
    SESSIONS ||--o{ LAPS : includes
    SESSIONS ||--o{ CAR_DATA : records
    SESSIONS ||--o{ INTERVALS : tracks
    SESSIONS ||--o{ POSITION : monitors
    SESSIONS ||--o{ PIT : records
    SESSIONS ||--o{ STINTS : tracks
    SESSIONS ||--o{ TEAM_RADIO : captures
    SESSIONS ||--o{ LOCATION : tracks

    DRIVERS {
        int driver_number PK
        string first_name
        string last_name
        string full_name
        string country_code
        string team_name
        string team_colour
    }

    MEETINGS {
        int meeting_key PK
        int year
        string country_name
        string circuit_short_name
        string location
        datetime date_start
    }

    SESSIONS {
        int session_key PK
        int meeting_key FK
        string session_name
        string session_type
        datetime date_start
        datetime date_end
    }

    CAR_DATA {
        int driver_number FK
        int session_key FK
        datetime date
        int speed
        int rpm
        int throttle
        int brake
        int n_gear
        int drs
    }

    LAPS {
        int driver_number FK
        int session_key FK
        int lap_number
        float lap_duration
        float duration_sector_1
        float duration_sector_2
        float duration_sector_3
        int st_speed
    }

    INTERVALS {
        int driver_number FK
        int session_key FK
        datetime date
        float gap_to_leader
        float interval
    }

    POSITION {
        int driver_number FK
        int session_key FK
        datetime date
        int position
    }

    PIT {
        int driver_number FK
        int session_key FK
        datetime date
        int lap_number
        float pit_duration
    }

    RACE_CONTROL {
        int driver_number FK
        int session_key FK
        datetime date
        string category
        string flag
        string message
        string scope
    }

    STINTS {
        int driver_number FK
        int session_key FK
        string compound
        int lap_start
        int lap_end
        int tyre_age_at_start
    }

    TEAM_RADIO {
        int driver_number FK
        int session_key FK
        datetime date
        string recording_url
    }

    LOCATION {
        int driver_number FK
        int session_key FK
        datetime date
        int x
        int y
        int z
    }

    WEATHER {
        int session_key FK
        datetime date
        float air_temperature
        float track_temperature
        int humidity
        float wind_speed
        int wind_direction
    }

The processes have the following stages of data architecture:

Data Ingestion

I made sure during my ingestion these 5 properties are always maintained

Idempodency ( Merge keys )

No two non-unique recods are added to the table also ensuring upserts only
Created a dynamic since the structure of the statement remains same

API call retires and Fallbacks

So API’s are unreliable sometimes, hence having retry feature to make sure, no data is left behind during ingest
Handling high frequency data
- For telemetry high frequency data, i broke my api calls into 5 min chunks
- Let’s say the API still produces too much data to handle, then it fallback into 1/2 the window size and then continues to maintain 5 min windows for the next call

Always ingest into a temp staging table

Made sure the production table is not impacted by ingestion issues (always ingesting to a staging table and dropping it immediately after succesful merge)

Compute resources ( save costs)

There is no Race every single day so i made sure i don’t exhaust the compute resources on a non race day

Data Transformation

These are all my data transformations

Anything prefixed with stg are my staging models.
Prefixed with “f1_” are my data marts and used in final dashboards
Prefixed with “int_” are my intermediate models used a lot in multiple marts as well

My Transformations on a high level include:

Staging models ( Pulled from tested source table ):

dbt_project/models/staging

Things like (Changing datatype and names of my columns)
- Changing datatype and names of my columns
  - driver_number::VARCHAR(50) as driver_number,
  - TO_TIMESTAMP_NTZ(date) as timestamp,
  - NULLIF(gap_to_leader, 'None') as gap_to_leader
  - DATEADD(second, lap_duration, DATE_TRUNC('second', date_start)) AS date_end,
- Adding columns for time ranges
  - DATE_TRUNC('second', date_start) AS date_start,
Intermediate models ( Pulled from tested staging models and other int_models alone ): few example below

dbt_project/models/intermediate
- Weather (Computing averages and Categorizing my dashboards)
- Computing lap wise start and end positions
  
  Marts (dbt_project/models/marts)
  
  Used directly in dashboards
  - There were only position changes , so i had to fill positions for the laps where there was no position change , to understand how the position evolved
  - A mart for dashboard filters as well

Data Validation

Added generic tests:

{% test range_check(model, column_name, min_value, max_value) %}

with validation as (
    select
        {{ column_name }} as field_value
    from {{ model }}
    where {{ column_name }} is not null  -- Only test non-null values
),

validation_errors as (
    select
        field_value
    from validation
    where field_value < {{ min_value }} or field_value > {{ max_value }}
)

select *
from validation_errors

{% endtest %}

More specific tests as well

-- Add this to macros/f1_car_data_tests.sql

{% test brake_throttle_combination(model) %}
with high_values as (
    select
        date,
        brake,
        throttle,
        speed,
        rpm
    from {{ model }}
    where brake > 105  -- Allowing for some sensor overflow
      and throttle > 105  -- Allowing for some sensor overflow
      -- Only flagging when both brake and throttle are extremely high
      -- and when we're not at very low speeds (could be valid during starts/pit exits)
      and speed > 50
      -- Excluding engine warmup/testing scenarios
      and rpm > 5000
),

-- Look for sustained periods of extreme values
sustained_issues as (
    select *,
        lead(date) over (order by date) as next_reading,
        lag(date) over (order by date) as prev_reading
    from high_values
)

-- Return only the problematic readings that persist for multiple samples
select
    date,
    brake,
    throttle,
    speed,
    rpm
from sustained_issues
where datediff('millisecond', prev_reading, date) < 500  -- Looking for sustained issues
   or datediff('millisecond', date, next_reading) < 500
{% endtest %}

Data Visualization

Historical Dashboards:

F1 (LAP on LAP analysis)

How do gaps between drivers evolve during the race?
Compare race pace vs qualifying pace per sector for a driver ?
How does sector performance affect position changes ?
How do pit-out laps affect overall race strategy?

F1 Telemetry Analysis (Historical)

How do drivers use DRS throughout the race?
How do braking patterns differ between drivers?
How does speed, throttle, drs patterns differ between multiple drivers

F1 Tyre preference and weather impact

What's the optimal tire life for each compound at different circuits?
How do different teams manage their tire strategies?
How does track temperature affect tire strategy?
What's the correlation between humidity and wet conditions effect tire performance?

Realtime Dashboard

Realtime replay of specific portion of the race
- Drivers race dashboard , showing speed, braking and drs in realtime
- Realtime view of your favourite driver
- Works during the race as well

Challenges and FAQs

Why Grafana on EC2?

Grafana does not support 1s natively on cloud it can go as low as 5s refresh , whats the fun if there is no 1s refresh in Formula 1 , hence i deployed my grafana instance changes min_refresh_interval in my grafana.ini file , hosted on the instance

Why Two Grafana’s (Deployed on Ec2 and Grafana cloud)

I wanted to keep the historical data containing millions of rows in Grafana Cloud, easy cloud snowflake connection , sharing and fast and realible, can make as many filters and changes since it’s just a connection to Snowflake
Grafana Ec2 to achieve 1s latency and then connect Grafana cloud button to redirect to my Grafana on Ec2

Why SingleStore pipelines ?

Extremely fast ingest
Has a shared Tier free forever
ms latency allowing me perform hig fedility queries in my Grafana dashboards

How does Snowflake dbt and airflow fit in the workflow ?

All of trend analysis, tyre considerations, lap on lap position change, driver riding patterns (Anything that requires a lot of historical data to view holistically)

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
images		images
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏎️ F1 Insights: Real-Time Replay & Historical Analytics 🏁

📖 Project Overview

Real-Time Race Replay

Historical Analysis & Strategic Insights

Motivation

Project Scope

Dataset Choices

High level Architecture Diagram

Technology choices:

Real-time Streaming Pipeline

SingleStore

AWS EC2 ( t3.micro )

Historical Trend Analysis

Interactive Visualization

API responses converted into Tables (ERD)

Data Ingestion

Idempodency ( Merge keys )

API call retires and Fallbacks

Always ingest into a temp staging table

Compute resources ( save costs)

Data Transformation

Staging models ( Pulled from tested source table ):

Intermediate models ( Pulled from tested staging models and other int_models alone ): few example below

Marts (`dbt_project/models/marts`)

Data Validation

Data Visualization

Historical Dashboards:

F1 (LAP on LAP analysis)

F1 Telemetry Analysis (Historical)

F1 Tyre preference and weather impact

Realtime Dashboard

Challenges and FAQs

About

Releases

Packages

sonGokuKhk98/Formula_1

Folders and files

Latest commit

History

Repository files navigation

🏎️ F1 Insights: Real-Time Replay & Historical Analytics 🏁

📖 Project Overview

Real-Time Race Replay

Historical Analysis & Strategic Insights

Motivation

Project Scope

Dataset Choices

High level Architecture Diagram

Technology choices:

Real-time Streaming Pipeline

SingleStore

AWS EC2 ( t3.micro )

Historical Trend Analysis

Interactive Visualization

API responses converted into Tables (ERD)

Data Ingestion

Idempodency ( Merge keys )

API call retires and Fallbacks

Always ingest into a temp staging table

Compute resources ( save costs)

Data Transformation

Staging models ( Pulled from tested source table ):

Intermediate models ( Pulled from tested staging models and other int_models alone ): few example below

Marts (dbt_project/models/marts)

Data Validation

Data Visualization

Historical Dashboards:

F1 (LAP on LAP analysis)

F1 Telemetry Analysis (Historical)

F1 Tyre preference and weather impact

Realtime Dashboard

Challenges and FAQs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Marts (`dbt_project/models/marts`)

Packages