Welcome to the Google Cloud x MLB™ Hackathon!
Calling all Devs: Step up to the plate and knock it out of the park! Use Google Cloud's heavy-hitting data and AI lineup (Gemini, Imagen, Vertex AI... the whole roster!) and real data from Major League Baseball™ to build the future of fan engagement. Showcase your AI skills, craft impactful applications, and revolutionize how baseball fans experience the game.
Ready to hit a grand slam? Build a project using Google Cloud AI that revolutionizes MLB™ fan experience.
Let's build the future of baseball together!
This repository contains resources and instructions to help you get started with accessing MLB's data services for your hackathon project. Stay tuned to this repo as more datasets become available.
A primary data source for this hackathon will be MLB's GUMBO (Grand Unified Master Baseball Object) data feeds, which are available without authentication.
The GUMBO (Grand Unified Master Baseball Object) live data feed provides a standardized JSON response that summarizes the entire state of a selected game upon each update. Unlike previous live event data feeds, GUMBO provides complete game information with every object creation, rather than incremental updates.
- Complete Game State: No need to maintain game state locally or build upon message sequences - each GUMBO object includes complete and current dataset for the entire game
- Standard JSON Format: GUMBO follows true JSON standards, unlike previous feed formats
- Flexible Access Methods: Available via:
- Websocket listener (push updates every 1-2 seconds)
- Stats API (pull updates every 12 seconds)
- Development-Friendly: Use actual production data for development at any time without requiring scheduled test data delivery
Access GUMBO data using these base URLs (replace {game_pk}
with the specific game ID):
-
Current Game State:
https://statsapi.mlb.com/api/v1.1/game/{game_pk}/feed/live
-
Specific Game Timestamp:
https://statsapi.mlb.com/api/v1.1/game/{game_pk}/feed/live?timecode=yyyymmdd_######
-
List of Game Update Timestamps:
https://statsapi.mlb.com/api/v1.1/game/{game_pk}/feed/live/timestamps
Here are some useful examples of common Stats API queries:
-
Get 2024 MLB Regular Season Schedule:
https://statsapi.mlb.com/api/v1/schedule?sportId=1&season=2024&gameType=R
Parameters explained:
sportId=1
: MLB (1 represents Major League Baseball)season=2024
: The season yeargameType=R
: Regular season games (R = Regular Season, P = Postseason, S = Spring Training)
-
Get Los Angeles Dodgers 2024 Roster:
https://statsapi.mlb.com/api/v1/teams/119/roster?season=2024
Parameters explained:
119
: Team ID for LA Dodgersseason=2024
: The season year
-
Get Team Information:
https://statsapi.mlb.com/api/v1/teams/119
- Returns detailed information about a specific team (119 = Dodgers)
- Add
?season=2024
to get team info for a specific season
-
Get Player Information:
https://statsapi.mlb.com/api/v1/people/660271
- Returns detailed information about a specific player (660271 = Shohei Ohtani)
- Add
?season=2024
to get player info for a specific season
-
Get Live Game Data:
https://statsapi.mlb.com/api/v1.1/game/716463/feed/live
- Returns live GUMBO feed for a specific game
- Game PKs can be obtained from the schedule endpoint
Common Query Parameters:
hydrate
: Add additional data to the response (e.g.,?hydrate=stats,team
)fields
: Limit the response to specific fieldsseason
: Specify a season yeardate
: Specify a specific date (format: MM/DD/YYYY)
Note: All endpoints return JSON data. You can use tools like curl
or Python's requests
library to fetch the data:
import requests
# Example: Get Dodgers roster
url = "https://statsapi.mlb.com/api/v1/teams/119/roster?season=2024"
response = requests.get(url)
data = response.json()
- Review the provided documentation in this repository for detailed information about:
- Available endpoints
- Data structure
- Query parameters
- Response formats
The MLB data feeds provide different levels of historical data granularity depending on the time period:
- 1901-1968: Boxscore level only
- 1969-1988: Play-by-play level
- 1989-2007: Pitch-by-pitch level
- 2008-2014: Pitch-by-pitch with pitch speed/break information (Pitch F/x)
- 2015-Present: Pitch-by-pitch with enhanced metrics:
- Pitch speed
- Exit velocity
- Home Run distance
- 2021: Florida State League (A) added
- 2022: Pacific Coast League (AAA) added
- 2023: International League (AAA) added
All datasets for the hackathon are available in our public Google Cloud Storage bucket. The datasets are organized into the following categories:
- MLB Caption Data
- MLB Fan Content Interaction Data
- MLB StatsAPI Documentation
- Game Data (including 2024 home runs dataset)
-
Direct Browser Access: Access the datasets through the Google Cloud Console: GCP MLB Hackathon 2025 Bucket
-
Command Line Access: Using
gsutil
(part of Google Cloud SDK):# Download all files gsutil -m cp -r gs://gcp-mlb-hackathon-2025/* . # Download specific dataset gsutil cp gs://gcp-mlb-hackathon-2025/datasets/2024-mlb-homeruns.csv .
Check out our Google Colab Notebook for interactive examples and tutorials to help you get started with the MLB data and Google Cloud AI tools.
For help understanding baseball terminology and statistics, refer to the MLB Glossary. This comprehensive resource explains baseball terms, statistics, and metrics used throughout MLB's data services.
More information about available datasets and hackathon themes will be provided in this repository. Stay tuned for updates!