Data for this project was scraped from MLB Savant: Statcast. I first scraped the raw data for the 2024 and 2025 seasons. These raw sheets were used in the strike zone analysis and used to create the spreadsheets for the simulation.
The following spreadsheets were cleaned, and they are the files necessary to run the simulation:
If you're interested in how I turned the raw data into the cleaned sheets used for the simulation, download this R code with the steps!
The following code utilizes the Raw 2024 and 2025 Pitch-by-Pitch data. This section has code for EDA to begin this project and discovering where umpires actually call strikes compared to the set strike zone.
The following code utilizes the cleaned spreadsheets Pitcher Options, Pitch Options, Full MLB Dataset. This section shows how the simulation was build first using R, then converted to Python, and adapted to a Python/Flask user interface.