Creating the Simulation

Steps and Functions to Create the "Play Ball!" Application

Goal

To uncover the pitcher, batter duel, it is important to understand how pitchers utilize their pitch sequencing and command to their advantage. Pitcher's decisions depend on many factors, such as count, inning, score, runners on base, and even the specific batter they are facing. With this level of specificity, there is not enough data for each situation to model what a pitcher would do. This does not make it possible to predict a specific outcome or what will happen in any given situation. However, utilizing a simulation allows users to see what could result from varying pitches and locations given the pitcher and batter matchup they selected. By simulating each pitch 1,000 times, this additionally helps uncovers underlying trends in the data.

Play Ball!

Playing in the role of the pitcher, users will choose to face a left or right-handed batter for an at-bat, trying to strategicially vary their pitch type and location to try to strike the batter out! Through this simulations, users will get to explore which locations and sequencing is most advantageous to the pitcher.


Data

I am using data scraped from MLB Statcast from the 2024 and 2025 seasons. This data contains every pitch from each regular season. After I downloaded the raw MLB Statcast data, I created various datasets that would be used for the simulation and displaying the pitcher and pitch type options available to users. I created the following datasets to be used in the simulation:

  1. Pitcher Options: Pitchers with enough data to run the simulation

    First, I had to figure out which pitchers the users could select. To do this, I took raw data from the 24 and 25 MLB seasons and created a list of pitchers that threw at least 1,000 pitches to both left and right-handed batters. This ensures users can only select pitchers who have enough data for the simulation to run.

    This dataset holds the player ID number, player name, and whether they throw left or right-handed. The name and handedness of the pitcher are then displayed to the user in the dropdown menu and the ID number is used to identify the pitcher in the other datasets.

  2. Pitch Options: Types of pitches that are dynamic to the pitcher and batter stance selected

    After the list of Pitcher Options has been created, I had to determine which pitches they could throw to each batter stance. For each combination of pitcher and batter stance, I only included the pitch type option if the following conditions were met:

    • Each pitch type was thrown at least 50 times
    • The pitch type was swung on at least 25 times
    • There were at least 10 recorded takes (no swings), hits, fouls, and swing and misses for the pitch type

    In order for this simulation to work, I had to make sure there was enough variability in the results for each pitch type. For example, if the batter never swung on the pitch in the raw data, the simulation cannot predict whether the batter would swing or not. Additionally, if the batter only recorded a foul for any time they swung, the simulation would not be able to predict the likelihood of recording a swing and miss or hit.

  3. Full MLB Dataset: Raw data for pitchers in Pitcher Options from the 24 and 25 seasons

    Finally, I filtered the full MLB Statcast data from 24 and 25 seasons to only include rows for the valid pitcher and pitch types listed in Pitcher and Pitch Options.


Flow of Simulation

On the Play Ball! page, once the user makes all of their selections (pitcher, batter, pitch, location), and hits "Throw Pitch!" all of the following steps are run.

When the user first hits "Throw Pitch," the MLB Datasetis filtered to only rows for the pitcher and batter stance selected by the user. Data for this matchup is then used for the rest of the steps in the simulation.

  1. Calculating Pitch Type Parameters

    With the selected pitch type, the pitcher's data is filtered down to rows where they threw that selected pitch. Two sets of parameters are now caluclated: swing_decision and contact parameters. Both parameters use the predictors:

    • dist_x: horizonal distance from the center of the strike zone (same as the horizontal location of the pitch (plate_x))
    • dist_z: vertical distance from the center of the strike zone (vertical location of the pitch (plate_z) - 2.5)
    • balls: number of balls in the count when the pitch was thrown
    • strikes: number of strikes in the count when the pitch was thrown


    I would like to note that balls and strikes are numerical, rather than categorical. These variables were originally categorical, but I encountered errors running the model due to not having enough observations and variability for each count. It was running a model for all 12 unique counts, where many counts had fewer than 10 rows of data or all resulted in the same outcome (ex: every observation resulted in batter taking the pitch as a ball). I first tried to limit my dataset to only pitchers and pitch types that had data for each count, but this resulted in only being able to throw fastballs for any given combination. I instead decided the better approach was to remove the categorical classification from balls and strikes. This no longer requires pichers to have enough data for each count, and still effectively captures the effects of batters being more likely to swing when there are fewer strikes and foul on #-2 counts, for example.

    Swing_decision_params uses a logistic regression model on the pitch type to determine whether the batter will swing (is_swing = 1) or not (is_swing = 0). Using raw data for the pitcher, batter stance, and pitch it fits the following model:
    is_swing ~ dist_x + dist_z + balls + strikes

    The coefficients of this model are stored in swing_dec_params for the pitch type and used in the Batter Decision function.

    Contact_params then filters the same data from swing_decisions to only include rows where the batter swung. It uses the following multinomial logistic regression model to determine what type of result the batter recorded from swinging:
    simple_result ~ dist_x + dist_z + balls + strikes

    Whiff, or swinging_strike, is the baseline of this model. The coefficients of foul and inplay are relative to whiff, and are stored in contact_params and used in the Swing Result function.


Results

For the Pitch thrown, the simulation loops through the four functions of the simulation 1,000 times. Here is a snapshot of the data stored and what it looks like:



So now what happens with this data?

First, the simulation has to choose one result from this data to display to the user. To decide the final outcome, the simulation takes a random weighted probability of all possible outcomes. This is the same method used in the Swing Result function. Each iteraction has an associated outcome, as seen in the snapshot above. For example, for all 1,000 iterations for one pitch, assume the following outcomes occured: All possible outcomes have probabilities that sum to 1. The computer then maps each probabilty onto a 0-1 scale and generates a random number. The random number falls into the range of one of the outcomes, which is the final result displayed to the user. A Ball takes up 45% of the total probability space in this example, so it is most likely that will be the final outcome, but it is possible for any outcome above to occur. This process mirrors the distribution of the data while allowing for outcomes other than the most probable to occur, which reflects what is seen in a real life game of baseball.
The result chosen and displayed on the screen also changes the current count of the at-bat accordingly.

Additonally, some of the data from all 1,000 iterations is displayed to the user, including: Together, these insights allow users to understand which locations can maximize the amount of called strikes, swing and misses, and fouls, while minimizing the number of hits.