During my first week, I created an outline for my project, detailing the order of areas and questions I wanted to explore and answer. With these questions, I outlined some of the necessary variables I would need and verifying the data was available. Luckily, baseball has data on pretty much everything you can think of. A premade Python package made it easy to scrape all of the pitch-by-pitch data I needed from MLB's Statcast website. During this week, I played around with some data and created simple visualizations in R that showed where balls and strikes were called against the strike zone. I identified my next step as trying to draw where this new strike zone was (where balls and strikes were actually called) compared to the set zone. I started experimenting with classification models, such as logistic regression and KNN, which I am continuing to work on heading into week two.
This week was very productive for me. Building off of week 1, I found that running a generalized additive model worked best to predict and
visualize the 'actual' strike zone. The model calculates the probability of the pitch being called a strike for any coordinate based on
the pitches called nearby. Different strike probabilities are created for each combination of coordinates, month, and batter stance. The black box
on the visuals represent the 'set' strike zone where balls and strikes should be determined. The smoother, colored lines represent the 50% probability
of a pitch being called a strike for that given month. In general, this allowed me to see that the 'actual' strike zone tends to expand from the 'set' strike
zone throughout the 2025 season for both left and right-handed batters. I also looked into comparing the 2024 season, when balls and strikes could not be
challenged, to 2025, where they could. I found that the strike zone tended to be larger each month in 2024 compared to 2025. With this understanding
of the strike zone, next week I will move into exploring how often pitchers throw outside of the strike zone.
Here is one of the visuals I created showing the strike zone for each month of the 2025 season for left vs right-handed batters.
This week moved a bit slower for me. After a meeting with Dr. Dunbar, we realized there were some changes to the strike zone visualizations that were not broken up by left and right-handed pitchers. From the umpires perspective behind the plate, left-handed batters have an “outside zone” to the left of the plate and an “inside zone” on the right. This is the opposite for right-handed batters. I decided to change the pitch coordinates for left-handed batters so they were in the same perspective as right-handed batters by negating the x pitch coordinate. After these edits, I started organizing data to answer how often pitches are thrown outside of the zone. Ideally, I want to break this up by the type of pitch thrown and/or whether it resulted in a strike, foul, ball, or ball in play. This is what I will be continuing heading into next week.
After finding the proportions of pitches thrown outside the zone and the results of those pitches, I moved into analzying specific pitchers and their pitch locations with results. I decided after our weekly meeting that I would start by selecting a pitcher and finding the average x and y coordinates of his pitches thrown, broken up by the type of pitch. With the average locations, I created a box (approximately equal to the radius of a baseball in each direction) around each coordinate. With each box around the average coordinates, I filtered my dataset to pitches inside any of the boxes. Then, I found the proportion of pitches in each box that were called strikes, fouls, swings and misses, hits, etc. With this information, I can compare the expected results of throwing a pitch in their average area versus somewhere else in the strike zone. Ideally, I would like the user to pick a scalar for the x and z coordinate of the average pitch location to see how the zones differ. This week I want to generalize this process to set myself up for users to be able to pick a pitcher and scale for the pitch coordinate. I think it would also be interesting to look at the optimal location/zone for each type of pitch for a specific pitcher.
Building off of last week, Dr. Dunbar and I decided to create and run a simulation model for these average and shifted pitch locations. The steps for the simulation and each function is described below. With these functions, I can run a simulation based on the pitcher's current pitch locations, or use input values for the x and z coordinate to shift the average location. After running each simulation thousands of times, it results in the probabilities of called strikes, balls, and swings for the average and shifted locations. Once the swing function is working, the swing category will be replaced with hit into play, swing and miss, and foul categories. My next steps are to get the swing function to work, create a visual of the simulation data, and look for any improvements to make in my functions.
This week I was able to finish building and refining my pitch simulation. I finished creating and integrating the swing function, which runs a multinomial
logistic regression with the results being a swing & miss, foul, or hit into play. This is based on the distance from the center of the plate. I think it would be interesting
to see how the results change by adding in other variables such as velocity or pitch movement, but I am going to leave for now to focus on other parts of my project. The swing function
takes the probability of these results and passes it to the simulation, only running if the batter chooses to swing at a pitch.
With my simulation done, I am working on tidying up my visuals that show the results of the simulation and figuring out my next steps. I think this will take some time and experimentation to see where to
go from here, but I am hoping to build off of my simulation to allow users to pick a game situation (likely number of outs and inning to start) and simulate an entire at-bat.
This week I was able to narrow down the next steps for my project and begin to picture what exactly I wanted my user interface to look like. The plan is to have a user select a specific pitcher and whether their batter is left or right-handed. A specific pitcher and batter matchup only see a maximum of ~20 at bats in a season, which would not be nearly enoughd data for this simulation. Since pitch locations and results vary by the handedness of the pitcher and batter, accounting for this gives more specificity than generalizing an at-bat to all batters that a pitcher faces. With these selections, for each pitch of the at-bat, the user would be able to drag a ball around the strike zone, select a type of pitch, and click a "pitch" button that would run the simulation with their selections. After running with their selections, it would take the most probable outcome and add it to the count (ex: if "called strike" is the most probable after the first pitch, the count would display 0-1 for their next selection. The count would also be a factor in the simulation's logic). It could also display the probability of all other results happening on that pitch. This would continue until the at bat has ended. This type of interface would allow users to "play" from a pitcher's perspective and could create a sort of competition for users to try to get the simulation to result in a strike-out or out. With this weekend being the start of Spring Break, I took time to document where I am currently at and specific steps to take when I pick up on this project again.
Getting back into my project after Spring Break, I was able to complete a full at-bat simulation in R! I was able to add the count into the simulation's logic in deciding whether a batter would swing at a pitch and the result of the contact made. This not only gave me a more accurate model, but is a crucial variable when users are running their own simulations on my dashboard. In R, I
tested running a full at-bat with a selected pitcher, pitch, and batter stance to test the code's logic and the results this would give me. In my dashboard, the user will be picking a different pitch and location each time, but for the sake of this test, it used the same selection until the at-bat was over. This worked so well and I definitely saw my results vary as I changed the location and
pitch type! Below are some screenshots from letting the simulation run on its own with initial selections. Something worth noting is that my simulation determines the outcome of each pitch using a random weighted sample. By not taking the most probable example every time, it avoids getting stuck throwing endless foul balls and allows for a more realistic, varying story for each at-bat.
With this working, I am currently working on transferring this to a working Python version. Thanks to Dr. Mcvey, I am also working with some JavaScript files that would allow a user to move a baseball over an image and keep track of the coordinates. My goal for this week is to have a html page that uses the baseball location coordinates and runs my at bat simulation in Python. To start, I am going to use
the same pitcher, pitch type, and batter stance and include those once the coordinates work.
It's both exciting and scary to say that presentations are just about a month away! This fact is definiely keeping me motivated. This week has been very succcessful for me. With a working Python version of my at-bat simulation, I started creating my Flask and Python webpage. I started by creating a basic application that displayed a hard-coded pitcher, batter stance, and pitch type and allowed the user to hit a "Run Simulation" button to view the results of each pitch of the at-bat. From here, I slowly added user selections and tested with each addition. Now, I am at the point where a user can select all three of these elements before running the simulation. In my interface, I want users to select a pitcher and batter stance and submit their selections before doing anything else. This prevents users from changing the pitcher or batter while they are mid-simulation and helps create the dynamic list of pitches users can choose from. With the matchup set, then the user can pitch a pitch type and run the simulation. There is also an option to reset the matchup and "restart" the simulation. I'm very happy where this is at and I'm excited to keep expanding it.
Here are some of the next steps and things I'm working on: