During my first week, I created an outline for my project, detailing the order of areas and questions I wanted to explore and answer. With these questions, I outlined some of the necessary variables I would need and verifying the data was available. Luckily, baseball has data on pretty much everything you can think of. A premade Python package made it easy to scrape all of the pitch-by-pitch data I needed from MLB's Statcast website. During this week, I played around with some data and created simple visualizations in R that showed where balls and strikes were called against the strike zone. I identified my next step as trying to draw where this new strike zone was (where balls and strikes were actually called) compared to the set zone. I started experimenting with classification models, such as logistic regression and KNN, which I am continuing to work on heading into week two.
This week was very productive for me. Building off of week 1, I found that running a generalized additive model worked best to predict and
visualize the 'actual' strike zone. The model calculates the probability of the pitch being called a strike for any coordinate based on
the pitches called nearby. Different strike probabilities are created for each combination of coordinates, month, and batter stance. The black box
on the visuals represent the 'set' strike zone where balls and strikes should be determined. The smoother, colored lines represent the 50% probability
of a pitch being called a strike for that given month. In general, this allowed me to see that the 'actual' strike zone tends to expand from the 'set' strike
zone throughout the 2025 season for both left and right-handed batters. I also looked into comparing the 2024 season, when balls and strikes could not be
challenged, to 2025, where they could. I found that the strike zone tended to be larger each month in 2024 compared to 2025. With this understanding
of the strike zone, next week I will move into exploring how often pitchers throw outside of the strike zone.
Here is one of the visuals I created showing the strike zone for each month of the 2025 season for left vs right-handed batters.
This week moved a bit slower for me. After a meeting with Dr. Dunbar, we realized there were some changes to the strike zone visualizations that were not broken up by left and right-handed pitchers. From the umpires perspective behind the plate, left-handed batters have an “outside zone” to the left of the plate and an “inside zone” on the right. This is the opposite for right-handed batters. I decided to change the pitch coordinates for left-handed batters so they were in the same perspective as right-handed batters by negating the x pitch coordinate. After these edits, I started organizing data to answer how often pitches are thrown outside of the zone. Ideally, I want to break this up by the type of pitch thrown and/or whether it resulted in a strike, foul, ball, or ball in play. This is what I will be continuing heading into next week.
After finding the proportions of pitches thrown outside the zone and the results of those pitches, I moved into analzying specific pitchers and their pitch locations with results. I decided after our weekly meeting that I would start by selecting a pitcher and finding the average x and y coordinates of his pitches thrown, broken up by the type of pitch. With the average locations, I created a box (approximately equal to the radius of a baseball in each direction) around each coordinate. With each box around the average coordinates, I filtered my dataset to pitches inside any of the boxes. Then, I found the proportion of pitches in each box that were called strikes, fouls, swings and misses, hits, etc. With this information, I can compare the expected results of throwing a pitch in their average area versus somewhere else in the strike zone. Ideally, I would like the user to pick a scalar for the x and z coordinate of the average pitch location to see how the zones differ. This week I want to generalize this process to set myself up for users to be able to pick a pitcher and scale for the pitch coordinate. I think it would also be interesting to look at the optimal location/zone for each type of pitch for a specific pitcher.
Building off of last week, Dr. Dunbar and I decided to create and run a simulation model for these average and shifted pitch locations. The steps for the simulation and each function is described below. With these functions, I can run a simulation based on the pitcher's current pitch locations, or use input values for the x and z coordinate to shift the average location. After running each simulation thousands of times, it results in the probabilities of called strikes, balls, and swings for the average and shifted locations. Once the swing function is working, the swing category will be replaced with hit into play, swing and miss, and foul categories. My next steps are to get the swing function to work, create a visual of the simulation data, and look for any improvements to make in my functions.
This week I was able to finish building and refining my pitch simulation. I finished creating and integrating the swing function, which runs a multinomial
logistic regression with the results being a swing & miss, foul, or hit into play. This is based on the distance from the center of the plate. I think it would be interesting
to see how the results change by adding in other variables such as velocity or pitch movement, but I am going to leave for now to focus on other parts of my project. The swing function
takes the probability of these results and passes it to the simulation, only running if the batter chooses to swing at a pitch.
With my simulation done, I am working on tidying up my visuals that show the results of the simulation and figuring out my next steps. I think this will take some time and experimentation to see where to
go from here, but I am hoping to build off of my simulation to allow users to pick a game situation (likely number of outs and inning to start) and simulate an entire at-bat.
This week I was able to narrow down the next steps for my project and begin to picture what exactly I wanted my user interface to look like. The plan is to have a user select a specific pitcher and whether their batter is left or right-handed. A specific pitcher and batter matchup only see a maximum of ~20 at bats in a season, which would not be nearly enoughd data for this simulation. Since pitch locations and results vary by the handedness of the pitcher and batter, accounting for this gives more specificity than generalizing an at-bat to all batters that a pitcher faces. With these selections, for each pitch of the at-bat, the user would be able to drag a ball around the strike zone, select a type of pitch, and click a "pitch" button that would run the simulation with their selections. After running with their selections, it would take the most probable outcome and add it to the count (ex: if "called strike" is the most probable after the first pitch, the count would display 0-1 for their next selection. The count would also be a factor in the simulation's logic). It could also display the probability of all other results happening on that pitch. This would continue until the at bat has ended. This type of interface would allow users to "play" from a pitcher's perspective and could create a sort of competition for users to try to get the simulation to result in a strike-out or out. With this weekend being the start of Spring Break, I took time to document where I am currently at and specific steps to take when I pick up on this project again.