During my first week, I created an outline for my project, detailing the order of areas and questions I wanted to explore and answer. With these questions, I outlined some of the necessary variables I would need and verifying the data was available. Luckily, baseball has data on pretty much everything you can think of. A premade Python package made it easy to scrape all of the pitch-by-pitch data I needed from MLB's Statcast website. During this week, I played around with some data and created simple visualizations in R that showed where balls and strikes were called against the strike zone. I identified my next step as trying to draw where this new strike zone was (where balls and strikes were actually called) compared to the set zone. I started experimenting with classification models, such as logistic regression and KNN, which I am continuing to work on heading into week two.
This week was very productive for me. Building off of week 1, I found that running a generalized additive model worked best to predict and
visualize the 'actual' strike zone. The model calculates the probability of the pitch being called a strike for any coordinate based on
the pitches called nearby. Different strike probabilities are created for each combination of coordinates, month, and batter stance. The black box
on the visuals represent the 'set' strike zone where balls and strikes should be determined. The smoother, colored lines represent the 50% probability
of a pitch being called a strike for that given month. In general, this allowed me to see that the 'actual' strike zone tends to expand from the 'set' strike
zone throughout the 2025 season for both left and right-handed batters. I also looked into comparing the 2024 season, when balls and strikes could not be
challenged, to 2025, where they could. I found that the strike zone tended to be larger each month in 2024 compared to 2025. With this understanding
of the strike zone, next week I will move into exploring how often pitchers throw outside of the strike zone.
Here is one of the visuals I created showing the strike zone for each month of the 2025 season for left vs right-handed batters.
This week moved a bit slower for me. After a meeting with Dr. Dunbar, we realized there were some changes to the strike zone visualizations that were not broken up by left and right-handed pitchers. From the umpires perspective behind the plate, left-handed batters have an “outside zone” to the left of the plate and an “inside zone” on the right. This is the opposite for right-handed batters. I decided to change the pitch coordinates for left-handed batters so they were in the same perspective as right-handed batters by negating the x pitch coordinate. After these edits, I started organizing data to answer how often pitches are thrown outside of the zone. Ideally, I want to break this up by the type of pitch thrown and/or whether it resulted in a strike, foul, ball, or ball in play. This is what I will be continuing heading into next week.
After finding the proportions of pitches thrown outside the zone and the results of those pitches, I moved into analzying specific pitchers and their pitch locations with results. I decided after our weekly meeting that I would start by selecting a pitcher and finding the average x and y coordinates of his pitches thrown, broken up by the type of pitch. With the average locations, I created a box (approximately equal to the radius of a baseball in each direction) around each coordinate. With each box around the average coordinates, I filtered my dataset to pitches inside any of the boxes. Then, I found the proportion of pitches in each box that were called strikes, fouls, swings and misses, hits, etc. With this information, I can compare the expected results of throwing a pitch in their average area versus somewhere else in the strike zone. Ideally, I would like the user to pick a scalar for the x and z coordinate of the average pitch location to see how the zones differ. This week I want to generalize this process to set myself up for users to be able to pick a pitcher and scale for the pitch coordinate. I think it would also be interesting to look at the optimal location/zone for each type of pitch for a specific pitcher.