Blog

Basic python implementation is coming along well and will be finished in the next couple of days, along with the user tool in Flask. I have a much more clear idea of how I want data to be stored and the options that I want users to have. The next day or two will be implementing this in Python, while leaving myself just enough time for debugging and making things look a little nicer. The last hangup is ensuring that everything is properly explained and documented.

Week 12:

This past week involved settling on parametric models as well as exploring the change detection algorithm. Change detection has been successfully employed and may continue to be tweaked slightly for performance, but for the most part I am now focused on moving everything into Python and building the user tool. Along with page design, I need to determine where model data will be stored when a model is run.

Week 11:

Walkthroughs were this week, which gave me the chance to identify targets that need to be met soon for this project to be successful, as well as to create a clear picture of what I want the deliverables to look like. Primary plan of action is to improve the individual player models for a set list of players in order to include an individualistic seasonality factor. Once the model is satisfactory, I will start working on the program in Python and importing the dataset.

Week 10:

Happy Easter! I've got a couple eggscellent developments to discuss, so let's hop right to it! First off, my dataset of 500 players per week is currently limited to 300 at the moment, as that is all that I can get R to run. Once testing has concluded I may be able to find extra space for R to grab the rest, but for now 300 is still plenty. Struggles with data size have applied to Excel as well, where I am still troubleshooting some functions that I would love for EDA that just won't run across the whole dataset. Basic models have been running successfully on the 300 per week sample, though results are less than ideal. Improvements will come through adding change detection and accounting for seasonality, as well as generally just using something that fits the data better than linear. When running more than just these basic models, I am primarily thinking about panel data. I am in the midst of figuring out how to run smoothing techniques on panel data to sort of create a model for individual players, which may improve performance. Regarding change detection, I have identified several players who broke out in this timeframe. Change detection may look something like running the change detection formula if a player breaks into the top 100 or top 50, but only applying it if his average rank has significantly changed.

Week 9:

Full dataset of weekly player rankings for the top 500 from 2009-current has been procured, giving us a lot to play around with. I have resumed basic data exploration in hopes of finding specific criteria for breakouts and falloffs to use for change detection. I have also come to the conclusion that without match-level data, seasonality will have to be estimated using the timeframe that each surface's season is played in. This should be reasonable, as tournaments are played at about the same time every year, but we may run into issues. Either way, it will provide some level of information regarding a player's expectations.

Week 7-8:

SNC spring break was this week which is why we have yet another multi-week blog post here. Dr. McVey has been kind enough to assist in acquiring a full dataset of player rankings from the ATP website, and we are hopeful to have this very soon. I have been exploring change detection as an additional feature of the rankings analysis program, which may allow us to predict whether a player will continue positive or negative momentum, or whether they will fall back towards mean-level results. Once the full dataset is acquired, I will begin comparing models with and without factors such as seasonality, and hopefully determine how the data is best explained.

Week 5-6:

I was sick for much of this period, however we still have developments as far as exploring how to fit models to the data we have. The solution that I came to was that it will be much easier to model the data in a slightly different form, containing both total points for a player and change in points from the previous period. The process of obtaining this data in the right form continues and will hopefully be solved shortly. This week also brought a change to looking primarily at weekly rankings in order to allow for other factors such as seasonality to be considered.

Week 4:

Other modeling solutions including K nearest neighbors have emerged as possible solutions, though data wrangling has remained a problem. A success from this week came in creating a solid timeline with broken up goals that should make staying on track easier. The next step will come in changing the data form to be modeled correctly.

Week 3:

Running smoothing techniques on this cross-sectional data has proven to be a challenge, as solutions such as summarizing average points by age have proven to be non-intuitive. I have found possible solutions in vector exponential smoothing to alleviate this issue. Right now, the focus is primarily on the data containing end of year rankings rather than weekly, so it will be nice to see what works well and what does not before moving on to more granular data.

Week 2:

After meeting with my advisors I was introduced to possible methods for continued EDA. I also found more data sources that may help in exploring factors such as age and court surface performance affecting player rankings. Techniques were discussed as possibilities to remove this seasonality element to allow for smoothing techniques. Had to do a little bit of wrangling to create the table for top 100 rankings for the entire timeframe, but should now have a much easier time exploring the data.

Weeks 0-1:

This week started with an unfocused idea for a project and ended with a much more clear direction of where we are headed. A big accomplishment from this week was finding the datasets that will drive this project. Additionally, I have been doing some basic exploratory data analysis, including identifying possibilities for functional form and modeling techniques.