Adan M.P. Blog

Weekly Progress & Updates

Week 6: Training/ Validation Work

March 8, 2026

This week, we put the model to test. Instead of letting the model see all the data at once, I trained our model on our training data, but completely hid a random set of 10 months to use as a validation test.

I took that trained model and filled it with the actual sugar, milk, and cocoa variables from those 10 hidden months to see what coffee output it would predict. The model uses this formula to calculate the scaled prediction based on the weights (the $\beta$ coefficients) it learned during training:

$$Scaled\_Prediction = \beta_0 + (\beta_1 \times Scaled\_Milk) + (\beta_2 \times Scaled\_Sugar) + (\beta_3 \times Scaled\_Cocoa)$$

To translate that statistical prediction back into real money that actually makes sense on our scorecard, it runs this unscaling equation:

$$Final\ Price = (Scaled\_Prediction \times SD_{price}) + Mean_{price}$$

The Results: I am impressed with how this is coming together day by day. On completely unseen data, we were only off by about 5 cents. Here is the actual scorecard from my Python terminal:

--- VALIDATION SCORECARD (10 Random Months) ---
Actual Price  Predicted Price  Error ($)
        3.33             3.34       0.01
        3.45             3.40       0.05
        3.50             3.56       0.06
        3.50             3.49       0.01
        3.21             3.21       0.00
        3.50             3.46       0.04
        3.47             3.46       0.01
        3.26             3.30       0.04
        3.59             3.34       0.25
        3.23             3.24       0.01

==================================================
MEAN ABSOLUTE ERROR (MAE): $0.048 per cup
==================================================
        

Looking Ahead: Now that we know the logic is good, Dr. Dunbar and I talked about using this exact same formula and backtracking 50 years to get even more data. By training the model on a massive half-century dataset, we are going to try and accurately predict the price of coffee for the upcoming year.

Week 5: Python Model

March 1, 2026

I got a lot of work done this weekend/week, as I finally made the leap into Python! After meeting with my professors, the immediate goal was to build a "Proof of Concept" model that takes actual user inputs and spits out a predicted price.

To do this, I took 15 months of recent data (Feb 2023 to Dec 2025) tracking the actual Median Price of a cup of coffee and merged it with my raw commodity costs. After doing all the heavy statistical lifting and diagnostics in R, I exported the clean dataset and built an interactive terminal simulator in Python.

The Raw Data: Here is a quick look under the hood at the merged tibble dataset right before it gets scaled and fed into the Python script. You can see how the target variable (Median Price) aligns perfectly with the economic indicators:

#Median Price   milk  sugar    coco
0          3.00  4.163  0.893  2686.2
1          3.04  4.098  0.887  2744.1
2          3.07  4.042  0.900  2927.5
3          3.09  4.042  0.920  2950.8
4          3.13  3.985  0.940  3100.3
5          3.16  3.971  0.950  3150.0
        

How it works: The Python script takes user inputs (simulating a 20% market shock going up or down), automatically scales the data behind the scenes to match the regression model, and then "unscales" the prediction back into real, readable dollars. Here are the results of my extreme stress tests:

========================================
   THE COMMODITY SHOCK SIMULATOR
========================================
Type 'up', 'down', or 'same' for each.

Milk price (up/down/same): down
Sugar price (up/down/same): down
Cocoa price (up/down/same): down

----------------------------------------
SCENARIO: Milk down | Sugar down | Cocoa down
PREDICTED MEDIAN CUP PRICE: $2.46
----------------------------------------

Milk price (up/down/same): up
Sugar price (up/down/same): up
Cocoa price (up/down/same): up

----------------------------------------
SCENARIO: Milk up | Sugar up | Cocoa up
PREDICTED MEDIAN CUP PRICE: $4.24
----------------------------------------
        

Report: The outputs are incredibly realistic. A total commodity crash brings a cup down to $2.46, while a massive inflation spike pushes it over four bucks. It proves the underlying math is solid and works dynamically.

Looking Ahead: Now that I have a Python "brain" that successfully handles inputs and outputs, the next step is connecting this script to my PHP dashboard so users can run these scenarios directly on my website instead of in a terminal. Also, I'm looking to expand to US imports from different countries to see how the shock from other countires can affect the price of coffee prices.

Week 4: Model Diagnostics & The 30-Cup Brew

February 23, 2026

This week has been a bit of a quieter week as I am in-between stages of my project, but there is still a lot of work ahead. I adjusted my methodology: instead of assuming 40 cups of coffee per pound, I recalculated the "Price Per Cup" (PPC) metric to assume a stronger brew of 30 cups per pound. I ran a new regression model and, more importantly, put it through rigorous diagnostic testing.

The Baseline Results: The model remains incredibly strong. Even with the adjusted price metric, the model explains about 84% of the variance in coffee prices, with Milk and Sugar remaining highly significant.

Call:
lm(formula = PPC ~ milk + sugar + coco + CPI_USA, data = master_clean)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.813e-03  7.866e-03   1.120 0.263309    
milk         1.321e-02  3.364e-03   3.926 0.000104 ***
sugar        2.610e-01  2.042e-02  12.781  < 2e-16 ***
coco         2.948e-07  9.056e-07   0.326 0.744971    
CPI_USA     -2.756e-04  6.300e-05  -4.374  1.6e-05 ***
---
Multiple R-squared:  0.8403,	Adjusted R-squared:  0.8385 
        

Testing: A high R-squared doesn't show the whole story. I ran a Variance Inflation Factor (VIF) test and a Breusch-Pagan (BP) test to check the model's underlying health.

> vif(model_30)
     milk     sugar      coco   CPI_USA 
 3.172797 15.736831  2.293888 11.276804 

> bptest(model_30)
	studentized Breusch-Pagan test
BP = 55.013, df = 4, p-value = 3.229e-11
        

Why I am dropping CPI:

1. Multicollinearity (VIF): Usually,any VIF score over 5 is problematic. Sugar (15.7) and General CPI (11.2) are severely inflated. They are so highly correlated that they are fighting each other in the math, which is why the model output a negative coefficient for inflation . My next step is to completely drop CPI_USA from the model.

2. Heteroskedasticity (BP Test): The BP test checks if the variance of errors changes over time. With a p-value near zero, the model failed this test. I will need to address this, potentially by taking the log of my variables or using robust standard errors.

Looking Ahead: Dropping the CPI and adjusting for these diagnostics is my next step. On the technical side, I am starting to build out a mini dashboard in PHP. Coming up, we have a lab scheduled to get more familiar with running a PHP front-end connected to a Python script on the back-end, which will be the exact architecture I need for my final interactive predictive model.

Week 3: Feature Engineering & Initial Models

February 15, 2026

This week was a major turning point. I moved from "Data Collection" to "Feature Engineering." To solve the problem of CPI being an abstract index, I engineered a new variable: "Price Per Cup" (PPC). By converting raw commodity costs (Price per Pound) into a per-cup metric, I can now model the actual dollar cost of a cup of coffee over the last 30 years.

Correlation Matrix of Coffee Factors

Current work: I ran my first Correlation Matrix and Linear Regression models. The results were fascinating—I discovered a massive "Inflation Trap" (Multicollinearity) when looking at short-term data. However, my long-term model (1990–2026) proved that Milk and Sugar prices are actually stronger predictors of coffee costs than general inflation. This validates my decision to use the historical dataset over the short-term one.

Looking ahead: Now that I have proved the "Cost" side of the equation (Milk/Sugar), next week is about the "Supply" side. I plan to incorporate the harvest data from Brazil and Vietnam into the model to see if global production shocks can explain the remaining variance in price. I also plan to start coding the skeleton of the interactive dashboard.

Week 2: Data Auditing & Creating a Plan

February 8, 2026

Now that I have officially decided on my project, I can go full steam ahead and find as much data available about coffee world production, milk, chocolate, and other coffee variables. I have spent most of this week searching for data and downloading CSV files.

Current work: I am currently in the stage of auditing my data, trying to figure out columns and the true amount of data that I have. I also have a plan to build an interactive dashboard for my project. It would be an interactive map of the world with trade routes and a slider for predicting different prices.

Looking ahead: For this upcoming week, I want to have my correlation matrix done and start thinking about what model I'm going to use. I also want to clean my website up to make it more coffee-themed.

Week 1: The Pivot to Coffee Data

February 4, 2026

My initial project idea was to track the impact of remote workers on inflation in Mexico City. However, after auditing the data from InsideAirbnb and Indeed, I realized the datasets were too disjointed to build a reliable Time Series model.

The Breakthrough: While reviewing economic data, I discovered a comprehensive 50-year dataset from the USDA and FRED regarding global coffee production. This data is clean, continuous, and allows for a much more rigorous statistical analysis.