Weekly Progress & Updates
This week has been a bit of a quieter week as I am in-between stages of my project, but there is still a lot of work ahead. I spent time diving deeper into the regression analysis using the metrics we established: milk, sugar, cocoa, General CPI, and Coffee CPI.
The Results: I ran the multi-factor model on the historical data. General CPI became insignificant (p = 0.06) once I added the commodity costs. The model shows that Milk and Sugar prices are actually the 'Real' drivers of the price changes (R² = 0.86), not just general inflation. This strongly validates my 'Cost of Goods' approach.
Call:
lm(formula = PPC ~ milk + sugar + coco + CPI_USA, data = matrix_data)
Residuals:
Min 1Q Median 3Q Max
-0.024819 -0.006702 -0.000568 0.005649 0.055078
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.651e-02 9.606e-03 -3.801 0.000185 ***
milk 1.927e-02 3.663e-03 5.260 3.3e-07 ***
sugar 1.805e-01 1.822e-02 9.907 < 2e-16 ***
coco -1.089e-06 8.014e-07 -1.359 0.175520
CPI_USA -1.136e-04 6.168e-05 -1.841 0.066915 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.8599, Adjusted R-squared: 0.8574
F-statistic: 352.8 on 4 and 230 DF, p-value: < 2.2e-16
Model A (The Long-Term Model):
• Timeline: 1990–2026
• Result: Milk and Sugar are highly significant. General inflation is not.
• Insight: Coffee prices are driven by raw ingredients, not just general inflation.
Current Work: Variable Adjustment
While the original Price Per Cup (PPC) metric divided the price per pound by 40 cups, I realized I want to simulate a stronger brew. I am currently recalculating the PPC to assume 25-30 cups per pound to see how this sensitivity analysis changes the coefficients in the correlation matrix.
Looking Ahead: I have started to create a mini dashboard in PHP. Coming up, we have a lab scheduled to get more familiar with running a PHP front-end connected to a Python script on the back-end, which will be the exact architecture I need for my final interactive predictive model.
This week was a major turning point. I moved from "Data Collection" to "Feature Engineering." To solve the problem of CPI being an abstract index, I engineered a new variable: "Price Per Cup" (PPC). By converting raw commodity costs (Price per Pound) into a per-cup metric, I can now model the actual dollar cost of a cup of coffee over the last 30 years.
Current work: I ran my first Correlation Matrix and Linear Regression models. The results were fascinating—I discovered a massive "Inflation Trap" (Multicollinearity) when looking at short-term data. However, my long-term model (1990–2026) proved that Milk and Sugar prices are actually stronger predictors of coffee costs than general inflation. This validates my decision to use the historical dataset over the short-term one.
Looking ahead: Now that I have proved the "Cost" side of the equation (Milk/Sugar), next week is about the "Supply" side. I plan to incorporate the harvest data from Brazil and Vietnam into the model to see if global production shocks can explain the remaining variance in price. I also plan to start coding the skeleton of the interactive dashboard.
Now that I have officially decided on my project, I can go full steam ahead and find as much data available about coffee world production, milk, chocolate, and other coffee variables. I have spent most of this week searching for data and downloading CSV files.
Current work: I am currently in the stage of auditing my data, trying to figure out columns and the true amount of data that I have. I also have a plan to build an interactive dashboard for my project. It would be an interactive map of the world with trade routes and a slider for predicting different prices.
Looking ahead: For this upcoming week, I want to have my correlation matrix done and start thinking about what model I'm going to use. I also want to clean my website up to make it more coffee-themed.
My initial project idea was to track the impact of remote workers on inflation in Mexico City. However, after auditing the data from InsideAirbnb and Indeed, I realized the datasets were too disjointed to build a reliable Time Series model.
The Breakthrough: While reviewing economic data, I discovered a comprehensive 50-year dataset from the USDA and FRED regarding global coffee production. This data is clean, continuous, and allows for a much more rigorous statistical analysis.