Running Predictor

The Model

I have been recording my runs in Strava for about five years. I wanted to see if I could use this data to make predictions about my racing pace. I downloaded my data from the website, including a spreadsheet collecting all my activity data (I’ve deleted some data from this file for privacy reasons).

I spent some time using visualizations and linear models to determine which variables would provide the most predictive power. I looked into trying to predict both race and non-race paces, and I considered factors including age, distance, elevation gain (both relative and absolute), amount of training, recovery, and season.

While I have much more data on my overall running than I do on the small number of races that I have run, I found too much variability in my running pace that cannot be explained through these data. Was I taking it easy or doing a workout? Had I eaten breakfast? How hot was it?

With races, much of this variability no longer applies. In a race, I’m going to try to go as fast as possible and pay attention to proper rest and nutrition, as far as it is under my control.

After trying out various models, I found the model with the most significance and explanatory power was a simple one that only took into account the length of the race and the amount of training I did over the previous twelve weeks.

Racing Pace Model

## Call:
## lm(formula = pace ~ Distance + training, data = races)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.537  -7.097   1.153  10.061  20.001 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 342.06054   10.41302  32.849 1.61e-11 ***
## Distance      2.80869    0.32545   8.630 6.02e-06 ***
## training     -0.17157    0.03272  -5.243 0.000377 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 13.32 on 10 degrees of freedom
## Multiple R-squared:  0.8875, Adjusted R-squared:  0.865 
## F-statistic: 39.46 on 2 and 10 DF,  p-value: 1.8e-05

Analysis of Variance Table

## Analysis of Variance Table
## Response: pace
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Distance   1 9122.4  9122.4  51.421 3.029e-05 ***
## training   1 4876.8  4876.8  27.489 0.0003772 ***
## Residuals 10 1774.1   177.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confidence Intervals

##                   2.5 %       97.5 %
## (Intercept) 318.8588806 365.26219053
## Distance      2.0835409   3.53383942
## training     -0.2444886  -0.09866009

This simple model explains much of the variability in my racing pace with a fairly high degree of confidence. It seems that how much I train for a race does indeed have a direct and measurable impact on my performance. Adjusting for the length of the race, every kilometer I run in the twelve weeks prior to the race results in an improvement of .17 seconds per kilometer (with a 95% confidence interval between .10 and .24 seconds).

The Tool

With this model, it’s a simple task to craft a tool that allows me to predict my race pace based on the length of the race and how much I’ve trained.

In this case the tool is a Shiny app. For now, this app is very simple. The user simply enters the length of the race in either kilometers or miles and the distance I’ve trained in the preceding twelve weeks, and the app returns a predicted race time and pace.

The next steps are to enable the app to communicate uncertainty by adding (and perhaps visualizing) confidence intervals.

The source code for this project is available on GitHub.

COVID-19 in Metro Atlanta

Early on in the pandemic, I became frustrated by the lack of quality visualizations of local COVID-19 data, particularly concerning Metro Atlanta, where I live, so I set out to create a set of visualizations of these data. Since then, the situation has improved greatly, but these plots still provide some details and comparisons that I have not seen elsewhere.

The first item I set out to create is an R Markdown document focused on Metro Atlanta that visualizes the distribution of cases and deaths as they change over time. The charts in this document track new cases and deaths in the core Metro Atlanta counties, both in absolute and relative terms. Subsequent charts provide context by tracking new cases and deaths in Metro Atlanta, Georgia, and the United States.

The second product of this project is a Shiny web app, which visualizes new case and death data for the entire United States, with a focus on county-level data in the context of state and national data. This reactive app allows the user to look at data from any state and county in the United States, subject to the geographic limitations of the original data set.

The source code for this project is available on GitHub.

All data come from the New York Times’ ongoing repository of COVID-19 cases and deaths in the United States.