I Created a Play-By-Play Dataset for the 2007 College Football Season Because I Couldn't Find One Online
Take a trip into the past with me.
I had the bright idea of delivering an interesting write-up this week to my faithful readers. First, I wanted to pivot to college football for a week because I love writing about both college and professional football. So I thought it would be interesting to compare some quarterbacks from before the advanced stats era to those who have played in the advanced stats era. Or to put it in simpler terms, comparing some BCS-era quarterbacks to some CFP-era quarterbacks.
With this new bright idea, I went into my repo on Monday and was greeted with an interesting error when trying to scrape play-by-play data from the usually trusty cfbfastR package:
This error marked a fork in the road for me. Do I pivot my planned content to something else this week, or do I try my hardest to try and find this play-by-play data from years before 2014? I chose the latter. Next week I’ll drop two tutorials. One for if you feel like scraping some old college football data yourself, and one for the sports-almanac style data visualization I’ll show below. Let’s dive right into it.
Analytics Content I Enjoyed This Week
I enjoyed this piece by
about Raiders WR/former Montana State QB Tommy Mellott, check it out here. Mellott was one of the most interesting prospects I saw for the 2025 NFL Draft. Here’s his comparison scorecard I posted back in April:I’ve enjoyed
’s continued coverage of the Stanley Cup Finals. Check out his most recent NHL ELO ratings here.Over the past few weeks I’ve enjoted
’s QB GOAT series, found here.
Grabbing College Football Play By Play Data From 2007
I will show you the code in a tutorial next week, but here is a rough outline of the steps I took to grab play by play data and calculate some advanced stats on it. Join me in putting on your data engineering hat.
First, I used requests and json in a Python script to grab a JSON file using ESPN’s public API. I used a specific game as an example in order to limit the request size, so I grabbed a game id from an ESPN URL and passed that to the API. By the way, shoutout to ESPN for having one of these. I don’t think there’d be any free advanced NFL stats if it weren’t for Pro Football Reference and ESPN.
With this new .json file for this specific game, I used pandas to impose the schema of the usual college football data that comes from cfbfastR onto this “new” data from 2007. I mapped the json values to this schema and exported this as a csv file.
The results of this script left me with a csv of play by play data with the following values filled in:
year
week
id_play
game_id
game_play_number
half_play_number
drive_play_number
pos_team
def_pos_team
pos_team_score
def_pos_team_score
half
period
clock.minutes
clock.seconds
play_type
play_text
down
distance
yards_to_goal
yards_gained
drive_result_detailed
drive_id
drive_result
drive_time_minutes_start
drive_time_seconds_start
drive_time_minutes_end
drive_time_seconds_end
drive_time_minutes_elapsed
drive_time_seconds_elapsed
number_of_drives
season
However, the following fields I’m used to receiving were still missing:
epa
ep_before
ep_after
wpa
wp_before
wp_after
def_wp_before
def_wp_after
home_epa
away_epa
…etc. You get the point, I was missing advanced stats. So where does that leave us? What are the options we have if we want to calculate expected points and expected points added? We’ll switch from Python to R for this portion.
The nflfastR package has a function called calculate_expected_points which takes parameters like season, home team, yard line, down, distance, etc. All things that we received from our ESPN data grab. So it’s not exactly perfect to pass college football data into a function designed for NFL data, it can still yield an interesting result. This output provided us an expected points column called ‘ep’, a touchdown probability column, field goal probability column, safety probability column, and opponent touchdown probability column, but for this exercise we’re going to move forward with just expected points.
After this, we can calculate our ep_before and ep_after columns needed for EPA by using a lead SQL function when ordering the plays by play_id ascending. Now, we have an EPA column that subtracts ep_before from ep_after to achieve this measure.
I repeated these steps for all 13 games of this team’s 2007 season.
I used this data to calculate EPA per dropback on this 2007 quarterback, then calculated EPA per dropback for every quarterback from the 2024 college football season to try and create some sort of comparison visualization to contextualize just how good this QB was back then. I used Game On Paper’s EPA/DB stat as my guiding light when calculating.
I visualized this table using the gt package in R.
So Which Quarterback Did I Choose?
Take a moment to travel back to 2007 with me. Graham Harrell and Michael Crabtree are running the Big 12 at Texas Tech. Tim Tebow and Percy Harvin are tearing up the SEC, but Les Miles and LSU won the National Championship over Jim Tressel’s Ohio State. Current Atlanta Falcons Offensive Coordinator Zac Robinson threw for 3671 yards at Oklahoma State. Mat Ryan had Boston College playing great. No one knows what a mortgage backed security or completion percentage over expected is. Irreplaceable by Beyonce is on top of the Billboard Top 100. Instant classic movies like Wild Hogs and Norbit released this year. Any guesses on who I chose? I would say if ball knowledge is a scale of 1-10, you should know who I’m zeroing in on if you consider yourself to rank at 3 or above.
If you don’t know by now, let’s take a trip to the sandy shores of O’Ahu in a faraway land known as Hawai’i.
I calculated University of Hawai’i’s Colt Brennan’s EPA per dropback in the 2007 season.
For the young folks reading this, there used to be this magical and mysterious conference in college football called the WAC (Western Athletic Conference). Founded in 1962, it originally included New Mexico, Arizona State, Arizona, Wyoming, BYU, and Utah. At its height in 1998, the WAC included two divisions and had teams competing in it like Air Force, Colorado State, Rice, TCU, Tulsa, UNLV, SMU, Fresno State, and of course, Hawai’i. The Rainbow Warriors won the WAC in 2007 by going undefeated in conference play on their way to a 12-1 record with a perfect regular season. The conference that season included that Boise State team who had just won the Fiesta Bowl over Oklahoma the year before using a hook and ladder play followed by the Statue of Liberty. Please excuse the 240p video, it’s ancient footage. I hope it’s in the National Archive.
There are no words I can sufficiently type to explain the nationwide hypnosis Colt Brennan and the Rainbow Warriors had us mainlanders under. Their games, like they do today, sometimes began at 11pm and sometimes began at 2am. And everyone stayed up to watch them. It was magical, especially to a 5th grader like me.
Excluding their bowl game, I grabbed every Rainbow Warriors game and calculated Brennan’s EPA per dropback. And here’s how he compared to the 2024 top 10 ranked players in the same category:
And there you have it! I thought I should model the table after this sports almanac I found in a thrift store a few weeks ago since we’re stepping into the past:
Let me know what you think!