<
>

How ESPN's NFL Football Power Index was developed

Over the past few years, ESPN's production analytics team has developed a full portfolio of metrics intended to rate players, teams and decisions. At times, fans have expressed an interest in learning more about the process and back-end details of some of the metrics, which include ESPN's Total Quarterback Rating (QBR), College Football Power Index (FPI), Soccer Power Index (SPI), Basketball Power Index (BPI), win probability models and team efficiency ratings, among others.

With the release of ESPN's NFL Football Power Index (FPI), the production analytics team would like to open the "black box" and provide insight into the process of creating a metric like this one.

With this in mind, three members of the analytics team who worked on this project -- Brian Burke, Zach Bradshaw and Alok Pattani -- have agreed to discuss what went into developing FPI. Burke, Bradshaw and Pattani come from varying backgrounds, but each is an expert in the field of sports analytics.

What is something fans should know about the process of creating a power ranking like NFL FPI?

BB: When developing a power ranking such as NFL FPI, we try to make the best use of all available information. Millions of data points dating back to 2006 are used to learn from the past to predict the future. This rigorous process requires a thorough exploration of the available information and the testing of many hypotheses.

AP: Along those same lines, the process is extremely objective and data-driven. We all have favorite teams and such, but the goal isn't to make our teams or any other team look "better" or "worse" -- the goal is to be right as much as we can. We rely entirely on data and the power of statistical analysis to aid us in the process.

How long does creating a metric like NFL FPI generally take?

ZB: It takes at least six weeks to fully experiment and develop underlying models, but that time doesn't include all the experience and research that went into developing many of the models that NFL FPI builds upon.

AP: A lot of work also has to be done to get things ready for analysis, such as ensuring the accuracy of our play-by-play data. Beyond generating FPI, programming in the logic to determine division rankings, playoff seeding and tiebreakers (NFL tiebreaker logic is extremely complicated) takes time. Finally, integrating all the pieces together and setting it up to run automatically backward and forward has its own set of challenges and associated work that takes a couple of weeks.

Let's talk more about the actual metric. You decided to use expected points added (EPA) per play as the basis of FPI. Why?

ZB: EPA is at the foundation of most advanced football analysis. Unlike other sports such as basketball, scoring points in the NFL is the cumulative result of many previous plays. EPA measures how much each individual play contributes to the scoreboard. Important factors that contribute to winning but are not measured in the box score are reflected in EPA. Lastly, evaluating performance on a per-play basis controls for the pace of play, leading to a better measure of performance.

What was something that you unexpectedly found mattered (or mattered more than anticipated) in the preseason ratings or game projections?

ZB: As I expected, performance from past seasons is a strong indicator of performance in subsequent seasons. However, to my surprise, offensive performance tends to be noticeably more consistent from season-to-season than defensive performance. The NFL is a QB-driven league, which contributes to this trend.

What was something that you unexpectedly found didn't matter (or mattered less than anticipated) in the preseason ratings or game projections?

AP: This has more to do with the season projections, but compared to college football, strength of schedule matters a lot less than people might think in the NFL. From an average NFL team's perspective, the difference between the easiest schedule and the hardest schedule, in terms of projected win total before the season, is about one win across the 16-game regular season. In college football, in comparison, the difference between the extremes is about five or six projected wins for an average team over a 12- to 13-game schedule -- a massive difference.

What was the hardest part of creating this specific metric?

BB: The most complex and challenging aspect is the QB injury factor. As everyone knows, or should know, the NFL is a QB-driven league, so when a starting QB is injured, his team's ratings and game probabilities will often change dramatically. Determining when a quarterback is going to play or had played (needed for accurate opponent adjustments) as well as the strength of the team's backup is a challenge in terms of creating an accurate model and managing the complexity involved.

So the QB portion appears to be a unique aspect of NFL FPI. Can you go into more detail how you determine how much a QB is worth, compared to his backup?

ZB: Quarterback impact is measured using a similar approach to Total QBR, with a few differences. The biggest difference from QBR is the incorporation of an aging curve. Young quarterbacks tend to improve with time and the performance of older quarterbacks starts to decline. Given a quarterback's age in the current season and age/efficiency in past seasons, we can estimate a quarterback's expected impact. To obtain a more predictive than descriptive quarterback metric, the effect of extremely lucky and unlucky plays is limited, and there is no down-weighting for when the game is less close as there is in QBR. Players without previous experience are set at replacement level.

Many fans are asking for a specific formula. Why is this so difficult to provide?

BB: As with QBR, it's not as if there is one equation -- two times passing yards plus one third times the number of touchbacks divided by pi minus -- that's not how these kinds of models work. There are so many layers to the analysis, from calculating play-level EPA to the team adjustments to the injury factors and beyond. It's not a single regression model, either, where we could list all the variables and coefficients in the model.

But we definitely don't want this to be a black box because there's nothing to hide and we want fans to trust the analysis. At its heart the system is based on team EPA per play, adjusted for "trash time" and opponent strength. There's a QB factor that goes into it when there's a backup QB. Home field, of course, is a major factor as well.

Is there anything else an average fan should know?

AP: The ratings and projections consider every play of every game, look forward, and will update. If you think we rated some team horribly (one way or the other), and that team plays more like you expect, the model will learn quickly, particularly early in the season.

BB: ESPN's models are completely nonarbitrary. In other words, we're not deciding what to include in the model or how much to weigh each input based on our own opinions about how football works. We let the data speak for itself. For advanced statistically minded folks, I'd like to reassure them how rigorous the model is under the hood. It was built with cross-validated data sets and fully respects the uncertainties and covariance among the inputs.

ZB: A team's FPI rating is just a means to an end. The ultimate goal of FPI is to correctly predict games and season outcomes such as the chance to make the playoffs or win the Super Bowl. Some teams are stronger than their records (lots of close losses) and others have favorable schedules, which are reflected in the game- and season-level projections.