In late March, the Diamondbacks named Robbie Ray their fifth starter. As is the case with most fifth starters, Ray's outlook was uncertain. He might scuffle along close to the line known by statheads as "replacement level," eating innings but allowing roughly as many runs as a scrubby Triple-A understudy would. He might carry forward the gains he made in 2015 and measure out as an average or better major league starter. Or he might continue adding velocity and tweaking his repertoire and turn into something close to an ace. With baseball, who can say until it has happened?
With Ray, though, it's hard to say even after it happened. By wins above replacement, he was worth a pitiful 0.7 wins last year, worse than Archie Bradley (the pitcher Ray beat out for the fifth starter's job) and Chase Anderson (the pitcher Arizona traded the previous winter to help make room for Ray). On the other hand, by wins above replacement, he was worth a robust 3.0 wins last year, tied with Cole Hamels and Marco Estrada. On a third hand, by wins above replacement, he was worth 4.8 wins last year, better than Johnny Cueto, better than Kyle Hendricks, better than all but 14 pitchers in the majors.
Welcome to baseball's "Two Watches" era, in which despite having more advanced data than ever, it sometimes feels as if we know less for certain than ever. (Or, as the saying goes: A man with one watch always knows what time it is, while a man with two can never be sure.) That first WAR comes from Baseball Reference, the second from FanGraphs and the third from Baseball Prospectus. All three models are held in high esteem in the top tier of baseball's publicly available metrics. I use all three because all are much smarter than I am. All three promise to get us closer to truth than our barroom arguments ever could. In most cases, all will agree about the general merits of a player. But every few dozen players, Robbie Ray happens.
Through Ray -- perhaps the most extreme WAR-splitter last season -- we have a lens into each of the three frameworks for measuring player value. Through the different frameworks, we have a lens into Ray, a pitcher you either can ignore entirely or absolutely must know about this minute, depending on the URL.
The first major decision a WAR model needs to make is whether it's going to measure the pitcher's skill or the pitcher's results. Baseball Reference's WAR, or bWAR, focuses on the results. It starts with an estimate of how many runs an average pitcher would have allowed under similar circumstances to Ray's -- the typical offensive output of his opponents, adjusted for the ballparks he pitched in, the role he was used in and the defense that backs him -- and compares that figure to the number of runs Ray allowed. In Ray's case, bWAR sees a National League starter pitching in a hitter's park in front of a below-average defense; it expects a pitcher in his spot to allow 5.01 runs (earned and unearned) per nine innings. Ray allowed 5.42 runs per nine, which means that over the course of 174 innings, he allowed eight more runs than an average pitcher. He was, in other words, below average. Simple.
But is it? Take his start against the Cincinnati Reds on July 23. Ray struck out 10 batters in five innings and walked only one, a sensational ratio. Rarely does a "bad" pitcher strike out two batters per inning with good control, but Baseball Reference's WAR would have marked him as "bad" that game because he allowed six runs. Three were unearned, thanks to a first-inning error by the third baseman. Three were earned, and that came in an inning that went like this:
Single
Single
Home run
Groundout
Lineout
Strikeout
That's a bad inning, to be sure. But rearrange the sequence so it goes like this:
Groundout
Lineout
Home run
Single
Single
Strikeout
That way, Ray would have allowed only one run. Same basic pitcher -- six batters, two singles, a home run and a strikeout -- but only one's WAR is sunk.
FanGraphs' WAR (or fWAR) doesn't care how many runs Ray allows. It cares how well he threw on a per-batter basis, and it focuses its attention on the three "fielding independent pitching" outcomes -- strikeouts, walks and home runs -- for which a pitcher has the most individual responsibility. A FIP-based WAR assumes a league-average batting average on balls in play (BABIP) for every pitcher because BABIPs tend to regress toward that league average and because exceptions are often more about luck or defense than the pitcher's skill. (It also adjusts for ballpark and other contextual factors.)
In the case of July 23 against the Reds, Ray's fWAR would be the same, no matter the order the third inning played out. It would be almost exactly the same, regardless of whether the third baseman botched the play in the first.
The question is whether Ray deserves to be treated as "normal" on those things FanGraphs WAR strips out. In his career, he has allowed a BABIP nearly 50 points higher than the league average -- and the highest BABIP in baseball during that time (minimum 250 innings). He has been significantly worse with runners on base, which means he has been more prone to sequences like the one above that lead to runs. Either Ray is still flushing the bad luck out of his system -- a reasonable, if uncertain, assumption -- or he has individual deficiencies that FanGraphs WAR doesn't pick up.
Each of these metrics is complicated enough, but they are, essentially, arithmetic. A fan familiar with each model would know intuitively why one likes Ray and the other doesn't and where the discrepancies lie. But the third, Baseball Prospectus' WARP, based on a stat called deserved run average, is the relative black box of the trio, as it uses a method called mixed modeling that accounts for both fixed and random effects. DRA-based WARP accounts for every factor that can be measured, from the catcher's framing, blocking and throwing ability to the size of each umpire's typical strike zone to the baserunning skills of the pitcher's opponents to the weather to each defender's abilities. It puts heavy emphasis on those things a pitcher controls, then attempts to determine the pitcher's share of responsibility for everything else. According to tests run by its creator, Jonathan Judge, it's the most predictive pitcher-evaluation model. It's also beyond the average fan's ability to dive into. Why does WARP like Robbie Ray? For most of us, the answer stalls at "because WARP likes Robbie Ray."
ERA: 4.90
FIP: 3.80
DRA: 2.95
I can easily accept any one of these three assessments of Robbie Ray, but the human brain has a hard time accepting all three at once.
Who is Robbie Ray, really? In 2010, he was drafted in the 12th round by the Washington Nationals. That doesn't do justice to his talent at the time: He signed for nearly $800,000, about what a low first-round pick might have expected. He was a prospect but not a top prospect, ranking near the bottom of the Nationals' top 10 prospects before 2012. Kevin Goldstein, then at Baseball Prospectus, wrote that he had "average velocity" and a "usable" breaking ball with a changeup that "flashes plus" and "loose, easy arm action that allows him to maintain his stuff deep into games." His upside was limited by a fairly slight frame and by unreliable command and control. Goldstein's "perfect-world" projection: a No. 3 starter.
Two years later, he was traded to Detroit for Doug Fister, a move that was (from the Tigers' perspective) universally panned. Ray was "a solid, if unspectacular, prospect who most likely profiles best as a back-end starter in a big league rotation," according to Baseball Prospectus. He worked in the low 90s, with a curveball and a changeup that were still works in progress. The Tigers GM at the time, Dave Dombrowski, was considerably higher on him, calling him, "a guy that we think is a No. 1 left-hander." A few months later, Dombrowski said Ray projects to be "at least" a No. 3 starter. But a year later, after a lackluster Triple-A season and a disastrous first showing in the majors, Ray was traded again, this time to the Diamondbacks.
That's when everything started to change. Ray's average fastball went from 92.4 mph in 2014 to 94.4 mph in 2015 -- the league's biggest jump that year -- and to 95.3 mph in 2016. Velocity isn't destiny, but Ray's leap was striking and unexpected. He now throws harder than Clayton Kershaw, than David Price, than every left-handed starting pitcher except James Paxton. He all but junked the perpetually-in-progress changeup in 2016, leaning more on a slider that had the eighth-best whiff rate among starters.
The results were staggering. Two years ago, Ray struck out fewer than seven batters per nine innings in Triple-A. This year, he struck out 11.3 batters per nine innings in the majors. Only 10 other starting pitchers in history struck out more batters per nine innings in a season, and there isn't a fluke in the bunch. Ray still walks too many batters, but not prohibitively so. He slotted in between Jake Arrieta and Yordano Ventura last year, with 3.7 walks per nine innings. His ground ball rate is above average.
Ray also allowed more runs last year than all but three National League pitchers. He averaged 5.4 innings per start, worse than every starting pitcher in the majors but Tom Koehler. His average exit velocity was the eighth-highest among starters last year. His team is 20-41 when he starts, a winning percentage only slightly better than that of the 2013 Astros. His record, for what that's worth, was 8-15 in 2016 and is 14-31 in his major league career. No active pitcher with at least 60 career starts has a worse winning percentage.
You can see, then, how a fan with eyes on every Diamondbacks game or with access to Ray's basic stat page or even with a prime seat between two scouts would come to some wildly different conclusions about how he has pitched. Usually, WAR, which smooths out any number of factors that would mislead that fan -- run support, ballpark, quality of his team, etc. -- is our great hope for figuring it out. Instead, WAR is almost as baffled as the rest of us.
The Man Who Has Two Watches problem has its own academic-sounding name: Segal's Law. The law is, most likely, intended to be taken sarcastically. (Its origin is apparently a San Diego newspaper that was mocking watch retailers.)
We know that the man with one watch doesn't actually know what time it is; that's the sarcasm. He is certain of what time it is, but his one watch is as fallible as two watches would be. We can intuit that, in a world of flawed watches, an average of many watches would usually be closer to the true time than a single watch. Thus, having two watches -- or two or three models of WAR -- is actually a defense against certainty in an uncertain world. To see the brokenness of our watches requires constant reminding, and a second watch serves as that reminder.
The three-WAR model, then, should be -- and probably is -- a strength of modern analytics. We know a lot of inconclusive facts about Robbie Ray, but the most important fact is that baseball buries many truths under near-infinite layers of ambiguity. Ray is a player whose performance can be perfectly described but never truly understood. With one WAR, we'd forget that. But a man with three WARs can never get too sure.
There's also a way the three WARs can circle back and lead us astray. It's not in the cases of Ray but in the cases of the overwhelming majority of players, whose WARs are pretty much in alignment. It's in the cases where, as with the polling in the recent election, a series of shared assumptions led to a polling error that was consistent across nearly all models. As the author of the Segal's Law Wikipedia page writes, "One might even think that it is better (to have two watches) since if the two watches are in approximate agreement one might assume that both are working and an average of them will yield the correct time to within some accuracy depending on the specification of the timepieces."
"One might assume." One might assume that, if Martin Perez's WARs are 1.8, 1.8 and 1.9 (as they were in 2016), that Martin Perez is a 1.8- or 1.9-win pitcher. But we've already seen, with Ray, that one model of WAR can be wildly misleading. We've seen, in fact, that two can be. It's not much of a leap to think even three could be, all at once, and in all the same ways, especially for unique players such as Zach Britton or Yadier Molina or David Ross or for players on the extremes.
Last week, Justin Verlander finished second in the AL Cy Young voting, as he was left off two five-person ballots entirely. I would've voted for him. His WARs were fairly tightly bunched, all at or near the top of his league: 6.8, 6.6 and 5.2 wins. The internet did what it does: It went after the two voters, tweeting at them that they were classless, clueless, hacks, morons, should have their voting privileges taken away, etc. It threw around slurs. "I hope you and everyone you love dies," one person, predictably, tweeted (now deleted) at one of the voters. If you ever want an argument against certainty, pick any well-traveled tweet and read the replies.
Those WARs for Verlander are compelling, and WAR as a framework is clearly the best, most comprehensive, most objective way for most of us to assess players. But baseball is not something we've cracked, and Robbie Ray is a good reminder that, until we do, we're all still staggering, out of the darkness but blinded by too many lights. It isn't a matter of whether that's a good thing or a bad thing; it's just a thing. Be grateful you aren't the general manager of a team that has to make hundred-million-dollar decisions in this era.