11 April 2014
Rating Systems Challenge: 2014 Review and Historical Results
This post is part of a series tracking the successes and failures of various NCAA Men's Basketball ranking systems and bracket models throughout the 2014 NCAA Tournament. Click here to check out the full series.
In defeating the 8 Kentucky Wildcats last Monday night, the 7 Connecticut Huskies completed an improbable run for the 2014 National Championship. No system under review here picked Connecticut to advance to the Sweet Sixteen. Along the way they beat three teams that at least one system put in the Final Four: Kentucky, 4 Michigan State and the top-seeded Florida Gators.
The surprise of a National Championship game featuring two schools whose seeds added up to fifteen resulted in the lowest scoring Rating Systems Challenge since 2011—when a 3rd-seeded Connecticut defeated an 8th-seeded Butler in a Final Four that included also included a 10th-seeded Virginia Commonwealth. The preseason polls defeated the other systems this year, but it was a Pyrrhic victory.
Considering that best systems won with only a 62% accuracy rate, this tells us more about how unpredictable the 2014 NCAA Tournament and less about how predictive the preseason polls were. Both FiveThirtyEight and Vegas did a better job of picking individual games, but finished third due to timing—their smart picks came early, while the preseason polls earned the late, more heavily weighted points resulting from Kentucky's run. No other systems outperformed chalk.
We now have four years worth of data for thirteen systems, three years worth of data for nineteen systems and two years of data for twenty.* Though the sample size is still uncomfortably small for my taste, in these data we're seeing some patterns start to emerge.
*In 2014 I aggregated the ESPN Computer and ESPN Decision Tree systems because they picked the same bracket (the Computer is heavily influenced by the Decision Tree). But they don't always pick the same bracket. For this post, I break them back out again.
The first pattern shows us how hard it is to beat chalk in the long run. Over four years, only FiveThirtyEight is consistently better than chalk in individual picks, and that only by one point. In the last three years combined, only three systems have beat chalk on a game-by-game basis, and the two highest performing (Survival Model and the USA Today [formerly ESPN] preseason poll) only by four extra correct picks.
The second pattern is that the bad systems remain. The RPI systems (NCAA and Lunardi) as well as Nolan Power Index miss picks consistently—the former two because they're not bold enough to pick the right upsets, the latter because it predicts upsets that don't happen. Use them to inform your bracket at your own risk.
In terms of points and percentiles, FiveThirtyEight has beat all on average these past four years, but being the best average bracket filler-outer isn't all that helpful if you perennially finish fourth behind Guy Who Gets Lucky Picking His Alma Mater and Guy Who Picks the Meanest Mascots #1 and #2. So how has FiveThirtyEight performed in each individual season? Pretty well, actually. Only 2012 saw FiveThirtyEight finish outside the top three. The only systems to crack the top three more than once have been Sagarin and Vegas (with the caveat that I didn't track all of these systems back in 2011).
In pools of a decent size (where the buy-in is a very small portion of the final pot) picking for the long-run is a loser's strategy. You want to pick brackets that will, at least once in awhile, beat all others even if they don't perform well in other years. In other words, you want to rely on systems that will help you maximize gains rather than minimize losses. Percentile finishes are one way to identify such systems. Finding peak maxima is another.
Among systems I've been tracking for four years, Sagarin has finished with the highest correct pick percentage and the most points in a single year, whereas the AP preseason poll's 97.8 percentile finish this past year is tops in this group. These are good systems to consider when looking for gain maximizers.
When we exclude 2011, when I reviewed a smaller set of systems, the Survival Model's excellent 2012 performance shines through, As does the USA Today preseason poll's performance in 2014 and the ESPN Computer's steady scoring over the past three years. These are also good candidates for gain maximizers.
That said, both the Suvival and FiveThirtyEight models have been tweaked since I started charting these systems, and I assume the ESPN systems are subject to modification as well. FiveThirtyEight seems to have done just as well post-tweak, Survival not so much. Make sure you research your systems before you use them to see if their inputs and weighting differ from year to year.
And so ends the 2014 Rating Systems Challenge. Thank you all for reading along. If you have any suggestions for other systems you'd like to see in the RSC, or other data you'd like me to explore, feel free to say so in the comments. Please tune in next year when I add Massey Ratings to the group, and check back later this spring for Rational Pastime baseball analytics.
by JD Mathewson