11 April 2014

Rating Systems Challenge: 2014 Review and Historical Results


Photo Credit: Puneet & Anshu Nanda

This post is part of a series tracking the successes and failures of various NCAA Men's Basketball ranking systems and bracket models throughout the 2014 NCAA Tournament. Click here to check out the full series.

In defeating the 8 Kentucky Wildcats last Monday night, the 7 Connecticut Huskies completed an improbable run for the 2014 National Championship. No system under review here picked Connecticut to advance to the Sweet Sixteen. Along the way they beat three teams that at least one system put in the Final Four: Kentucky, 4 Michigan State and the top-seeded Florida Gators.

The surprise of a National Championship game featuring two schools whose seeds added up to fifteen resulted in the lowest scoring Rating Systems Challenge since 2011—when a 3rd-seeded Connecticut defeated an 8th-seeded Butler in a Final Four that included also included a 10th-seeded Virginia Commonwealth. The preseason polls defeated the other systems this year, but it was a Pyrrhic victory.




Considering that best systems won with only a 62% accuracy rate, this tells us more about how unpredictable the 2014 NCAA Tournament and less about how predictive the preseason polls were. Both FiveThirtyEight and Vegas did a better job of picking individual games, but finished third due to timing—their smart picks came early, while the preseason polls earned the late, more heavily weighted points resulting from Kentucky's run. No other systems outperformed chalk.


We now have four years worth of data for thirteen systems, three years worth of data for nineteen systems and two years of data for twenty.* Though the sample size is still uncomfortably small for my taste, in these data we're seeing some patterns start to emerge.

*In 2014 I aggregated the ESPN Computer and ESPN Decision Tree systems because they picked the same bracket (the Computer is heavily influenced by the Decision Tree). But they don't always pick the same bracket. For this post, I break them back out again.

The first pattern shows us how hard it is to beat chalk in the long run. Over four years, only FiveThirtyEight is consistently better than chalk in individual picks, and that only by one point. In the last three years combined, only three systems have beat chalk on a game-by-game basis, and the two highest performing (Survival Model and the USA Today [formerly ESPN] preseason poll) only by four extra correct picks.

The second pattern is that the bad systems remain. The RPI systems (NCAA and Lunardi) as well as Nolan Power Index miss picks consistently—the former two because they're not bold enough to pick the right upsets, the latter because it predicts upsets that don't happen. Use them to inform your bracket at your own risk.

System Performance 2011-14


In terms of points and percentiles, FiveThirtyEight has beat all on average these past four years, but being the best average bracket filler-outer isn't all that helpful if you perennially finish fourth behind Guy Who Gets Lucky Picking His Alma Mater and Guy Who Picks the Meanest Mascots #1 and #2. So how has FiveThirtyEight performed in each individual season? Pretty well, actually. Only 2012 saw FiveThirtyEight finish outside the top three. The only systems to crack the top three more than once have been Sagarin and Vegas (with the caveat that I didn't track all of these systems back in 2011).

Top Three Systems 2011-14


In pools of a decent size (where the buy-in is a very small portion of the final pot) picking for the long-run is a loser's strategy. You want to pick brackets that will, at least once in awhile, beat all others even if they don't perform well in other years. In other words, you want to rely on systems that will help you maximize gains rather than minimize losses. Percentile finishes are one way to identify such systems. Finding peak maxima is another.

Among systems I've been tracking for four years, Sagarin has finished with the highest correct pick percentage and the most points in a single year, whereas the AP preseason poll's 97.8 percentile finish this past year is tops in this group. These are good systems to consider when looking for gain maximizers.

System Performance 2012-14


When we exclude 2011, when I reviewed a smaller set of systems, the Survival Model's excellent 2012 performance shines through, As does the USA Today preseason poll's performance in 2014 and the ESPN Computer's steady scoring over the past three years. These are also good candidates for gain maximizers.

That said, both the Suvival and FiveThirtyEight models have been tweaked since I started charting these systems, and I assume the ESPN systems are subject to modification as well. FiveThirtyEight seems to have done just as well post-tweak, Survival not so much. Make sure you research your systems before you use them to see if their inputs and weighting differ from year to year.

And so ends the 2014 Rating Systems Challenge. Thank you all for reading along. If you have any suggestions for other systems you'd like to see in the RSC, or other data you'd like me to explore, feel free to say so in the comments. Please tune in next year when I add Massey Ratings to the group, and check back later this spring for Rational Pastime baseball analytics.

6 comments:

Mark said...

Thanks for the great review! It's nice to see all of these stats on the systems. Last year, before I realized that there many systems to look at, I went with the default system in the website I put my bracket into. Since that website was Fox Sports, the system was WhatIfSports. It looks like it's not a great system, but it might be worth adding to the list.

Two years ago I went with a different system, which I valued based on its usefulness in chess: Elo. If you look at Sagarin's ratings, the Pure Elo rating is to the far right. So you might also consider adding that one.

Also, I wonder if you might even try to create your own system based on the systems you've researched? It could be as simple as taking the top X systems based on a particular statistic and giving them each a vote. Or even take the top Y systems from each statistic. You might not like that system, though, since it will probably give you a system that performs above average but never outstanding.

JD Mathewson said...

Thanks for the suggestion. I'll look into incorporating WhatIfSports into next year's system.

I have thought about breaking out Sagarin's Elo to see how well it performs. You're actually not the first to ask about that. I probably would have done it had I more time before the start of the Tourney. I'm a big fan of Elo systems: the power ratings I use for my MLB postseason projections are based on my own Elo-based MLB power ratings system, which I'll be debuting later this spring.

I've not put much thought into developing my own system considering that developing a good one would require quite a bit of testing and data collection. Plus, that's essentially what FiveThirtyEight does (they use Pomeroy, Sagarin, LRMC, Chalk and the Preseason Polls to develop an ideal bracket). One rule I have is that if I don't think I can improve on what Nate Silver's doing, I don't worry about it.

Thanks for reading!

Mark said...

Also, looking at the chart "System Performance 2012-14", I believe you have max percentile for FiveThirtyEight incorrect. Didn't it get a 95.5 percentile last year? It also wouldn't make sense for them to get an 80.7 percentile this year and have Max = Mean != 80.7.

JD Mathewson said...

Thanks, Mark. There were some sorting issues that I had to fix, then I must have left some typos in when I fixed them.

Mickey Hill said...

Great article, found your site by chance while searching Google for "success rates for the different sports metrics systems." Look forward to reading even more. Any chance you have already rated analytic system(s) performance in any other sports (i.e., NBA, MLB, NFL, NCAAF or UEFA)?

JD Mathewson said...

Glad you enjoyed my work. Unfortunately, I don't test rating systems for the other sports, simply because the amount of work required just wouldn't equal the amount of interest. I'll be doing my own analytic rating system for MLB this spring/summer.

Post a Comment

Please Be Civil.