Home
Math-Based
Ratings
Now known as computer ratings, math-based
ratings actually predate computers (depending on one's definition of a computer). The Dickinson system started in
1926, Houlgate in 1927, Dunkel in 1929, Boand in 1930, Williamson in
1932, Litkenhous in 1934, and Poling in 1935. All except the Dunkel
system are now defunct, and even Dunkel is practically defunct as far
as the general public knows. I have covered the Dickinson system separately.
Today there are hundreds of rating formulas
out there, probably one for each left-brained football fan, but the
most important in history were probably the ever-changing systems that were used by the BCS for its
ratings. For the last few years of the BCS, that would be the following six systems: Massey, Sagarin,
Billingsley, Anderson-Hester, Colley, and Wolfe.
The advantage of math-based systems, of course,
is their objectivity and consistency of criteria from year to year.
Sometimes a system owner will change his math formula, but when he
does, theoretically all of his "national champions" for all years change, so that
they are all selected by the same criteria. Because of that, an older
copy of the NCAA Records Book may list different "champions" for a
system than a newer copy does. But each list should be consistent within its
current formula.
The flaws of these systems are many, some
more than others. But one flaw they all share lies in their
bureaucratic approach to rating college football teams: a bureaucratic
system is incapable of recognizing and handling exceptions.
The
other flaws are dependent on the
system in question, but nearly all rating formulas are based on
premises that I, for one, would not agree with in the first place. This
can be hard to judge for a system owner that keeps his recipe a
closely-guarded secret, but even in that case you can judge the
"national champions" his system has selected.
And that is where we see the big fail for computer systems as national championship selectors:
their choices are too often clearly ridiculous. As a result, for the
most part, no one sees them as constituting national championships.
Math Systems That Are Treated as National Championships
An
unfortunate exception is the Dickinson system's 17 "champions" selected
for 1924-1940. Some people do see those as "legitimate," which is ironic,
since it is the most primitive and one of the worst systems of them all.
Also,
for some reason, Alabama absurdly recognizes their 1941 Houlgate
system selection as a "national championship." Alabama was 9-2 that
year, consensus champion Minnesota 8-0. Alabama finished fifth in the
SEC that year, and was ranked #20 in the final AP poll (though they
would be higher if the final poll had come out after the bowls. I
myself would rank them about #10). Worst "national championship" recognized by any school? I would have to think so.
Other
schools may claim similar titles, but Alabama's 1941 farce is the most
famous case. For modern (post-WWII) years, however, no one recognizes
math formulas as national championships. If they did, just using the
systems listed in the NCAA Records Book, we would have an additional 42
so-called national champions between 1970 and the present (many of whom
lost to the legitimate national champion).
Inability to Handle Exceptions
As
I said, the bureaucratic approach of a math formula rating cannot
recognize or handle exceptions. In other words, computer ratings do not
account for injuries, illness, suspensions, expulsions, or any major
losses in team personnel at all. They do not account for the effect of
weather on a result. And they do not account for any of the many
psychological factors that most of us humans can recognize and account
for.
Player Losses
Of
course, if a team loses a game because of injuries or other player
losses, it shouldn't matter as far as that team's rating goes. A loss
is a loss. Obviously that team was lacking in depth, and its rating
should decline. But where it does matter is for the opponents that team
plays.
As an example, let's say Team A loses to Team B in their
finale in overtime. Team B loses ten key starters to mass suspensions
for their bowl game, and Team C defeats them there in overtime. For
that result alone, any math system will rank Team C higher than Team A,
because Team B is judged by a math formula to be the same team
regardless of who played. However, this is clearly unfair, as the
opponents Team A and Team C faced in this example were, in reality,
quite different.
Sometimes the loss of even one player can have
a huge impact on how good a team is, such as Oregon in 2007. Through 9
games, Oregon was 8-1 and #2 in the BCS ratings that year, with big
wins over final AP top 25 teams Michigan 39-7, Southern Cal 24-17, and
Arizona State 35-23. Then they lost QB Dennis Dixon, their offense
disappeared, and they lost their last 3 regular season games.
At
that point, most systems will see USC and ASU as having lost to an 8-4
team, even though the team they faced was clearly far better than that
(Billingsley's system is one exception that I will address next). And
Arizona and UCLA, who beat Oregon in their first two games without
Dixon, will be given credit for wins far beyond what they should be
credited for.
Oregon's last loss, to Oregon State, presents a
further complication, because by that point Oregon's offense had
recovered, and they scored 31 on Oregon State in a loss and 56 in their
bowl win over South Florida. So Oregon State faced an Oregon team that
was better than Arizona and UCLA faced, but not quite as good as the
one USC and ASU faced. All of this is too complex a set of exceptions
for any math formula to properly handle, including Billingsley's.
Billingsley's Approach
Billingsley's
approach is different from most in that he only takes strength of
schedule into account at the point that games occurred. So in the Oregon
2007 example, his system sees both USC and ASU as losing to a
highly-rated Oregon team, and USC and ASU's ratings are unaffected by
anything Oregon does thereafter. That is a good thing in this case, at
least where USC and ASU are concerned. But this approach has its own
problems, both within this Oregon example and in general.
Where
Oregon 2007 is concerned, his system still gives Arizona and UCLA
enormous undue credit for having beaten a highly ranked
Oregon team. It still does not account at all for the major injury
underpinning those wins. Furthermore, even though Oregon's offense
recovered by the Oregon State game, and Oregon State thus faced a much
more powerful Oregon team than Arizona and UCLA did, Billingsley's
system gives Oregon State less credit for beating Oregon, as Oregon was much lower-rated by then.
And in a general sense, I think his
approach is poor. Let's say a team goes 6-0 against unrated teams, but
plays its six strongest opponents to end the season and loses all six
games. The first team that beats them gets a big rating boost for it in
Billingsley's system. The second gets less. The last gets very little,
having beaten a lowly-rated (6-6) team. Even though they all defeated the same team.
In
2009, Alabama opened the season with a 34-24 win over Virginia Tech,
who finished 10-3 and very highly rated. But where Alabama is concerned, Billingsley's system
ignores everything Virginia Tech did after that game. It had no effect
on Alabama's rating. I think we all agree that that is a poor approach, and yet
Billingsley was used in the BCS ratings, which had rather a profound
effect on the fortunes of so many schools.
In general, it is
just a bad idea to ignore so much data in a sport where there isn't
enough as it is (due to 120 teams playing 12 game schedules).
Weather
Similar
to player losses, if a team loses a game in part due to weather, it is
still a loss. But should it be considered equal to a loss unaffected by
weather? And should a poor performance in a win played in good weather
be judged the same as a poor performance in bad weather? I think not.
But math systems cannot account for weather.
For
example, let's say that you are trying to rate two teams that are equal
in every way, and that each was upset by a losing team. But one of them
suffered their upset in a game that was played in a rainstorm or
blizzard. You might therefore rightly rank that team higher. But a
computer cannot see that. Similarly, a powerful team might win a game
in an ice storm, but barely, and see their rating adversely affected by
a system that measures performance. That is not fair, and humans can
see and account for it.
Psychological Factors
It
is easy to dismiss psychological factors as "a bunch of hooey," as my
grandmother would say, but statistics will bear them out as real. And
they are just a few more things computers cannot account for.
Take
a rivalry game. The kind of a game where you can
"throw out the records," as Keith Jackson would say. It is, in other
words, much tougher than a game against a non-rival of the same record.
Just ask 8-0-1 Army, who was tied by 0-8-1 Navy in 1948. Should this
result be judged the same as though Army had been tied by 0-8-1
Indiana?
And it isn't just the famous rivalries,
like Army-Navy, Ohio State-Michigan, and Auburn-Alabama. Stanford-Cal,
South Carolina-Clemson, and Missouri-Kansas are just as heated. And
there are a lot of them. Texas has two (Oklahoma and Texas A&M).
Now,
just like the above factors, a loss in a rivalry game is still a loss.
But for a given year, if you are looking at, for example, a powerful
Missouri team that was upset by a mediocre Kansas, and comparing
them to a powerful team that was upset by a mediocre non-rival, you
might well rank Missouri higher. Because a rivalry game is tougher. But
a computer cannot see that.
Teams are also much, much tougher in
their last game when their coach has announced his retirement, and
often in the last game when their coach has been fired effective at
season's end. In past years we have seen this in Franchione's last
game at Texas A&M, Carr's last game at Michigan, and Bowden's last
game at Florida State, all big upset wins for the outgoing coach. Computers cannot account for this in judging the
opponents those teams defeated.
Humans can also consider factors
such as the difficulty of playing well the week before or after a big
game, or the effect of a tragedy or public scandal on a team (it can
inspire them to play better or depress/distract them into playing
worse).
A win is a win and a loss is a loss, but all of these
factors can and should affect the degree of impact some wins and losses
have on a team's rating. And math systems are incapable in this regard.
Strength of Schedule
Although
all math systems attempt to account for strength of schedule, virtually
none properly do so. I covered most of this in the Strength of Schedule section of my How to Rate Teams guide.
First of all, let's say that two teams are playing the following schedules, and that all of these rankings are fairly accurate:
Team A: #5, #10, #15, #30, #70, #80, #90, #100, #105, #110
Team B: #30, #40, #45, #50, #55, #60, #65, #70, #72, #73
Which
is the tougher schedule? Many math systems will say Team B's is tougher, since its
average opponent ranking is 56, and Team A's is 61.5. But which team
plays the tougher schedule is entirely dependent on how good Team A and
Team B are.
If these teams are top ten teams, then Team A's schedule is vastly tougher
than Team B's. It is not even close. If these teams are vying for a
national title, and a computer selects Team B based on its average
opponent ranking, that is just ridiculous. Team A will have played 3
top 25 teams, and Team B none.
For national championship
contenders, whether the weak teams on their schedule are #100 or #70 is
virtually irrelevant. Yet math systems will see that difference as the
same as that between a #10 and #40 opponent.
On the other hand,
if these teams are both about #75, then Team B has the far tougher
schedule, as all of their opponents are ranked higher, whereas Team A
is looking at a 5-5 season. So strength of schedule is very much
relative to the power level of the team playing the schedule.
This
principle is the same for a system that judges strength of schedule by
win-loss records (though such systems are poorer and prone to other
problems as well). If Team A goes unbeaten against teams that finished
9-1, 8-2, 7-3, and seven teams that were 1-9, and Team B defeats teams
that finished 6-4, 5-5, and eight teams that finished 4-6, Team B's
schedule will be deemed by a math system to be much stronger. But if
these teams are national championship contenders, Team A's schedule is
the far tougher. If they are #60-80 type teams, Team B's schedule is
tougher.
The Straight Record Fallacy
Systems
that simply judge strength of schedule by the straight records of
opponents have other problems as well. Typically, there is a big power
difference between a 7-3 SEC team and a 7-3 WAC team. But a simple
system, particularly the older ones (such as Houlgate), will judge such
opponents as the same. And even modern systems that alleviate this
problem with a more sophisticated formula can still be affected by it,
if to a lesser degree.
This is particularly a problem when
systems try to name champions for seasons long ago, when there were
fewer intersectional games. A team from the South, for example, could
be selected as champion for a season early in the 20th century when
neither they nor any of their opponents played a team from outside the
South. They just played a lot of regional teams with strong records.
The
problem is, major teams from the South fared consistently poorly
against other regions at that time. They have a losing record against
every other region they played 1901-1920. And a losing record against
every region but the West Coast 1921-1929 (3-0-1 against the West).
From
1901 through 1929, major Southern teams were 7-49-3 against the Big 10
region, 17-36-2 against the East, and 3-9-3 against the Missouri Valley
(which was far weaker than the Big 10 and East). So when Billingsley's
system selects 8-0 Auburn as national champion of 1913, over 7-0
Chicago, 7-0 Notre Dame, 9-0 Harvard, and 8-0 Nebraska, it seems more
than a little ridiculous.
Auburn's all-Southern schedule was
strong on its face, with teams that finished 4-3, 6-1-1, 6-1-2, 7-2,
5-3, and 6-2. That is why they won Billingsley's system. But the flaw
here is in treating 7-2 Southern teams as equal to 7-2 Big
10 and Eastern region teams, when they very clearly were not equal.
Auburn's
opponents played only one game against a power region, and that was 5-3
Vanderbilt losing to Michigan 33-2 (Auburn beat Vandy 14-6). They went
3-0-1 against Southwest teams (irrelevant) and 0-2 against mid-Atlantic
team Virginia (relevant only because they lost to a team from a weak
region). Virginia was 7-1, but lost their only game against an Eastern
team (4-4 Georgetown).
Performance
I
covered much of this in the Performance section of my How to Rate Teams
guide. Performance is basically score differential. Many, if not most,
modern math formulas measure performance, though the BCS did not allow
its computer rating systems to take it into account.
I do think
math formulas have a lot of trouble properly measuring performance
(details to follow), but on the other hand, people voting in polls can
(and should) account for performance, so ultimately I don't think it was
a good idea to force computer ratings to ignore so much potentially
useful data. I have no doubt that Sagarin's original system (his
real rating list) is better than the dumbed-down system he used for the
BCS.
But of course, I don't think the BCS should have been using computer ratings at all.
Problems With Measuring Performance
Computers
cannot account for many exceptions that artificially affect score
differential, including things I've already covered, such as player
losses, weather, and psychological factors. But unlike humans, math
formulas also cannot see how games unfold. For example, if a team is
down by 14, and scores a touchdown on a Hail Mary pass on the last
play, the final score will look much closer than the actual game was.
But the winner was not really threatened.
And on the other end,
one team might take a lead with 30 seconds left. The other team,
desperate to come back, throws risky passes, and one is returned for a
touchdown (a sequence I've seen many times). This final score will not
be as close as the actual game was.
Humans can see and account for these things, and computers cannot.
Furthermore,
since different teams have different strengths and offensive
approaches, it can be difficult to compare them using score
differential. Some teams are more ball-control and defense oriented,
and others run a no-huddle passing offense even when they are winning
big. And some coaches like to run up scores, while others shut it down
when they have a big lead. Just because one team wins by an average of
30 and another by an average of 15 doesn't mean that the higher-scoring
team is better.
The most important dividing line in performance
is the touchdown. A team that wins by 10 has performed better than a
team that wins by 7, because the latter team was an unlucky bounce or
Hail Mary pass away from a potential loss. When comparing national
championship candidates, the main thing to look at is not whether they
won by 20 or 30 points a game, but how many times they were threatened.
How many close games they had.
All else being equal, a team that
has one close game (winning by a touchdown or less) should be rated
higher than a team that has 3 close games. Even though the team with 3
close games can have a higher average score differential due to big
blowouts in the rest of their games.
The Pudding
There
is much more to say about the flaws of computer rating systems, and
more detail to get into. But this overview is long enough. Let's cut to
the chase. Or the pudding, as it were. That's where I've been told the
proof is. All you really need to know about computer ratings as
national championship selectors is evident in the vast list of silly
choices they have made over the years.
2002:
Three systems preferred 11-2 Southern Cal to 14-0 Ohio State. Some 60
years from now, I fully expect Southern Cal to officially claim a national championship for 2002.
1998: Sagarin had 11-1 Ohio State over 13-0 Tennessee.
1996: Alderson picked Florida State over Florida, who defeated FSU 52-20 in the national championship game.
1994: 13-0 Nebraska or 12-0 Penn State? Dunkel said 10-1-1 Florida State.
1988 & 1989: Notre
Dame beat Miami in '88, but one system selected Miami anyway. Miami won
the next year, but a couple of systems took Notre Dame anyway.
1987: Berryman took 11-1 Florida State over 12-0 Miami, who beat them.
1986: Almost all systems tabbed 11-1 Oklahoma #1. 12-0 Penn State beat 11-1 Miami, who beat Oklahoma.
1984: If ever there was titanium-clad proof that Billingsley's computer formula has a problem, it's 1984,
when his computer says Brigham Young was the best team in the land. Why
bother with a computer at all if it's just programmed to be as dumb
as that year's writers?
1980: Three systems would like to give the trophy to 10-2 Oklahoma over 12-0 Georgia.
1976: Five systems like 11-1 Southern Cal over 12-0 Pittsburgh.
1969: Matthews says no to 11-0 Texas. For 11-0 Penn State's sake? No. 10-0-1 Southern Cal? No. For 8-1 Ohio State.
1968: Litkenhous tells us that 8-1-2 Georgia is more worthy than 10-0 Ohio State (as well as 11-0 Penn State).
1961: Poling has 8-0-1 Ohio State over 11-0 Alabama.
1955: Boand pushes 9-1 Michigan State past 11-0 Oklahoma.
1941:
Houlgate's aforementioned selection of fifth-place 9-2 SEC team Alabama
over 8-0 Minnesota. Not to mention 8-0-1 Notre Dame and 8-0 Duquesne.
Or 8-1-1 Mississippi State, the SEC champion who beat Alabama, but lost
to Duquesne. And Alabama claims it. They even made "national
championship" rings (more than 40 years after the fact). I have seen one.
1940:
Williamson stays out of all the debate by picking 10-1 Tennessee over
8-0 Minnesota, 10-0 Stanford, and 11-0 Boston College (who beat
Tennessee).
1939: Dickinson
stands alone in 8-0-2 Southern Cal's corner, opposite 11-0 Texas
A&M. As noted in my separate review of Dickinson, and as you can see in the link above, in recent years
USC did some "research" and decided to claim this as a "national
championship." Funny stuff.
There are plenty more. But you get the idea.
Conclusion
It
may seem as if I dislike math-based ratings, but I don't. It's true
that I have no use for the older, simpler systems, but modern ones that
measure performance are interesting. I've been reading Sagarin's lists
for decades. I just don't see them as legitimate national championship
selectors, or even top 25 ratings. They are power ratings.
Power
ratings don't care who beat who. They are lists of who the best teams
are (or who the formula/selector measures/believes the best teams
to be). And the best team doesn't always win. The computers are
correct, I believe, that Oklahoma was a better team than Penn State in
1986 (and so was Miami). PSU is the only legitimate #1 team for that
season, but that doesn't mean they were the best.
Similarly,
there is no doubt in my mind that Miami was better than Ohio State in
2002. But Ohio State is the only legitimate #1 for that year.
Therefore,
I think the math rating #1 teams throughout history should be put into
a separate list from the human national championship selections in the
NCAA Records Book. Math formulas are selecting the best teams (or
attempting to, at least), not the national champions, and there is
definitely a difference.
For
all their limitations, math-based ratings can do a good, objective job
of putting together a power rating. And they can be a helpful tool for
humans that rate teams. If Sagarin has a 6-4 team at #15, for example,
it might be a good idea to at least take a look at that team. Chances
are, there's a reason for it.