Every year, more and more games are released. With so many titles competing for our limited attention, we must often make choices about which games to buy and play. The rise of video streaming affords the modern gamers an unprecedented ability to preview the content of a game, but we often still fall back on the most expedient metric for a game’s quality: the review score.
Review scores are appealing because of their simplicity and convenience. At a mere glance, they offer authoritative reassurance that an anticipated title is “good”, or cautionary guidance to avoid a release deemed “bad”, “broken”, or perhaps worst: “meh”.
For assessing the quality of a game, this is obviously a shortcut; games are complex works of art that cannot be reduced to a number, and tastes are not interchangeable. But sometimes shortcuts are worth taking. We don’t have time to research every new game that comes along, so it’s not unreasonable to write off a game roundly panned as a “1 out of 5”. Similarly, it’s often worth looking into games that release to universal acclaim.
The value of ratings becomes more muddied when there is disagreement, either between different critics or between professional critics and regular consumers (often represented as an aggregated “user score”). To make the best inferences from the available data, it’s important to understand how different types of reviewers rate games.
Getting a Professional Opinion
Professional critics review games as part of their job. They might work for a media organization as a contributor or produce content independently for their blog or YouTube channel. Such individuals rate and review games for others, and balance their unfiltered opinions with external factors imposed by the expectation of their audience and the norms of the industry.
Some of the biggest limitations imposed on a professional critic in assigning a review score are the common interpretation of which scores are “good” or “bad”, and which scores are considered appropriate to give out to different types of games.
Review aggregators like Metacritic have had a major normalizing effect on how readers interpret the range of review scores. By directly mapping scores from all publications to the range from 0-100, Metacritic has conditioned us to think of them in the same way as letter grades (~90+ is an A, ~60- is an F).
Without this mapping, a more limited 5-point scale is interpreted differently; a 3 out of 5 might suggest that a game is competent and worthwhile for fans of a genre or series. But if a reviewer gives a game a 3 out of 5, Metacritic will convert it to a 60 , which will be perceived by most readers as a “failing grade”. It becomes necessary to switch to a more granular scale and give it a 70% or higher, leaving most of the scale unused most of the time.
If you wonder why reviewers care about how Metacritic interprets their scores, consider that some game studios award bonuses based on their game’s Metacritic score. Reviewers may be averse to adhering rigidly to a scale with a lower average knowing that it could hurt people whose work they enjoyed.
Everyone’s a Critic
Gaming sites and services like Metacritic often give users the ability to rate games themselves. For an anonymous user submitting a rating, there is no built-in audience for their specific opinion. Since their rating simply feeds into an aggregated “user score”, they are incentivized to provide a score that maximizes their impact on the resulting number.
User ratings on Metacritic are aggregated by taking the mean (the average calculated by adding up all the scores and dividing them by the number of votes). This incentivizes extreme votes because scores that are further from the average have more of an impact on the new average.
Suppose I play a game and decide that it deserves a score of 5/10, but when I check Metacritic I see the current user score is 9, based on 20 user reviews. If I submit an honest rating of 5, then the mean average is reduced to about 8.8. However, if I instead rate it 0/10, the average is reduced further, to about 8.6. By submitting a more extreme rating, I can have a greater impact on the resulting average and skew it closer to the number that I feel it ought to be.
We can eliminate this incentive by aggregating user ratings a different way. Instead of the mean average, we could use the median score, the one that is in the exact middle when all scores are ordered from lowest to highest. This score is average in the sense that half of all half of all the scores are above it, and the other half are below it. If the median is 9 and I think that the score should be 5, then I am incentivized to vote honestly because a more extreme vote will not be any more effective than an honest vote in changing the median.
This resistance to tactical voting is a useful property to have for aggregated user scores for games because it avoids amplifying the influence of a vocally dissatisfied minority. Since, as previously discussed, game scores tend to be distributed toward the upper end of the scale (~70% or higher), a small proportion of users submitting extremely low ratings can quickly tank the rating to the point where it is no longer a useful representation of the consensus opinion.
Make it Meaningful
User scores like those on Metacritic are particularly susceptible to “tactical voting” because users tend to submit a rating and then forget about it. Users might be more likely to care about providing an honest rating if they felt like they were recording their rating not just for others, but for their own record. The new AllMyGames rating feature was designed first as a way for gamers to keep track of how much they liked the games they’ve played relative to others.
Another way to encourage users to rate games honestly is to provide them value in exchange. Netflix and similar services have incentivized users to rate content by promising to use that data to make personalized recommendations for other content.
Review scores don’t tell the whole story, but they can be a handy shortcut for getting a sense of how a game has been received and as a concise metric for recording our own opinions. Just remember to look a bit deeper when you can. Don’t judge a game by its number!