Back in my usability days, I talked often about measurement error, the idea that something throws a monkey wrench into an otherwise careful attempt at accurate observation. Biases, for example, pop up in all sorts of interesting and confounding ways—I’ve seen users struggle with a site but say they loved their user experience. If not an issue with the Web design itself, users bring expectations with them as they use Web sites; one man I interviewed for the Social Security Administration consistently used an interface wrong because he didn’t personally identify himself as disabled, even though it took him 10 minutes to cross the room with his walker before taking his seat.
Usability scientists who aim to measure the success and efficiency of online systems have created an arsenal of tools for gathering more accurate information that can stem the effect of whatever measurement error is in play. One of those tools is a Likert Scale. We’re all familiar with them, those are the “rate this from 1 to 5, 1 being least and 5 being most,” items that float around opinion surveys and rating systems like Yelp. In truth, a scale can go from 1 to 3, 1 to 4, or 1 to 10, or whatever the designer thinks the range should cover.
But Likert Scales are notorious in the world of usability data collection, because very few people design them correctly, and very few respondents react to them appropriately. Problem number 1 with the scales is the discrete distance between each option: the scale demands that the difference between 1 and 2 is equivalent to the distance between 2 and 3. But emotionally, if we are judging our own satisfaction with something, can we parse out our feelings that way? Is happiness always even across a continuum? In terms of satisfaction, there is a lot of evidence that scores tend to drift toward the extremes of the scale, no matter how many markers there are in between. And for items that aren’t controversial, many people will select the middlemost answer. Some scale designers set up even-numbered scales to eliminate the lazy neutral response, but that doesn’t address the problem of pole attraction for respondents.
Another issue with Likert Scales stems from the individual differences among people making selections. When a nurse asks us, “What is your pain level,” and points to the familiar happy or sad face scale, what is a 3 for me may be a 6 for someone else. This particular scale is also problematic for the fact that while most of us no what the “no pain” rank feels like, we have very different (or no experience) with the “worst possible pain” point on the scale. Thus that range of 1 to 10 can be markedly different for different people. Perhaps the medical practitioner is only looking to see how someone feels generally, but this particular problem persists in other applications of the scale.
Which brings me to the now-standard 5-point scale for rating books. Amazon, Goodreads, Barnes & Nobel all have a kind of satisfaction rating available so that readers can make their assessment quickly, and this widget is typically separate from a text box where a written review would go. In granting access to such quick or snap judgments, one would think that users would be careful about where on the scale they made their mark. Instead, most responses are selected in the 4 and 5 slots, a strong preponderance for the top of the scale. Here are some examples I pulled from Amazon and Goodreads:
- The Mill River Recluse, by Darcy Chan—4-star average, 547 responses: 292 5-stars,131 4-stars, 54 3-stars, 47 2-stars, 51 1-star on Amazon
- One September Morning, by Rosalind Noonan—4.5-star average, 7 responses: 5 5-stars, 2 4-stars on Amazon
- The Anubis Gates, by Brian Powers—4-star average, 2,669 responses: 1022 5-stars, 927 4-stars, 525 3-stars, 140 2-stars, 49 1-star on Goodreads (so 73% of the scores were in the 4-5 star marks)
- Falling for Me, by Anna David—4-star average, 41 responses: 14 5-stars, 15 4-stars, 11 3-stars, 1 2-stars on Goodreads
First, it seems to me that there are different readers out there—readers who like to rank books they’ve read, and casual rankers who will do it if the scale is presented to them. Obviously more popular books generate more responses, but it would appear that there is often a bias toward positive ranking over neutral or negative ranking. I will wonder out loud if the active book rankers, the first group, have a different style of ranking; that is, are people who look forward to rating books more inclined to rate in the middle than the casual reader who sometimes rates books?
That said, when a poor book hits the scene readers will respond with very negative numbers. And all of the history on Amazon of fake ratings going up for some titles can’t mask a very crappy book, as the 1- and 2-star ratings will start to climb despite an author or publisher’s attempts to fluff up the average rating.
All of this leads me to thinking that the scale ratings scheme just doesn’t work well for books. On the one hand, some people try to manipulate the numbers and results, and on the other, measurement error seems to be firmly entrenched in this design of recommending or rating titles. Even just splitting out that distinction: “Did you like this book” versus “Would you recommend this book” may garner different results.
What I would rather see is a couple of yes/no widgets like those above, with a short box for users to say why or why not. I can understand the impetus behind the ratings game, but I’d rather use more traditional sales markers and qualitative-based reviews from readers to gauge a book’s interestingness or quality. And I have high hopes that someday, others will agree with me.