Skip to content

【翻译】How Not To Sort By Average Rating

原文链接:http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

All rights belong to Evan Miller, I just do the translation.

===============================分割线================================

How Not To Sort By Average Rating

如何不使用平均评价来进行排序

By Evan Miller

February 6, 2009 (Changes)

PROBLEM: You are a web programmer. You have users. Your users rate stuff on your site. You want to put the highest-rated stuff at the top and lowest-rated at the bottom. You need some sort of “score” to sort by.

问题: 假设你是一名苦逼的写网站的程序员。你(的网站)拥有用户,你的用户大爷们在你的网站上对各种东西进行评分。你想要把评分最高的东西放在页面最上头,评分最低的放在页面最下头(或者,第二页,如图,lol)。你需要依据一些分数来进行排序。

googleFunny

WRONG SOLUTION #1: Score = (Positive ratings) – (Negative ratings)

Why it is wrong: Suppose one item has 600 positive ratings and 400 negative ratings: 60% positive. Suppose item two has 5,500 positive ratings and 4,500 negative ratings: 55% positive. This algorithm puts item two (score = 1000, but only 55% positive) above item one (score = 200, and 60% positive). WRONG.

错误的解决方法1:分数 = 好评 – 差评

错误的原因: 假设一个东西有600个好评和400个差评,那么60%的评价是好得。 又假设另一个东西有5500个好评和4500个差评,那么它的好评率只有55%。 这种算法会把第二件物品(分数=1000,但是只有55%的好评率)排在第一件物品前面(分数=200,但好评率有60%)错错错!

Sites that make this mistake: Urban Dictionary

犯此错误的网站:Urban Dictionary

WRONG SOLUTION #2: Score = Average rating = (Positive ratings) / (Total ratings)

Why it is wrong: Average rating works fine if you always have a ton of ratings, but suppose item 1 has 2 positive ratings and 0 negative ratings. Suppose item 2 has 100 positive ratings and 1 negative rating. This algorithm puts item two (tons of positive ratings) below item one (very few positive ratings). WRONG.

错误的解决方法2:分数 = 平均评价 = 好评 / 总评 (好评率)

错误的原因: 如果你经常能得到成千上万的评价的话,平均评价其实还能说明问题。但是假设item1有个两个好评和0个差评,item2有100个好评和1个差评。这种算法会把item2(有一堆评价)排在item1(基本没有评价)后面。错错错!

Sites that make this mistake: Amazon.com

犯此错误的网站:Amazon.com

CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

正解:分数 = 伯努利分布下的威尔逊置信区间的下限值

Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:

我们需要平衡好评中由于小样本而导致的不确定性。幸运的是,这个数学难题在1927年被Edwin B.Wilson解决了。 我们想问的是: 已知所有的评分,那么有95%的几率真实的好评率是至少多少呢?Wilson童鞋给出的答案是。 如果只考虑好评和差评(比如不是5星评价),那么好评率的下限值为:

(Use minus where it says plus/minus to calculate the lower bound.) Here is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.

(在用加减来计算下限值的地方用减)。 p̂ 是观察到得好评率,zα/2是标准正态分布下的(1-α/2)分位数,n是总评分个数

The same formula implemented in Ruby:

pos is the number of positive ratings, n is the total number of ratings, and confidence refers to the statistical confidence level: pick 0.95 to have a 95% chance that your lower bound is correct, 0.975 to have a 97.5% chance, etc. The z-score in this function never changes, so if you don’t have a statistics package handy or if performance is an issue you can always hard-code a value here for z. (Use 1.96 for a confidence level of 0.95.)

pos 是好评数,n是总评价数,然后confidence表示统计置信水平:选择0.95来拥有95%的机会得出正确的下限值,0.975来拥有97.5%的机会,以此类推。z-score函数是不变的,所以如果你没有一个靠谱的统计包或者程序的表现不给力,你可以自己给它一个值(在置信水平0.95的情况下用1.96吧)

以下插入一段别人的翻译作此部分的补充:

“那么,正确的算法是什么呢?

我们先做如下设定:

  (1)每个用户的投票都是独立事件。

(2)用户只有两个选择,要么投赞成票,要么投反对票。

(3)如果投票总人数为n,其中赞成票为k,那么赞成票的比例p就等于k/n。

如果你熟悉统计学,可能已经看出来了,这是一种统计分布,叫做“二项分布”(binomial distribution)。这很重要,下面马上要用到。

我们的思路是,p越大,就代表这个项目的好评比例越高,越应该排在前面。但是,p的可信性,取决于有多少人投票,如果样本太小,p就不可信。好在我们已经知道,p是”二项分布”中某个事件的发生概率,因此我们可以计算出p的置信区间。所谓“置信区间”,就是说,以某个概率而言,p会落在的那个区间。比如,某个产品的好评率是80%,但是这个值不一定可信。根据统计学,我们只能说,有95%的把握可以断定,好评率在75%到85%之间,即置信区间是[75%, 85%]。

这样一来,排名算法就比较清晰了:

  第一步,计算每个项目的”好评率”(即赞成票的比例)。

第二步,计算每个”好评率”的置信区间(以95%的概率)。

第三步,根据置信区间的下限值,进行排名。这个值越大,排名就越高。

这样做的原理是,置信区间的宽窄与样本的数量有关。比如,A有8张赞成票,2张反对票;B有80张赞成票,20张反对票。这两个项目的赞成票比例都是80%,但是B的置信区间(假定[75%, 85%])会比A的置信区间(假定[70%, 90%])窄得多,因此B的置信区间的下限值(75%)会比A(70%)大,所以B应该排在A前面。

置信区间的实质,就是进行可信度的修正,弥补样本量过小的影响。如果样本多,就说明比较可信,不需要很大的修正,所以置信区间会比较窄,下限值会比较大;如果样本少,就说明不一定可信,必须进行较大的修正,所以置信区间会比较宽,下限值会比较小。

二项分布的置信区间有多种计算公式,最常见的是“正态区间”(Normal approximation interval),教科书里几乎都是这种方法。但是,它只适用于样本较多的情况(np > 5 且 n(1 − p) > 5),对于小样本,它的准确性很差。

1927年,美国数学家 Edwin Bidwell Wilson提出了一个修正公式,被称为“威尔逊区间”,很好地解决了小样本的准确性问题。

在上面的公式中,表示样本的”赞成票比例”,n表示样本的大小,表示对应某个置信水平的z统计量,这是一个常数,可以通过查表或统计软件包得到。一般情况下,在95%的置信水平下,z统计量的值为1.96。

威尔逊置信区间的均值为

它的下限值为

可以看到,当n的值足够大时,这个下限值会趋向。如果n非常小(投票人很少),这个下限值会大大小于。实际上,起到了降低”赞成票比例”的作用,使得该项目的得分变小、排名下降。”


UPDATE, April 2012: Here is an illustrative SQL statement that will do the trick, assuming you have a widgets table with positive and negative ratings, and you want to sort them on the lower bound of a 95% confidence interval:

此处有个能够说明问题的SQL代码可以实现这个小把戏,假设你有一个记载好评差评的数据库表widgets,然后你想根据95%的置信水平下的下限值来排列它们:

If your boss doesn’t believe that such a complicated SQL statement could possibly return a useful result, just compare the results to the other two method described above:

如果你的老板不相信一个这个如天书一般的SQL查询代码能够得到一丁点儿有用的结果,那么对比一下上面的方法的结果就知道了

You will quickly see that the extra bit of math makes all the good stuff bubble up to the top. (But before running this SQL on a massive database, talk to your friendly neighborhood database administrator about proper use of indexes.)


I initially devised this method for a Chuck Norris-style fact generator to honor of one of my professors, but it has since caught on at places like Reddit, Yelp, and Digg.

OTHER APPLICATIONS

The Wilson score confidence interval isn’t just for sorting, of course. It is useful whenever you want to know with confidence what percentage of people took some sort of action. For example, it could be used to:

  • Detect spam/abuse: What percentage of people who see this item will mark it as spam?
  • Create a “best of” list: What percentage of people who see this item will mark it as “best of”?
  • Create a “Most emailed” list: What percentage of people who see this page will click “Email”?

Indeed, it may be more useful in a “top rated” list to display those items with the highest number of positive ratings per page view, download, or purchase, rather than positive ratings per rating. Many people who find something mediocre will not bother to rate it at all; the act of viewing or purchasing something and declining to rate it contains useful information about that item’s quality.

 

当然,威尔逊置信区间并非只能用来排序。 当你想要了解人们做某些事情的比率的置信度的时候也可以使用。举个栗子:

  1. 检测垃圾邮件/和谐词:看到这东西并将其标记为垃圾邮件的人得比率是多少?
  2. 生成“最好的xx”的清单:看到这东西并将其标记为“最好的xx”的人的比率是多少?
  3. 生成“被email转发最多的页面“的清单:看到这页面并点击”Email”的人的比率是多少?

当然,在一个”评分最高“的列表里显示在不同情况如被观看页面,下载页面,或支付页面下拥有最高的好评率的商品显然比单纯显示好评率要实用得多。(因为)当人们觉得一个东西一般的时候就不会想要去评分,但是人们的页面查看行为或支付行为里包含着暗示产品质量的有用信息。

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.