Skip to content

【翻译】How Reddit ranking algorithms work

原文地址:http://amix.dk/blog/post/19588

All rights belong to Amir Salihefendic, I just do the translation. 🙂

=========================分割线======================

How Reddit ranking algorithms work

Reddit Icon

This is a follow up post to How Hacker News ranking algorithm works. This time around I will examine how Reddit’s default story and comment rankings work. Reddit’s algorithms are fairly simple to understand and to implement and in this post I’ll dig deeper into them.

女士们绅士们,这次我们来探索发现Reddit网站的默认的故事和评论的排行算法是怎么运作的。Reddit的算法相对来说很简单易懂而且也很好开发,在这个博文里我会更深入的探究这个算法。

The first part of this post will focus on story ranking, i.e. how are Reddit stories ranked? The second part of this post will focus on comment ranking, which does not use the same ranking as stories (unlike Hacker News), Reddit’s comment ranking algorithm is quite interesting and the idea guy behind it is Randall Munroe (the author of xkcd).

博文的第一部分会专注于故事的排行,举个栗子:这些Reddit故事是怎么排的呢?博文的第二部分会专注于评论的排行,这不像Hacker News那样,Reddit的评论排行和故事排行用的是不同的算法。 Reddit的评论排行算法十分有趣,然后这个算法背后的男人是Randall Munroe.

Digging into the story ranking code

Reddit is open sourced and the code is freely available. Reddit is implemented in Python and their code is located here. Their sorting algorithms are implemented in Pyrex, which is a language to write Python C extensions. They have used Pyrex for speed reasons. I have rewritten their Pyrex implementation into pure Python since it’s easier to read.

Reddit网站是开源的,其源码在网上可以免费得到。Reddit是用Python开发的,然后其排序算法是用Pyrex研发,Pyrex是一种融合python和C语言的类Python编程语言。Reddit使用Pyrex的原因是因为速度。 我已经用纯Python重写了他们的算法所以大家可以愉快的食用。

(来自官网:Pyrex is a Python-like language for rapidly and easily writing python extension modules. It can be described as python with C data types. With Pyrex, one can produce Python-like code that runs as fast as in C, with easy access to C libraries and functions.)

The default story algorithm called the hot ranking is implemented like this:

默认的故事排序算法叫做热门排序,代码如下:

In mathematical notation the hot algorithm looks like this (I have this from SEOmoz, but I doubt they are the author of this):

Reddit's algorithm

Effects of submission time

发帖时间的影响

Following things can be said about submission time related to story ranking:

  • Submission time has a big impact on the ranking and the algorithm will rank newer stories higher than older
  • The score won’t decrease as time goes by, but newer stories will get a higher score than older. This is a different approach than the Hacker News’s algorithm which decreases the score as time goes by

以下因素代表了发帖时间对故事排行的影响:

  1. 发帖时间对于故事排行有巨大影响,算法会把新帖子排在旧帖子的前面(其他条件相同的情况下)
  2. 分数并不会因时间的流逝而降低,但是新的故事会得到更高的分数,这个方法跟Hacker News那个分数随时间消逝的算法不太一致

Here is a visualization of the score for a story that has same amount of up and downvotes, but different submission time:

Reddit's score

The logarithm scale

Reddit’s hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applies:

  • The first 10 upvotes have the same weight as the next 100 upvotes which have the same weight as the next 1000 etc…

Reddit的热门排行算法使用了对数函数来给予刚得赞的帖子相对于其他帖子来说更高的权重。通常来说:

  1. 头十个赞和接下来100个赞得相同的权重,然后接下来的1000个赞也是一样

Here is a visualization:

The logarithm function's effects

Without using the logarithm scale the score would look like this:

如果没有使用对数函数:

Without logarithm function

Effects of downvotes

不赞的影响力

Reddit is one of the few sites that has downvotes. As you can read in the code a story’s “score” is defined to be:

  • up_votes – down_votes

Reddit是为数不多可以给不赞的网站。正如代码里显示的,故事的分数的定义是:

点赞 – 不赞

The meaning of this can be visualized like this:

Effects of down votes
This has a big impact for stories that get a lot of upvotes and downvotes (e.g. controversial stories) as they will get a lower ranking than stories that just get upvotes. This could explain why kittens (and other non-controversial stories) rank so high 🙂

这对于那些有很多赞和不赞的故事来说有巨大影响(比如争议性故事),因此它们会比那些只得到赞的故事的排行更低。这可能说明了为什么有关猫咪的故事排行都那么高。

Conclusion of Reddit’s story ranking

  • Submission time is a very important parameter, generally newer stories will rank higher than older
  • The first 10 upvotes count as high as the next 100. E.g. a story that has 10 upvotes and a story that has 50 upvotes will have a similar ranking
  • Controversial stories that get similar amounts of upvotes and downvotes will get a low ranking compared to stories that mainly get upvotes

Reddit故事排行的总结:

  1. 发帖时间是一个非常重要的参数,通常来说新故事的排行会比旧的高
  2. 头10个赞的重要性和接下来100个赞的重要性一样。比如一个有10个赞的故事的排行可能和一个有50个赞的故事的排行差不多
  3. 那些被赞数和被不赞数差不多的具有争议性的故事会比那些只得赞的故事的排行会低

How Reddit’s comment ranking works

Randall Munroe of xkcd is the idea guy behind Reddit’s best ranking. He has written a great blog post about it:

Randall Munroe 是Reddit 评论排行算法背后的男人。他曾写过一篇给力的博文来介绍这个算法。(猛击上面的链接)

You should read his blog post as it explains the algorithm in a very understandable way. The outline of his blog post is following:

  • Using the hot algorithm for comments isn’t that smart since it seems to be heavily biased toward comments posted early
  • In a comment system you want to rank the best comments highest regardless of their submission time
  • A solution for this has been found in 1927 by Edwin B. Wilson and it’s called “Wilson score interval”, Wilson’s score interval can be made into “the confidence sort”
  • The confidence sort treats the vote count as a statistical sampling of a hypothetical full vote by everyone – like in an opinion poll
  • How Not To Sort By Average Rating outlines the confidence ranking in higher detail, definitely recommended reading!

这篇博文通俗易懂,是居家旅行必备良药。博文的大纲大致如此:

  1. 在评论上使用热门算法不给力啊!因为这个算法对于那些发的早的评论的重要性的估计有强烈的误差
  2. 在评论系统里,你希望根据发布时间来排行那些得分最高的评论
  3. 一个解决方法在1927年被Edwin大神发现了,那种方法叫做“威尔逊置信区间”,威尔逊置信区间可以用来进行置信度排行
  4. 置信度排序把投票数当做一个得满票假设的统计样本,就好像一个民意测验
  5. How Not To Sort By Average Rating这篇博文更为详细的讲述了这个方法,强推!

Digging into the comment ranking code

The confidence sort algorithm is implemented in _sorts.pyx, I have rewritten their Pyrex implementation into pure Python (do also note that I have removed their caching optimization):

这置信度排行算法在sort.pyx文件中,我用纯python重写了这个算法方便大家理解。

The confidence sort uses Wilson score interval and the mathematical notation looks like this:

Wilson's score interval

使用威尔逊置信区间的置信度排行和其数学表达式如上:

In the above formula the parameters are defined in a following way:

  • p is the observed fraction of positive ratings
  • n is the total number of ratings
  • zα/2 is the (1-α/2) quantile of the standard normal distribution

在上面的方程里,各项参数定义如下:

  1. p 是观察到的好评率
  2. n 是评分总数
  3. zα/2是标准正态分布的(1-α/2)位分位数

Let’s summarize the above in a following manner:

  • The confidence sort treats the vote count as a statistical sampling of a hypothetical full vote by everyone
  • The confidence sort gives a comment a provisional ranking that it is 85% sure it will get to
  • The more votes, the closer the 85% confidence score gets to the actual score
  • Wilson’s interval has good properties for a small number of trials and/or an extreme probability

让我们来用以下方式来总结一下:

  1. 置信度排序把投票数当做一个得满票假设的统计样本
  2. 置信度排序给予一个评论一个有85%可能性的临时的排名
  3. 投票越多,其85%的置信度分数更接近与真实分数
  4. 威尔逊置信区间拥有好的特性来应对小样本或者极端概率的发生

Randall has a great example of how the confidence sort ranks comments in his blog post:

If a comment has one upvote and zero downvotes, it has a 100% upvote rate, but since there’s not very much data, the system will keep it near the bottom. But if it has 10 upvotes and only 1 downvote, the system might have enough confidence to place it above something with 40 upvotes and 20 downvotes — figuring that by the time it’s also gotten 40 upvotes, it’s almost certain it will have fewer than 20 downvotes. And the best part is that if it’s wrong (which it is 15% of the time), it will quickly get more data, since the comment with less data is near the top.

如果一个评论有一个赞和0个不赞,它就有100%的得赞率,但是因为这数据量不大,系统会让其保持在末尾。若如果它有10个赞和只有1个不赞,系统则有足够的置信度来将其排在另一个有40个赞和20个不赞的帖子前面—计算其随着时间的推移,它也会得到40个赞,但是可以十分肯定的是它将会得到少于20个不赞。还有最美好的部分是如果它错了(有15%的时间里会发生),它会快速的得到更多的数据,因为没啥数据的评论在排名顶端那附近。

Effects of submission time: there are none!

The great thing about the confidence sort is that submission time is irrelevant (much unlike the hot sort or Hacker News’s ranking algorithm). Comments are ranked by confidence and by data sampling – – i.e. the more votes a comment gets the more accurate its score will become.

置信度排序的一个好处是发帖时间变得不相关了 (就跟热门排序或者Hacker New’s的排序算法不太一样)。 评论是根据置信度和样本来排序的–比如:如果一个评论得到越多的投票,那它的得分(或者在排序中的所得)会更加准确

Visualization

Let’s visualize the confidence sort and see how it ranks comments. We can use Randall’s example:

Reddit's confidence score visualized
As you can see the confidence sort does not care about how many votes a comment have received, but about how many upvotes it has compared to the total number of votes and to the sampling size!

如图,置信度排序不在乎评论收到了多少投票,但是它在乎在样本大小的全部投票里有多少赞

Application outside of ranking

Like Evan Miller notes Wilson’s score interval has applications outside of ranking. He lists 3 examples:

  • Detect spam/abuse: What percentage of people who see this item will mark it as spam?
  • Create a “best of” list: What percentage of people who see this item will mark it as “best of”?
  • Create a “Most emailed” list: What percentage of people who see this page will click “Email”?

To use it you only need two things:

  • the total number of ratings/samplings
  • the positive number of ratings/samplings

Given how powerful and simple this is, it’s amazing that most sites today use the naive ways to rank their content. This includes billion dollar companies like Amazon.com, which define Average rating = (Positive ratings) / (Total ratings).

这段不翻啦,在另一篇博文有。:)

Conclusion

I hope you have found this useful and leave comments if you have any questions or remarks.

Happy hacking as always 🙂

Related

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.