电子邮件回复率的贝叶斯平均值

Bayesian average for email response rates

我有一个数据集,其中包含不同的电子邮件代码、电子邮件收件人以及他们是否回复电子邮件的标志。我计算了每个人过去的回复率,针对当前电子邮件之前的电子邮件(回复总和/电子邮件数量)。它看起来像这样:

email_code  responded  person  number_of_emails  response_rate  date
wy2         1         A       0                 0              2022/01/12
na3         1         A       1                 100            2022/01/22
li3         0         A       2                 100            2022/01/23
pa4         1         A       3                 66             2022/01/24   

然而,这似乎不对。想象一下,A 收到了 1 封电子邮件并进行了回复,那么他们的回复率为 100%。 B 收到了 10 封邮件并回复了其中的 9 封,因此他们的回复率为 90%。但是B更有可能回应。

我想我需要计算一些贝叶斯平均值,与 this post and this website 类似。但是,这些网站显示了如何对评级进行此操作,我不知道如何根据我的情况调整公式。

任何 help/suggestions 将不胜感激!

post on SO 完美地描述了如何计算贝叶斯评级,IMO。

我引用:

rating = (v / (v + m)) * R +
         (m / (v + m)) * C;

变量是:

  • R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
  • C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
  • v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
  • m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.

所以在你的情况下:

  • R就是response rate或者number of replies / number of received emails. If someone hasn't received any emails set Rto0to avoid divison by zero. If the haven't responded to any received emails theirR`当然是0.

  • C,是所有收件人的R总和除以所有收件人的数量。

  • v,是收到邮件的数量。如果有人收到 10 封电子邮件,他们的 v 将为 10。如果没有收到任何电子邮件,他们的 v 将为零。

  • m,如原文 post 中所述,是可调参数。

进一步引用原文 post,它很好地描述了 m

All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.