比较两组的退出率

Question

我们在 R 中有 2 个数据集 good 和 bad。它包含 users 和 games。游戏包含 10 种不同的游戏类型 1,2,...,10。数据集 good 包含玩了很长时间的用户，bad 包含玩了很短时间然后停止玩的用户。

head(good)
user   game
1      4
2      3
3      4
1      1
15     4
1      2

和

head(bad)
user   game
10      4
22      3
37      4
37      1
38      4
46      2

我找到了一个用户在 he/she 停止玩之前玩过的最新游戏。因此，对于一个固定游戏的一组，我们有 'last-game played' 的时间/它玩的总时间。这给了我们一个退出率。如果退出率高，则意味着它很可能是最近玩的游戏，如果退出率低，则意味着该游戏可能不是最后玩的游戏。

在 R 中我们可以看到组的退出率 good

exitrate_good
game  exitrate
1     0.133333
2     0.127772
3     0.090332
...
9     0.317307
10    0.190854

另一组类似bad

exitrate_bad
game  exitrate
1     0.186522
2     0.045888
3     0.192556
...
9     0.365899
10    0.119331

在这里我们可以看到游戏 9 在 good 和 bad 中都有很高的退出率。

我的问题是：如何找到不受欢迎导致玩家停止玩的游戏？

用户上次玩过的游戏可能会导致用户停止玩。我应该如何比较两组的退出率？

--------（扩展）

让我们看看good组。在 R 中我输入 last_game_good 并且我们得到这个输出

latest_game_played   not_latest_game_played
734                  3917
645                  3507
...
765                  2100
112                  535

所以第一行简单地说 734+3917 玩过这个游戏，这是 734 情况下最近玩的游戏。

在这里我们还可以看到游戏 id 9（倒数第二行）与 not_latest_game_played 相比具有非常高的 latest_game_played。为此，我使用 pairwise.prop.test 并获得所有成对相关性，一些具有低 p 值，一些具有高于 0.05 的 p 值。如果我运行对其他组使用相同的东西，比如组 bad ，我如何使用这些信息以及如何比较它？

Answer 1

因此，您可以做的一件事是比较您组中游戏之间的差异。例如。与良好组中的游戏 Y 相比，游戏 X 的退出率是否更高？坏组呢？是一样的图案吗？也许完全不同的模式？

您可以做的另一件事是将游戏与不同组中的自己进行比较。例如。好组的游戏 X 是否比差组的游戏 X 退出率更高？

第三件事是预先指定并确定一个对你不利的退出率，并将所有组中的所有游戏与该退出率进行比较。例如。我知道 40% 的退出率对我不利。有没有哪个组的游戏退出率高于40%？

我将重点介绍第一种情况。

我这样创建数据集

dt = read.table(text=
"latest_game_played   not_latest_game_played
734                  3917
645                  3507
765                  2100
112                  535", header=T)

# create game id
dt$game_id = c(1,2,9,10)

# create total numbers
dt$totals = dt$latest_game_played + dt$not_latest_game_played

dt

#   latest_game_played not_latest_game_played game_id totals
# 1                734                   3917       1   4651
# 2                645                   3507       2   4152
# 3                765                   2100       9   2865
# 4                112                    535      10    647

然后我计算百分比并检查是否存在至少一个统计显着差异

# check percentages
prop.test(dt$latest_game_played, dt$totals)

# 4-sample test for equality of proportions without continuity correction
# 
# data:  dt$latest_game_played out of dt$totals
# X-squared = 176.51, df = 3, p-value < 2.2e-16
# alternative hypothesis: two.sided
# sample estimates:
#    prop 1    prop 2    prop 3    prop 4 
# 0.1578155 0.1553468 0.2670157 0.1731066

请注意，您还可以将这些百分比保存在您自己创建的新列中。您看到的 p 值小于 0.05，因此至少有一款游戏的退出率高于另一款游戏。或者，换句话说，成对地检查是合理的differences/comparisons。不知道（还）哪个差异具有统计显着性，或者是否存在更多差异。下一步就是去一探究竟了。

# check pairwise comparisons
pairwise.prop.test(dt$latest_game_played, dt$totals)

# Pairwise comparisons using Pairwise comparison of proportions 
# 
# data:  dt$latest_game_played out of dt$totals 
# 
#      1       2       3      
# 2 0.82    -       -      
# 3 < 2e-16 < 2e-16 -      
# 4 0.82    0.82    3.2e-06
# 
# P value adjustment method: holm

这是成对 p 值的 table。您可以看到第 9 场比赛（支持 3）在统计上显着高于所有其他百分比。其他游戏退出率无差异

您可以对您的其他组执行类似的过程，看看是否找到相同的 thing/pattern

比较两组的退出率

To compare exit-rates for two groups

statistics

r