Prop.test 关于 R 中具有多级因子的计数数据

Question

我无法弄清楚 prop.test 是否可以运行这么多数据（见下文）或者我是否需要运行因子的每个级别的代码单独“区域”。到目前为止，我已经看到很多用这种格式编写的例子，但我有更多的因子水平：

# Whipray "Zone"
prop.test(c(4,4,0), c(9,7,15))

我想知道这些年来和不同地点之间空腹鱼的比例是否有统计学上的差异，即测试空腹在任何时间或地点都没有差异（沿着通过成对测试来查看这些差异在哪里（如果存在））。

空腹鱼的数量

> table1 <- xtabs(empty_count ~ Zone + Year, data = df)
> table1
           Year
Zone        2016 2017 2018
  Crocodile    0    8    2
  Rankin       3   17    8
  West         7   31   17
  Whipray      4    4    0

捕获的所有鱼的数量

> table2 <- xtabs(total_count ~ Zone + Year, data = df)
> table2
           Year
Zone        2016 2017 2018
  Crocodile    1   18    7
  Rankin      14   46   69
  West        29   67   58
  Whipray      9    7   15

Answer 1

我想我已经设法从 cross-tabs:

reverse-engineer 你的原始数据框

df
#>         Zone Year total_count empty_count
#> 1  Crocodile 2016           1           0
#> 2     Rankin 2016          14           3
#> 3       West 2016          29           7
#> 4    Whipray 2016           9           4
#> 5  Crocodile 2017          18           8
#> 6     Rankin 2017          46          17
#> 7       West 2017          67          31
#> 8    Whipray 2017           7           4
#> 9  Crocodile 2018           7           2
#> 10    Rankin 2018          69           8
#> 11      West 2018          58          17
#> 12   Whipray 2018          15           0

在我看来，与其尝试进行多次成对比较，不如执行一次逻辑回归来找出显着差异所在。只需确保您有一个“non-empty”计数列，并且您的年份是因子：

df$non_empty_count <- df$total_count - df$empty_count
df$Year <- as.factor(df$Year)

你的逻辑回归看起来像这样：

model <- glm(cbind(empty_count, non_empty_count) ~ Zone + Year, 
             data = df, family = binomial)

summary(model)
#> 
#> Call:
#> glm(formula = cbind(empty_count, non_empty_count) ~ Zone + Year, 
#>     family = binomial, data = df)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.47563  -0.59548   0.04512   0.56564   1.26509  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)  -0.9588     0.5360  -1.789   0.0737 .
#> ZoneRankin   -0.4864     0.4740  -1.026   0.3048  
#> ZoneWest      0.1281     0.4552   0.281   0.7784  
#> ZoneWhipray  -0.1395     0.6038  -0.231   0.8173  
#> Year2017      0.7975     0.3659   2.180   0.0293 *
#> Year2018     -0.3861     0.3831  -1.008   0.3136  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 39.255  on 11  degrees of freedom
#> Residual deviance: 11.800  on  6  degrees of freedom
#> AIC: 57.905
#> 
#> Number of Fisher Scoring iterations: 5

您可以将其理解为虽然不同站点之间的空腹比例没有显着差异，但与所有站点的其他年份相比，2017 年空腹鱼的比例明显更高。

可重现数据

df <- structure(list(Zone = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L), .Label = c("Crocodile", "Rankin", "West", 
"Whipray"), class = "factor"), Year = c(2016, 2016, 2016, 2016, 
2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018), total_count = c(1L, 
14L, 29L, 9L, 18L, 46L, 67L, 7L, 7L, 69L, 58L, 15L), empty_count = c(0L, 
3L, 7L, 4L, 8L, 17L, 31L, 4L, 2L, 8L, 17L, 0L)), row.names = c(NA, 
-12L), class = "data.frame")

^{由 reprex package (v2.0.1)}

于 2022-02-09 创建

Prop.test 关于 R 中具有多级因子的计数数据

Prop.test on count data with multi-level factors in R

r

proportions