分类数据与 R 中二项式响应的相关性

Correlation of categorical data to binomial response in R

我想分析分类输入变量和二项式响应变量之间的相关性,但我不确定如何组织我的数据或者我是否正在计划正确的分析。

这是我的数据 table(变量解释如下):

species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table

数据table解释

我列出了个别物种,下一栏是它们注释的喂养类型。给定列中的 1 表示是,0 表示否。有些物种有多种取食方式,有些则只有一种取食方式。我感兴趣的响应变量是 "loss," 表示特征丢失。我很想知道是否有任何喂养类型预测或与 "loss."

的状态相关

想法

我不确定是否有一种好方法可以将喂养类型作为一个分类变量包含多个类别。我不认为我可以将它组织为类型为 c("scavenge","dung","pred", etc...) 的单个变量,因为有些物种有多种喂养类型,所以我将它们分成不同的列,并将它们的状态表示为 1(是)或 0(否)。目前我正在考虑尝试使用对数线性分析,但我发现的示例并没有相当的数据......我很高兴提出建议。

非常感谢任何帮助或指出正确的方向!

样本太少,您有 4 个损失 == 0 和 18 个损失 == 1。您将 运行 陷入拟合完整逻辑回归(即包括所有变量)的问题。我建议使用 Fisher 测试来测试每种喂养习惯的关联:

library(dplyr)
library(purrr)

# function for the fisher test
FISHER <- function(x,y){
       FT = fisher.test(table(x,y))

data.frame(
       pvalue=FT$p.value,
       oddsratio=as.numeric(FT$estimate),
       lower_limit_OR = FT$conf.int[1],
       upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"

results <- data[,FEEDING] %>% 
map_dfr(FISHER,y=data$loss) %>% 
add_column(var=FEEDING,.before=1)

您将获得每种喂养习惯的结果:

> results
       var      pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465    0.002943469       2.817560
2     dung 1.000000000 1.1582683    0.017827686      20.132849
3     pred 0.263157895 0.0000000    0.000000000       3.189217
4   nectar 0.535201640 0.0000000    0.000000000       5.503659
5    plant 0.002597403       Inf    2.780171314            Inf
6    blood 1.000000000 0.0000000    0.000000000      26.102285
7 mushroom 0.337662338 5.0498688    0.054241930     467.892765

pvalue 是 fisher.test 的 p 值,基本上优势比 > 1,变量与损失正相关。在你所有的变量中,植物是最强的,你可以检查:

> table(loss,plant)
    plant
loss  0  1
   0 18  0
   1  1  3

几乎所有 plant=1 的都是 loss=1.. 所以对于你当前的数据集,我认为这是你能做的最好的。应该获得更大的样本量,看看这是否仍然成立。