多级(集群)数据的描述性统计

Descriptive Statistic for Multilevel (clustered) Data

我无法为本质上是多层次的数据生成描述性统计的复杂横截面。我试图从几个不同的角度来解决这个问题,但无济于事。请在下面找到一些我用于失败的 plyr 解决方案的代码。问题是学校存在于一个学区内。我需要地区级别的汇总统计数据以匹配该地区的每所学校。 plyr 解决方案显然只为学校的每个子样本生成学区级别的描述性统计数据,而不是将汇总的学区信息应用于每所学校。

几天来,我一直在尝试找到解决此问题的方法。

总的来说,data.table 会提供更好的解决方案吗?

#Generate Data
set.seed(500)
School <- rep(seq(1:20), 2)
District <- rep(c(rep("East", 10), rep("West", 10)), 2)
Score <- rnorm(40, 100, 15)
Student.ID <- sample(1:1000,8,replace=T)
items <- data.frame(replicate(10, sample(1:4, 40, replace=TRUE)))
gender <- rep( c("Male","Female"), 100*c(0.4,0.6) )  
gender <- sample(gender, 40)
low.inc <- rep( c("Status.A", "Status.B", "Status.c"), 100*c(0.3,0.2,0.5) )  
low.inc <- sample(low.inc, 40)
items <- data.frame(lapply(items, factor, ordered=TRUE, 
                           levels=1:4))
                           labels=c("Strongly disagree","Disagree",
                                    "Agree","Strongly Agree")
school.data <- data.frame(Student.ID, School, District, Score, items, gender, low.inc)
sd1 = sd(school.data$Score)
m1 = mean(school.data$Score)
sd.above = m1 + sd1
sd.below = m1 - sd1
school.data$scorecat[Score >= sd.above] <- "High"
school.data$scorecat[Score > sd.below & Score <= sd.above] <- "Moderate"
school.data$scorecat[Score <= sd.below] <- "Low"

#Attempt to generate table
library(plyr)
b1 <- ddply(school.data, .var = c("gender", "District", "School"), .fun = summarise,
  n = length(scorecat),
  high = sum(scorecat %in% c("High")),
  high.prop = high / n, # Referring to vars I just created
  mod = sum(scorecat %in% c("Moderate")),
  mod.prop = mod / n, # Referring to vars I just created
  low = sum(scorecat %in% c("Low")),
  low.prop = low / n # Referring to vars I just created
)
drops <- c("high","mod", "low") #set up a list to drop columns
b1 <- b1[,!(names(b1) %in% drops)]
colnames(b1)[1] <- "Demographic Variable"

注意:此 table 会生成正确的学区值,这些值应唯一地分配给每所学校。我想要一个 table 就像第一个例子,每个学校都有相应的这些值。

b1 <- ddply(school.data, .var = c("gender", "District"), .fun = summarise,
  n = length(scorecat),
  high = sum(scorecat %in% c("High")),
  high.prop = high / n, # Referring to vars I just created
  mod = sum(scorecat %in% c("Moderate")),
  mod.prop = mod / n, # Referring to vars I just created
  low = sum(scorecat %in% c("Low")),
  low.prop = low / n # Referring to vars I just created
)
drops <- c("high","mod", "low") #set up a list to drop columns
b1 <- b1[,!(names(b1) %in% drops)]
colnames(b1)[1] <- "Demographic Variable"

如果我理解的很好,你想要的是计算一个地区级别的变量,然后将其归因于学校级别。我几乎听不懂你剩下的 post。

你在 base R 中连续使用 总计的 和 合并 .

鉴于您已经计算了摘要 b1 table 使用 dplyr,你可以将它合并到初始 school.data 数据集。

    school.data2 <- merge(school.data,b1,by=c("District","gender"))

让我知道是否可以。