来自 lodown 包的 SCF 数据问题
SCF data issue from lodown package
我在用lodown包分析SCF的时候发现了一个很奇怪的问题。这组黑人的数据肯定有问题,年龄不到35岁,大专以上学历。这组share/mean太高了
我试着把种族、年龄和教育三个因素放在一起,看看某个群体的总财富占总人口的比例。
# input data
scf_imp <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016.rds" ) )
scf_rw <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016 rw.rds" ) )
scf_design <-
svrepdesign(
weights = ~wgt ,
repweights = scf_rw[ , -1 ] ,
data = imputationList( scf_imp ) ,
scale = 1 ,
rscales = rep( 1 / 998 , 999 ) ,
mse = FALSE ,
type = "other" ,
combined.weights = TRUE
)
# Variable Recoding
scf_design <- update(scf_design ,
racecl4 = factor(racecl4 ,
labels = c("White" ,
"Black" ,
"Hispanic/Latino" ,
"Other" )),
edcl = factor(edcl ,
labels = c("less than high school" ,
"high school or GED" ,
"some college" ,
"college degree" )),
agecl = factor(agecl ,
labels = c("less than 35" ,
"35-44" ,
"45-54" ,
"55-64" ,
"65-74" ,
"75 or more"))
)
# calculation
trible <- scf_MIcombine( with( scf_design ,
svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , svytotal )
) )
sum_black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% sum()
black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% matrix(nrow = 4)
black <- as.data.frame(black/sum_black)
colnames(black) <- c("less than 35" , "35-44" , "45-54" , "55-64" ,"65-74" , "75 or more")
black <- black %>% mutate(total = rowSums(black))
black <- rbind(black,total = colSums(black))
black <- sapply(black,scales::percent) %>% as.data.frame()
rownames(black) <- c("less than high school" , "high school or GED" , "some college" , "college degree", "total" )
black <- rownames_to_column(black,"share for black")
我用同样的方法计算平均值。结果发现,年龄小于35岁,大专以上文化程度的黑人群体,share/mean值非常高。但它不应该。是数据有问题还是我使用的方法有问题?
(来源:sinaimg.cn)
(来源:sinaimg.cn)
消费者财务调查大约有 6,000 个未加权的记录,您将结果分成近 100 个组,因此平均每个单元格只有 N=60 个。看看这个,看看它有多小。
counts <- scf_MIcombine( with( scf_design ,
svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , unwtd.count )
) )
这不是硬性规定,但如果标准误差超过统计数据的 30%,则该统计数据可能不稳定。看看 SE( trible ) / coef( trible ) > 0.3
,您会发现几乎所有统计数据都不稳定。
SCF 是一个了不起的数据集,但样本量可能不够大,无法支持如此精确的突破。谢谢
我在用lodown包分析SCF的时候发现了一个很奇怪的问题。这组黑人的数据肯定有问题,年龄不到35岁,大专以上学历。这组share/mean太高了
我试着把种族、年龄和教育三个因素放在一起,看看某个群体的总财富占总人口的比例。
# input data
scf_imp <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016.rds" ) )
scf_rw <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016 rw.rds" ) )
scf_design <-
svrepdesign(
weights = ~wgt ,
repweights = scf_rw[ , -1 ] ,
data = imputationList( scf_imp ) ,
scale = 1 ,
rscales = rep( 1 / 998 , 999 ) ,
mse = FALSE ,
type = "other" ,
combined.weights = TRUE
)
# Variable Recoding
scf_design <- update(scf_design ,
racecl4 = factor(racecl4 ,
labels = c("White" ,
"Black" ,
"Hispanic/Latino" ,
"Other" )),
edcl = factor(edcl ,
labels = c("less than high school" ,
"high school or GED" ,
"some college" ,
"college degree" )),
agecl = factor(agecl ,
labels = c("less than 35" ,
"35-44" ,
"45-54" ,
"55-64" ,
"65-74" ,
"75 or more"))
)
# calculation
trible <- scf_MIcombine( with( scf_design ,
svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , svytotal )
) )
sum_black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% sum()
black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% matrix(nrow = 4)
black <- as.data.frame(black/sum_black)
colnames(black) <- c("less than 35" , "35-44" , "45-54" , "55-64" ,"65-74" , "75 or more")
black <- black %>% mutate(total = rowSums(black))
black <- rbind(black,total = colSums(black))
black <- sapply(black,scales::percent) %>% as.data.frame()
rownames(black) <- c("less than high school" , "high school or GED" , "some college" , "college degree", "total" )
black <- rownames_to_column(black,"share for black")
我用同样的方法计算平均值。结果发现,年龄小于35岁,大专以上文化程度的黑人群体,share/mean值非常高。但它不应该。是数据有问题还是我使用的方法有问题?
(来源:sinaimg.cn)
(来源:sinaimg.cn)
消费者财务调查大约有 6,000 个未加权的记录,您将结果分成近 100 个组,因此平均每个单元格只有 N=60 个。看看这个,看看它有多小。
counts <- scf_MIcombine( with( scf_design ,
svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , unwtd.count )
) )
这不是硬性规定,但如果标准误差超过统计数据的 30%,则该统计数据可能不稳定。看看 SE( trible ) / coef( trible ) > 0.3
,您会发现几乎所有统计数据都不稳定。
SCF 是一个了不起的数据集,但样本量可能不够大,无法支持如此精确的突破。谢谢