按组创建具有相关性和 p 值的数据框?
Create dataframe with correlation and p-value by group?
我正在尝试根据 R 中的特定组 (COUNTY) 关联多个变量。虽然我能够通过这种方法成功地找到每一列的相关性,但我似乎无法找到一种方法来保存每个组的 table 的 p 值。有什么建议吗?
示例数据:
crops <- data.frame(
COUNTY = sample(37001:37900),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
示例代码:
crops %>%
group_by(COUNTY) %>%
do(data.frame(Cor=t(cor(.[,2:5], .[,2]))))
^这为我提供了每一列的相关性,但我还需要知道每一列的 p 值。理想情况下,最终输出应如下所示。
Desired Output
每个国家只有 1 个观测值,所以它不起作用。我为每个国家设置了更多示例:
set.seed(111)
crops <- data.frame(
COUNTY = sample(37001:37002,10,replace=TRUE),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
我认为您需要转换为长格式,并为每个国家/地区和变量
执行 cor.test
calcor=function(da){
data.frame(cor.test(da$CropYield,da$value)[c("estimate","p.value")])
}
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.))
# A tibble: 6 x 4
# Groups: COUNTY, name [6]
COUNTY name estimate p.value
<int> <chr> <dbl> <dbl>
1 37001 ColdDays 0.466 0.292
2 37001 MaxTemp -0.225 0.628
3 37001 precip -0.356 0.433
4 37002 ColdDays 0.888 0.304
5 37002 MaxTemp 0.941 0.220
6 37002 precip -0.489 0.674
以上为您提供了每个县的每个变量与作物产量的相关性。现在是将其转换为宽格式的问题:
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.)) %>%
pivot_wider(values_from=c(estimate,p.value),names_from=name)
COUNTY estimate_ColdDa… estimate_MaxTemp estimate_precip p.value_ColdDays
<int> <dbl> <dbl> <dbl> <dbl>
1 37001 0.466 -0.225 -0.356 0.292
2 37002 0.888 0.941 -0.489 0.304
# … with 2 more variables: p.value_MaxTemp <dbl>, p.value_precip <dbl>
我正在尝试根据 R 中的特定组 (COUNTY) 关联多个变量。虽然我能够通过这种方法成功地找到每一列的相关性,但我似乎无法找到一种方法来保存每个组的 table 的 p 值。有什么建议吗?
示例数据:
crops <- data.frame(
COUNTY = sample(37001:37900),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
示例代码:
crops %>%
group_by(COUNTY) %>%
do(data.frame(Cor=t(cor(.[,2:5], .[,2]))))
^这为我提供了每一列的相关性,但我还需要知道每一列的 p 值。理想情况下,最终输出应如下所示。
Desired Output
每个国家只有 1 个观测值,所以它不起作用。我为每个国家设置了更多示例:
set.seed(111)
crops <- data.frame(
COUNTY = sample(37001:37002,10,replace=TRUE),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
我认为您需要转换为长格式,并为每个国家/地区和变量
执行 cor.testcalcor=function(da){
data.frame(cor.test(da$CropYield,da$value)[c("estimate","p.value")])
}
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.))
# A tibble: 6 x 4
# Groups: COUNTY, name [6]
COUNTY name estimate p.value
<int> <chr> <dbl> <dbl>
1 37001 ColdDays 0.466 0.292
2 37001 MaxTemp -0.225 0.628
3 37001 precip -0.356 0.433
4 37002 ColdDays 0.888 0.304
5 37002 MaxTemp 0.941 0.220
6 37002 precip -0.489 0.674
以上为您提供了每个县的每个变量与作物产量的相关性。现在是将其转换为宽格式的问题:
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.)) %>%
pivot_wider(values_from=c(estimate,p.value),names_from=name)
COUNTY estimate_ColdDa… estimate_MaxTemp estimate_precip p.value_ColdDays
<int> <dbl> <dbl> <dbl> <dbl>
1 37001 0.466 -0.225 -0.356 0.292
2 37002 0.888 0.941 -0.489 0.304
# … with 2 more variables: p.value_MaxTemp <dbl>, p.value_precip <dbl>