使用 dplyr 和 tidyr 制作更复杂的表格
Making more complex tables using dplyr and tidyr
我有一个看起来像这样的数据集,尽管真实示例有更多的列。只有一排(目前)。
Results <- structure(list(PCV2_CT_Min = 7.15, PPV2_CT_Min = 11.4, PPV3_CT_Min = 8.6,
PPV4_CT_Min = 16.3, PPV_CT_Min = 29.58, NI_BOCA_CT_Min = 20.51,
SW_BOCA_CT_Min = 23.49, PCV2_CT_Count = 695L, PPV2_CT_Count = 695L,
PPV3_CT_Count = 695L, PPV4_CT_Count = 695L, PPV_CT_Count = 695L,
NI_BOCA_CT_Count = 695L, SW_BOCA_CT_Count = 695L),
.Names = c("PCV2_CT_Min", "PPV2_CT_Min", "PPV3_CT_Min", "PPV4_CT_Min", "PPV_CT_Min", "NI_BOCA_CT_Min", "SW_BOCA_CT_Min", "PCV2_CT_Count", "PPV2_CT_Count", "PPV3_CT_Count", "PPV4_CT_Count", "PPV_CT_Count", "NI_BOCA_CT_Count", "SW_BOCA_CT_Count"),
row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
每个列名由一个变量名和一个函数名组成,所以PCV2_CT_Min是PCV2病毒检测的最小计数(CT); PCV_CT_Count是测试的动物总数,依此类推。
它是由 运行 summarize_all 从 dplyr 在另一个数据集上制作的猪,使用此代码的更长版本:-
V <- Pig %>%
select(ends_with('CT')) %>%
summarise_all(funs(Min = min(.,na.rm=TRUE),
Count = n()))
在真实的例子中,有更多的函数,而且它们有不同的参数。我想最终得到的是这样的数据框:-
Parameter PCV_CT PPV2_CT PPV3_CT PPV4_CT PPV_CT NI_BOCA_CT SW_BOCA_CT
Min 7.15 11.4 8.6 16.3 29.58 20.51 23.49
Count 695 695 695 695 695 695 695
我原以为有一种简单的方法可以做到这一点,也许可以使用 tidyr 中的 seperate 命令之类的东西,但我有绞尽脑汁,搜索SO,更广泛的网络,查看tidyr文档,都无济于事。我认为答案应该是显而易见的,但我看不到。
我将不胜感激。
你需要gather
所有的列,separate
把名字变成你想要的相关部分,然后spread
把数据变回一个宽的形式:
library(tidyverse)
Results %>%
gather(var, val, everything()) %>%
extract(var, into = c("var", "measure"), regex = "(.*)_(Min|Count)") %>%
spread(var, val)
# # A tibble: 2 x 8
# measure NI_BOCA_CT PCV2_CT PPV_CT PPV2_CT PPV3_CT PPV4_CT SW_BOCA_CT
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Count 695.00 695.00 695.00 695.0 695.0 695.0 695.00
# 2 Min 20.51 7.15 29.58 11.4 8.6 16.3 23.49
一个更通用的正则表达式可能是 regex = "(.*)_(.*)"
,如果您使用了多个其他摘要函数,这可能会很有用。
我知道您有理由采用这种形式的数据,但这与您实际应该查看的内容有点相反。理想情况下,让您的列包含所有相同类型度量的数据更有意义....
使用 base R/reshape2
的两个不同想法可能是:
拆分和堆叠:
dfs <- lapply(c("Min", "Count"), function(x) {
res <- Results[, grepl(x, names(Results))]
res <- setNames(res, gsub(paste0("_", x), "", names(res)))
res$measure <- x
return(res)
})
do.call(rbind, dfs)
# A tibble: 2 x 8
# PCV2_CT PPV2_CT PPV3_CT PPV4_CT PPV_CT NI_BOCA_CT SW_BOCA_CT measure
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 7.15 11.4 8.6 16.3 29.58 20.51 23.49 Min
#2 695.00 695.0 695.0 695.0 695.00 695.00 695.00 Count
熔铸:
library(reshape2)
melted <- melt(data.frame(Results))
melted$measure <- gsub(".*_(Min|Count)", "\1", melted$variable)
melted$variable <- gsub("_(Min|Count)", "", melted$variable)
dcast(melted, measure ~ variable)
# measure NI_BOCA_CT PCV2_CT PPV_CT PPV2_CT PPV3_CT PPV4_CT SW_BOCA_CT
#1 Count 695.00 695.00 695.00 695.0 695.0 695.0 695.00
#2 Min 20.51 7.15 29.58 11.4 8.6 16.3 23.49
我有一个看起来像这样的数据集,尽管真实示例有更多的列。只有一排(目前)。
Results <- structure(list(PCV2_CT_Min = 7.15, PPV2_CT_Min = 11.4, PPV3_CT_Min = 8.6,
PPV4_CT_Min = 16.3, PPV_CT_Min = 29.58, NI_BOCA_CT_Min = 20.51,
SW_BOCA_CT_Min = 23.49, PCV2_CT_Count = 695L, PPV2_CT_Count = 695L,
PPV3_CT_Count = 695L, PPV4_CT_Count = 695L, PPV_CT_Count = 695L,
NI_BOCA_CT_Count = 695L, SW_BOCA_CT_Count = 695L),
.Names = c("PCV2_CT_Min", "PPV2_CT_Min", "PPV3_CT_Min", "PPV4_CT_Min", "PPV_CT_Min", "NI_BOCA_CT_Min", "SW_BOCA_CT_Min", "PCV2_CT_Count", "PPV2_CT_Count", "PPV3_CT_Count", "PPV4_CT_Count", "PPV_CT_Count", "NI_BOCA_CT_Count", "SW_BOCA_CT_Count"),
row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
每个列名由一个变量名和一个函数名组成,所以PCV2_CT_Min是PCV2病毒检测的最小计数(CT); PCV_CT_Count是测试的动物总数,依此类推。
它是由 运行 summarize_all 从 dplyr 在另一个数据集上制作的猪,使用此代码的更长版本:-
V <- Pig %>%
select(ends_with('CT')) %>%
summarise_all(funs(Min = min(.,na.rm=TRUE),
Count = n()))
在真实的例子中,有更多的函数,而且它们有不同的参数。我想最终得到的是这样的数据框:-
Parameter PCV_CT PPV2_CT PPV3_CT PPV4_CT PPV_CT NI_BOCA_CT SW_BOCA_CT
Min 7.15 11.4 8.6 16.3 29.58 20.51 23.49
Count 695 695 695 695 695 695 695
我原以为有一种简单的方法可以做到这一点,也许可以使用 tidyr 中的 seperate 命令之类的东西,但我有绞尽脑汁,搜索SO,更广泛的网络,查看tidyr文档,都无济于事。我认为答案应该是显而易见的,但我看不到。
我将不胜感激。
你需要gather
所有的列,separate
把名字变成你想要的相关部分,然后spread
把数据变回一个宽的形式:
library(tidyverse)
Results %>%
gather(var, val, everything()) %>%
extract(var, into = c("var", "measure"), regex = "(.*)_(Min|Count)") %>%
spread(var, val)
# # A tibble: 2 x 8
# measure NI_BOCA_CT PCV2_CT PPV_CT PPV2_CT PPV3_CT PPV4_CT SW_BOCA_CT
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Count 695.00 695.00 695.00 695.0 695.0 695.0 695.00
# 2 Min 20.51 7.15 29.58 11.4 8.6 16.3 23.49
一个更通用的正则表达式可能是 regex = "(.*)_(.*)"
,如果您使用了多个其他摘要函数,这可能会很有用。
我知道您有理由采用这种形式的数据,但这与您实际应该查看的内容有点相反。理想情况下,让您的列包含所有相同类型度量的数据更有意义....
使用 base R/reshape2
的两个不同想法可能是:
拆分和堆叠:
dfs <- lapply(c("Min", "Count"), function(x) {
res <- Results[, grepl(x, names(Results))]
res <- setNames(res, gsub(paste0("_", x), "", names(res)))
res$measure <- x
return(res)
})
do.call(rbind, dfs)
# A tibble: 2 x 8
# PCV2_CT PPV2_CT PPV3_CT PPV4_CT PPV_CT NI_BOCA_CT SW_BOCA_CT measure
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 7.15 11.4 8.6 16.3 29.58 20.51 23.49 Min
#2 695.00 695.0 695.0 695.0 695.00 695.00 695.00 Count
熔铸:
library(reshape2)
melted <- melt(data.frame(Results))
melted$measure <- gsub(".*_(Min|Count)", "\1", melted$variable)
melted$variable <- gsub("_(Min|Count)", "", melted$variable)
dcast(melted, measure ~ variable)
# measure NI_BOCA_CT PCV2_CT PPV_CT PPV2_CT PPV3_CT PPV4_CT SW_BOCA_CT
#1 Count 695.00 695.00 695.00 695.0 695.0 695.0 695.00
#2 Min 20.51 7.15 29.58 11.4 8.6 16.3 23.49