使用 R 中同一变量的多个二进制度量来旋转 table
Pivoting a table with multiple binary measures of the same variable in R
我有一个 data.frame 包含数百列,其中每一列都是二进制(“是”/“否”),用于测量可能的结果,并且可以折叠成一个简短的变量列表多个选项(折叠变量的 none 有多个 YES)。我试图找出崩溃的原因,但我能想出的唯一解决方案在我的大型数据集上非常不优雅且耗时。
这是一个玩具示例来解释我的意思:
> groceries<-tribble(~item, ~prod_potato, ~prod_apple, ~prod_banana, ~day_monday, ~day_tuesday, ~day_wednesday,
1, "N","N","Y","N","N","Y",
2, "Y","N","N","N","N","Y",
3, "Y","N","N","N","Y","N",
4,"N","Y","N","Y","N","N")
# A tibble: 4 x 7
item prod_potato prod_apple prod_banana day_monday day_tuesday day_wednesday
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 N N Y N N Y
2 2 Y N N N N Y
3 3 Y N N N Y N
4 4 N Y N Y N N
每件商品只能是土豆、香蕉或苹果中的一种,而且只能在特定日期购买,所以这些多列真的没什么用。
我想要的结果是这样的:
item prod day
1 banana wednesday
2 potato wednesday
3 potato tuesday
4 apple monday
这是我想出的解决方案,它可以完成工作,但不是很好:
pivot_longer(groceries,2:4,names_to=c("prod"),names_prefix = "prod_") %>%
filter(value=="Y") %>% select(-value) %>%
pivot_longer(2:4,names_to="day",names_prefix="day_") %>%
filter(value=="Y") %>% select(-value)
# A tibble: 4 x 3
item prod day
<dbl> <chr> <chr>
1 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
4 4 apple monday
但我 100% 确定有一个不那么麻烦的解决方案,不需要我对一些 20 多个折叠变量重复这个过程。
我绝对理想的解决方案是能够根据 _
之前的字符串对列进行分组并将其用作列名,并将 _
之后的字符串作为值时原始变量的值为“YES”。但我愿意使用稍微更手动的解决方案,我每次都确定要分组的列和变量名称。
谁能提出解决方案(最好是 tidyverse
-- 我相信 data.table 会有一个超级有效的解决方案,但我永远无法理解它)?
谢谢!
你可以用pivot_longer
把它们全部分开成group
和options
,然后就减少到Y
的值(我用summarize
只是为了删除该列,但可以在此处轻松使用 filter
) 和 pivot_wider
.
library(dplyr)
library(tidyr)
# library(tidyverse)
groceries %>%
pivot_longer(-item,
names_to = c("group", "options"),
names_sep = "_") %>%
group_by(item, group) %>%
summarize(options = options[value == "Y"],
.groups = "drop") %>%
pivot_wider(names_from = "group",
values_from = "options")
#> # A tibble: 4 × 3
#> item day prod
#> <dbl> <chr> <chr>
#> 1 1 wednesday banana
#> 2 2 wednesday potato
#> 3 3 tuesday potato
#> 4 4 monday apple
library(data.table)
setDT(groceries)
dt <- melt(groceries, id.vars = c("item"))[value == "Y"]
dt[, c("A", "B") := tstrsplit(variable, "_")]
dcast(dt, item ~ A, value.var = c("B"))
# item day prod
# 1: 1 wednesday banana
# 2: 2 wednesday potato
# 3: 3 tuesday potato
# 4: 4 monday apple
或one-liner:
dcast(melt(groceries, id.vars = c("item"))[value == "Y"][, c("A", "B") := tstrsplit(variable, "_")], item ~ A, value.var = c("B"))
基本 R 选项
dfout <- reshape(
transform(
subset(
cbind(groceries[1], stack(groceries[-1])),
values == "Y"
),
p = gsub("_.*", "", ind),
q = gsub(".*_", "", ind)
)[c("item", "p", "q")],
direction = "wide",
idvar = "item",
timevar = "p"
)
dfout[order(dfout$item), ]
给予
item q.prod q.day
9 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
8 4 apple monday
我有一个 data.frame 包含数百列,其中每一列都是二进制(“是”/“否”),用于测量可能的结果,并且可以折叠成一个简短的变量列表多个选项(折叠变量的 none 有多个 YES)。我试图找出崩溃的原因,但我能想出的唯一解决方案在我的大型数据集上非常不优雅且耗时。
这是一个玩具示例来解释我的意思:
> groceries<-tribble(~item, ~prod_potato, ~prod_apple, ~prod_banana, ~day_monday, ~day_tuesday, ~day_wednesday,
1, "N","N","Y","N","N","Y",
2, "Y","N","N","N","N","Y",
3, "Y","N","N","N","Y","N",
4,"N","Y","N","Y","N","N")
# A tibble: 4 x 7
item prod_potato prod_apple prod_banana day_monday day_tuesday day_wednesday
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 N N Y N N Y
2 2 Y N N N N Y
3 3 Y N N N Y N
4 4 N Y N Y N N
每件商品只能是土豆、香蕉或苹果中的一种,而且只能在特定日期购买,所以这些多列真的没什么用。
我想要的结果是这样的:
item prod day
1 banana wednesday
2 potato wednesday
3 potato tuesday
4 apple monday
这是我想出的解决方案,它可以完成工作,但不是很好:
pivot_longer(groceries,2:4,names_to=c("prod"),names_prefix = "prod_") %>%
filter(value=="Y") %>% select(-value) %>%
pivot_longer(2:4,names_to="day",names_prefix="day_") %>%
filter(value=="Y") %>% select(-value)
# A tibble: 4 x 3
item prod day
<dbl> <chr> <chr>
1 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
4 4 apple monday
但我 100% 确定有一个不那么麻烦的解决方案,不需要我对一些 20 多个折叠变量重复这个过程。
我绝对理想的解决方案是能够根据 _
之前的字符串对列进行分组并将其用作列名,并将 _
之后的字符串作为值时原始变量的值为“YES”。但我愿意使用稍微更手动的解决方案,我每次都确定要分组的列和变量名称。
谁能提出解决方案(最好是 tidyverse
-- 我相信 data.table 会有一个超级有效的解决方案,但我永远无法理解它)?
谢谢!
你可以用pivot_longer
把它们全部分开成group
和options
,然后就减少到Y
的值(我用summarize
只是为了删除该列,但可以在此处轻松使用 filter
) 和 pivot_wider
.
library(dplyr)
library(tidyr)
# library(tidyverse)
groceries %>%
pivot_longer(-item,
names_to = c("group", "options"),
names_sep = "_") %>%
group_by(item, group) %>%
summarize(options = options[value == "Y"],
.groups = "drop") %>%
pivot_wider(names_from = "group",
values_from = "options")
#> # A tibble: 4 × 3
#> item day prod
#> <dbl> <chr> <chr>
#> 1 1 wednesday banana
#> 2 2 wednesday potato
#> 3 3 tuesday potato
#> 4 4 monday apple
library(data.table)
setDT(groceries)
dt <- melt(groceries, id.vars = c("item"))[value == "Y"]
dt[, c("A", "B") := tstrsplit(variable, "_")]
dcast(dt, item ~ A, value.var = c("B"))
# item day prod
# 1: 1 wednesday banana
# 2: 2 wednesday potato
# 3: 3 tuesday potato
# 4: 4 monday apple
或one-liner:
dcast(melt(groceries, id.vars = c("item"))[value == "Y"][, c("A", "B") := tstrsplit(variable, "_")], item ~ A, value.var = c("B"))
基本 R 选项
dfout <- reshape(
transform(
subset(
cbind(groceries[1], stack(groceries[-1])),
values == "Y"
),
p = gsub("_.*", "", ind),
q = gsub(".*_", "", ind)
)[c("item", "p", "q")],
direction = "wide",
idvar = "item",
timevar = "p"
)
dfout[order(dfout$item), ]
给予
item q.prod q.day
9 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
8 4 apple monday