将包含 comma-separated 字符串值的列拆分为 R 中的新 header 列
Splitting a column, containing comma-separated string values, into new header columns in R
我有一个数据框,其中一列包含字符串,用逗号分隔。我想知道是否有一种有效的方法可以将这些 comma-separated 值放入新列 headers 中,并将这些新列值设为二进制(如果它们是原始行的一部分)。我的数据样本可以在下面复制:
data <- structure(list(id = c(6901257L, 6304928L, 7919400L), amenities =
c("Wireless Internet,Air conditioning,Kitchen,Heating,Family/kid
friendly,Essentials,Hair dryer,Iron,translation missing:
en.hosting_amenity_50", "Wireless Internet,Air
conditioning,Kitchen,Heating,Family/kid friendly,Washer,Dryer,Smoke
detector,Fire extinguisher,Essentials,Shampoo,Hangers,Hair
dryer,Iron,translation missing: en.hosting_amenity_50", "TV,Cable
TV,Wireless Internet,Air
conditioning,Kitchen,Breakfast,Buzzer/wireless
intercom,Heating,Family/kid friendly,Smoke detector,Carbon monoxide
detector,Fire extinguisher,Essentials,Shampoo,Hangers,Hair
dryer,Iron,Laptop friendly workspace,translation missing:
en.hosting_amenity_50" )), .Names = c("id", "amenities"), class =
"data.frame", row.names = c(NA, 3L))
我有一个低效的生成结果的方法,就是把数据做成长格式,然后在reshape2中使用dcast。这种低效的方法可以通过以下方式重现:
data.long <- data %>%
mutate(amenities = strsplit(as.character(amenities), ",")) %>%
unnest(amenities)
data.long$amenities.value <- 1
data.wide <- reshape2::dcast(data.long, id ~ amenities, value.var =
"amenities.value") #desired result
有没有更有效的方法从原始数据结构中得到想要的结果?
这是一种使用库 splitstackshape 的方法:
library(splitstackshape)
library(tidyverse)
cSplit(df, "amenities", sep = ",", direction = "long") %>%
mutate(value = 1) %>%
spread(amenities, value) -> df.wide
all.equal(df.wide, data.wide)
#TRUE
根据@A5C1D2H2I1M1N2O1R2T1,更密集和更快的解决方案是
cSplit_e(data, "amenities", ",", mode = "binary", type = "character", drop = TRUE)
仅使用 tidyverse
library(tidyverse)
data %>%
separate_rows(amenities, sep = ",") %>%
table() %>%
data.frame() %>%
spread(amenities,Freq)
我有一个数据框,其中一列包含字符串,用逗号分隔。我想知道是否有一种有效的方法可以将这些 comma-separated 值放入新列 headers 中,并将这些新列值设为二进制(如果它们是原始行的一部分)。我的数据样本可以在下面复制:
data <- structure(list(id = c(6901257L, 6304928L, 7919400L), amenities =
c("Wireless Internet,Air conditioning,Kitchen,Heating,Family/kid
friendly,Essentials,Hair dryer,Iron,translation missing:
en.hosting_amenity_50", "Wireless Internet,Air
conditioning,Kitchen,Heating,Family/kid friendly,Washer,Dryer,Smoke
detector,Fire extinguisher,Essentials,Shampoo,Hangers,Hair
dryer,Iron,translation missing: en.hosting_amenity_50", "TV,Cable
TV,Wireless Internet,Air
conditioning,Kitchen,Breakfast,Buzzer/wireless
intercom,Heating,Family/kid friendly,Smoke detector,Carbon monoxide
detector,Fire extinguisher,Essentials,Shampoo,Hangers,Hair
dryer,Iron,Laptop friendly workspace,translation missing:
en.hosting_amenity_50" )), .Names = c("id", "amenities"), class =
"data.frame", row.names = c(NA, 3L))
我有一个低效的生成结果的方法,就是把数据做成长格式,然后在reshape2中使用dcast。这种低效的方法可以通过以下方式重现:
data.long <- data %>%
mutate(amenities = strsplit(as.character(amenities), ",")) %>%
unnest(amenities)
data.long$amenities.value <- 1
data.wide <- reshape2::dcast(data.long, id ~ amenities, value.var =
"amenities.value") #desired result
有没有更有效的方法从原始数据结构中得到想要的结果?
这是一种使用库 splitstackshape 的方法:
library(splitstackshape)
library(tidyverse)
cSplit(df, "amenities", sep = ",", direction = "long") %>%
mutate(value = 1) %>%
spread(amenities, value) -> df.wide
all.equal(df.wide, data.wide)
#TRUE
根据@A5C1D2H2I1M1N2O1R2T1,更密集和更快的解决方案是
cSplit_e(data, "amenities", ",", mode = "binary", type = "character", drop = TRUE)
仅使用 tidyverse
library(tidyverse)
data %>%
separate_rows(amenities, sep = ",") %>%
table() %>%
data.frame() %>%
spread(amenities,Freq)