根据每个单元格内的多个 chr 值在 R 中创建虚拟变量

Question

我正在尝试创建多个虚拟变量，基于我的 df 中一个名为 'Tags' 的列（14 行，2 列，分数和标签。我的问题是在每个单元格中可以有任何字符值的数量（最多约 30 个值）。

当我要求：

 str(df$Tags)

R returns:

chr [1:14] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolera"| __truncated__ ...

当我要求时：

df$Tags[1]

R returns:

[1] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolerantie\", \"met familie\", \"met vrienden\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", chinees, gastronomisch, glutenvrij, kindvriendelijk, romantisch, traditioneel, trendy, verjaardag, zakelijk"

似乎第一个单元格中的值的格式不同（逗号之间的值）

所以我想要的是为每个单元格中出现的每个可能值创建一个虚拟变量。因此，第一个新虚拟对象应称为 "biologische gerechten"（或任何类似名称），并且应针对每种情况显示对应值是否存在于 (1) 'Tags' 列中或不存在 (0)。

我用 'dplyr' 尝试了几件事，例如：

df = mutate(df, biologisch = ifelse(Tags == "biologische gerechten", 1, 0))

R 确实创建了一个新列 'biologisch'，但它只包含零。有没有另一种方法来分离所有值，然后为所有可能的值创建虚拟变量？希望有人能帮帮我，谢谢！

Answer 1

这是一种解决方案：

# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
                 Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
                 stringsAsFactors = FALSE)

# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)

# use strsplit to break your jumbles of tags into separate elements, with a 
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")

# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))

# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results; 
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))

# assign the tags as column names
names(df2) <- tags

瞧：

> df2
   Y F P C Z K A J U H M O L E S R T Q V B I X G
1  1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2  0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3  0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4  0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5  0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6  0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7  0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8  0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9  1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1

根据每个单元格内的多个 chr 值在 R 中创建虚拟变量

Creating dummy variables in R based on multiple chr values within each cell

string

r

character

dplyr

dummy-variable