从具有多个值的字符串创建虚拟变量
Create dummy variables from string with multiple values
我有一个数据集,其中一列包含多个值,由 ;
分隔。
name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming
我想为 good_at
中的每个唯一值创建一个虚拟变量,这样每个虚拟变量都包含一个 TRUE
或 FALSE
以指示该个人是否拥有该特定值。
期望的输出
Drawing Cooking
True False
False True
False False
False False
我创建了一个可提供所需输出的函数:
dum <- function(kw, col, type=c(T, F)) {
t <- as.data.frame(grep(as.character(kw), col, ignore.case=T))
t$one <- type[1]
colnames(t) <- c("col1","dummy")
t2 <- as.data.frame(grep(as.character(kw), col, ignore.case=T,
invert=T))
t2$zero <- type[2]
colnames(t2) <- c("col1","dummy")
t3<-rbind(t, t2)
t3<-t3[order(t3$col1), ]
return(t3$dummy)
}
它可能不是特别优雅,但它确实有效。使用您的示例,您的数据框是 df
并且您要引用的列是 df$Good_at
Drawing <- dum("drawing", df$Good_at)
> Drawing
TRUE
FALSE
...
Cooking <- dum("cooking", df$Good_at)
> Cooking
FALSE
TRUE
...
概览
要为 good_at
中的每个唯一值创建虚拟变量,需要执行以下步骤:
- 将
good_at
分成多行
- 为每个
name
-sex
对 good_at
中的每个值生成虚拟变量 - 使用 dummy::dummy()
- 将数据重塑为 4 列:
name
、sex
、key
和 value
key
包含所有虚拟变量列名
value
包含每个虚拟变量中的值
- 只保留
value
不为零的记录
- 将数据重塑为每个姓名-性别对一条记录,列数与
key
中一样多
- 将虚拟列转换为逻辑向量。
代码
# load necessary packages ----
library(dummy)
library(tidyverse)
# load necessary data ----
df <-
read.table(text = "name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming"
, header = TRUE
, stringsAsFactors = FALSE)
# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
df %>%
separate_rows(good_at, sep = ";")
# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
df_clean %>%
select(good_at) %>%
dummy() %>%
bind_cols(df_clean) %>%
# drop "good_at" column
select(-good_at) %>%
# make the tibble long by reshaping it into 4 columns:
# name, sex, key and value
# where key are the all dummy variable column names
# and value are the values in each dummy variable
gather(key, value, -name, -sex) %>%
# keep records where
# value is not equal to zero
# note: this is due to "Tom" having both a
# "good_at_Drawing" value of 0 and 1.
filter(value != 0) %>%
# make the tibble wide
# with one record per name-sex pair
# and as many columns as there are in key
# with their values from value
# and filling NA values to 0
spread(key, value, fill = 0) %>%
# for each name-sex pair
# cast the dummy variables into logical vectors
group_by(name, sex) %>%
mutate_all(funs(as.integer(.) %>% as.logical())) %>%
ungroup() %>%
# just for safety let's join
# the original "good_at" column
left_join(y = df, by = c("name", "sex")) %>%
# bring the original "good_at" column to the left-hand side
# of the tibble
select(name, sex, good_at, matches("good_at_"))
# view result ----
df_dummies
# A tibble: 4 x 9
# name sex good_at good_at_Cooking good_at_Drawing good_at_Hiking
# <chr> <chr> <chr> <lgl> <lgl> <lgl>
# 1 Char… M Swimmi… FALSE FALSE FALSE
# 2 Mary F Cookin… TRUE FALSE FALSE
# 3 Sam M Running FALSE FALSE FALSE
# 4 Tom M Drawin… FALSE TRUE TRUE
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
# good_at_Swimming <lgl>
# end of script #
我有一个数据集,其中一列包含多个值,由 ;
分隔。
name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming
我想为 good_at
中的每个唯一值创建一个虚拟变量,这样每个虚拟变量都包含一个 TRUE
或 FALSE
以指示该个人是否拥有该特定值。
期望的输出
Drawing Cooking
True False
False True
False False
False False
我创建了一个可提供所需输出的函数:
dum <- function(kw, col, type=c(T, F)) {
t <- as.data.frame(grep(as.character(kw), col, ignore.case=T))
t$one <- type[1]
colnames(t) <- c("col1","dummy")
t2 <- as.data.frame(grep(as.character(kw), col, ignore.case=T,
invert=T))
t2$zero <- type[2]
colnames(t2) <- c("col1","dummy")
t3<-rbind(t, t2)
t3<-t3[order(t3$col1), ]
return(t3$dummy)
}
它可能不是特别优雅,但它确实有效。使用您的示例,您的数据框是 df
并且您要引用的列是 df$Good_at
Drawing <- dum("drawing", df$Good_at)
> Drawing
TRUE
FALSE
...
Cooking <- dum("cooking", df$Good_at)
> Cooking
FALSE
TRUE
...
概览
要为 good_at
中的每个唯一值创建虚拟变量,需要执行以下步骤:
- 将
good_at
分成多行 - 为每个
name
-sex
对 - 将数据重塑为 4 列:
name
、sex
、key
和value
key
包含所有虚拟变量列名value
包含每个虚拟变量中的值
- 只保留
value
不为零的记录 - 将数据重塑为每个姓名-性别对一条记录,列数与
key
中一样多
- 将虚拟列转换为逻辑向量。
good_at
中的每个值生成虚拟变量 - 使用 dummy::dummy()
代码
# load necessary packages ----
library(dummy)
library(tidyverse)
# load necessary data ----
df <-
read.table(text = "name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming"
, header = TRUE
, stringsAsFactors = FALSE)
# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
df %>%
separate_rows(good_at, sep = ";")
# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
df_clean %>%
select(good_at) %>%
dummy() %>%
bind_cols(df_clean) %>%
# drop "good_at" column
select(-good_at) %>%
# make the tibble long by reshaping it into 4 columns:
# name, sex, key and value
# where key are the all dummy variable column names
# and value are the values in each dummy variable
gather(key, value, -name, -sex) %>%
# keep records where
# value is not equal to zero
# note: this is due to "Tom" having both a
# "good_at_Drawing" value of 0 and 1.
filter(value != 0) %>%
# make the tibble wide
# with one record per name-sex pair
# and as many columns as there are in key
# with their values from value
# and filling NA values to 0
spread(key, value, fill = 0) %>%
# for each name-sex pair
# cast the dummy variables into logical vectors
group_by(name, sex) %>%
mutate_all(funs(as.integer(.) %>% as.logical())) %>%
ungroup() %>%
# just for safety let's join
# the original "good_at" column
left_join(y = df, by = c("name", "sex")) %>%
# bring the original "good_at" column to the left-hand side
# of the tibble
select(name, sex, good_at, matches("good_at_"))
# view result ----
df_dummies
# A tibble: 4 x 9
# name sex good_at good_at_Cooking good_at_Drawing good_at_Hiking
# <chr> <chr> <chr> <lgl> <lgl> <lgl>
# 1 Char… M Swimmi… FALSE FALSE FALSE
# 2 Mary F Cookin… TRUE FALSE FALSE
# 3 Sam M Running FALSE FALSE FALSE
# 4 Tom M Drawin… FALSE TRUE TRUE
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
# good_at_Swimming <lgl>
# end of script #