如果行包含字符串,则为虚拟列
Dummy columns if row contains string
我的数据集可能如下所示:
x <- data.frame(id=c(1,2,3),
col1=c("UX1", "UX3", "UX2"),
col2=c("UX2", "UX1", "UX1"),
col3=c("PROC1", "PROC2", "PROC3"),
col4=c("PROC3", "PROC3", "PROC1")
)
输出:
id col1 col2 col3 col4
1 1 UX1 UX2 PROC1 PROC3
2 2 UX3 UX1 PROC2 PROC3
3 3 UX2 UX1 PROC3 PROC1
我希望输出如下所示:
x2 <- data.frame(id=c(1,2,3),
col1=c("UX1", "UX3", "UX2"),
col2=c("UX2", "UX1", "UX1"),
col3=c("PROC1", "PROC2", "PROC3"),
col43=c("PROC3", "PROC3", "PROC1"),
UX1=c(1,1,1),
UX2=c(1,0,1),
UX3=c(0,1, 0),
PROC1 =c(1,0,1),
PROC2=c(0,1,0),
PROC3 = c(1,1,1))
想要的输出:
id col1 col2 col3 col43 UX1 UX2 UX3 PROC1 PROC2 PROC3
1 1 UX1 UX2 PROC1 PROC3 1 1 0 1 0 1
2 2 UX3 UX1 PROC2 PROC3 1 0 1 0 1 1
3 3 UX2 UX1 PROC3 PROC1 1 1 0 1 0 1
如果一行包含一个字符串,那么创建一个虚拟对象是很基本的。我可以使用 library(dummies)
创建 dummy.data.frame
例如
y <- dummy.data.frame(x)
但此方法认为(例如)第一列中的 UX1 不同于第二列中的 UX1。所以 dummy.data.frame 不起作用...
这是来自 tidyverse
的想法。我们首先 gather
除了 id
变量。然后我们 spread
获得所需的结构并使用简单的 replace
到 'dummify' 我们的数据,即
library(tidyverse)
x %>%
gather(var, val, -id) %>%
spread(val, var, fill = 0) %>%
mutate_at(vars(-id), funs(replace(., . != 0, 1)))
这给出了,
id PROC1 PROC2 PROC3 UX1 UX2 UX3
1 1 1 0 1 1 1 0
2 2 0 1 1 1 0 1
3 3 1 0 1 1 1 0
然后您可以很容易地 cbind()
到原始数据框,即
x2 <- x %>%
gather(var, val, -id) %>%
spread(val, var, fill = 0) %>%
mutate_at(vars(-id), funs(replace(., . != 0, 1)))
cbind(x, x2)
# id proc1 proc2 proc3 proc4 id PROC1 PROC2 PROC3 UX1 UX2 UX3
#1 1 UX1 UX2 PROC1 PROC3 1 1 0 1 1 1 0
#2 2 UX3 UX1 PROC2 PROC3 2 0 1 1 1 0 1
#3 3 UX2 UX1 PROC3 PROC1 3 1 0 1 1 1 0
注意: 正如@mmn 指出的,我们可以 merge
而不是 cbind
,即
x %>%
gather(var, val, - id) %>%
spread(val, var, fill = 0) %>%
mutate_at(vars(-id), funs(replace(., . != 0, 1))) %>%
left_join(x, ., by = 'id')
# id col1 col2 col3 col4 PROC1 PROC2 PROC3 UX1 UX2 UX3
#1 1 UX1 UX2 PROC1 PROC3 1 0 1 1 1 0
#2 2 UX3 UX1 PROC2 PROC3 0 1 1 1 0 1
#3 3 UX2 UX1 PROC3 PROC1 1 0 1 1 1 0
为了完整起见,还建议 data.table 替代方案:
# load the data table package
library(data.table)
# create the sample data set
x <- data.frame(id=c(1,2,3),
col1=c("UX1", "UX3", "UX2"),
col2=c("UX2", "UX1", "UX1"),
col3=c("PROC1", "PROC2", "PROC3"),
col4=c("PROC3", "PROC3", "PROC1")
)
# convert data frame to data table
x <- data.table(x)
# first convert data to long format using melt function
# then use cast to go back to wide format, convert "value" variable to columns and check where are missing values
# then join on the original data set
x[dcast(melt(x, "id"), id ~ value, fun = function(x) sum(!is.na(x))), on = "id"]
我的数据集可能如下所示:
x <- data.frame(id=c(1,2,3),
col1=c("UX1", "UX3", "UX2"),
col2=c("UX2", "UX1", "UX1"),
col3=c("PROC1", "PROC2", "PROC3"),
col4=c("PROC3", "PROC3", "PROC1")
)
输出:
id col1 col2 col3 col4
1 1 UX1 UX2 PROC1 PROC3
2 2 UX3 UX1 PROC2 PROC3
3 3 UX2 UX1 PROC3 PROC1
我希望输出如下所示:
x2 <- data.frame(id=c(1,2,3),
col1=c("UX1", "UX3", "UX2"),
col2=c("UX2", "UX1", "UX1"),
col3=c("PROC1", "PROC2", "PROC3"),
col43=c("PROC3", "PROC3", "PROC1"),
UX1=c(1,1,1),
UX2=c(1,0,1),
UX3=c(0,1, 0),
PROC1 =c(1,0,1),
PROC2=c(0,1,0),
PROC3 = c(1,1,1))
想要的输出:
id col1 col2 col3 col43 UX1 UX2 UX3 PROC1 PROC2 PROC3
1 1 UX1 UX2 PROC1 PROC3 1 1 0 1 0 1
2 2 UX3 UX1 PROC2 PROC3 1 0 1 0 1 1
3 3 UX2 UX1 PROC3 PROC1 1 1 0 1 0 1
如果一行包含一个字符串,那么创建一个虚拟对象是很基本的。我可以使用 library(dummies)
创建 dummy.data.frame
例如
y <- dummy.data.frame(x)
但此方法认为(例如)第一列中的 UX1 不同于第二列中的 UX1。所以 dummy.data.frame 不起作用...
这是来自 tidyverse
的想法。我们首先 gather
除了 id
变量。然后我们 spread
获得所需的结构并使用简单的 replace
到 'dummify' 我们的数据,即
library(tidyverse)
x %>%
gather(var, val, -id) %>%
spread(val, var, fill = 0) %>%
mutate_at(vars(-id), funs(replace(., . != 0, 1)))
这给出了,
id PROC1 PROC2 PROC3 UX1 UX2 UX3 1 1 1 0 1 1 1 0 2 2 0 1 1 1 0 1 3 3 1 0 1 1 1 0
然后您可以很容易地 cbind()
到原始数据框,即
x2 <- x %>%
gather(var, val, -id) %>%
spread(val, var, fill = 0) %>%
mutate_at(vars(-id), funs(replace(., . != 0, 1)))
cbind(x, x2)
# id proc1 proc2 proc3 proc4 id PROC1 PROC2 PROC3 UX1 UX2 UX3
#1 1 UX1 UX2 PROC1 PROC3 1 1 0 1 1 1 0
#2 2 UX3 UX1 PROC2 PROC3 2 0 1 1 1 0 1
#3 3 UX2 UX1 PROC3 PROC1 3 1 0 1 1 1 0
注意: 正如@mmn 指出的,我们可以 merge
而不是 cbind
,即
x %>%
gather(var, val, - id) %>%
spread(val, var, fill = 0) %>%
mutate_at(vars(-id), funs(replace(., . != 0, 1))) %>%
left_join(x, ., by = 'id')
# id col1 col2 col3 col4 PROC1 PROC2 PROC3 UX1 UX2 UX3
#1 1 UX1 UX2 PROC1 PROC3 1 0 1 1 1 0
#2 2 UX3 UX1 PROC2 PROC3 0 1 1 1 0 1
#3 3 UX2 UX1 PROC3 PROC1 1 0 1 1 1 0
为了完整起见,还建议 data.table 替代方案:
# load the data table package
library(data.table)
# create the sample data set
x <- data.frame(id=c(1,2,3),
col1=c("UX1", "UX3", "UX2"),
col2=c("UX2", "UX1", "UX1"),
col3=c("PROC1", "PROC2", "PROC3"),
col4=c("PROC3", "PROC3", "PROC1")
)
# convert data frame to data table
x <- data.table(x)
# first convert data to long format using melt function
# then use cast to go back to wide format, convert "value" variable to columns and check where are missing values
# then join on the original data set
x[dcast(melt(x, "id"), id ~ value, fun = function(x) sum(!is.na(x))), on = "id"]