如何使用共享相同级别的各种列创建虚拟变量
How to create dummy variables using various columns sharing same levels
我正在尝试获取以下 table 的虚拟变量:
df:
Value1 var1 var2 var3 var4
9.330154398 HomeATL AwayHOU HomeEast AwayWest
32.43881489 AwaySDN HomeATL HomeWest AwayWest
54.77178387 AwayLAN HomeATL AwayEast HomeSame
54.77178387 AwayLAN HomeATL AwayWest HomeEast
第var1
列和var2
列共享相同的级别。另一方面,var3
和 var4
列也显示了它们的级别。因此,我需要在创建虚拟变量的过程中,创建的新列不应该有重复的级别。我的意思是,在 var3 和 var4 的示例中,对于第 1 行和第 3 行,都有 AwayWest
,所以我只需要在每一行上用数字 1 填充名为 AwayWest
的 1 列。
我想要的输出是:
Value1 HomeEast HomeWest AwayEast AwayWest HomeSame HomeATL AwayHOU AwaySDN AwayLAN
9.330154398 1 0 0 1 0 1 1 0 0
32.43881489 0 1 0 1 0 1 0 1 0
54.77178387 0 0 1 0 1 1 0 0 1
54.77178387 1 0 0 1 0 1 0 0 1
我尝试为每个要转换的列创建一个 1 (col1
) 的新列:
spread(df,var1, col1) %>%
spread(var2, col1)%>%
spread(var3, col1)%>%
spread(var1, col1)
但是它不起作用。
谢谢
基本 R 选项是使用 model.matrix
df <- cbind(df[, "Value1", drop = F], model.matrix(Value1 ~ . - 1, data = df))
df
# Value1 var1AwayLAN var1AwaySDN var1HomeATL var2HomeATL var3AwayWest
#1 9.330154 0 0 1 0 0
#2 32.438815 0 1 0 1 0
#3 54.771784 1 0 0 1 0
#4 54.771784 1 0 0 1 1
# var3HomeEast var3HomeWest var4HomeEast var4HomeSame
#1 1 0 0 0
#2 0 1 0 0
#3 0 0 0 1
#4 0 0 1 0
如有必要,我们可以将列名固定为
names(df) <- sub("var\d", "", names(df))
重现您的预期输出。
示例数据
df <- read.table(text =
"Value1 var1 var2 var3 var4
9.330154398 HomeATL AwayHOU HomeEast AwayWest
32.43881489 AwaySDN HomeATL HomeWest AwayWest
54.77178387 AwayLAN HomeATL AwayEast HomeSame
54.77178387 AwayLAN HomeATL AwayWest HomeEast", header = T)
你也可以这样做-
> data.table::setDT(df)[,id:=1:.N]
> cbind(df[,.(Value1)],dcast(
melt(setDT(df)[, c(.(id=id), lapply(c("var1","var2","var3","var4"), function(x) paste0(x, get(x))))], id.vars="id"),
id ~ value,
length))
输出-
Value1 id var1AwayLAN var1AwaySDN var1HomeATL var2AwayHOU var2HomeATL var3AwayEast var3AwayWest var3HomeEast var3HomeWest
1: 9.330154 1 0 0 1 1 0 0 0 1 0
2: 32.438815 2 0 1 0 0 1 0 0 0 1
3: 54.771784 3 1 0 0 0 1 1 0 0 0
4: 54.771784 4 1 0 0 0 1 0 1 0 0
base R
选项将是 table
tbl <- +(table(c(col(df1[-1])), unlist(df1[-1]) ) > 0)
cbind(df1[1], as.data.frame.matrix(tbl))
或使用tidyverse
library(tidyverse)
rownames_to_column(df1, 'rn') %>%
gather(key, val, var1:var4) %>%
count(rn, val) %>%
spread(val, n, fill = 0) %>%
select(-rn) %>%
bind_cols(df1[1], .)
数据
df1 <- structure(list(Value1 = c(9.330154398, 32.43881489, 54.77178387,
54.77178387), var1 = c("HomeATL", "AwaySDN", "AwayLAN", "AwayLAN"
), var2 = c("AwayHOU", "HomeATL", "HomeATL", "HomeATL"), var3 = c("HomeEast",
"HomeWest", "AwayEast", "AwayWest"), var4 = c("AwayWest", "AwayWest",
"HomeSame", "HomeEast")), class = "data.frame", row.names = c(NA,
-4L))
我正在尝试获取以下 table 的虚拟变量:
df:
Value1 var1 var2 var3 var4
9.330154398 HomeATL AwayHOU HomeEast AwayWest
32.43881489 AwaySDN HomeATL HomeWest AwayWest
54.77178387 AwayLAN HomeATL AwayEast HomeSame
54.77178387 AwayLAN HomeATL AwayWest HomeEast
第var1
列和var2
列共享相同的级别。另一方面,var3
和 var4
列也显示了它们的级别。因此,我需要在创建虚拟变量的过程中,创建的新列不应该有重复的级别。我的意思是,在 var3 和 var4 的示例中,对于第 1 行和第 3 行,都有 AwayWest
,所以我只需要在每一行上用数字 1 填充名为 AwayWest
的 1 列。
我想要的输出是:
Value1 HomeEast HomeWest AwayEast AwayWest HomeSame HomeATL AwayHOU AwaySDN AwayLAN
9.330154398 1 0 0 1 0 1 1 0 0
32.43881489 0 1 0 1 0 1 0 1 0
54.77178387 0 0 1 0 1 1 0 0 1
54.77178387 1 0 0 1 0 1 0 0 1
我尝试为每个要转换的列创建一个 1 (col1
) 的新列:
spread(df,var1, col1) %>%
spread(var2, col1)%>%
spread(var3, col1)%>%
spread(var1, col1)
但是它不起作用。
谢谢
基本 R 选项是使用 model.matrix
df <- cbind(df[, "Value1", drop = F], model.matrix(Value1 ~ . - 1, data = df))
df
# Value1 var1AwayLAN var1AwaySDN var1HomeATL var2HomeATL var3AwayWest
#1 9.330154 0 0 1 0 0
#2 32.438815 0 1 0 1 0
#3 54.771784 1 0 0 1 0
#4 54.771784 1 0 0 1 1
# var3HomeEast var3HomeWest var4HomeEast var4HomeSame
#1 1 0 0 0
#2 0 1 0 0
#3 0 0 0 1
#4 0 0 1 0
如有必要,我们可以将列名固定为
names(df) <- sub("var\d", "", names(df))
重现您的预期输出。
示例数据
df <- read.table(text =
"Value1 var1 var2 var3 var4
9.330154398 HomeATL AwayHOU HomeEast AwayWest
32.43881489 AwaySDN HomeATL HomeWest AwayWest
54.77178387 AwayLAN HomeATL AwayEast HomeSame
54.77178387 AwayLAN HomeATL AwayWest HomeEast", header = T)
你也可以这样做-
> data.table::setDT(df)[,id:=1:.N]
> cbind(df[,.(Value1)],dcast(
melt(setDT(df)[, c(.(id=id), lapply(c("var1","var2","var3","var4"), function(x) paste0(x, get(x))))], id.vars="id"),
id ~ value,
length))
输出-
Value1 id var1AwayLAN var1AwaySDN var1HomeATL var2AwayHOU var2HomeATL var3AwayEast var3AwayWest var3HomeEast var3HomeWest
1: 9.330154 1 0 0 1 1 0 0 0 1 0
2: 32.438815 2 0 1 0 0 1 0 0 0 1
3: 54.771784 3 1 0 0 0 1 1 0 0 0
4: 54.771784 4 1 0 0 0 1 0 1 0 0
base R
选项将是 table
tbl <- +(table(c(col(df1[-1])), unlist(df1[-1]) ) > 0)
cbind(df1[1], as.data.frame.matrix(tbl))
或使用tidyverse
library(tidyverse)
rownames_to_column(df1, 'rn') %>%
gather(key, val, var1:var4) %>%
count(rn, val) %>%
spread(val, n, fill = 0) %>%
select(-rn) %>%
bind_cols(df1[1], .)
数据
df1 <- structure(list(Value1 = c(9.330154398, 32.43881489, 54.77178387,
54.77178387), var1 = c("HomeATL", "AwaySDN", "AwayLAN", "AwayLAN"
), var2 = c("AwayHOU", "HomeATL", "HomeATL", "HomeATL"), var3 = c("HomeEast",
"HomeWest", "AwayEast", "AwayWest"), var4 = c("AwayWest", "AwayWest",
"HomeSame", "HomeEast")), class = "data.frame", row.names = c(NA,
-4L))