将 classes 的长数据集转换为宽数据集,其中变量是每个 class 的虚拟代码
Turn long dataset of classes taken into wide dataset where variables are dummy code for each class
假设我有一个数据集,其中的行是 classes people take:
attendance <- data.frame(id = c(1, 1, 1, 2, 2),
class = c("Math", "English", "Math", "Reading", "Math"))
I.e.,
id class
1 1 "Math"
2 1 "English"
3 1 "Math"
4 2 "Reading"
5 2 "Math"
我想创建一个新数据集,其中行是 id,变量是 class 名称,如下所示:
class.names <- names(table(attendance$class))
attedance2 <- matrix(nrow=length(table(attendance$id)),
ncol=length(class.names))
colnames(attedance2) <- class.names
attedance2 <- as.data.frame(attedance2)
attedance2$id <- unique(attendance$id)
I.e.,
English Math Reading id
1 NA NA NA 1
2 NA NA NA 2
我想用那个特定的 id 是否使用 class 来填充 NA。它可以是 Yes/No、1/0 或 classes
的计数
I.e.,
English Math Reading id
1 "Yes" "Yes" "No" 1
2 "No" "Yes" "Yes" 2
我熟悉 dplyr,所以如果在解决方案中使用但不是必需的,对我来说会更容易。谢谢您的帮助!
使用:
library(reshape2)
attendance$val <- 'yes'
dcast(unique(attendance), id ~ class, value.var = 'val', fill = 'no')
给出:
id English Math Reading
1 1 yes yes no
2 2 no yes yes
与data.table
类似的方法:
library(data.table)
dcast(unique(setDT(attendance))[,val:='yes'], id ~ class, value.var = 'val', fill = 'no')
或 dplyr
/tidyr
:
library(dplyr)
library(tidyr)
attendance %>%
distinct() %>%
mutate(var = 'yes') %>%
spread(class, var, fill = 'no')
另一个更复杂的选项可能会先重新整形,然后用 yes
和 no
替换计数(请参阅 here for an explanation 关于 dcast
的默认聚合选项):
att2 <- dcast(attendance, id ~ class, value.var = 'class')
给出:
id English Math Reading
1 1 1 2 0
2 2 0 1 1
现在您可以将计数替换为:
# create index which counts are above zero
idx <- att2[,-1] > 0
# replace the non-zero values with 'yes'
att2[,-1][idx] <- 'yes'
# replace the zero values with 'no'
att2[,-1][!idx] <- 'no'
最终给出:
> att2
id English Math Reading
1 1 yes yes no
2 2 no yes yes
我们可以用 base R
attendance$val <- "yes"
d1 <- reshape(attendance, idvar = 'id', direction = 'wide', timevar = 'class')
d1[is.na(d1)] <- "no"
names(d1) <- sub("val\.", '', names(d1))
d1
# id Math English Reading
#1 1 yes yes no
#4 2 yes no yes
或 xtabs
xtabs(val ~id + class, transform(unique(attendance), val = 1))
# class
# id English Math Reading
# 1 1 1 0
# 2 0 1 1
注意:二进制可以很容易地转换为 'yes'、'no',但最好是 1/0 或 TRUE/FALSE
假设我有一个数据集,其中的行是 classes people take:
attendance <- data.frame(id = c(1, 1, 1, 2, 2),
class = c("Math", "English", "Math", "Reading", "Math"))
I.e.,
id class
1 1 "Math"
2 1 "English"
3 1 "Math"
4 2 "Reading"
5 2 "Math"
我想创建一个新数据集,其中行是 id,变量是 class 名称,如下所示:
class.names <- names(table(attendance$class))
attedance2 <- matrix(nrow=length(table(attendance$id)),
ncol=length(class.names))
colnames(attedance2) <- class.names
attedance2 <- as.data.frame(attedance2)
attedance2$id <- unique(attendance$id)
I.e.,
English Math Reading id
1 NA NA NA 1
2 NA NA NA 2
我想用那个特定的 id 是否使用 class 来填充 NA。它可以是 Yes/No、1/0 或 classes
的计数I.e.,
English Math Reading id
1 "Yes" "Yes" "No" 1
2 "No" "Yes" "Yes" 2
我熟悉 dplyr,所以如果在解决方案中使用但不是必需的,对我来说会更容易。谢谢您的帮助!
使用:
library(reshape2)
attendance$val <- 'yes'
dcast(unique(attendance), id ~ class, value.var = 'val', fill = 'no')
给出:
id English Math Reading 1 1 yes yes no 2 2 no yes yes
与data.table
类似的方法:
library(data.table)
dcast(unique(setDT(attendance))[,val:='yes'], id ~ class, value.var = 'val', fill = 'no')
或 dplyr
/tidyr
:
library(dplyr)
library(tidyr)
attendance %>%
distinct() %>%
mutate(var = 'yes') %>%
spread(class, var, fill = 'no')
另一个更复杂的选项可能会先重新整形,然后用 yes
和 no
替换计数(请参阅 here for an explanation 关于 dcast
的默认聚合选项):
att2 <- dcast(attendance, id ~ class, value.var = 'class')
给出:
id English Math Reading 1 1 1 2 0 2 2 0 1 1
现在您可以将计数替换为:
# create index which counts are above zero
idx <- att2[,-1] > 0
# replace the non-zero values with 'yes'
att2[,-1][idx] <- 'yes'
# replace the zero values with 'no'
att2[,-1][!idx] <- 'no'
最终给出:
> att2 id English Math Reading 1 1 yes yes no 2 2 no yes yes
我们可以用 base R
attendance$val <- "yes"
d1 <- reshape(attendance, idvar = 'id', direction = 'wide', timevar = 'class')
d1[is.na(d1)] <- "no"
names(d1) <- sub("val\.", '', names(d1))
d1
# id Math English Reading
#1 1 yes yes no
#4 2 yes no yes
或 xtabs
xtabs(val ~id + class, transform(unique(attendance), val = 1))
# class
# id English Math Reading
# 1 1 1 0
# 2 0 1 1
注意:二进制可以很容易地转换为 'yes'、'no',但最好是 1/0 或 TRUE/FALSE