从一个字符串变量创建多个虚拟变量
Create several dummy variables from one string variable
我几乎尝试了 this similar question 中的所有方法,但我无法得到其他人似乎得到的结果。这是我的问题:
我有一个这样的数据框,列出了每个老师的成绩:
> profs <- data.frame(teaches = c("1st", "1st, 2nd",
"2nd, 3rd",
"1st, 2nd, 3rd"))
> profs
teaches
1 1st
2 1st, 2nd
3 2nd, 3rd
4 1st, 2nd, 3rd
我一直在寻找将 teaches
变量分成列的解决方案,如下所示:
teaches1st teaches2nd teaches3rd
1 1 0 0
2 1 1 0
3 0 1 1
4 1 1 1
考虑到回答者的解释,I understand this solution 涉及 splitstackshape
库和显然已弃用的 concat.split.expanded
函数应该完全符合我的要求。但是,我似乎无法达到相同的结果:
> concat.split.expanded(profs, "teaches", fill = 0, drop = TRUE)
Fehler in seq.default(min(vec), max(vec)) :
'from' cannot be NA, NaN or infinite
使用 cSplit
,我理解它取代了 "most of the earlier concat.split* functions",我得到这个:
> cSplit(profs, "teaches")
teaches_1 teaches_2 teaches_3
1: 1st NA NA
2: 1st 2nd NA
3: 2nd 3rd NA
4: 1st 2nd 3rd
我试过使用 cSplit
的帮助并调整每一个参数,但我就是无法进行拆分。感谢您的帮助。
我找到了解决方法。如果您有一个只包含分隔符和数字的字符串变量,那么 concat.split.expanded
似乎可以工作,即:
> profs <- data.frame(teaches = c("1", "1, 2", "2, 3", "1, 2, 3"))
> profs
teaches
1 1
2 1, 2
3 2, 3
4 1, 2, 3
现在 concat.split.expanded
与 Dummy variables from a string variable:
相同
> concat.split.expanded(profs, "teaches", fill = 0, drop = TRUE)
teaches_1 teaches_2 teaches_3
1 1 0 0
2 1 1 0
3 0 1 1
4 1 1 1
但是,我仍在寻找一种解决方案,它不涉及从我的 teaches
变量中删除所有字母。
这是另一个选项:
Vectorize(grepl, 'pattern')(c('1st', '2nd', '3rd'), profs$teaches)
# 1st 2nd 3rd
# [1,] TRUE FALSE FALSE
# [2,] TRUE TRUE FALSE
# [3,] FALSE TRUE TRUE
# [4,] TRUE TRUE TRUE
您可以尝试 mtabulate
来自 qdapTools
library(qdapTools)
res <- mtabulate(strsplit(as.character(profs$teaches), ', '))
colnames(res) <- paste0('teaches', colnames(res))
res
# teaches1st teaches2nd teaches3rd
#1 1 0 0
#2 1 1 0
#3 0 1 1
#4 1 1 1
或使用stringi
library(stringi)
(vapply(c('1st', '2nd', '3rd'), stri_detect_fixed, logical(4L),
str=profs$teaches))+0L
# 1st 2nd 3rd
#[1,] 1 0 0
#[2,] 1 1 0
#[3,] 0 1 1
#[4,] 1 1 1
由于您的连接数据是连接的字符串(不是连接的数值),您需要添加 type = "character"
以使函数按预期工作。
该函数的默认设置是数值,因此出现有关 NaN
等的错误。
命名与同一系列中其他函数的缩写形式更加一致。因此,现在是 cSplit_e
(尽管旧函数名称仍然有效)。
library(splitstackshape)
cSplit_e(profs, "teaches", ",", type = "character", fill = 0)
# teaches teaches_1st teaches_2nd teaches_3rd
# 1 1st 1 0 0
# 2 1st, 2nd 1 1 0
# 3 2nd, 3rd 0 1 1
# 4 1st, 2nd, 3rd 1 1 1
?concat.split.expanded
的帮助页面与 cSplit_e
的帮助页面相同。如果您有任何使它更容易理解的提示,请在包的 GitHub 页面上提出问题。
我几乎尝试了 this similar question 中的所有方法,但我无法得到其他人似乎得到的结果。这是我的问题:
我有一个这样的数据框,列出了每个老师的成绩:
> profs <- data.frame(teaches = c("1st", "1st, 2nd",
"2nd, 3rd",
"1st, 2nd, 3rd"))
> profs
teaches
1 1st
2 1st, 2nd
3 2nd, 3rd
4 1st, 2nd, 3rd
我一直在寻找将 teaches
变量分成列的解决方案,如下所示:
teaches1st teaches2nd teaches3rd
1 1 0 0
2 1 1 0
3 0 1 1
4 1 1 1
考虑到回答者的解释,I understand this solution 涉及 splitstackshape
库和显然已弃用的 concat.split.expanded
函数应该完全符合我的要求。但是,我似乎无法达到相同的结果:
> concat.split.expanded(profs, "teaches", fill = 0, drop = TRUE)
Fehler in seq.default(min(vec), max(vec)) :
'from' cannot be NA, NaN or infinite
使用 cSplit
,我理解它取代了 "most of the earlier concat.split* functions",我得到这个:
> cSplit(profs, "teaches")
teaches_1 teaches_2 teaches_3
1: 1st NA NA
2: 1st 2nd NA
3: 2nd 3rd NA
4: 1st 2nd 3rd
我试过使用 cSplit
的帮助并调整每一个参数,但我就是无法进行拆分。感谢您的帮助。
我找到了解决方法。如果您有一个只包含分隔符和数字的字符串变量,那么 concat.split.expanded
似乎可以工作,即:
> profs <- data.frame(teaches = c("1", "1, 2", "2, 3", "1, 2, 3"))
> profs
teaches
1 1
2 1, 2
3 2, 3
4 1, 2, 3
现在 concat.split.expanded
与 Dummy variables from a string variable:
> concat.split.expanded(profs, "teaches", fill = 0, drop = TRUE)
teaches_1 teaches_2 teaches_3
1 1 0 0
2 1 1 0
3 0 1 1
4 1 1 1
但是,我仍在寻找一种解决方案,它不涉及从我的 teaches
变量中删除所有字母。
这是另一个选项:
Vectorize(grepl, 'pattern')(c('1st', '2nd', '3rd'), profs$teaches)
# 1st 2nd 3rd
# [1,] TRUE FALSE FALSE
# [2,] TRUE TRUE FALSE
# [3,] FALSE TRUE TRUE
# [4,] TRUE TRUE TRUE
您可以尝试 mtabulate
来自 qdapTools
library(qdapTools)
res <- mtabulate(strsplit(as.character(profs$teaches), ', '))
colnames(res) <- paste0('teaches', colnames(res))
res
# teaches1st teaches2nd teaches3rd
#1 1 0 0
#2 1 1 0
#3 0 1 1
#4 1 1 1
或使用stringi
library(stringi)
(vapply(c('1st', '2nd', '3rd'), stri_detect_fixed, logical(4L),
str=profs$teaches))+0L
# 1st 2nd 3rd
#[1,] 1 0 0
#[2,] 1 1 0
#[3,] 0 1 1
#[4,] 1 1 1
由于您的连接数据是连接的字符串(不是连接的数值),您需要添加 type = "character"
以使函数按预期工作。
该函数的默认设置是数值,因此出现有关 NaN
等的错误。
命名与同一系列中其他函数的缩写形式更加一致。因此,现在是 cSplit_e
(尽管旧函数名称仍然有效)。
library(splitstackshape)
cSplit_e(profs, "teaches", ",", type = "character", fill = 0)
# teaches teaches_1st teaches_2nd teaches_3rd
# 1 1st 1 0 0
# 2 1st, 2nd 1 1 0
# 3 2nd, 3rd 0 1 1
# 4 1st, 2nd, 3rd 1 1 1
?concat.split.expanded
的帮助页面与 cSplit_e
的帮助页面相同。如果您有任何使它更容易理解的提示,请在包的 GitHub 页面上提出问题。