R 用 str_match_all 提取教育程度
R extract education degree with str_match_all
我有一个包含教育信息(大学+学位)的R字符串列。我想提取学位并创建两个分类变量来指示本科或研究生学位(undergrad.dummy
和 grad.dummy
)。
df = data.frame(educ = c("Angelo State University (BBA, Finance; BBA, Economics;)", "University of Oxford (MA, Philosophy; MA, Economics;", "Ross School of Business, University of Michigan (MBA; BBA;)"))
我的方法是创建一个本科和研究生学位列表,如下所示
undergrad.list = c("BBA", "BA")
grad.list = c("MA", "MBA", "PhD")
我尝试的是首先从 educ
中提取学位
df$degree = str_match_all(df$educ, "BBA|MS|MBA|BA")
问题是结果可能是一个向量,我很难从我的大型数据集中提取本科和研究生学位。最后,我想要
df = data.frame(educ = c("Angelo State University (BBA, Finance; BBA, Economics;)", "University of Oxford (MA, Philosophy; MA, Economics;", "Ross School of Business, University of Michigan (MBA; BBA;)"), undergrad.dummy = c(1,0,1), grad.dummy = c(0,1,1))
希望得到一些解决这个问题的建议。
将模式向量保存在 list
中,用 map
循环 list
(来自 purrr
),paste
它们成为一个字符串通过 collapse
ing 与 |
(OR
) 在 pattern
中使用 str_detect
,returns 一个逻辑向量,将其强制转换为二进制 (as.integer
或 +
),重命名 map_dfc
中的列并将这些列绑定到原始数据集
library(dplyr)
library(purrr)
library(stringr)
map_dfc(list(undergrad.list, grad.list), ~
+(str_detect(df$educ, str_c("\b(",str_c(.x, collapse="|"), ")\b")))) %>%
set_names(c("undergrad.dummy", 'grad.dummy')) %>%
bind_cols(df, .)
-输出
# educ undergrad.dummy grad.dummy
#1 Angelo State University (BBA, Finance; BBA, Economics;) 1 0
#2 University of Oxford (MA, Philosophy; MA, Economics; 0 1
#3 Ross School of Business, University of Michigan (MBA; BBA;) 1 1
for (i in 1:nrow(df)) {
if (grepl("BA", df$educ[i]) == TRUE) {
df$unddummy[i] <- 1
} else {
df$unddummy[i] <- 0
}
if (grepl("MBA|MA|Phd", df$educ[i]) == TRUE) {
df$graddummy[i] <- 1
} else {
df$graddummy[i] <- 0
}
}
df
educ unddummy graddummy
1 Angelo State University (BBA, Finance; BBA, Economics;) 1 0
2 University of Oxford (MA, Philosophy; MA, Economics; 0 1
3 Ross School of Business, University of Michigan (MBA; BBA;) 1 1
此解决方案检查 df$edu 中单元格的内容,并将相应的值放入 unddummy 和 graddummy。有两点要注意:在查本科的时候,查“BA”就可以了,因为“BBA”中包含“BA”。这暗示了一个重要的问题,我的正则表达式技能几乎不存在。当名称包含 BA 时,此代码还将大学标记为“本科”,例如“巴厘岛大学”。一旦找到使用正则表达式的正确方法,我将进行编辑...
我有一个包含教育信息(大学+学位)的R字符串列。我想提取学位并创建两个分类变量来指示本科或研究生学位(undergrad.dummy
和 grad.dummy
)。
df = data.frame(educ = c("Angelo State University (BBA, Finance; BBA, Economics;)", "University of Oxford (MA, Philosophy; MA, Economics;", "Ross School of Business, University of Michigan (MBA; BBA;)"))
我的方法是创建一个本科和研究生学位列表,如下所示
undergrad.list = c("BBA", "BA")
grad.list = c("MA", "MBA", "PhD")
我尝试的是首先从 educ
df$degree = str_match_all(df$educ, "BBA|MS|MBA|BA")
问题是结果可能是一个向量,我很难从我的大型数据集中提取本科和研究生学位。最后,我想要
df = data.frame(educ = c("Angelo State University (BBA, Finance; BBA, Economics;)", "University of Oxford (MA, Philosophy; MA, Economics;", "Ross School of Business, University of Michigan (MBA; BBA;)"), undergrad.dummy = c(1,0,1), grad.dummy = c(0,1,1))
希望得到一些解决这个问题的建议。
将模式向量保存在 list
中,用 map
循环 list
(来自 purrr
),paste
它们成为一个字符串通过 collapse
ing 与 |
(OR
) 在 pattern
中使用 str_detect
,returns 一个逻辑向量,将其强制转换为二进制 (as.integer
或 +
),重命名 map_dfc
中的列并将这些列绑定到原始数据集
library(dplyr)
library(purrr)
library(stringr)
map_dfc(list(undergrad.list, grad.list), ~
+(str_detect(df$educ, str_c("\b(",str_c(.x, collapse="|"), ")\b")))) %>%
set_names(c("undergrad.dummy", 'grad.dummy')) %>%
bind_cols(df, .)
-输出
# educ undergrad.dummy grad.dummy
#1 Angelo State University (BBA, Finance; BBA, Economics;) 1 0
#2 University of Oxford (MA, Philosophy; MA, Economics; 0 1
#3 Ross School of Business, University of Michigan (MBA; BBA;) 1 1
for (i in 1:nrow(df)) {
if (grepl("BA", df$educ[i]) == TRUE) {
df$unddummy[i] <- 1
} else {
df$unddummy[i] <- 0
}
if (grepl("MBA|MA|Phd", df$educ[i]) == TRUE) {
df$graddummy[i] <- 1
} else {
df$graddummy[i] <- 0
}
}
df
educ unddummy graddummy
1 Angelo State University (BBA, Finance; BBA, Economics;) 1 0
2 University of Oxford (MA, Philosophy; MA, Economics; 0 1
3 Ross School of Business, University of Michigan (MBA; BBA;) 1 1
此解决方案检查 df$edu 中单元格的内容,并将相应的值放入 unddummy 和 graddummy。有两点要注意:在查本科的时候,查“BA”就可以了,因为“BBA”中包含“BA”。这暗示了一个重要的问题,我的正则表达式技能几乎不存在。当名称包含 BA 时,此代码还将大学标记为“本科”,例如“巴厘岛大学”。一旦找到使用正则表达式的正确方法,我将进行编辑...