R 用 str_match_all 提取教育程度

R extract education degree with str_match_all

我有一个包含教育信息(大学+学位)的R字符串列。我想提取学位并创建两个分类变量来指示本科或研究生学位(undergrad.dummygrad.dummy)。

df = data.frame(educ = c("Angelo State University  (BBA, Finance; BBA, Economics;)", "University of Oxford  (MA, Philosophy; MA, Economics;", "Ross School of Business, University of Michigan  (MBA; BBA;)"))

我的方法是创建一个本科和研究生学位列表,如下所示

undergrad.list = c("BBA", "BA")
grad.list = c("MA", "MBA", "PhD")

我尝试的是首先从 educ

中提取学位
df$degree = str_match_all(df$educ, "BBA|MS|MBA|BA")

问题是结果可能是一个向量,我很难从我的大型数据集中提取本科和研究生学位。最后,我想要

df = data.frame(educ = c("Angelo State University  (BBA, Finance; BBA, Economics;)", "University of Oxford  (MA, Philosophy; MA, Economics;", "Ross School of Business, University of Michigan  (MBA; BBA;)"), undergrad.dummy = c(1,0,1), grad.dummy = c(0,1,1))

希望得到一些解决这个问题的建议。

将模式向量保存在 list 中,用 map 循环 list(来自 purrr),paste 它们成为一个字符串通过 collapseing 与 | (OR) 在 pattern 中使用 str_detect,returns 一个逻辑向量,将其强制转换为二进制 (as.integer+),重命名 map_dfc 中的列并将这些列绑定到原始数​​据集

library(dplyr)
library(purrr)
library(stringr)
map_dfc(list(undergrad.list, grad.list), ~
       +(str_detect(df$educ, str_c("\b(",str_c(.x, collapse="|"), ")\b")))) %>%
   set_names(c("undergrad.dummy", 'grad.dummy')) %>%
  bind_cols(df, .)

-输出

#                                                          educ undergrad.dummy grad.dummy
#1     Angelo State University  (BBA, Finance; BBA, Economics;)               1          0
#2        University of Oxford  (MA, Philosophy; MA, Economics;               0          1
#3 Ross School of Business, University of Michigan  (MBA; BBA;)               1          1
    for (i in 1:nrow(df)) {
      if (grepl("BA", df$educ[i]) == TRUE) {
        df$unddummy[i] <- 1
      } else {
        df$unddummy[i] <- 0
      }
      if (grepl("MBA|MA|Phd", df$educ[i]) == TRUE) {
        df$graddummy[i] <- 1
      } else {
        df$graddummy[i] <- 0
      }
    }

 df
                                                          educ unddummy graddummy
1     Angelo State University  (BBA, Finance; BBA, Economics;)        1         0
2        University of Oxford  (MA, Philosophy; MA, Economics;        0         1
3 Ross School of Business, University of Michigan  (MBA; BBA;)        1         1

此解决方案检查 df$edu 中单元格的内容,并将相应的值放入 unddummy 和 graddummy。有两点要注意:在查本科的时候,查“BA”就可以了,因为“BBA”中包含“BA”。这暗示了一个重要的问题,我的正则表达式技能几乎不存在。当名称包含 BA 时,此代码还将大学标记为“本科”,例如“巴厘岛大学”。一旦找到使用正则表达式的正确方法,我将进行编辑...