将一个句子与 R 中的一个句子匹配?

match a sentence with a sentence in R?

我有两个数据框占用和数据。我想将数据中的每个职业与职业相匹配,并通过在职业数据框中添加一列来分配相应的class。

occupation <- c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank", "Love my profession of Professor", "NA")

occupation <- data.frame(occupation)

data <- data.frame(class = c("Engineers","Designer","Artist","Designer","Poetry""Banker and Prof"), Occupation = c("Civil Engineer", "Graphic Designer", "Painter","Poetry","Architect(prof)", "Sales Manager Bank"))

我想要这样的输出

 occupation                             class
    I am Civil Engineer human being        Engineers
    Painter  Architect Poetry              Artists
    Graphic Designer too late              Designers
    Architect by Painter profession        Architect
    Sales Manager Bank                     Banker and Prof
    Love my profession of Professor        NA
      NA                                   NA

我试过了,但它没有任何反应

occupation$value <- sapply(data$occupation, grepl, x = occupation)

agrep 非常接近。我无法让它为 Architect(prof) 工作,但如果你删除括号,它就可以工作:

data$Occupation <- sub("\(.*", "", data$Occupation)
data
            class         Occupation
1       Engineers     Civil Engineer
2        Designer   Graphic Designer
3        Designer          Architect
4 Banker and Prof Sales Manager Bank

occ.class <- data$class[unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))]
occ.class
[1] Engineers       Designer        Designer        Banker and Prof
Levels: Banker and Prof Designer Engineers

如果你想让第三个显示 Architect 你应该在你的 data data.frame.

中相应地改变它

至于编辑:

occ.class <- unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))
ifelse(length(occ.class), data$class[occ.class], NA)

我不知道你的数据有多复杂,但这对低复杂度的字符串很有用。使用 agrep 函数允许您设置容差参数,以便您可以匹配不相等的字符串:

occupation <- data.frame(occupation = c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank"), 
                         stringsAsFactors = FALSE)
data <- data.frame(class = c("Engineers","Designer","Architect","Banker and Prof"), 
                   occupation = c("Civil Engineer", "Graphic Designer", "Architect(prof)", "Sales Manager Bank"),
                   stringsAsFactors = FALSE)

occupation$value <- sapply(occupation$occupation, function(x) {
    match.class <- sapply(data$class, function(y) agrep(y, x, max.distance = 0.2))
    data$class[which(match.class == 1)]
  }
)

如果你上升 max.distance 你可以检测到最后的文本,但以前的字符串也会这样做。

                       occupation            value
1 I am Civil Engineer human being   Civil Engineer
2       Graphic Designer too late Graphic Designer
3         Architect by profession  Architect(prof)
4              Sales Manager Bank      

第二个选项匹配每个单词,但对于 'I am Civil Engineer human being' 的情况,单词 'I' 和 'am' 匹配所有内容。

occupation$value <- sapply(occupation$occupation, function(x) {
    match.class <- sapply(data$class, function(y) {
      any(sapply(strsplit(x, ' ')[[1]], function(z)
        any(agrep(z, y, max.distance = 0.2))))
    })
    data$class[which(match.class)]
  }
)

结果是这样...

                       occupation                                                                 value
1 I am Civil Engineer human being Civil Engineer, Graphic Designer, Architect(prof), Sales Manager Bank
2       Graphic Designer too late                                                      Graphic Designer
3         Architect by profession                                                       Architect(prof)
4              Sales Manager Bank                                                    Sales Manager Bank

Here thelink when you can download the code