使用 R 将单词列表与字符串中的单个单词匹配

Question

编辑：修复了数据示例问题

Background/Data：我正在处理两个数据集之间的合并：一个是各种 public 上市公司的法定名称列表第二个是一个相当肮脏的领域，有公司名称、个人头衔和各种其他难以预测的词。公司名单约14000行，脏数据约130万行。并非每家 public 上市公司都会出现在脏数据中，有些公司可能会以不同的方式出现多次（Exxon Mobil、Exxon、ExxonMobil 等）。

因此，我目前的做法是将 publicly 交易的公司名称列表分解为每个标题中使用的单个词（在清除一些常用词后，如 company、corporation、inc 等），导致数据如下所示 Have1。一些脏数据的示例如下所示 Have2。在我正在进行的工作中，我还清理了这些字符串以消除像 Inc 和 Company 这样的词，但如果有人有比我目前的方法更好的想法，我将留下数据 as-is。此外，我们可以假设数据中几乎没有（如果有的话）完全匹配，并且 Have2 数据噪音太大，无法在不进行额外工作的情况下成功使用模糊匹配。

问题：确定 Have2 中哪些项目包含 Have1 中的单词的最佳方法是什么？具体来说，我认为我需要最终数据看起来像 Want，这样我就可以 link 将 public 公司名称转换为脏数据名称。计划是 hand-verify 考虑到 Have2 数据的难度，但是如果有人对解决此问题的其他方法有任何建议，我绝对愿意接受建议（拜托，有人，有一个建议哈哈）。

到目前为止已尝试：我有这样的代码，但需要很长时间才能达到运行，而且效率似乎很低。即：

library(data.table)
library(stringr)

company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")

have1 <- data.table(table(str_split(company_name_data, "\W+", simplify = TRUE)))[!V1 == "inc"]

have2 <- c("ceo and director, apple inc",
           "current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
           "xerox inc., president and ceo",
           "president and ceo of the amazon apple assn., division 4")


#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\W+", simplify = TRUE))

#Creates container
store <- data.table()

#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix

for (i in 1:nrow(have1)){
  
  matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\b",have1$V1[i],"\b"), have3[x,])))])
  
  if (nrow(matches) == 0){
    next
  }
  
  #Create combo data
  matches[, have1_word := have1$V1[i]]
  
  #Storage
  store <- rbind(store, matches)
}

想要

Name (from Have2)	Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy	amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy	radiation
vp and general bird aficionado of the amazon apple assn. branch F	amazon
vp and general bird aficionado of the amazon apple assn. branch F	apple
ceo and director, apple inc	apple
xerox inc., president and ceo	xerox

有1个

Word	N
amazon	1
apple	3
xerox	1
notgoingtomatch	2
radiation	1

有2个

Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F

Answer 1

根据来自 company_name_data 且仅 have2 的数据，使用您记录的内容：

library(tidytext)
library(tidyverse)

#------------ remove stop words before tokenization ---------------

# now split each phrase, remove the stop words, rejoin the phrases 
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data,     # split the phrases into individual words,
                       # remove stop words then reassemble phrases
                       function(x) {
                         paste(unlist(strsplit(x,
                                               " ")
                         )[!(unlist(strsplit(x, 
                                             " ")) %in% (stop_words$word %>% 
                                                           unlist())
                         ) # end 2nd unlist
                         ], # end subscript of string split
                         collapse=" ")})) # reassemble string

haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
                           function(x){
                             paste(unlist(strsplit(x,
                                                   " ")
                             )[(unlist(strsplit(x, 
                                                 " ")) %in% comp2
                             ) # end 2nd unlist
                             ], # end subscript of string split
                             collapse=" ")})) # reassemble string

基于文本分析，第二列中的结果是“苹果”、“辐射”、“施乐”和“亚马逊苹果”。

我确定此代码最初不是我的。我确定我是从 Whosebug 的某个地方得到这些想法的...

使用 R 将单词列表与字符串中的单个单词匹配

Matching word list to individual words in strings using R

r

matching

string-matching

data.table