使用 R 将单词列表与字符串中的单个单词匹配

Matching word list to individual words in strings using R

编辑:修复了数据示例问题

Background/Data:我正在处理两个数据集之间的合并:一个是各种 public 上市公司的法定名称列表第二个是一个相当肮脏的领域,有公司名称、个人头衔和各种其他难以预测的词。公司名单约14000行,脏数据约130万行。并非每家 public 上市公司都会出现在脏数据中,有些公司可能会以不同的方式出现多次(Exxon Mobil、Exxon、ExxonMobil 等)。

因此,我目前的做法是将 publicly 交易的公司名称列表分解为每个标题中使用的单个词(在清除一些常用词后,如 company、corporation、inc 等),导致数据如下所示 Have1。一些脏数据的示例如下所示 Have2。在我正在进行的工作中,我还清理了这些字符串以消除像 Inc 和 Company 这样的词,但如果有人有比我目前的方法更好的想法,我将留下数据 as-is。此外,我们可以假设数据中几乎没有(如果有的话)完全匹配,并且 Have2 数据噪音太大,无法在不进行额外工作的情况下成功使用模糊匹配。

问题:确定 Have2 中哪些项目包含 Have1 中的单词的最佳方法是什么?具体来说,我认为我需要最终数据看起来像 Want,这样我就可以 link 将 public 公司名称转换为脏数据名称。计划是 hand-verify 考虑到 Have2 数据的难度,但是如果有人对解决此问题的其他方法有任何建议,我绝对愿意接受建议(拜托,有人,有一个建议哈哈)。

到目前为止已尝试:我有这样的代码,但需要很长时间才能达到 运行,而且效率似乎很低。即:

library(data.table)
library(stringr)

company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")

have1 <- data.table(table(str_split(company_name_data, "\W+", simplify = TRUE)))[!V1 == "inc"]

have2 <- c("ceo and director, apple inc",
           "current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
           "xerox inc., president and ceo",
           "president and ceo of the amazon apple assn., division 4")


#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\W+", simplify = TRUE))

#Creates container
store <- data.table()

#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix

for (i in 1:nrow(have1)){
  
  matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\b",have1$V1[i],"\b"), have3[x,])))])
  
  if (nrow(matches) == 0){
    next
  }
  
  #Create combo data
  matches[, have1_word := have1$V1[i]]
  
  #Storage
  store <- rbind(store, matches)
}

想要

Name (from Have2) Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy radiation
vp and general bird aficionado of the amazon apple assn. branch F amazon
vp and general bird aficionado of the amazon apple assn. branch F apple
ceo and director, apple inc apple
xerox inc., president and ceo xerox

有1个

Word N
amazon 1
apple 3
xerox 1
notgoingtomatch 2
radiation 1

有2个

Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F

根据来自 company_name_data 且仅 have2 的数据,使用您记录的内容:

library(tidytext)
library(tidyverse)

#------------ remove stop words before tokenization ---------------

# now split each phrase, remove the stop words, rejoin the phrases 
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data,     # split the phrases into individual words,
                       # remove stop words then reassemble phrases
                       function(x) {
                         paste(unlist(strsplit(x,
                                               " ")
                         )[!(unlist(strsplit(x, 
                                             " ")) %in% (stop_words$word %>% 
                                                           unlist())
                         ) # end 2nd unlist
                         ], # end subscript of string split
                         collapse=" ")})) # reassemble string

haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
                           function(x){
                             paste(unlist(strsplit(x,
                                                   " ")
                             )[(unlist(strsplit(x, 
                                                 " ")) %in% comp2
                             ) # end 2nd unlist
                             ], # end subscript of string split
                             collapse=" ")})) # reassemble string

基于文本分析,第二列中的结果是“苹果”、“辐射”、“施乐”和“亚马逊苹果”。

我确定此代码最初不是我的。我确定我是从 Whosebug 的某个地方得到这些想法的...