使用 R 将单词列表与字符串中的单个单词匹配
Matching word list to individual words in strings using R
编辑:修复了数据示例问题
Background/Data:我正在处理两个数据集之间的合并:一个是各种 public 上市公司的法定名称列表第二个是一个相当肮脏的领域,有公司名称、个人头衔和各种其他难以预测的词。公司名单约14000行,脏数据约130万行。并非每家 public 上市公司都会出现在脏数据中,有些公司可能会以不同的方式出现多次(Exxon Mobil、Exxon、ExxonMobil 等)。
因此,我目前的做法是将 publicly 交易的公司名称列表分解为每个标题中使用的单个词(在清除一些常用词后,如 company、corporation、inc 等),导致数据如下所示 Have1
。一些脏数据的示例如下所示 Have2
。在我正在进行的工作中,我还清理了这些字符串以消除像 Inc 和 Company 这样的词,但如果有人有比我目前的方法更好的想法,我将留下数据 as-is。此外,我们可以假设数据中几乎没有(如果有的话)完全匹配,并且 Have2
数据噪音太大,无法在不进行额外工作的情况下成功使用模糊匹配。
问题:确定 Have2
中哪些项目包含 Have1
中的单词的最佳方法是什么?具体来说,我认为我需要最终数据看起来像 Want
,这样我就可以 link 将 public 公司名称转换为脏数据名称。计划是 hand-verify 考虑到 Have2
数据的难度,但是如果有人对解决此问题的其他方法有任何建议,我绝对愿意接受建议(拜托,有人,有一个建议哈哈)。
到目前为止已尝试:我有这样的代码,但需要很长时间才能达到 运行,而且效率似乎很低。即:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\b",have1$V1[i],"\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
想要
Name (from Have2)
Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
radiation
vp and general bird aficionado of the amazon apple assn. branch F
amazon
vp and general bird aficionado of the amazon apple assn. branch F
apple
ceo and director, apple inc
apple
xerox inc., president and ceo
xerox
有1个
Word
N
amazon
1
apple
3
xerox
1
notgoingtomatch
2
radiation
1
有2个
Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F
根据来自 company_name_data 且仅 have2 的数据,使用您记录的内容:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
基于文本分析,第二列中的结果是“苹果”、“辐射”、“施乐”和“亚马逊苹果”。
我确定此代码最初不是我的。我确定我是从 Whosebug 的某个地方得到这些想法的...
编辑:修复了数据示例问题
Background/Data:我正在处理两个数据集之间的合并:一个是各种 public 上市公司的法定名称列表第二个是一个相当肮脏的领域,有公司名称、个人头衔和各种其他难以预测的词。公司名单约14000行,脏数据约130万行。并非每家 public 上市公司都会出现在脏数据中,有些公司可能会以不同的方式出现多次(Exxon Mobil、Exxon、ExxonMobil 等)。
因此,我目前的做法是将 publicly 交易的公司名称列表分解为每个标题中使用的单个词(在清除一些常用词后,如 company、corporation、inc 等),导致数据如下所示 Have1
。一些脏数据的示例如下所示 Have2
。在我正在进行的工作中,我还清理了这些字符串以消除像 Inc 和 Company 这样的词,但如果有人有比我目前的方法更好的想法,我将留下数据 as-is。此外,我们可以假设数据中几乎没有(如果有的话)完全匹配,并且 Have2
数据噪音太大,无法在不进行额外工作的情况下成功使用模糊匹配。
问题:确定 Have2
中哪些项目包含 Have1
中的单词的最佳方法是什么?具体来说,我认为我需要最终数据看起来像 Want
,这样我就可以 link 将 public 公司名称转换为脏数据名称。计划是 hand-verify 考虑到 Have2
数据的难度,但是如果有人对解决此问题的其他方法有任何建议,我绝对愿意接受建议(拜托,有人,有一个建议哈哈)。
到目前为止已尝试:我有这样的代码,但需要很长时间才能达到 运行,而且效率似乎很低。即:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\b",have1$V1[i],"\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
想要
Name (from Have2) | Word (from Have1) |
---|---|
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy | amazon |
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy | radiation |
vp and general bird aficionado of the amazon apple assn. branch F | amazon |
vp and general bird aficionado of the amazon apple assn. branch F | apple |
ceo and director, apple inc | apple |
xerox inc., president and ceo | xerox |
有1个
Word | N |
---|---|
amazon | 1 |
apple | 3 |
xerox | 1 |
notgoingtomatch | 2 |
radiation | 1 |
有2个
Name |
---|
ceo and director, apple inc |
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy |
xerox inc., president and ceo |
vp and general bird aficionado of the amazon apple assn. branch F |
根据来自 company_name_data 且仅 have2 的数据,使用您记录的内容:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
基于文本分析,第二列中的结果是“苹果”、“辐射”、“施乐”和“亚马逊苹果”。
我确定此代码最初不是我的。我确定我是从 Whosebug 的某个地方得到这些想法的...