如何将特定长度的数据帧的一列与另一个具有特定关键字匹配的向量相匹配?
How do I match a column of a dataframe of a particular length with another vector which has certain key-words to match to?
我的数据框Expenses
如下所示:
date name expenditure type
23MAR2013 KOSH ENTRP 4000 COMPANY
23MAR2013 JOHN DOE 800 INDIVIDUAL
24MAR2013 S KHAN 300 INDIVIDUAL
24MAR2013 JASINT PVT LTD 8000 COMPANY
25MAR2013 KOSH ENTRPRISE 2000 COMPANY
25MAR2013 JOHN S DOE 220 INDIVIDUAL
25MAR2013 S KHAN 300 INDIVIDUAL
26MAR2013 S KHAN 300 INDIVIDUAL
早些时候,我从 name
列中识别出重复名称和模式的存在,并将其存储在向量 NameVector
中,如下所示。
KOSH JOHN DOE KHAN JASINT
我的问题是,如何将 Expenses$name
的每个字符串模式与向量 NameVector
匹配并在主数据框中以分类方式打印出来?
date name expenditure type category
23MAR2013 KOSH ENTRP 4000 COMPANY KOSH
23MAR2013 JOHN DOE 800 INDIVIDUAL JOHN DOE
24MAR2013 S KHAN 300 INDIVIDUAL KHAN
24MAR2013 JASINT PVT LTD 8000 COMPANY JASINT
25MAR2013 KOSH ENTRPRISE 2000 COMPANY KOSH
25MAR2013 JOHN S DOE 220 INDIVIDUAL JOHN DOE
25MAR2013 SALM KHAN 300 INDIVIDUAL KHAN
26MAR2013 S KHAN 300 INDIVIDUAL KHAN
我尝试使用 strsplit()
将名称的不同部分分成不同的列并尝试匹配使用 agrep()
的模式,但我 没有 获得所需的输出。进一步反省数据,我注意到有前导空格并去掉了它们,仍然不知道为什么我没有得到如上所示的输出。
上述 table 的 csv :
"Date","name","expenditure","type"
"23MAR2013","KOSH ENTRP",4000,"COMPANY"
"23MAR2013 ","JOHN DOE",800,"INDIVIDUAL"
"24MAR2013","S KHAN",300,"INDIVIDUAL"
"24MAR2013","JASINT PVT LTD",8000,"COMPANY"
"25MAR2013","KOSH ENTRPRISE",2000,"COMPANY"
"25MAR2013","JOHN S DOE",220,"INDIVIDUAL"
"25MAR2013","S KHAN",300,"INDIVIDUAL"
"26MAR2013","S KHAN",300,"INDIVIDUAL"
以及已 calculated/identifies 为
的名称向量
NameVector <- c("KOSH","JOHN DOE","KHAN","JASINT")
你可以试试
library(stringi)
pat <- paste(unlist(strsplit(NameVector, ' ')), collapse="|")
Expenses$category <- vapply(stri_extract_all_regex(Expenses$name, pat),
paste, collapse=' ', character(1L))
Expenses
# date name expenditure type category
#1 23MAR2013 KOSH ENTRP 4000 COMPANY KOSH
#2 23MAR2013 JOHN DOE 800 INDIVIDUAL JOHN DOE
#3 24MAR2013 S KHAN 300 INDIVIDUAL KHAN
#4 24MAR2013 JASINT PVT LTD 8000 COMPANY JASINT
#5 25MAR2013 KOSH ENTRPRISE 2000 COMPANY KOSH
#6 25MAR2013 JOHN S DOE 220 INDIVIDUAL JOHN DOE
#7 25MAR2013 S KHAN 300 INDIVIDUAL KHAN
#8 26MAR2013 S KHAN 300 INDIVIDUAL KHAN
我的数据框Expenses
如下所示:
date name expenditure type
23MAR2013 KOSH ENTRP 4000 COMPANY
23MAR2013 JOHN DOE 800 INDIVIDUAL
24MAR2013 S KHAN 300 INDIVIDUAL
24MAR2013 JASINT PVT LTD 8000 COMPANY
25MAR2013 KOSH ENTRPRISE 2000 COMPANY
25MAR2013 JOHN S DOE 220 INDIVIDUAL
25MAR2013 S KHAN 300 INDIVIDUAL
26MAR2013 S KHAN 300 INDIVIDUAL
早些时候,我从 name
列中识别出重复名称和模式的存在,并将其存储在向量 NameVector
中,如下所示。
KOSH JOHN DOE KHAN JASINT
我的问题是,如何将 Expenses$name
的每个字符串模式与向量 NameVector
匹配并在主数据框中以分类方式打印出来?
date name expenditure type category
23MAR2013 KOSH ENTRP 4000 COMPANY KOSH
23MAR2013 JOHN DOE 800 INDIVIDUAL JOHN DOE
24MAR2013 S KHAN 300 INDIVIDUAL KHAN
24MAR2013 JASINT PVT LTD 8000 COMPANY JASINT
25MAR2013 KOSH ENTRPRISE 2000 COMPANY KOSH
25MAR2013 JOHN S DOE 220 INDIVIDUAL JOHN DOE
25MAR2013 SALM KHAN 300 INDIVIDUAL KHAN
26MAR2013 S KHAN 300 INDIVIDUAL KHAN
我尝试使用 strsplit()
将名称的不同部分分成不同的列并尝试匹配使用 agrep()
的模式,但我 没有 获得所需的输出。进一步反省数据,我注意到有前导空格并去掉了它们,仍然不知道为什么我没有得到如上所示的输出。
上述 table 的 csv :
"Date","name","expenditure","type"
"23MAR2013","KOSH ENTRP",4000,"COMPANY"
"23MAR2013 ","JOHN DOE",800,"INDIVIDUAL"
"24MAR2013","S KHAN",300,"INDIVIDUAL"
"24MAR2013","JASINT PVT LTD",8000,"COMPANY"
"25MAR2013","KOSH ENTRPRISE",2000,"COMPANY"
"25MAR2013","JOHN S DOE",220,"INDIVIDUAL"
"25MAR2013","S KHAN",300,"INDIVIDUAL"
"26MAR2013","S KHAN",300,"INDIVIDUAL"
以及已 calculated/identifies 为
的名称向量NameVector <- c("KOSH","JOHN DOE","KHAN","JASINT")
你可以试试
library(stringi)
pat <- paste(unlist(strsplit(NameVector, ' ')), collapse="|")
Expenses$category <- vapply(stri_extract_all_regex(Expenses$name, pat),
paste, collapse=' ', character(1L))
Expenses
# date name expenditure type category
#1 23MAR2013 KOSH ENTRP 4000 COMPANY KOSH
#2 23MAR2013 JOHN DOE 800 INDIVIDUAL JOHN DOE
#3 24MAR2013 S KHAN 300 INDIVIDUAL KHAN
#4 24MAR2013 JASINT PVT LTD 8000 COMPANY JASINT
#5 25MAR2013 KOSH ENTRPRISE 2000 COMPANY KOSH
#6 25MAR2013 JOHN S DOE 220 INDIVIDUAL JOHN DOE
#7 25MAR2013 S KHAN 300 INDIVIDUAL KHAN
#8 26MAR2013 S KHAN 300 INDIVIDUAL KHAN