匹配两个文件中的单词并提取匹配的一个
Matching words from two files and extract matched one
我有以下数据框:
dataFrame <- data.frame(sent = c(1,1,2,2,3,3,3,4,5), word = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"), val = c(1,2,3,4,5,6,7,8,9))
数据框“dataFrame”如下所示:
sent word val
1 good printer 1
1 wireless easy 2
2 just right size 3
2 size perfect weight 4
3 worth price 5
3 website great tablet 6
3 pan nice tablet 7
4 great price 8
5 product easy install 9
然后我有话:
nouns <- c("printer", "wireless", "weight", "price", "tablet")
我只需要从 dataFrame 中提取这些词 (nouns),并且只有这些提取的词会添加到新列 (eg.extract) 在 dataFrame.
非常感谢您提供的任何帮助建议。非常感谢转发。
期望的输出:
sent word val extract
1 good printer 1 printer
1 wireless easy 2 wireless
2 just right size 3 size
2 size perfect weight 4 weight
3 worth price 5 price
3 website great tablet 6 table
3 pan nice tablet 7 tablet
4 great price 8 price
5 product easy install 9 remove this row (no match)
这是一个使用 stringi
包的简单解决方案(顺便说一句,size
不在您的 nouns
列表中)。
library(stringi)
transform(dataFrame,
extract = stri_extract_all(word,
regex = paste(nouns, collapse = "|"),
simplify = TRUE))
# sent word val extract
# 1 1 good printer 1 printer
# 2 1 wireless easy 2 wireless
# 3 2 just right size 3 <NA>
# 4 2 size perfect weight 4 weight
# 5 3 worth price 5 price
# 6 3 website great tablet 6 tablet
# 7 3 pan nice tablet 7 tablet
# 8 4 great price 8 price
# 9 5 product easy install 9 <NA>
这是另一种解决方案。有点复杂,但它也会删除名词和 dataFrame$word
之间没有匹配的行
require(stringr)
dataFrame <- data.frame("sent" = c(1,1,2,2,3,3,3,4,5),
"word" = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"),
val = c(1,2,3,4,5,6,7,8,9))
nouns <- c("printer", "wireless", "weight", "price", "tablet")
test <- character()
df.del <- list()
for (i in 1:nrow(dataFrame)) {
if(length(intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) == 0) {
df.del <- rbind(df.del, i)
} else {
test <- rbind(test,
intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " "))))
}
}
dataFrame <- dataFrame[-c(unlist(df.del)), ]
dataFrame <- cbind(dataFrame, test)
names(dataFrame)[4] <- "extract"
输出:
sent word val extract
1 1 good printer 1 printer
2 1 wireless easy 2 wireless
4 2 size perfect weight 4 weight
5 3 worth price 5 price
6 3 website great tablet 6 tablet
7 3 pan nice tablet 7 tablet
8 4 great price 8 price
这是另一个使用循环函数和 if 语句的解决方案。
word<-dataFrame$word
dat<-NULL
extract<-c(rep(c("remove"), each=length(word)))
n<-length(word)
m<-length(nouns)
for (i in 1:n) {
g<-as.character(word[i])
for (j in 1:m) {
dat<-grepl(nouns[j], g)
if(dat == TRUE) {extract[i] <- nouns[j]}
}
}
dataFrame$extract <- extract
# sent word val extract
#1 1 good printer 1 printer
#2 1 wireless easy 2 wireless
#3 2 just right size 3 remove
#4 2 size perfect weight 4 weight
#5 3 worth price 5 price
#6 3 website great tablet 6 tablet
#7 3 pan nice tablet 7 tablet
#8 4 great price 8 price
#9 5 product easy install 9 remove
我有以下数据框:
dataFrame <- data.frame(sent = c(1,1,2,2,3,3,3,4,5), word = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"), val = c(1,2,3,4,5,6,7,8,9))
数据框“dataFrame”如下所示:
sent word val
1 good printer 1
1 wireless easy 2
2 just right size 3
2 size perfect weight 4
3 worth price 5
3 website great tablet 6
3 pan nice tablet 7
4 great price 8
5 product easy install 9
然后我有话:
nouns <- c("printer", "wireless", "weight", "price", "tablet")
我只需要从 dataFrame 中提取这些词 (nouns),并且只有这些提取的词会添加到新列 (eg.extract) 在 dataFrame.
非常感谢您提供的任何帮助建议。非常感谢转发。
期望的输出:
sent word val extract
1 good printer 1 printer
1 wireless easy 2 wireless
2 just right size 3 size
2 size perfect weight 4 weight
3 worth price 5 price
3 website great tablet 6 table
3 pan nice tablet 7 tablet
4 great price 8 price
5 product easy install 9 remove this row (no match)
这是一个使用 stringi
包的简单解决方案(顺便说一句,size
不在您的 nouns
列表中)。
library(stringi)
transform(dataFrame,
extract = stri_extract_all(word,
regex = paste(nouns, collapse = "|"),
simplify = TRUE))
# sent word val extract
# 1 1 good printer 1 printer
# 2 1 wireless easy 2 wireless
# 3 2 just right size 3 <NA>
# 4 2 size perfect weight 4 weight
# 5 3 worth price 5 price
# 6 3 website great tablet 6 tablet
# 7 3 pan nice tablet 7 tablet
# 8 4 great price 8 price
# 9 5 product easy install 9 <NA>
这是另一种解决方案。有点复杂,但它也会删除名词和 dataFrame$word
之间没有匹配的行require(stringr)
dataFrame <- data.frame("sent" = c(1,1,2,2,3,3,3,4,5),
"word" = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"),
val = c(1,2,3,4,5,6,7,8,9))
nouns <- c("printer", "wireless", "weight", "price", "tablet")
test <- character()
df.del <- list()
for (i in 1:nrow(dataFrame)) {
if(length(intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) == 0) {
df.del <- rbind(df.del, i)
} else {
test <- rbind(test,
intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " "))))
}
}
dataFrame <- dataFrame[-c(unlist(df.del)), ]
dataFrame <- cbind(dataFrame, test)
names(dataFrame)[4] <- "extract"
输出:
sent word val extract
1 1 good printer 1 printer
2 1 wireless easy 2 wireless
4 2 size perfect weight 4 weight
5 3 worth price 5 price
6 3 website great tablet 6 tablet
7 3 pan nice tablet 7 tablet
8 4 great price 8 price
这是另一个使用循环函数和 if 语句的解决方案。
word<-dataFrame$word
dat<-NULL
extract<-c(rep(c("remove"), each=length(word)))
n<-length(word)
m<-length(nouns)
for (i in 1:n) {
g<-as.character(word[i])
for (j in 1:m) {
dat<-grepl(nouns[j], g)
if(dat == TRUE) {extract[i] <- nouns[j]}
}
}
dataFrame$extract <- extract
# sent word val extract
#1 1 good printer 1 printer
#2 1 wireless easy 2 wireless
#3 2 just right size 3 remove
#4 2 size perfect weight 4 weight
#5 3 worth price 5 price
#6 3 website great tablet 6 tablet
#7 3 pan nice tablet 7 tablet
#8 4 great price 8 price
#9 5 product easy install 9 remove