R中两个不同数据框之间的匹配文本

matching text between two different data frames in R

我在数据框中有以下数据:

structure(list(`head(ker$text)` = structure(1:6, .Label = c("@_rpg_17 little league travel tourney. These parents about to be wild.", 
"@auscricketfan @davidwarner31 yes WI tour is coming soon", "@keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR", 
"@NWAWhatsup tour of duty in NWA considered a dismal assignment?  Companies send in their best ppl and then those ppl don't want to leave", 
"Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy", 
"Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO"
), class = "factor")), .Names = "head(ker$text)", row.names = c(NA, 
-6L), class = "data.frame")

我有另一个数据框,其中包含从上述数据框中提取的主题标签。具体如下:

structure(list(destination = c("#topstation", "#destination", "#munnar", 
"#Kerala", "#Delhi", "#beach")), .Names = "destination", row.names = c(NA, 
6L), class = "data.frame")

我想在我的第一个数据框中创建一个新列,它只包含与第二个数据框匹配的标签。例如,df1 的第一行没有任何主题标签,因此新列中的此单元格将为空白。但是,第二行包含 4 个主题标签,其中三个与第二个数据框匹配。我试过使用:

str_match
str_extract

函数。我非常接近使用此处其中一篇帖子中给出的代码来获得它。

new_col <- ker[unlist(lapply(destn$destination, agrep, ker$text)), ]

虽然我明白了,但我得到一个列表作为输出我收到一个错误指示

replacement has 1472 rows, data has 644

我试过将 max.distance 设置为不同的参数,每个参数都给我不同的错误。有人可以帮我解决吗?我正在考虑的一种替代方法是将每个主题标签放在一个单独的列中,但不确定它是否会帮助我进一步分析我拥有的其他变量的数据。我要查找的输出如下:

text          new_col          new_col2    new_col3
statement1    
statement2
statement3    #destination     #munnar     #topstation
statement4
statement5    #Kerala
statement6    #Kerala

你可以这样做:

library(stringr)
results <- sapply(df$`head(ker$text)`, 
                  function(x) { str_match_all(x, paste(df2$destination, collapse = "|")) })

df$matches <- results

如果要将结果分离出来,可以使用:

df <- cbind(df, do.call(rbind, lapply(results,[, 1:max(sapply(results, length)))))

library(stringi);
df1$tags <- sapply(stri_extract_all(df1[[1]],regex='#\w+'),function(x) paste(x[x%in%df2[[1]]],collapse=','));
df1;
##                                                                                                                             head(ker$text)                             tags
## 1                                                                   @_rpg_17 little league travel tourney. These parents about to be wild.
## 2                                                                                 @auscricketfan @davidwarner31 yes WI tour is coming soon
## 3                                                       @keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination,#munnar,#topstation
## 4 @NWAWhatsup tour of duty in NWA considered a dismal assignment?  Companies send in their best ppl and then those ppl don't want to leave
## 5     Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy                          #Kerala
## 6   Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO                          #Kerala

编辑: 如果你想为每个标签单独列:

library(stringi);
m <- sapply(stri_extract_all(df1[[1]],regex='#\w+'),function(x) x[x%in%df2[[1]]]);
df1 <- cbind(df1,do.call(rbind,lapply(m,`[`,1:max(sapply(m,length)))));
df1;
##                                                                                                                             head(ker$text)            1       2           3
## 1                                                                   @_rpg_17 little league travel tourney. These parents about to be wild.         <NA>    <NA>        <NA>
## 2                                                                                 @auscricketfan @davidwarner31 yes WI tour is coming soon         <NA>    <NA>        <NA>
## 3                                                       @keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination #munnar #topstation
## 4 @NWAWhatsup tour of duty in NWA considered a dismal assignment?  Companies send in their best ppl and then those ppl don't want to leave         <NA>    <NA>        <NA>
## 5     Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy      #Kerala    <NA>        <NA>
## 6   Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO      #Kerala    <NA>        <NA>