在 R 中的许多文章中找到相近的词

Question

我有一个 tibble table (mydf)（100 行 x 5 列）。文章由许多段落组成。

ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018") 

article1<-c("This is the first article. It is not long. It is not short. It 
comprises of many words and many sentences. This ends paragraph one.  
Parapraph two starts here. It is just a continuation.")

article2<-c("This is the second article. It is longer than first article by 
number of words. It also does not communicate anyything of value. Reading it 
can put you to sleep or jumpstart your imagination. Let your imagination 
take you to some magical place. Enjoy the ride.")

Articles<-c(article1,article2)

FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)

ID    Date    FirstWord    SecondWord    Articles
 1    xxxx     xxx           xxx          xxx
 2     etc
 3     etc

我想向 table 添加新列，如果 FirstWord 接近 Article 中的 SecondWord 30 个词空间。

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

我按照 Whosebug 中的这个例子来计算距离 -

library(tidytext)
library(dplyr)

all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number()) 

library(fuzzyjoin)

nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))

我得到 table 这样的：

  focus_term   focus_position  ID    Date    FirstWord    SecondWord   word  position

如何获得这种格式的结果：

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

感谢您的帮助:)

Answer 1

由于您正在标记文章列，因此我们将其转换为单词列，为了获得原始文章列，只需在标记化之前将其变异为新列（比如说 new_column）。在 nearby_words 中，我刚刚在输出中选择了您想要的列。此外，如果它等于或不等于 30，我还添加了带距离的布尔值。

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
        all_words <- mydf %>%
          mutate(new_column=Articles) %>%
          unnest_tokens(word, Articles) %>%
          mutate(position = row_number())

    nearby_words <- all_words %>%
      filter(word == FirstWord) %>%
      select(focus_term = word, focus_position = position) %>%
      difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
     mutate(distance = abs(focus_position - position)) %>%
     mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
     select(ID,Date,FirstWord,SecondWord,new_column,distance)

在 R 中的许多文章中找到相近的词

Find close words in many articles in R

r

fuzzyjoin