在 R 中的许多文章中找到相近的词
Find close words in many articles in R
我有一个 tibble table (mydf)(100 行 x 5 列)。
文章由许多段落组成。
ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018")
article1<-c("This is the first article. It is not long. It is not short. It
comprises of many words and many sentences. This ends paragraph one.
Parapraph two starts here. It is just a continuation.")
article2<-c("This is the second article. It is longer than first article by
number of words. It also does not communicate anyything of value. Reading it
can put you to sleep or jumpstart your imagination. Let your imagination
take you to some magical place. Enjoy the ride.")
Articles<-c(article1,article2)
FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
ID Date FirstWord SecondWord Articles
1 xxxx xxx xxx xxx
2 etc
3 etc
我想向 table 添加新列,如果 FirstWord 接近 Article 中的 SecondWord 30 个词空间。
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
我按照 Whosebug 中的这个例子来计算距离 -
library(tidytext)
library(dplyr)
all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))
我得到 table 这样的:
focus_term focus_position ID Date FirstWord SecondWord word position
如何获得这种格式的结果:
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
感谢您的帮助:)
由于您正在标记文章列,因此我们将其转换为单词列,为了获得原始文章列,只需在标记化之前将其变异为新列(比如说 new_column)。在 nearby_words 中,我刚刚在输出中选择了您想要的列。此外,如果它等于或不等于 30,我还添加了带距离的布尔值。
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
all_words <- mydf %>%
mutate(new_column=Articles) %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position)) %>%
mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
select(ID,Date,FirstWord,SecondWord,new_column,distance)
我有一个 tibble table (mydf)(100 行 x 5 列)。 文章由许多段落组成。
ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018")
article1<-c("This is the first article. It is not long. It is not short. It
comprises of many words and many sentences. This ends paragraph one.
Parapraph two starts here. It is just a continuation.")
article2<-c("This is the second article. It is longer than first article by
number of words. It also does not communicate anyything of value. Reading it
can put you to sleep or jumpstart your imagination. Let your imagination
take you to some magical place. Enjoy the ride.")
Articles<-c(article1,article2)
FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
ID Date FirstWord SecondWord Articles
1 xxxx xxx xxx xxx
2 etc
3 etc
我想向 table 添加新列,如果 FirstWord 接近 Article 中的 SecondWord 30 个词空间。
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
我按照 Whosebug 中的这个例子来计算距离 -
library(tidytext)
library(dplyr)
all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))
我得到 table 这样的:
focus_term focus_position ID Date FirstWord SecondWord word position
如何获得这种格式的结果:
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
感谢您的帮助:)
由于您正在标记文章列,因此我们将其转换为单词列,为了获得原始文章列,只需在标记化之前将其变异为新列(比如说 new_column)。在 nearby_words 中,我刚刚在输出中选择了您想要的列。此外,如果它等于或不等于 30,我还添加了带距离的布尔值。
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
all_words <- mydf %>%
mutate(new_column=Articles) %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position)) %>%
mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
select(ID,Date,FirstWord,SecondWord,new_column,distance)