在文本数据框中查找按行的重要单词

finding row-wise important words in text dataframe

我有一个如下所示的数据框:

sentences <- data.frame(sentences = 
                          c('You can apply for or renew your Medical Assistance benefits online by using COMPASS.',
                            'COMPASS is the name of the website where you can apply for Medical Assistance and many other services that can help you make ends meet.',
                          'Medical tourism refers to people traveling to a country other than their own to obtain medical treatment. In the past this usually referred to those who traveled from less-developed countries to major medical centers in highly developed countries for treatment unavailable at home.',
                          'Health tourism is a wider term for travel that focus on medical treatments and the use of healthcare services. It covers a wide field of health-oriented, tourism ranging from preventive and health-conductive treatment to rehabilitational and curative forms of travel.',
                          'Medical tourism carries some risks that locally provided medical care either does not carry or carries to a much lesser degree.',
                          'Receiving medical care abroad may subject medical tourists to unfamiliar legal issues. The limited nature of litigation in various countries is a reason for accessbility of care overseas.', 
                          'While some countries currently presenting themselves as attractive medical tourism destinations provide some form of legal remedies for medical malpractice, these legal avenues may be unappealing to the medical tourist.'))

我想要做的就是在每一行中找到重要的词并创建一个新的列,应该如下所示:

sentences$ImpWords <- c("apply, renew, Medical, Assistance, benefits, online, COMPASS",
                    "COMPASS, name, website, apply, Medical, Assistance, services, help, meet") 

and so forth

我不知道该怎么做?

我正在尝试使用 tm、tidytext 等各种包进行词袋、清理和预处理等操作。但无法获得所需的结果。

有没有其他可能的选择?

这将实现您的目标。如果你想删除更多的单词,只需找到一个 bigger/different 列表(许多可以通过不同的包获得)。这里我用了tm的英文停用词

library(tm)
stopwords <- stopwords('en')

sentences <- data.frame(sentences = 
                          c('You can apply for or renew your Medical Assistance benefits online by using COMPASS.',
                            'COMPASS is the name of the website where you can apply for Medical Assistance and many other services that can help you make ends meet.',
                            'Medical tourism refers to people traveling to a country other than their own to obtain medical treatment. In the past this usually referred to those who traveled from less-developed countries to major medical centers in highly developed countries for treatment unavailable at home.',
                            'Health tourism is a wider term for travel that focus on medical treatments and the use of healthcare services. It covers a wide field of health-oriented, tourism ranging from preventive and health-conductive treatment to rehabilitational and curative forms of travel.',
                            'Medical tourism carries some risks that locally provided medical care either does not carry or carries to a much lesser degree.',
                            'Receiving medical care abroad may subject medical tourists to unfamiliar legal issues. The limited nature of litigation in various countries is a reason for accessbility of care overseas.', 
                            'While some countries currently presenting themselves as attractive medical tourism destinations provide some form of legal remedies for medical malpractice, these legal avenues may be unappealing to the medical tourist.'))


sentences[,"sentences"] <- sentences[,"sentences"] %>% as.character()


ImpWords <- c()
for (i in 1:nrow(sentences)) {

  originalWords <- gsub('[[:punct:] ]+',' ',sentences[i, "sentences"]) %>% trimws(.) %>% strsplit(., " ") 
  lowerCaseWords <- gsub('[[:punct:] ]+',' ',tolower(sentences[i, "sentences"])) %>% trimws(.) %>% strsplit(., " ")
  wordsNotInStopWords <- originalWords[[1]][which(!lowerCaseWords[[1]] %in% stopwords)]
  wordsNotInStopWordsGreaterThanThreeChar <- wordsNotInStopWords[which(nchar(wordsNotInStopWords) > 3)]
  ImpWords[i] <- paste(wordsNotInStopWordsGreaterThanThreeChar, collapse = ", ")

}

sentences$ImpWords <- ImpWords
sentences$ImpWords

如果您愿意,这里有一种使用整洁数据原则的方法。这种方法的一个好处是它在 stopword dictionary 的选择上非常灵活。您可以通过 get_stopwords().

的参数将它们切换出去
library(tidyverse)
library(tidytext)

sentences %>%
  mutate(line = row_number()) %>%
  unnest_tokens(word, sentences) %>%
  anti_join(get_stopwords(source = "smart")) %>%
  nest(word) %>%
  mutate(words = map(data, unlist),
         words = map_chr(words, paste, collapse = " "))

#> Joining, by = "word"
#> # A tibble: 7 x 3
#>    line data           words                                              
#>   <int> <list>         <chr>                                              
#> 1     1 <tibble [7 × … apply renew medical assistance benefits online com…
#> 2     2 <tibble [9 × … compass website apply medical assistance services …
#> 3     3 <tibble [23 ×… medical tourism refers people traveling country ob…
#> 4     4 <tibble [25 ×… health tourism wider term travel focus medical tre…
#> 5     5 <tibble [12 ×… medical tourism carries risks locally provided med…
#> 6     6 <tibble [18 ×… receiving medical care abroad subject medical tour…
#> 7     7 <tibble [17 ×… countries presenting attractive medical tourism de…

reprex package (v0.2.0) 创建于 2018-08-14。

第一行创建一个列来跟踪每个句子,然后下一行使用 unnest_tokens() 对文本进行分词并将其转换为整洁的格式。然后,您可以通过 anti_join() 删除停用词。在此之后,最后几行是从整洁的数据格式(FYI 确实有您正在寻找的信息,只是格式不同)转换为您谈论的数据结构。如果愿意,您可以使用 select(-data) 删除 data 列。