r 中的词干函数

Stemming function in r

程序包 corpus 提供自定义词干提取功能。词干函数应该,当给定一个术语作为输入时,return 术语的词干作为输出。

来自 Stemming Words 我采用了以下示例,该示例使用 hunspell 字典进行词干提取。

首先我定义了测试这个函数的句子:

sentences<-c("The color blue neutralizes orange yellow reflections.", 
             "Zod stabbed me with blue Kryptonite.", 
             "Because blue is your favourite colour.",
             "Red is wrong, blue is right.",
             "You and I are going to yellowstone.",
             "Van Gogh looked for some yellow at sunset.",
             "You ruined my beautiful green dress.",
             "You do not agree.",
             "There's nothing wrong with green.")

自定义词干提取函数是:

stem_hunspell <- function(term) {
  # look up the term in the dictionary
  stems <- hunspell::hunspell_stem(term)[[1]]

  if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }

  stem
}

这个代码

sentences=text_tokens(sentences, stemmer = stem_hunspell)

产生:

> sentences
[[1]]
[1] "the"        "color"      "blue"       "neutralize" "orange"     "yellow"    
[7] "reflection" "."         

[[2]]
[1] "zod"        "stabbed"    "me"         "with"       "blue"       "kryptonite"
[7] "."         

[[3]]
[1] "because"   "blue"      "i"         "your"      "favourite" "colour"   
[7] "."        

[[4]]
[1] "re"    "i"     "wrong" ","     "blue"  "i"     "right" "."    

[[5]]
[1] "you"         "and"         "i"           "are"         "go"         
[6] "to"          "yellowstone" "."          

[[6]]
[1] "van"    "gogh"   "look"   "for"    "some"   "yellow" "at"     "sunset" "."     

[[7]]
[1] "you"       "ruin"      "my"        "beautiful" "green"     "dress"    
[7] "."        

[[8]]
[1] "you"   "do"    "not"   "agree" "."    

[[9]]
[1] "there"   "nothing" "wrong"   "with"    "green"   "." 

词干提取后我想对文本应用其他操作,例如删除停用词。无论如何,当我应用 tm-函数时:

removeWords(sentences,stopwords)

我的语句,得到如下错误:

Error in UseMethod("removeWords", x) : 
 no applicable method for 'removeWords' applied to an object of class "list"

如果我使用

unlist(sentences)

我没有得到想要的结果,因为我最终得到了一个包含 65 个元素的 chr。期望的结果应该是(例如对于第一句话):

"the color blue neutralize orange yellow reflection."       

如果你想从每个 sentence 中删除停用词,你可以使用 lapply :

library(tm)
lapply(sentences, removeWords, stopwords())

#[[1]]
#[1] ""           "color"      "blue"       "neutralize" "orange"     "yellow"     "reflection" "."         

#[[2]]
#[1] "zod"        "stabbed"    ""           ""           "blue"       "kryptonite" "."  
#...
#...

但是,从您的预期输出来看,您似乎想要将文本粘贴在一起。

lapply(sentences, paste0, collapse = " ")

#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."

#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....

我们可以使用map

library(tm)
library(purrr)
map(sentences, removeWords, stopwords())
#[[1]]
#[1] ""           "color"      "blue"       "neutralize" "orange"     "yellow"     "reflection"
#[8] "."         

#[[2]]
#[1] "zod"        "stabbed"    ""           ""           "blue"       "kryptonite" "."