r 中的词干函数
Stemming function in r
程序包 corpus
提供自定义词干提取功能。词干函数应该,当给定一个术语作为输入时,return 术语的词干作为输出。
来自 Stemming Words 我采用了以下示例,该示例使用 hunspell
字典进行词干提取。
首先我定义了测试这个函数的句子:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
自定义词干提取函数是:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
这个代码
sentences=text_tokens(sentences, stemmer = stem_hunspell)
产生:
> sentences
[[1]]
[1] "the" "color" "blue" "neutralize" "orange" "yellow"
[7] "reflection" "."
[[2]]
[1] "zod" "stabbed" "me" "with" "blue" "kryptonite"
[7] "."
[[3]]
[1] "because" "blue" "i" "your" "favourite" "colour"
[7] "."
[[4]]
[1] "re" "i" "wrong" "," "blue" "i" "right" "."
[[5]]
[1] "you" "and" "i" "are" "go"
[6] "to" "yellowstone" "."
[[6]]
[1] "van" "gogh" "look" "for" "some" "yellow" "at" "sunset" "."
[[7]]
[1] "you" "ruin" "my" "beautiful" "green" "dress"
[7] "."
[[8]]
[1] "you" "do" "not" "agree" "."
[[9]]
[1] "there" "nothing" "wrong" "with" "green" "."
词干提取后我想对文本应用其他操作,例如删除停用词。无论如何,当我应用 tm
-函数时:
removeWords(sentences,stopwords)
我的语句,得到如下错误:
Error in UseMethod("removeWords", x) :
no applicable method for 'removeWords' applied to an object of class "list"
如果我使用
unlist(sentences)
我没有得到想要的结果,因为我最终得到了一个包含 65 个元素的 chr
。期望的结果应该是(例如对于第一句话):
"the color blue neutralize orange yellow reflection."
如果你想从每个 sentence
中删除停用词,你可以使用 lapply
:
library(tm)
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
但是,从您的预期输出来看,您似乎想要将文本粘贴在一起。
lapply(sentences, paste0, collapse = " ")
#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."
#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....
我们可以使用map
library(tm)
library(purrr)
map(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection"
#[8] "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
程序包 corpus
提供自定义词干提取功能。词干函数应该,当给定一个术语作为输入时,return 术语的词干作为输出。
来自 Stemming Words 我采用了以下示例,该示例使用 hunspell
字典进行词干提取。
首先我定义了测试这个函数的句子:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
自定义词干提取函数是:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
这个代码
sentences=text_tokens(sentences, stemmer = stem_hunspell)
产生:
> sentences
[[1]]
[1] "the" "color" "blue" "neutralize" "orange" "yellow"
[7] "reflection" "."
[[2]]
[1] "zod" "stabbed" "me" "with" "blue" "kryptonite"
[7] "."
[[3]]
[1] "because" "blue" "i" "your" "favourite" "colour"
[7] "."
[[4]]
[1] "re" "i" "wrong" "," "blue" "i" "right" "."
[[5]]
[1] "you" "and" "i" "are" "go"
[6] "to" "yellowstone" "."
[[6]]
[1] "van" "gogh" "look" "for" "some" "yellow" "at" "sunset" "."
[[7]]
[1] "you" "ruin" "my" "beautiful" "green" "dress"
[7] "."
[[8]]
[1] "you" "do" "not" "agree" "."
[[9]]
[1] "there" "nothing" "wrong" "with" "green" "."
词干提取后我想对文本应用其他操作,例如删除停用词。无论如何,当我应用 tm
-函数时:
removeWords(sentences,stopwords)
我的语句,得到如下错误:
Error in UseMethod("removeWords", x) :
no applicable method for 'removeWords' applied to an object of class "list"
如果我使用
unlist(sentences)
我没有得到想要的结果,因为我最终得到了一个包含 65 个元素的 chr
。期望的结果应该是(例如对于第一句话):
"the color blue neutralize orange yellow reflection."
如果你想从每个 sentence
中删除停用词,你可以使用 lapply
:
library(tm)
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
但是,从您的预期输出来看,您似乎想要将文本粘贴在一起。
lapply(sentences, paste0, collapse = " ")
#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."
#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....
我们可以使用map
library(tm)
library(purrr)
map(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection"
#[8] "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."