通过 `hunspell` 词典进行词干提取
Stemming by `hunspell` dictionary
来自 Stemming Words 我采用了以下自定义词干提取函数:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
它使用 hunspell
字典进行词干提取(包 corpus
)。
我在下面的句子上尝试了这个功能。
sentences<-c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
然后我进行了以下操作:
library(qdap)
library(tm)
sentences <- iconv(sentences, "latin1", "ASCII", sub="")
sentences <- gsub('http\S+\s*', '', sentences)
sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)
sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)
# Stemming
library(corpus)
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
sentences=text_tokens(sentences, stemmer = stem_hunspell)
sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)
我无法解释结果:
[[1]]
[1] "" "taking" "active" "step" "" "tackle"
[[2]]
[1] "" "numb" "" "measure" "" "" "taking" ""
[9] "support"
[[3]]
[1] "" "caught" "" "committing" "" "decent"
[7] "act"
例如为什么 commit 和 take 以 ing 形式出现?为什么数字变成了"numb"?
我认为答案主要是这正是 hunspell
的词干提取方式。我们可以在一个更简单的示例中进行检查:
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"
ing 形式是 hunspell 提供的唯一选项。对我来说这也没有多大意义,我的建议是使用不同的词干分析器。在我们进行期间,我认为您也可以从切换到 quanteda
而不是 tm
:
中获益
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
tokens(sentences, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r" "take" "proactiv" "step" "to" "tackl" "."
#> [8] "." "."
#>
#> text2 :
#> [1] "a" "number" "of" "measur" "we" "are" "take"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "commit" "an" "indec" "act" "."
我认为工作流程更简洁,结果对我来说更有意义。 quanteda
使用 SnowballC
包在此处进行词干提取,如果需要,您可以将其集成到 tm
工作流程中。 tokens
对象是与输入对象顺序相同但被标记化(即拆分为单词)的文本。
如果您仍然想使用 hunspell
,您可以使用以下函数来实现,它可以解决您似乎遇到的一些问题("number" 现在是正确的):
stem_hunspell <- function(toks) {
# look up the term in the dictionary
stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
# if there are no stems, use the original term
stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
tokens_replace(toks, types(toks), stems, valuetype = "fixed")
}
tokens(sentences, remove_numbers = TRUE, ) %>%
tokens_tolower() %>%
stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're" "taking" "active" "step" "to" "tackle" "." "."
#> [9] "."
#>
#> text2 :
#> [1] "a" "number" "of" "measure" "we" "are" "taking"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "committing" "an"
#> [6] "decent" "act" "."
来自 Stemming Words 我采用了以下自定义词干提取函数:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
它使用 hunspell
字典进行词干提取(包 corpus
)。
我在下面的句子上尝试了这个功能。
sentences<-c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
然后我进行了以下操作:
library(qdap)
library(tm)
sentences <- iconv(sentences, "latin1", "ASCII", sub="")
sentences <- gsub('http\S+\s*', '', sentences)
sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)
sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)
# Stemming
library(corpus)
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
sentences=text_tokens(sentences, stemmer = stem_hunspell)
sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)
我无法解释结果:
[[1]]
[1] "" "taking" "active" "step" "" "tackle"
[[2]]
[1] "" "numb" "" "measure" "" "" "taking" ""
[9] "support"
[[3]]
[1] "" "caught" "" "committing" "" "decent"
[7] "act"
例如为什么 commit 和 take 以 ing 形式出现?为什么数字变成了"numb"?
我认为答案主要是这正是 hunspell
的词干提取方式。我们可以在一个更简单的示例中进行检查:
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"
ing 形式是 hunspell 提供的唯一选项。对我来说这也没有多大意义,我的建议是使用不同的词干分析器。在我们进行期间,我认为您也可以从切换到 quanteda
而不是 tm
:
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
tokens(sentences, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r" "take" "proactiv" "step" "to" "tackl" "."
#> [8] "." "."
#>
#> text2 :
#> [1] "a" "number" "of" "measur" "we" "are" "take"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "commit" "an" "indec" "act" "."
我认为工作流程更简洁,结果对我来说更有意义。 quanteda
使用 SnowballC
包在此处进行词干提取,如果需要,您可以将其集成到 tm
工作流程中。 tokens
对象是与输入对象顺序相同但被标记化(即拆分为单词)的文本。
如果您仍然想使用 hunspell
,您可以使用以下函数来实现,它可以解决您似乎遇到的一些问题("number" 现在是正确的):
stem_hunspell <- function(toks) {
# look up the term in the dictionary
stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
# if there are no stems, use the original term
stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
tokens_replace(toks, types(toks), stems, valuetype = "fixed")
}
tokens(sentences, remove_numbers = TRUE, ) %>%
tokens_tolower() %>%
stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're" "taking" "active" "step" "to" "tackle" "." "."
#> [9] "."
#>
#> text2 :
#> [1] "a" "number" "of" "measure" "we" "are" "taking"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "committing" "an"
#> [6] "decent" "act" "."