如何在 R 中执行词干提取和词形还原?
How to perform stemming and lemmatization in R?
我正在处理具有如下所示字符串的文本数据
"significant step towards large scale hydrogen production iisc team
collaboration jncasr researcher develop low cost catalyst speed split
water generate hydrogen gas"
为了在文本中获得正确的单词形式...需要进行词干提取或词形还原。我正在这样做,但它没有给出所需的输出
stemDocument(p[1], language = "english")
[1] "signific step toward larg scale hydrogen product iisc team
collabor jncasr research develop low cost catalyst speed split water
generat hydrogen gas"
lemmatize_strings(p[1], dictionary = lexicon::hash_lemmas)
[1] "significant step towards large scale hydrogen production iisc
team collaboration jncasr researcher develop low cost catalyst speed
split water generate hydrogen gas"
如何得到这样的输出
significant step toward large scale hydrogen produce iisc team
collaborate jncasr research develop low cost catalyst speed split
water generate hydrogen gas
可能值得提供您正在使用的软件包。要执行您希望的操作,您可以使用以下两个包执行以下操作
library(udpipe)
# This takes a minute to download the english dictionary
x <- udpipe(x = "significant step towards large scale hydrogen production iisc team
collaboration jncasr researcher develop low cost catalyst
speed split water generate hydrogen gas",
object = "english")
这将为您提供各种信息供您分析,包括标记、引理等。您可以用它做很多事情。
x$lemma
[1] "significant" "step" "towards" "large" "scale" "hydrogen" "production"
[8] "iisc" "team" "collaboration" "jncasr" "researcher" "develop" "low"
[15] "cost" "catalyst" "speed" "split" "water" "generate" "hydrogen"
[22] "gas"
要阻止单词,您可以使用 tm
包。如果你想阻止引理,你有它们:
library(tm)
tm::stemDocument(x$lemma)
这将为您提供以下内容:
[1] "signific" "step" "toward" "larg" "scale" "hydrogen" "product" "iisc" "team" "collabor"
[11] "jncasr" "research" "develop" "low" "cost" "catalyst" "speed" "split" "water" "generat"
[21] "hydrogen" "gas"
我正在处理具有如下所示字符串的文本数据
"significant step towards large scale hydrogen production iisc team collaboration jncasr researcher develop low cost catalyst speed split water generate hydrogen gas"
为了在文本中获得正确的单词形式...需要进行词干提取或词形还原。我正在这样做,但它没有给出所需的输出
stemDocument(p[1], language = "english")
[1] "signific step toward larg scale hydrogen product iisc team collabor jncasr research develop low cost catalyst speed split water generat hydrogen gas"
lemmatize_strings(p[1], dictionary = lexicon::hash_lemmas)
[1] "significant step towards large scale hydrogen production iisc team collaboration jncasr researcher develop low cost catalyst speed split water generate hydrogen gas"
如何得到这样的输出
significant step toward large scale hydrogen produce iisc team collaborate jncasr research develop low cost catalyst speed split water generate hydrogen gas
可能值得提供您正在使用的软件包。要执行您希望的操作,您可以使用以下两个包执行以下操作
library(udpipe)
# This takes a minute to download the english dictionary
x <- udpipe(x = "significant step towards large scale hydrogen production iisc team
collaboration jncasr researcher develop low cost catalyst
speed split water generate hydrogen gas",
object = "english")
这将为您提供各种信息供您分析,包括标记、引理等。您可以用它做很多事情。
x$lemma
[1] "significant" "step" "towards" "large" "scale" "hydrogen" "production"
[8] "iisc" "team" "collaboration" "jncasr" "researcher" "develop" "low"
[15] "cost" "catalyst" "speed" "split" "water" "generate" "hydrogen"
[22] "gas"
要阻止单词,您可以使用 tm
包。如果你想阻止引理,你有它们:
library(tm)
tm::stemDocument(x$lemma)
这将为您提供以下内容:
[1] "signific" "step" "toward" "larg" "scale" "hydrogen" "product" "iisc" "team" "collabor"
[11] "jncasr" "research" "develop" "low" "cost" "catalyst" "speed" "split" "water" "generat"
[21] "hydrogen" "gas"