如何使用 hunspell 包在 R 的列中建议正确的单词?
How to use hunspell package to suggest correct words in a column in R?
我目前正在处理每行包含大量文本的大型数据框,我想使用 hunspell
包有效地识别和替换每个句子中拼写错误的单词。我能够识别拼写错误的单词,但无法弄清楚如何在列表中执行 hunspell_suggest
。
这里是数据框的例子:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
我将文本列转换为字符,并使用 hunspell
识别每行中拼写错误的单词。
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
我试过了
df1$suggest <- hunspell_suggest(df1$word_check)
但它一直报错:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
我是新手,所以我不确定使用 hunspell_suggest
函数的建议列会怎样。任何帮助将不胜感激。
检查你的中间步骤。 df1$word_check
的输出如下:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
类型为list
。如果你这样做 lapply(df1$word_check, hunspell_suggest)
你可以获得建议。
编辑
我决定更详细地讨论这个问题,因为我没有看到任何简单的替代方法。这就是我想出的:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
虽然可能有更优雅的方法,但这个函数 returns 一个字符串向量被更正为:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
注意,因为这是 returns hunspell
给出的第一个建议 - 可能正确也可能不正确。
我目前正在处理每行包含大量文本的大型数据框,我想使用 hunspell
包有效地识别和替换每个句子中拼写错误的单词。我能够识别拼写错误的单词,但无法弄清楚如何在列表中执行 hunspell_suggest
。
这里是数据框的例子:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
我将文本列转换为字符,并使用 hunspell
识别每行中拼写错误的单词。
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
我试过了
df1$suggest <- hunspell_suggest(df1$word_check)
但它一直报错:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
我是新手,所以我不确定使用 hunspell_suggest
函数的建议列会怎样。任何帮助将不胜感激。
检查你的中间步骤。 df1$word_check
的输出如下:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
类型为list
。如果你这样做 lapply(df1$word_check, hunspell_suggest)
你可以获得建议。
编辑
我决定更详细地讨论这个问题,因为我没有看到任何简单的替代方法。这就是我想出的:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
虽然可能有更优雅的方法,但这个函数 returns 一个字符串向量被更正为:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
注意,因为这是 returns hunspell
给出的第一个建议 - 可能正确也可能不正确。