在 dfm 中查找非英文标记并删除它们
Find in a dfm non-english tokens and remove them
在 dfm 中如何检测非英语单词并将其删除?
dftest <- data.frame(id = 1:3,
text = c("Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"))
dfm 的构造示例是这样的:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% %>% tokens_wordstem() %>%
dfm()
我发现 textcat 包是一种替代解决方案,但在真实数据集中有很多情况,其中整行都是英语,它仅将一个字符识别为另一种语言。有没有其他方法可以使用 quanteda 在 dfm 中的数据帧或标记中查找非英语行?
您可以使用包含所有英语单词的单词列表来完成此操作。它存在于 hunspell
包中,用于拼写检查。
library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#> affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff
#> dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic
#> encoding: UTF-8
#> wordchars: ’
#> added: 0 custom words
# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>%
# the vector contains extra information on the words, which is removed
gsub("/.+", "", .)
# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#> [1] "furnace" "steno" "Hadoop" "alumna"
#> [5] "gonorrheal" "multichannel" "biochemical" "Riverside"
#> [9] "granddad" "glum" "exasperation" "restorative"
#> [13] "appropriate" "submarginal" "Nipponese" "hotting"
#> [17] "solicitation" "pillbox" "mealtime" "thunderbolt"
#> [21] "chaise" "Milan" "occidental" "hoeing"
#> [25] "debit" "enlightenment" "coachload" "entreating"
#> [29] "grownup" "unappreciative" "egret" "barre"
#> [33] "Queen" "Tammany" "Goodyear" "horseflesh"
#> [37] "roar" "fictionalization" "births" "mediator"
#> [41] "resitting" "waiter" "instructive" "Baez"
#> [45] "Muenster" "sleepless" "motorbike" "airsick"
#> [49] "leaf" "belie"
有了这个理论上应该包含所有英语单词但只包含英语单词的向量,我们可以删除非英语标记:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_keep(english_words, valuetype = "fixed") %>%
tokens_wordstem() %>%
dfm()
testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#> features
#> docs this a spanish word english onlin here comment va
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 0 0
#> text3 0 0 0 0 0 0 0 1 1
如您所见,这很有效,但并不完美。来自“ça va”的“va”和“comment”一样被选为英语单词。因此,您要做的是找到正确的单词列表 and/or 清理它。您也可以考虑删除删除过多单词的文本。
关于是先删除非英语“行”,还是稍后删除非英语单词,问题并不完全清楚。欧洲语言之间有很多同源词(在不止一种语言中出现的同形异义词),因此 tokens_keep()
策略将不完善。
您可以在检测到语言后删除非英语文档,使用 cld3 库:
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
然后将其输入quanteda::dfm()
。
在 dfm 中如何检测非英语单词并将其删除?
dftest <- data.frame(id = 1:3,
text = c("Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"))
dfm 的构造示例是这样的:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% %>% tokens_wordstem() %>%
dfm()
我发现 textcat 包是一种替代解决方案,但在真实数据集中有很多情况,其中整行都是英语,它仅将一个字符识别为另一种语言。有没有其他方法可以使用 quanteda 在 dfm 中的数据帧或标记中查找非英语行?
您可以使用包含所有英语单词的单词列表来完成此操作。它存在于 hunspell
包中,用于拼写检查。
library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#> affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff
#> dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic
#> encoding: UTF-8
#> wordchars: ’
#> added: 0 custom words
# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>%
# the vector contains extra information on the words, which is removed
gsub("/.+", "", .)
# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#> [1] "furnace" "steno" "Hadoop" "alumna"
#> [5] "gonorrheal" "multichannel" "biochemical" "Riverside"
#> [9] "granddad" "glum" "exasperation" "restorative"
#> [13] "appropriate" "submarginal" "Nipponese" "hotting"
#> [17] "solicitation" "pillbox" "mealtime" "thunderbolt"
#> [21] "chaise" "Milan" "occidental" "hoeing"
#> [25] "debit" "enlightenment" "coachload" "entreating"
#> [29] "grownup" "unappreciative" "egret" "barre"
#> [33] "Queen" "Tammany" "Goodyear" "horseflesh"
#> [37] "roar" "fictionalization" "births" "mediator"
#> [41] "resitting" "waiter" "instructive" "Baez"
#> [45] "Muenster" "sleepless" "motorbike" "airsick"
#> [49] "leaf" "belie"
有了这个理论上应该包含所有英语单词但只包含英语单词的向量,我们可以删除非英语标记:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_keep(english_words, valuetype = "fixed") %>%
tokens_wordstem() %>%
dfm()
testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#> features
#> docs this a spanish word english onlin here comment va
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 0 0
#> text3 0 0 0 0 0 0 0 1 1
如您所见,这很有效,但并不完美。来自“ça va”的“va”和“comment”一样被选为英语单词。因此,您要做的是找到正确的单词列表 and/or 清理它。您也可以考虑删除删除过多单词的文本。
关于是先删除非英语“行”,还是稍后删除非英语单词,问题并不完全清楚。欧洲语言之间有很多同源词(在不止一种语言中出现的同形异义词),因此 tokens_keep()
策略将不完善。
您可以在检测到语言后删除非英语文档,使用 cld3 库:
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
然后将其输入quanteda::dfm()
。