在 dfm 中查找非英文标记并删除它们

Find in a dfm non-english tokens and remove them

在 dfm 中如何检测非英语单词并将其删除?

dftest <- data.frame(id = 1:3, 
                     text = c("Holla this is a spanish word", 
                              "English online here", 
                              "Bonjour, comment ça va?"))

dfm 的构造示例是这样的:

testDfm <- dftest$text %>%
             tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)  %>%  %>% tokens_wordstem() %>%
             dfm()

我发现 textcat 包是一种替代解决方案,但在真实数据集中有很多情况,其中整行都是英语,它仅将一个字符识别为另一种语言。有没有其他方法可以使用 quanteda 在 dfm 中的数据帧或标记中查找非英语行?

您可以使用包含所有英语单词的单词列表来完成此操作。它存在于 hunspell 包中,用于拼写检查。

library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#>  affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff 
#>  dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic 
#>  encoding: UTF-8 
#>  wordchars: ’ 
#>  added: 0 custom words

# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>% 
# the vector contains extra information on the words, which is removed
  gsub("/.+", "", .)

# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#>  [1] "furnace"          "steno"            "Hadoop"           "alumna"          
#>  [5] "gonorrheal"       "multichannel"     "biochemical"      "Riverside"       
#>  [9] "granddad"         "glum"             "exasperation"     "restorative"     
#> [13] "appropriate"      "submarginal"      "Nipponese"        "hotting"         
#> [17] "solicitation"     "pillbox"          "mealtime"         "thunderbolt"     
#> [21] "chaise"           "Milan"            "occidental"       "hoeing"          
#> [25] "debit"            "enlightenment"    "coachload"        "entreating"      
#> [29] "grownup"          "unappreciative"   "egret"            "barre"           
#> [33] "Queen"            "Tammany"          "Goodyear"         "horseflesh"      
#> [37] "roar"             "fictionalization" "births"           "mediator"        
#> [41] "resitting"        "waiter"           "instructive"      "Baez"            
#> [45] "Muenster"         "sleepless"        "motorbike"        "airsick"         
#> [49] "leaf"             "belie"

有了这个理论上应该包含所有英语单词但只包含英语单词的向量,我们可以删除非英语标记:

testDfm <- dftest$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)  %>%
  tokens_keep(english_words, valuetype = "fixed") %>% 
  tokens_wordstem() %>%
  dfm()

testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#>        features
#> docs    this a spanish word english onlin here comment va
#>   text1    1 1       1    1       0     0    0       0  0
#>   text2    0 0       0    0       1     1    1       0  0
#>   text3    0 0       0    0       0     0    0       1  1

如您所见,这很有效,但并不完美。来自“ça va”的“va”和“comment”一样被选为英语单词。因此,您要做的是找到正确的单词列表 and/or 清理它。您也可以考虑删除删除过多单词的文本。

关于是先删除非英语“行”,还是稍后删除非英语单词,问题并不完全清楚。欧洲语言之间有很多同源词(在不止一种语言中出现的同形异义词),因此 tokens_keep() 策略将不完善。

您可以在检测到语言后删除非英语文档,使用 cld3 库:

dftest <- data.frame(
  id = 1:3,
  text = c(
    "Holla this is a spanish word",
    "English online here",
    "Bonjour, comment ça va?"
  )
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
##   id                         text
## 1  1 Holla this is a spanish word
## 2  2          English online here

然后将其输入quanteda::dfm()