使用 Quanteda 软件包 R 删除 2 个停用词列表

Question

我正在语料库数据帧上使用 quanteda 包，这是我使用的基本代码：

library(quanteda)

fmsi_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

但是，我有另一个停用词列表作为数据框，称为 stpw，我想考虑它。

我试过了：

fmsi_des <- dfm(corpus_des, remove=stopwords("spanish","stpw"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

Error in stopwords("spanish", "stpw") : unused argument ("stpw")

然后我用“spanish”的停用词和 stpw 的停用词创建了一个列表：

all_stops <- c("bogota","vias","medellin","valle","departamento",stopwords("spanish"))

fmsi_des <- dfm(corpus_des, remove=stopwords("all_stops"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

Error in stopwords("all_stops") : no stopwords available for 'all_stops'

我还用我的停用词创建了一个 txt 文件，以便尝试：

library(tm)

stopwords = readLines('stpw.txt') 
x  = fd$contract_description        
x  =  removeWords(x,stopwords)

des <- subset(x, !is.na(x))
corpus_des <- corpus(des$fd.contract_description)
fmsi_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

Warning message: In readLines("stp.txt") : Incomplete final line found in 'stpw.txt'

Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : incorrect regular expression '(*UCP)\b(bogota|vias|medellin|valle|departamento|+)\b' In addition : Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'nothing to repeat' at '+)\b'

Answer 1

在这种情况下，知道 return 对象在 R 中的值是获得所需结果的关键。具体来说，您需要知道 stopwords() return 是什么，以及它的第一个参数是什么。

stopwords(language = "sp") returns 西班牙语停用词的字符向量，使用默认的 source = "snowball" 列表。（有关详细信息，请参阅 ?stopwords。）

因此，如果您想删除默认的西班牙语列表加上您自己的单词，您可以将 returned 字符向量与其他元素连接起来。这就是您在创建 all_stops.

时所做的

因此，要删除 all_stops——在这里，使用 quanteda v3 建议的用法——您只需执行以下操作：

fmsi_des <- corpus_des %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
    tokens_remove(pattern = all_stops) %>%
    dfm()

使用 Quanteda 软件包 R 删除 2 个停用词列表

Remove 2 stopwords lists with Quanteda package R

r

corpus

text-mining

stop-words

quanteda