如何根据多个关键字从语料库中创建两个子集？

Question

我正在处理 quanteda 中的大量政治演讲，我想创建两个子集。第一个应包含特定关键字列表中的一个或多个（例如“migrant*”、“migration*”、“asylum*”）。第二个应该包含不包含任何这些术语的文档（不属于第一个子集的演讲）。

如有任何意见，我们将不胜感激。谢谢！

#first suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern=paste0(regex_pattern), ignore_case = TRUE, collapse="|"), "yes", "no")

Warning messages:
1: In (function (case_insensitive, comments, dotall, dot_all = dotall,  :
  Unknown option to `stri_opts_regex`.
2: In stringi::stri_detect_regex(corp_labcon, pattern = paste0(regex_pattern),  :
  longer object length is not a multiple of shorter object length
  
> table(corp_labcon$criteria)

    no    yes 
556921   6139 

#Second suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern = paste0(glob2rx(regex_pattern), collapse = "|")), "yes","no")

> table(corp_labcon$criteria)

    no 
563060

Answer 1

不确定您的数据是如何组织的，但您可以尝试函数 grep()。假设数据是一个数据框，每一行都是一个文本，你可以试试：

words <- c("migrant", "migration", "asylum")

df[grep(words, df$text),] # This will give you those lines with the words
df[!grep(words, df$text),] # This will give you those lines without the words

不过，您的数据可能不是这样构造的！你应该更好地解释你的数据是什么样子的。

Answer 2

您没有提供可重现的示例，但我将展示如何使用 quanteda 和可用的语料库来完成 data_corpus_inaugural。您可以使用可以附加到语料库的文档变量。这就像在 data.frame.

中添加一个变量

在 stringi::stri_detect_regex 中，如果文本中有任何要查找的词，您将查看每个文档，如果是，则将条件列中的值设置为是。否则为否。之后，您可以使用 corpus_subset 根据条件值创建 2 个新的 corpi。请参阅下面的示例代码。

library(quanteda)

# words used in regex selection
regex_pattern <- c("migrant*", "migration*", "asylum*")

# add selection to corpus
data_corpus_inaugural$criteria <- ifelse(stringi::stri_detect_regex(data_corpus_inaugural, 
                                                                    pattern = paste0(regex_pattern, 
                                                                                     collapse = "|")),
                                         "yes","no")

# Check docvars and new criteria column
head(docvars(data_corpus_inaugural))
  Year  President FirstName                 Party criteria
1 1789 Washington    George                  none      yes
2 1793 Washington    George                  none       no
3 1797      Adams      John            Federalist       no
4 1801  Jefferson    Thomas Democratic-Republican       no
5 1805  Jefferson    Thomas Democratic-Republican       no
6 1809    Madison     James Democratic-Republican       no

# split corpus into segment 1 and 2
segment1 <- corpus_subset(data_corpus_inaugural, criteria == "yes")
segment2 <- corpus_subset(data_corpus_inaugural, criteria == "no")

如何根据多个关键字从语料库中创建两个子集？

How do I create two subsets out of a corpus based on multiple keywords?

r

subset

corpus

quanteda