如何通过从随机化中排除那些包含特定单词列表的段落，从语料库中随机 select 段落？

Question

我有一个语料库。我想从这个语料库中随机提取段落。但是，随机化练习必须做到特定个词的段落不能被抽样。

这是一个例子：

txt <- c("PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard",
         "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard",
         "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper.",
         "Fiscal policy is a bad thing. \n\n SO is a great place where skilled people solve coding problems.",
         "Fiscal policy is not as good as people may think",
         "Economics is fun. \n\n I prefer Macro.")
corp <- corpus(txt, docvars = data.frame(serial = 1:6))

没有任何限制地做到这一点是直截了当的：

reshape = corpus_reshape(corp, "paragraphs")
sample = corpus_sample(reshape, 4)

# Result

[1] "Economics is fun."                                "Fiscal policy is not as good as people may think"
[3] "Fiscal policy is a bad thing."                    "Quarentine is hard"

如您所见，随机化选择了包含 财政政策 的 "paragraphs"。我希望通过排除出现 财政政策 的 paragraphs/sentences 来对语料库进行抽样。

我可以在采样之前删除原始数据集中与该词相关的句子吗？你会怎么做？

请注意，在真实数据集中，我需要排除包含不止一两个关键字的句子。所以，请提出一些可以很容易扩展到很多词的东西。

非常感谢！

Answer 1

如果你有文本，那么你可以在创建语料库之前使用 grepl 的子集来省略 "fiscal policy"（或任何其他单词）。

txt2 <- txt[!grepl("fiscal policy|I am groot", tolower(txt))]
txt2

[1] "PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard"                             
[2] "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard"                                                        
[3] "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper."
[4] "Economics is fun. \n\n I prefer Macro."

第 4 项和第 5 项尚未选择。现在做你的采样。

如果你只有语料库，那么提取文本，然后使用上面的代码。

txt <- texts(corp)

Answer 2

如果要排除包含 "fiscal policy" 的 paragraphs/sentences 则需要先将文本重塑为段落，然后过滤出包含 "fiscal policy" 的术语包含排除短语，然后才样本。

如果您在创建语料库之前过滤文本，您将从还包含过滤短语的输入文本中排除非过滤短语段落。

library("quanteda") ## Package version: 2.0.1 set.seed(10) txt <- c( "PAGE 1. A single sentence. Short sentence. Three word sentence. \n\n Quarentine is hard", "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard", "Very long sentence, with three parts, separated by commas. PAGE 3.\n\n quarantine it's good tough to focus on paper.", "Fiscal policy is a bad thing. \n\n SO is a great place where skilled people solve coding problems.", "Fiscal policy is not as good as people may think", "Economics is fun. \n\n I prefer Macro." ) corp <- corpus(txt, docvars = data.frame(serial = 1:6)) %>% corpus_reshape(to = "paragraphs") tail(corp) ## Corpus consisting of 6 documents and 1 docvar. ## text3.2 : ## "quarantine it's good tough to focus on paper." ## ## text4.1 : ## "Fiscal policy is a bad thing." ## ## text4.2 : ## "SO is a great place where skilled people solve coding proble..." ## ## text5.1 : ## "Fiscal policy is not as good as people may think" ## ## text6.1 : ## "Economics is fun." ## ## text6.2 : ## "I prefer Macro."

现在我们可以根据模式匹配进行子集化了。

corp2 <- corpus_subset(corp, !grepl("fiscal policy", corp, ignore.case = TRUE)) tail(corp2) ## Corpus consisting of 6 documents and 1 docvar. ## text2.2 : ## "quarantine is very very hard" ## ## text3.1 : ## "Very long sentence, with three parts, separated by commas. ..." ## ## text3.2 : ## "quarantine it's good tough to focus on paper." ## ## text4.2 : ## "SO is a great place where skilled people solve coding proble..." ## ## text6.1 : ## "Economics is fun." ## ## text6.2 : ## "I prefer Macro." corpus_sample(corp2, size = 4) ## Corpus consisting of 4 documents and 1 docvar. ## text6.2 : ## "I prefer Macro." ## ## text1.2 : ## "Quarentine is hard" ## ## text2.2 : ## "quarantine is very very hard" ## ## text3.2 : ## "quarantine it's good tough to focus on paper."

包含 "fiscal policy" 的段落已消失。

请注意，我在这里使用了 grepl()，但全面的高级替代品是来自 stringi 的 str_detect()（或等效的 stringr 包装器）。这些还可以让您更好地控制使用更快的固定匹配，同时还可以控制是否匹配大小写。

all.equal( grepl("fiscal policy", txt, ignore.case = TRUE), stringi::stri_detect_fixed(txt, "fiscal policy", case_insensitive = TRUE), stringr::str_detect(txt, fixed("fiscal policy"), case_insensitive = TRUE) ) ## [1] TRUE

如何通过从随机化中排除那些包含特定单词列表的段落，从语料库中随机 select 段落？

How to randomly select paragraphs from a corpus by excluding, from the randomization, those paragraphs that include a specific list of words?

dictionary

r

corpus

dataframe

quanteda