在上下文中放置关键字时告诉 `kwic()` 忽略停用词?

Tell `kwic()` to ignore stopwords when situating keywords in context?

我再次对 quanteda 包中的 kwic() 函数有疑问。我想提取围绕特定关键字的五个词(在下面的示例中,它们是“stack overflow”和“radio star”)。但是,在标记化过程中删除停用词后,kwic() 并不是 return 前 5 个词和 post 关键字的实际 window,而是比那更少的词。有没有办法告诉 kwic() 在计算上下文中的关键字时忽略停用词?

下面的代表:

library(quanteda)

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of. Now I am also adding a few words that would not be removed as stopwords, as follows: Maintenance, Television, Superstar, Textual Analysis. Video killed the radio star is another sentence I would like to include.", 
           "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow. Once again adding some non-stopwords: Maintenance, television, superstar, textual analysis. Video killed the radio star is another sentence I would like to include.", 
           "Finally, this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech. Here are some more non-stopwords: Maintenance, television, superstar, textual analysis")

data <- data.frame(id=1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_remove(stopwords("en"), padding = TRUE) %>%
  tokens_compound(pattern = phrase(c("stack overflow*", "radio star*")),
                  concatenator = " ")

test_kwic <- kwic(test_tokens,
                  pattern = c("stack overflow", "radio star"),
                  window = 5)

正如@phiver 所建议的那样,在删除停用词时使用 padding = FALSE 解决了这个问题。谢谢!