复合词的标记化在 Quanteda 中不起作用

Question

我正在尝试使用 kwic() 函数创建一个包含特定上下文关键字的数据框，但不幸的是，我运行在尝试标记化基础数据集时遇到了一些错误。

这是我用作可重现示例的数据集的子集：

test_cluster <- speeches_subset %>%
  filter(grepl('Schwester Agnes',
                speechContent,
                ignore.case = TRUE))

test_corpus <- corpus(test_cluster,
                      docid_field = "id",
                      text_field = "speechContent")

此处，test_cluster 包含 12 个变量的六个观测值，即 speechContent 列包含复合词“Schwester Agnes”的六行。 test_corpus 将基础数据转换为 quanteda 语料库对象。

当我然后运行下面的代码时，我希望，首先，speechContent 变量的内容被标记化，并且由于 tokens_compound，复合词“ Schwester Agnes”将被标记化。在第二步中，我希望 kwic() 函数 return 一个由六行组成的数据框，其中 keyword 变量包括复合词“Schwester Agnes”。然而，kwic() return 是一个包含 7 个变量的 0 个观察值的空数据框。我认为这是因为我在使用 tokens_compound() 时犯了一些错误，但我不确定...任何帮助将不胜感激！

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("Schwester Agnes"))

test_kwic <- kwic(test_tokens,
                  pattern = "Schwester Agnes",
                  window = 5)

编辑：我意识到上面的例子不容易重现，所以请参考下面的 reprex：

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id=1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = c("stack", "overflow"))

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)

Answer 1

您需要申请 phrase("stack overflow") 并在 tokens_compound() 中设置 concatenator = " "。

require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1

speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", 
           "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", 
           "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id = 1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)
test_kwic
#> Keyword-in-context with 2 matches.                                                                             
#>  [1, 29] for example is the word | stack overflow | However there are so many
#>  [2, 24]     but at the very end | stack overflow |

^{由 reprex package (v2.0.1)}

创建于 2022-05-06

复合词的标记化在 Quanteda 中不起作用

Tokenization of Compound Words not Working in Quanteda

nlp

r

token

quanteda