在Quanteda中，我们如何按字面匹配引号？

Question

我正在尝试使用 Quanteda 的 tokens_lookup() 函数和 valuetype="regex" 来匹配句子标记中的引号，这是一个简短的问题。根据 here 提供的有关 Quanteda 使用的正则表达式风格的信息，我认为可以使用的方法是 \Q ... \E，但这并没有成功。

library(quanteda) 
# package version: 1.5.2

text <- c("text „some quoted text“ more text", "text « some quoted text » more text")

dict <- dictionary(list(MY_KEY = c("\Q*\E")))
# Error: '\Q' is an unrecognized escape in character string starting ""\Q"

我也试过直接匹配引号"“"，这至少看起来是一个合法的正则表达式模式，但最后也没用。也没有带有双反斜杠的 \Q...\E 变体，因为它们用于单词边界，例如 (\b)。

所以我认为更普遍的问题是 here 提到的正则表达式是否与 Quanteda 理解的 valuetype="regex".

兼容

编辑：

这适用于第一个字符串，但不适用于第二个。

dict <- dictionary(list(MY_KEY = c(".\".")))

Answer 1

是否可能是语言或区域设置问题？你的 "quotation marks" 在我的屏幕上看起来不像引号，当我更改模式时我可以找到它们。

library(quanteda) 
#> Package version: 2.0.1

text <- c("text „some quoted text“ more text", "text « some quoted text » more text")

dict <- dictionary(list(found_it = c("„"), found_other = c("«")))

toks2 <- tokens(text)
tokens_lookup(toks2, dict)

#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "found_it"
#> 
#> text2 :
#> [1] "found_other"

Answer 2

quanteda 中的正则表达式建立在 stringi 包之上，支持 Unicode 字符类别。您可以通过在搜索模式中使用这些类别来检索所有报价：

Ps, Pe - 标点符号，开闭
Pi, Pf - 标点符号首尾引号

我包括了所有四个，因为例如 „ 在 Ps 中但不在 Pi 中，« 在 Pi 中但不在 Ps 中。

更多详细信息 here。

library("quanteda")
## Package version: 2.0.1

text <- c(
  "text „some quoted text“ more text",
  "text « some quoted text » more text"
)
toks <- tokens(text)

tokens_select(toks, "[\p{Pf}\p{Pi}\p{Ps}\p{Pe}]", valuetype = "regex")
## Tokens consisting of 2 documents.
## text1 :
## [1] "„"
## 
## text2 :
## [1] "«" "»"

在Quanteda中，我们如何按字面匹配引号？

In Quanteda, how can we match quotation marks literally?

regex

r

quotation-marks

quanteda