丢弃包含嵌套目标词的较长字典匹配项

Discard longer dictionary matches which contain a nested target word

我正在使用 tokens_lookup 查看某些文本是否包含我词典中的单词。现在我试图找到一种方法来丢弃当字典单词处于有序单词序列中时出现的匹配项。举个例子,假设 Ireland 在字典中。我想排除提到北爱尔兰(或包含英国的任何固定词组)的情况。我想出的唯一间接解决方案是用这些词集(例如英国)构建另一本词典。但是,当同时引用 Britain 和 Great Britain 时,此解决方案将不起作用。谢谢。

library("quanteda")

dict <- dictionary(list(IE = "Ireland"))

txt <- c(
  doc1 = "Ireland lorem ipsum",
  doc2 = "Lorem ipsum Northern Ireland",
  doc3 = "Ireland lorem ipsum Northern Ireland"
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = dict)

您可以通过为“北爱尔兰”指定另一个字典键来执行此操作,其值也为“北爱尔兰”。如果您在 tokens_lookup() 中使用参数 nested_scope = "dictionary",那么这将首先且仅匹配一次较长的短语,将“Ireland”与“Northern Ireland”分开。通过使用与值相同的键,您可以像这样替换它(附带的好处是现在将这两个标记“Northern”和“Ireland”组合为一个标记。

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"))

txt <- c(
  doc1 = "Ireland lorem ipsum",
  doc2 = "Lorem ipsum Northern Ireland",
  doc3 = "Ireland lorem ipsum Northern Ireland"
)

toks <- tokens(txt)

tokens_lookup(toks,
  dictionary = dict, exclusive = FALSE,
  nested_scope = "dictionary", capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"    "lorem" "ipsum"
## 
## doc2 :
## [1] "Lorem"            "ipsum"            "Northern Ireland"
## 
## doc3 :
## [1] "IE"               "lorem"            "ipsum"            "Northern Ireland"

此处我使用 exclusive = FALSE 进行说明,因此您可以看到查找和替换的内容。当你 运行 它时,你可以删除它和 capkeys 参数。

如果您想丢弃“北爱尔兰”标记,只需使用

tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary") %>%
  tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
## 
## doc2 :
## character(0)
## 
## doc3 :
## [1] "IE"