通过正则表达式替换 quanteda 令牌

Replace quanteda tokens through regex

我想明确替换包 quanteda 的 class tokens 的对象中定义的特定标记。我无法复制适用于 stringr.

的标准方法

objective 是用 c("XXX", "of").

形式的两个标记替换所有 "XXXof" 形式的标记

请看下面的最小值:

suppressPackageStartupMessages(library(quanteda))
library(stringr)

text = "It was a beautiful day down to the coastof California."

# I would solve this with stringr as follows: 
text_stringr = str_replace( text, "(^.*?)(of)", "\1 \2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."

# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )

# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\1 \2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "It"         "was"        "a"          "beautiful"  "day"       
#>  [6] "down"       "to"         "the"        "\1 \2"    "California"
#> [11] "."

任何解决方法?

reprex package (v1.0.0)

于 2021 年 3 月 16 日创建

您可以使用混合构建需要分隔的单词及其分隔形式的列表,然后使用tokens_replace()进行替换。这样做的好处是允许您在应用之前整理列表,这意味着您可以验证您没有发现您可能不想应用的替代品。

suppressPackageStartupMessages(library("quanteda"))

toks <- tokens("It was a beautiful day down to the coastof California.")

keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\1 \2") %>%
  strsplit(" ")

keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"

tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
##  [1] "It"         "was"        "a"          "beautiful"  "day"       
##  [6] "down"       "to"         "the"        "coast"      "of"        
## [11] "California" "."

reprex package (v1.0.0)

于 2021 年 3 月 16 日创建