如何将正则表达式与 kwic 一起使用以获得所有匹配项？

Question

我似乎无法使用 quanteda 的 qwic 获得所需的输出。这是我尝试过的：

library(quanteda)
library(tidyverse)

鉴于此文本

text <- "This is a phone number: 222-222-2222. Here's another phone number...(111)111 1111. This -- 333-3333 -- aint a complete phone number."

这是一个正则表达式，可以匹配大多数美国 phone 号码以及号码每一侧的任何字符

regex.phone1 <- "\D\(?\d{3}\)?[.\s-]?\s*\d{3}[.\s-]?\s*[.\s-]*\d{4}\D"

它匹配这里的第一个数字，这意味着正则表达式按预期工作。

regmatches(text,regexpr(regex.phone1,text))

" 222-222-2222."

但 kwic 不匹配任何内容。这个：

 kwic(
  x = text,
  pattern = regex.phone1,
  window = 5,
  valuetype = "regex",
  case_insensitive = TRUE
) %>%
  as_tibble

returns:

A tibble: 0 x 7
… with 7 variables: docname <chr>, from <int>, to <int>, pre <chr>, keyword <chr>,
  post <chr>, pattern <fct>

我希望它匹配所有 phone 个数字，在本例中是：

"222-222-2222."

".(111)111 1111."

（并将它们放在显示 pre、post 等的 kwic 输出的正常形式中）。

Answer 1

我尝试通过正则表达式制作 phrases 来匹配 phone 数字。

library(quanteda)
library(tidyverse)

text <- "This is a number: 541 145-8884 also 222-222-2222 Here's (444)111-1111. No. 555 666 7774"

kwic(
  x = text,
  phrase(c("^[\d]{10}$","^[\d]{3} [\d]{3}-[\d]{4}$","^[\d]{3}-[\d]{3}-[\d]{4}$","^[\d]{3} [\d]{3} [\d]{4}$","^[(] [\d]{3} [)] [\d]{3}-[\d]{4}$")),
  window = 3,
  valuetype = "regex",
  separator = " ",
  case_insensitive = FALSE
) %>%
  
print(as_tibble)

# Output:                                                                                                 
#   [text1, 6:7]                a number: |   541 145-8884   | also 222-222-2222 Here's
#   [text1, 9:9]        541 145-8884 also |   222-222-2222   | Here's( 444             
# [text1, 11:14] also 222-222-2222 Here's | ( 444 ) 111-1111 | . No.                   
# [text1, 18:20]                    . No. |   555 666 7774   |

如何将正则表达式与 kwic 一起使用以获得所有匹配项？

How to use regex with kwic to get all matches?

regex

r

quanteda