如何将正则表达式与 kwic 一起使用以获得所有匹配项?
How to use regex with kwic to get all matches?
我似乎无法使用 quanteda 的 qwic
获得所需的输出。这是我尝试过的:
library(quanteda)
library(tidyverse)
鉴于此文本
text <- "This is a phone number: 222-222-2222. Here's another phone number...(111)111 1111. This -- 333-3333 -- aint a complete phone number."
这是一个正则表达式,可以匹配大多数美国 phone 号码以及号码每一侧的任何字符
regex.phone1 <- "\D\(?\d{3}\)?[.\s-]?\s*\d{3}[.\s-]?\s*[.\s-]*\d{4}\D"
它匹配这里的第一个数字,这意味着正则表达式按预期工作。
regmatches(text,regexpr(regex.phone1,text))
" 222-222-2222."
但 kwic 不匹配任何内容。
这个:
kwic(
x = text,
pattern = regex.phone1,
window = 5,
valuetype = "regex",
case_insensitive = TRUE
) %>%
as_tibble
returns:
A tibble: 0 x 7
… with 7 variables: docname <chr>, from <int>, to <int>, pre <chr>, keyword <chr>,
post <chr>, pattern <fct>
我希望它匹配所有 phone 个数字,在本例中是:
"222-222-2222."
".(111)111 1111."
(并将它们放在显示 pre、post 等的 kwic 输出的正常形式中)。
我尝试通过正则表达式制作 phrases
来匹配 phone 数字。
library(quanteda)
library(tidyverse)
text <- "This is a number: 541 145-8884 also 222-222-2222 Here's (444)111-1111. No. 555 666 7774"
kwic(
x = text,
phrase(c("^[\d]{10}$","^[\d]{3} [\d]{3}-[\d]{4}$","^[\d]{3}-[\d]{3}-[\d]{4}$","^[\d]{3} [\d]{3} [\d]{4}$","^[(] [\d]{3} [)] [\d]{3}-[\d]{4}$")),
window = 3,
valuetype = "regex",
separator = " ",
case_insensitive = FALSE
) %>%
print(as_tibble)
# Output:
# [text1, 6:7] a number: | 541 145-8884 | also 222-222-2222 Here's
# [text1, 9:9] 541 145-8884 also | 222-222-2222 | Here's( 444
# [text1, 11:14] also 222-222-2222 Here's | ( 444 ) 111-1111 | . No.
# [text1, 18:20] . No. | 555 666 7774 |
我似乎无法使用 quanteda 的 qwic
获得所需的输出。这是我尝试过的:
library(quanteda)
library(tidyverse)
鉴于此文本
text <- "This is a phone number: 222-222-2222. Here's another phone number...(111)111 1111. This -- 333-3333 -- aint a complete phone number."
这是一个正则表达式,可以匹配大多数美国 phone 号码以及号码每一侧的任何字符
regex.phone1 <- "\D\(?\d{3}\)?[.\s-]?\s*\d{3}[.\s-]?\s*[.\s-]*\d{4}\D"
它匹配这里的第一个数字,这意味着正则表达式按预期工作。
regmatches(text,regexpr(regex.phone1,text))
" 222-222-2222."
但 kwic 不匹配任何内容。 这个:
kwic(
x = text,
pattern = regex.phone1,
window = 5,
valuetype = "regex",
case_insensitive = TRUE
) %>%
as_tibble
returns:
A tibble: 0 x 7
… with 7 variables: docname <chr>, from <int>, to <int>, pre <chr>, keyword <chr>,
post <chr>, pattern <fct>
我希望它匹配所有 phone 个数字,在本例中是:
"222-222-2222."
".(111)111 1111."
(并将它们放在显示 pre、post 等的 kwic 输出的正常形式中)。
我尝试通过正则表达式制作 phrases
来匹配 phone 数字。
library(quanteda)
library(tidyverse)
text <- "This is a number: 541 145-8884 also 222-222-2222 Here's (444)111-1111. No. 555 666 7774"
kwic(
x = text,
phrase(c("^[\d]{10}$","^[\d]{3} [\d]{3}-[\d]{4}$","^[\d]{3}-[\d]{3}-[\d]{4}$","^[\d]{3} [\d]{3} [\d]{4}$","^[(] [\d]{3} [)] [\d]{3}-[\d]{4}$")),
window = 3,
valuetype = "regex",
separator = " ",
case_insensitive = FALSE
) %>%
print(as_tibble)
# Output:
# [text1, 6:7] a number: | 541 145-8884 | also 222-222-2222 Here's
# [text1, 9:9] 541 145-8884 also | 222-222-2222 | Here's( 444
# [text1, 11:14] also 222-222-2222 Here's | ( 444 ) 111-1111 | . No.
# [text1, 18:20] . No. | 555 666 7774 |