如何使用 quanteda 的 kwic 在正则表达式模式中定义可选元素?
How to define optional element in regex pattern with quanteda's kwic?
我正在努力 'translate' 从 stringi
/stringr
到 quanteda
的 kwic
函数的正则表达式。
如何获得“Jane Mayer”的所有实例,无论她是否有中间名。
请注意,我没有数据中所有现有中间名的列表。因此定义多个模式(每个中间名一个)是不可能的。
非常感谢!
library(quanteda)
library(tidyverse)
txt <- c("this is Jane Alexandra Mayer",
"this is Jane Mayer",
"this is Jane Eli Mayer",
"this is Jane Burger")
txt_token <- tokens(txt)
my_pattern <- c("Jane .* Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 2 matches.
#> [text1, 3:5] this is | Jane Alexandra Mayer |
#> [text3, 3:5] this is | Jane Eli Mayer |
my_pattern <- c("Jane .? Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 2 matches.
#> [text1, 3:5] this is | Jane Alexandra Mayer |
#> [text3, 3:5] this is | Jane Eli Mayer |
my_pattern <- c("Jane.* Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 1 match.
#> [text2, 3:4] this is | Jane Mayer |
my_pattern <- c("Jane . Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 2 matches.
#> [text1, 3:5] this is | Jane Alexandra Mayer |
#> [text3, 3:5] this is | Jane Eli Mayer |
对于 stringr
我会简单地使用:
str_extract(txt, regex("Jane.* Mayer"))
#> [1] "Jane Alexandra Mayer" "Jane Mayer" "Jane Eli Mayer"
#> [4] NA
```
<sup>Created on 2021-11-28 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
看来你需要传递另一个模式才能完全匹配 Jane Mayer
:
kwic(txt_token, pattern=phrase(c("Jane .* Mayer", "Jane Mayer")), valuetype = "regex")
# => Keyword-in-context with 3 matches.
# [text1, 3:5] this is | Jane Alexandra Mayer |
# [text2, 3:4] this is | Jane Mayer |
# [text3, 3:5] this is | Jane Eli Mayer |
我正在努力 'translate' 从 stringi
/stringr
到 quanteda
的 kwic
函数的正则表达式。
如何获得“Jane Mayer”的所有实例,无论她是否有中间名。 请注意,我没有数据中所有现有中间名的列表。因此定义多个模式(每个中间名一个)是不可能的。
非常感谢!
library(quanteda)
library(tidyverse)
txt <- c("this is Jane Alexandra Mayer",
"this is Jane Mayer",
"this is Jane Eli Mayer",
"this is Jane Burger")
txt_token <- tokens(txt)
my_pattern <- c("Jane .* Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 2 matches.
#> [text1, 3:5] this is | Jane Alexandra Mayer |
#> [text3, 3:5] this is | Jane Eli Mayer |
my_pattern <- c("Jane .? Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 2 matches.
#> [text1, 3:5] this is | Jane Alexandra Mayer |
#> [text3, 3:5] this is | Jane Eli Mayer |
my_pattern <- c("Jane.* Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 1 match.
#> [text2, 3:4] this is | Jane Mayer |
my_pattern <- c("Jane . Mayer")
kwic(txt_token, pattern=phrase(my_pattern), valuetype = "regex")
#> Keyword-in-context with 2 matches.
#> [text1, 3:5] this is | Jane Alexandra Mayer |
#> [text3, 3:5] this is | Jane Eli Mayer |
对于 stringr
我会简单地使用:
str_extract(txt, regex("Jane.* Mayer"))
#> [1] "Jane Alexandra Mayer" "Jane Mayer" "Jane Eli Mayer"
#> [4] NA
```
<sup>Created on 2021-11-28 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
看来你需要传递另一个模式才能完全匹配 Jane Mayer
:
kwic(txt_token, pattern=phrase(c("Jane .* Mayer", "Jane Mayer")), valuetype = "regex")
# => Keyword-in-context with 3 matches.
# [text1, 3:5] this is | Jane Alexandra Mayer |
# [text2, 3:4] this is | Jane Mayer |
# [text3, 3:5] this is | Jane Eli Mayer |