如何用quanteda和kwic进行模糊模式匹配?
How to do fuzzy pattern matching with quanteda and kwic?
我有医生写的文本,我希望能够突出显示上下文中的特定单词(我在他们的文本中搜索的单词之前 5 个单词和之后 5 个单词)。假设我要搜索 'suicidal' 这个词。然后我会使用 quanteda 包中的 kwic 函数:
kwic(数据集,模式=“自杀”,window = 5)
到目前为止一切顺利,但我想允许出现拼写错误的可能性。在这种情况下,我想允许三个不同的字符,而不限制这些字符在单词中的位置。
是否可以使用 quanteda 的 kwic 函数来做到这一点?
示例:
dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"))
dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)
只会给我第一个拼写正确的句子。
好问题。我们没有像 "valuetype" 这样的近似匹配,但这对未来的发展来说是一个有趣的想法。与此同时,我建议使用 base::agrep()
生成一个固定模糊匹配列表,然后在这些匹配项上进行匹配。所以这看起来像:
library("quanteda")
## Package version: 1.5.2
dataset <- data.frame(
"patient" = 1:9, "text" = c(
"On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"
),
stringsAsFactors = FALSE
)
corp <- corpus(dataset)
# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
types()
使用 agrep()
生成最接近的模糊匹配 - 在这里我 运行 几次,每次从默认值 0.1 稍微增加 max.distance
。
# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal" "uicida"
然后,将其用作 kwic()
的 pattern
参数:
# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##
## [text1, 9] the patient was | suicidal | when he showed
## [text2, 9] the patient was | suicidaa | when he showed
## [text3, 9] the patient was | suiciaaa | when he showed
## [text4, 9] the patient was | suicaaal | when he showed
## [text5, 9] the patient was | suiaaaal | when he showed
## [text6, 9] the patient was | saacidal | when he showed
## [text7, 9] the patient was | suaaadal | when he showed
## [text8, 9] the patient was | icidal | when he showed
## [text9, 9] the patient was | uicida | when he showed
还有其他基于类似解决方案的可能性,例如 fuzzyjoin 或 stringdist 包,但这是来自基础 应该运行良好的包。
我有医生写的文本,我希望能够突出显示上下文中的特定单词(我在他们的文本中搜索的单词之前 5 个单词和之后 5 个单词)。假设我要搜索 'suicidal' 这个词。然后我会使用 quanteda 包中的 kwic 函数:
kwic(数据集,模式=“自杀”,window = 5)
到目前为止一切顺利,但我想允许出现拼写错误的可能性。在这种情况下,我想允许三个不同的字符,而不限制这些字符在单词中的位置。
是否可以使用 quanteda 的 kwic 函数来做到这一点?
示例:
dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"))
dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)
只会给我第一个拼写正确的句子。
好问题。我们没有像 "valuetype" 这样的近似匹配,但这对未来的发展来说是一个有趣的想法。与此同时,我建议使用 base::agrep()
生成一个固定模糊匹配列表,然后在这些匹配项上进行匹配。所以这看起来像:
library("quanteda")
## Package version: 1.5.2
dataset <- data.frame(
"patient" = 1:9, "text" = c(
"On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"
),
stringsAsFactors = FALSE
)
corp <- corpus(dataset)
# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
types()
使用 agrep()
生成最接近的模糊匹配 - 在这里我 运行 几次,每次从默认值 0.1 稍微增加 max.distance
。
# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal" "uicida"
然后,将其用作 kwic()
的 pattern
参数:
# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##
## [text1, 9] the patient was | suicidal | when he showed
## [text2, 9] the patient was | suicidaa | when he showed
## [text3, 9] the patient was | suiciaaa | when he showed
## [text4, 9] the patient was | suicaaal | when he showed
## [text5, 9] the patient was | suiaaaal | when he showed
## [text6, 9] the patient was | saacidal | when he showed
## [text7, 9] the patient was | suaaadal | when he showed
## [text8, 9] the patient was | icidal | when he showed
## [text9, 9] the patient was | uicida | when he showed
还有其他基于类似解决方案的可能性,例如 fuzzyjoin 或 stringdist 包,但这是来自基础 应该运行良好的包。