quanteda kwic 提取数字后跟百分比
quanteda kwic to extract number followed by percentage
我有一些文本包含包含数字的短语,后跟一些符号。我想提取它们,例如,数字后跟百分比。使用 quanteda 包中的 kwic 函数似乎适用于数字作为正则表达式(例如 "\d{1,}"
)。
尽管如此,我没有找到如何使用 quanteda 提取它后跟一个百分号。
以下文本可作为文本示例:
Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%)
of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%)
developed diarrhoea attributable only to C. difficile and/ or toxin,
and the remaining 17 (68%) were asymptomat- ic: none had
pseudomembranous colitis.
quanteda
包处理正则表达式的方式相当奇怪。我不确定为什么这个解决方案有效,但我认为它与 kwic
如何处理指定模式有关。用 phrase
函数包装 pattern
并添加 space returns 正确的结果:
s <- c("Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.")
kwic(s, phrase("\d+ %"), valuetype = "regex")
我建议您联系软件包维护者并指出这个问题。似乎 counter-intuitive。
原因是当您直接在语料库或字符对象上调用 kwic()
时,它会在 keywords-in-context 之前将一些参数传递给 tokens()
影响标记化的发生方式分析。 (这记录在 ?kwic
的 ...
参数中。)
quanteda 中的默认标记化使用 stringi 词边界定义,因此:
tokens("Thirteen (7%) of 187")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(" "7" "%" ")" "of" "187"
如果您想使用更简单的空白标记器,可以使用以下方法实现:
tokens("Thirteen (7%) of 187", what = "fasterword")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(7%)" "of" "187"
因此,在 kwic()
中使用它的方法是:
kwic(s, "\d+%", valuetype = "regex", what = "fasterword")
# [text1, 2] Thirteen | (7%) | of 187 patients acquired C.
# [text1, 12] C. difficile in ICU-1, 9 | (36%) | of 25 on ICU-2 and
# [text1, 19] 25 on ICU-2 and 3 | (5.9%) | of 51 patients in BU.
# [text1, 26] 51 patients in BU. Eight | (32%) | developed diarrhoea attributable only to
# [text1, 41] toxin, and the remaining 17 | (68%) | were asymptomat- ic: none had
否则,您需要将正则表达式包装在 phrase()
函数中,并用空格分隔元素:
kwic(s, phrase("\d+ %"), valuetype = "regex")
# [text1, 3:4] Thirteen( | 7 % | ) of 187 patients acquired
# [text1, 18:19] in ICU-1, 9( | 36 % | ) of 25 on ICU-2
# [text1, 28:29] on ICU-2 and 3( | 5.9 % | ) of 51 patients in
# [text1, 39:40] in BU. Eight( | 32 % | ) developed diarrhoea attributable only
# [text1, 60:61] and the remaining 17( | 68 % | ) were asymptomat- ic
这种行为可能需要一些时间来适应,但这是确保用户完全控制搜索 multi-token 序列的最佳方式,而不是实施一种确定元素应该是什么的单一方式multi-token 输入尚未被标记化时的序列。
我有一些文本包含包含数字的短语,后跟一些符号。我想提取它们,例如,数字后跟百分比。使用 quanteda 包中的 kwic 函数似乎适用于数字作为正则表达式(例如 "\d{1,}"
)。
尽管如此,我没有找到如何使用 quanteda 提取它后跟一个百分号。
以下文本可作为文本示例:
Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.
quanteda
包处理正则表达式的方式相当奇怪。我不确定为什么这个解决方案有效,但我认为它与 kwic
如何处理指定模式有关。用 phrase
函数包装 pattern
并添加 space returns 正确的结果:
s <- c("Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.")
kwic(s, phrase("\d+ %"), valuetype = "regex")
我建议您联系软件包维护者并指出这个问题。似乎 counter-intuitive。
原因是当您直接在语料库或字符对象上调用 kwic()
时,它会在 keywords-in-context 之前将一些参数传递给 tokens()
影响标记化的发生方式分析。 (这记录在 ?kwic
的 ...
参数中。)
quanteda 中的默认标记化使用 stringi 词边界定义,因此:
tokens("Thirteen (7%) of 187")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(" "7" "%" ")" "of" "187"
如果您想使用更简单的空白标记器,可以使用以下方法实现:
tokens("Thirteen (7%) of 187", what = "fasterword")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(7%)" "of" "187"
因此,在 kwic()
中使用它的方法是:
kwic(s, "\d+%", valuetype = "regex", what = "fasterword")
# [text1, 2] Thirteen | (7%) | of 187 patients acquired C.
# [text1, 12] C. difficile in ICU-1, 9 | (36%) | of 25 on ICU-2 and
# [text1, 19] 25 on ICU-2 and 3 | (5.9%) | of 51 patients in BU.
# [text1, 26] 51 patients in BU. Eight | (32%) | developed diarrhoea attributable only to
# [text1, 41] toxin, and the remaining 17 | (68%) | were asymptomat- ic: none had
否则,您需要将正则表达式包装在 phrase()
函数中,并用空格分隔元素:
kwic(s, phrase("\d+ %"), valuetype = "regex")
# [text1, 3:4] Thirteen( | 7 % | ) of 187 patients acquired
# [text1, 18:19] in ICU-1, 9( | 36 % | ) of 25 on ICU-2
# [text1, 28:29] on ICU-2 and 3( | 5.9 % | ) of 51 patients in
# [text1, 39:40] in BU. Eight( | 32 % | ) developed diarrhoea attributable only
# [text1, 60:61] and the remaining 17( | 68 % | ) were asymptomat- ic
这种行为可能需要一些时间来适应,但这是确保用户完全控制搜索 multi-token 序列的最佳方式,而不是实施一种确定元素应该是什么的单一方式multi-token 输入尚未被标记化时的序列。