为什么 RegEx 不提取字符串中的以下片段？

Question

我是 R 的新手。

我正在从事一个项目，该项目涉及从 R 中抓取内容并分析情绪和其他变量的内容。我正在使用 RedditExtractoR 包和 get_reddit() 函数来检索我的数据。

以下是 Reddit 评论的示例：

Honestly it1s right up there with Out of the Woods and Cruel Summer for me so maybe that1s saying more about me than it does about the songs [=11=]5[=11=]2 meh I1m still going to bop

如您所见，我看到的不是撇号之类的标点符号，而是后跟 3 个数字的反斜杠。当我使用 strsplit(comment, "") 拆分字符串时，单词和空格显示为单独的字符，但反斜杠和数字也显示为单个字符（例如："m"、"e"、"h ", " ", "我", "\031", "m")

我已经尝试了一些方法来尝试隔离这种奇怪的字符串，但到目前为止没有任何效果。我的尝试包括：

grepl("[\\]+[[:digit:]]+", comment)
grepl("^.*[\\]+[0-9]{3,}.*$", comment)
iconv(comment, from = "ASCII", to = "latin1", sub = "", toRaw = FALSE)

...以及更多变体，所有这些变体都返回值 FALSE。此外，当我拆分字符串并将“\031”保存为变量时，它 returns class “字符”，并且 returns FALSE 到输入为任何内容时 grepl 的任何变体但是完整的“\031”。

我可以尝试什么？我不明白为什么正则表达式无法识别反斜杠和数字。

Answer 1

1定义为R字符串中的"1"字符串文字，是一个控制字符，称为END OF MEDIUM。 [=15=]5 是一个 ENQUIRY 字符，[=17=]2 是 START OF TEXT。它们都属于 Cc - Other, control Unicode 类别 class。

要检查这些字符，您可以使用

grepl("\p{Cc}", comment, perl=TRUE)

R demo:

comment <- "Honestly it1s right up there with Out of the Woods and Cruel Summer for me so maybe that1s saying more about me than it does about the songs [=11=]5[=11=]2 meh I1m still going to bop"
regmatches(comment, gregexpr("\p{Cc}+", comment, perl=TRUE))
# => [1] "1"     "1"     "[=11=]5[=11=]2" "1"    
grepl("\p{Cc}", comment, perl=TRUE)
# => [1] TRUE

为什么 RegEx 不提取字符串中的以下片段？

Why is RegEx not picking up on the following segments within a string?

regex

string

ascii

r