具有 Positive lookhead 的正则表达式仍然使用 strsplit() 在错误的位置拆分字符串
Regex with Positive lookhead still splits string in wrong place using strsplit()
我正在尝试拆分包含消息的字符向量 在日期时间指示器的前面。
我正在考虑将 strsplit()
与正则表达式一起使用,而 perl = TRUE
这是一些示例数据:
TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
这是我到目前为止尝试过的:
Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
Cut
根据 this website,正则表达式应该在日期时间指示符的正前方截断字符串。然而,我得到的结果看起来像这样,第一个字符被截断:
[1] "0"
[2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"
[3] "0"
[4] "5.10.17, 09:27 - Person One: I could bring some beer\n"
[5] "0"
[6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[7] "0"
[8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[9] "0"
[10] "5.10.17, 09:27 - Person Two: ???"
[11] "0"
[12] "5.10.17, 09:28 - Person Two: You guys have history?\n"
[13] "0"
[14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
这是结果应该的样子:
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
注意:我无法在换行符处拆分数据,因为某些消息在消息中间包含一个或多个。
您只需要在 \n
后跟日期时创建拆分模式。
strsplit(gsub("(.*?\n)(\d+[.]\d+[.]\d+)","\1SPLITHERE\2",TEST),"SPLITHERE")
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
您还可以使用基数 r
中的 rematches
regmatches(TEST,gregexpr(".*?\n",TEST))
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
您可以在正向预测前添加一个白色字符 class \s
。
我稍微改变了你的例子,使它更准确地匹配你的问题(即在标题中添加 \n)
> TEST <- c("05.10.17, 09:26 - Person One: How about\n we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
> unlist(strsplit(TEST,"\s(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
## [1] "05.10.17, 09:26 - Person One: How about\n we chill on sunday"
## [2] "05.10.17, 09:27 - Person One: I could bring some beer"
## [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards"
## [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-"
## [5] "05.10.17, 09:27 - Person Two: ???"
## [6] "05.10.17, 09:28 - Person Two: You guys have history?"
## [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
strsplit(TEST, '(?<=\\n|^)(0)',perl=T)[[1]][2:7]
我正在尝试拆分包含消息的字符向量 在日期时间指示器的前面。
我正在考虑将 strsplit()
与正则表达式一起使用,而 perl = TRUE
这是一些示例数据:
TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
这是我到目前为止尝试过的:
Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
Cut
根据 this website,正则表达式应该在日期时间指示符的正前方截断字符串。然而,我得到的结果看起来像这样,第一个字符被截断:
[1] "0"
[2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"
[3] "0"
[4] "5.10.17, 09:27 - Person One: I could bring some beer\n"
[5] "0"
[6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[7] "0"
[8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[9] "0"
[10] "5.10.17, 09:27 - Person Two: ???"
[11] "0"
[12] "5.10.17, 09:28 - Person Two: You guys have history?\n"
[13] "0"
[14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
这是结果应该的样子:
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
注意:我无法在换行符处拆分数据,因为某些消息在消息中间包含一个或多个。
您只需要在 \n
后跟日期时创建拆分模式。
strsplit(gsub("(.*?\n)(\d+[.]\d+[.]\d+)","\1SPLITHERE\2",TEST),"SPLITHERE")
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
您还可以使用基数 r
中的rematches
regmatches(TEST,gregexpr(".*?\n",TEST))
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
您可以在正向预测前添加一个白色字符 class \s
。
我稍微改变了你的例子,使它更准确地匹配你的问题(即在标题中添加 \n)
> TEST <- c("05.10.17, 09:26 - Person One: How about\n we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
> unlist(strsplit(TEST,"\s(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
## [1] "05.10.17, 09:26 - Person One: How about\n we chill on sunday"
## [2] "05.10.17, 09:27 - Person One: I could bring some beer"
## [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards"
## [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-"
## [5] "05.10.17, 09:27 - Person Two: ???"
## [6] "05.10.17, 09:28 - Person Two: You guys have history?"
## [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
strsplit(TEST, '(?<=\\n|^)(0)',perl=T)[[1]][2:7]