具有 Positive lookhead 的正则表达式仍然使用 strsplit() 在错误的位置拆分字符串

Question

我正在尝试拆分包含消息的字符向量 在日期时间指示器的前面。

我正在考虑将 strsplit() 与正则表达式一起使用，而 perl = TRUE

这是一些示例数据：

TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")

这是我到目前为止尝试过的：

Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
Cut

根据 this website，正则表达式应该在日期时间指示符的正前方截断字符串。然而，我得到的结果看起来像这样，第一个字符被截断：

 [1] "0"                                                                                   
 [2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
 [3] "0"                                                                                   
 [4] "5.10.17, 09:27 - Person One: I could bring some beer\n"                              
 [5] "0"                                                                                   
 [6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
 [7] "0"                                                                                   
 [8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
 [9] "0"                                                                                   
[10] "5.10.17, 09:27 - Person Two: ???"                                                                   
[11] "0"                                                                                   
[12] "5.10.17, 09:28 - Person Two: You guys have history?\n"                               
[13] "0"                                                                                   
[14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

这是结果应该的样子：

 [1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                                                                                   
 [2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                         
 [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"                                                                                   
 [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
 [5] "05.10.17, 09:27 - Person Two: ???\n"                                                                                   
 [6] "05.10.17, 09:28 - Person Two: You guys have history?\n"  
 [7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

注意：我无法在换行符处拆分数据，因为某些消息在消息中间包含一个或多个。

Answer 1

您只需要在 \n 后跟日期时创建拆分模式。

 strsplit(gsub("(.*?\n)(\d+[.]\d+[.]\d+)","\1SPLITHERE\2",TEST),"SPLITHERE")
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
[5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

您还可以使用基数 r

中的 rematches

 regmatches(TEST,gregexpr(".*?\n",TEST))
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
[5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

Answer 2

您可以在正向预测前添加一个白色字符 class \s。

我稍微改变了你的例子，使它更准确地匹配你的问题（即在标题中添加 \n）

> TEST <- c("05.10.17, 09:26 - Person One: How about\n we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
> unlist(strsplit(TEST,"\s(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))

## [1] "05.10.17, 09:26 - Person One: How about\n we chill on sunday"                         
## [2] "05.10.17, 09:27 - Person One: I could bring some beer"                                
## [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards"    
## [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-"                                
## [5] "05.10.17, 09:27 - Person Two: ???"                                                    
## [6] "05.10.17, 09:28 - Person Two: You guys have history?"                                 
## [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

Answer 3

strsplit(TEST, '(?<=\\n|^)(0)',perl=T)[[1]][2:7]

具有 Positive lookhead 的正则表达式仍然使用 strsplit() 在错误的位置拆分字符串

Regex with Positive lookhead still splits string in wrong place using strsplit()

regex

pcre

r

strsplit

regex-lookarounds