清理'stringr str_replace_all'匹配多次时自动拼接

Question

我使用 police_officer <- str_extract_all(txtparts, "ID:.*\n") 从文本文件中提取了参与 911 呼叫的所有警察的姓名。示例：
2237 DISTURBANCE 报告<br> 接线员：电话接线员 Sharon L Moran Location/Address: [BRO 6949] 威尔逊街 61 号 ID：巡警达尔文·安德森 Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45 ID：巡警 Stephen T Pina Disp-22:43:48 Clrd-22:46:10 ID：Michael V Damiano 中士 Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22

在某些部分，当它匹配多个 ID: 时，我得到："c(\" Patrolman Darvin Anderson\n\", \" Patrolman Stephen T Pina\n\", \" Sergeant Michael V Damiano\n\")"。到目前为止，这是我尝试清理数据的方法：
police_officer <- str_replace_all(police_officer,"c\(.","") police_officer <- str_replace_all(police_officer,"\)","") police_officer <- str_replace_all(police_officer,"ID:","") police_officer <- str_replace_all(police_officer,"\n\","") # I can't get rid of\n\.

这就是我的结局
" Patrolman Darvin Anderson\n\", \" Patrolman Stephen T Pina\n\", \" Sergeant Michael V Damiano\n\""

我需要帮助清洁 \n\。

Answer 1

您可以将以下正则表达式与 str_match_all 一起使用：

\bID:\s*(\w+(?:\h+\w+)*)

见regex demo

> txt <- "Call Taker:    Telephone Operators Sharon L Moran\n  Location/Address:    [BRO 6949] 61 WILSON ST\n                ID:    Patrolman Darvin Anderson\n                       Disp-22:43:39                 Arvd-22:48:57  Clrd-23:49:45\n                ID:    Patrolman Stephen T Pina\n                       Disp-22:43:48                                Clrd-22:46:10\n                ID:    Sergeant Michael V Damiano\n                       Disp-22:46:33                 Arvd-22:47:14  Clrd-22:55:22"
> str_match_all(txt, "\bID:\s*(\w+(?:\h+\w+)*)")
[[1]]
     [,1]                                [,2]                        
[1,] "ID:    Patrolman Darvin Anderson"  "Patrolman Darvin Anderson" 
[2,] "ID:    Patrolman Stephen T Pina"   "Patrolman Stephen T Pina"  
[3,] "ID:    Sergeant Michael V Damiano" "Sergeant Michael V Damiano"

正则表达式匹配 ID: 作为一个完整的单词，然后匹配零个或多个空格（使用 \s*），然后捕获可选的字母数字字符序列用水平空格分隔。 str_match_all 有助于提取捕获的部分，因此，您不能将 str_extract_all 与此正则表达式一起使用。

更新：

> time <- str_trim(str_extract(txt, " [[:digit:]]{4}"))
> Call_taker <- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %>% str_replace_all("\n","")
> address <- str_extract(txt, "Location/Address:.*\n")
> Police_officer <- str_match_all(txt, "\bID:\s*(\w+(?:\h+\w+)*)")
> BPD_log <- cbind(time,Call_taker,address,list(Police_officer[[1]][,2]))
> BPD_log <- as.data.frame(BPD_log)
> colnames(BPD_log) <- c("time", "Call_taker", "address", "Police_officer")
> BPD_log
  time                             Call_taker                                        address
1 6949     Telephone Operators Sharon L Moran Location/Address:    [BRO 6949] 61 WILSON ST\n
                                                                   Police_officer
1 Patrolman Darvin Anderson, Patrolman Stephen T Pina, Sergeant Michael V Damiano
>

清理'stringr str_replace_all'匹配多次时自动拼接

Cleaning 'stringr str_replace_all' automatic concatenation when matching multiple times

regex

string

substring

r

stringr