清理'stringr str_replace_all'匹配多次时自动拼接
Cleaning 'stringr str_replace_all' automatic concatenation when matching multiple times
我使用 police_officer <- str_extract_all(txtparts, "ID:.*\n")
从文本文件中提取了参与 911 呼叫的所有警察的姓名。
示例:
2237 DISTURBANCE 报告<br>
接线员:电话接线员 Sharon L Moran
Location/Address: [BRO 6949] 威尔逊街 61 号
ID:巡警达尔文·安德森
Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45
ID:巡警 Stephen T Pina
Disp-22:43:48 Clrd-22:46:10
ID:Michael V Damiano 中士
Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22
在某些部分,当它匹配多个 ID:
时,我得到:"c(\" Patrolman Darvin Anderson\n\", \" Patrolman Stephen T Pina\n\", \" Sergeant Michael V Damiano\n\")"
。
到目前为止,这是我尝试清理数据的方法:
police_officer <- str_replace_all(police_officer,"c\(.","")
police_officer <- str_replace_all(police_officer,"\)","")
police_officer <- str_replace_all(police_officer,"ID:","")
police_officer <- str_replace_all(police_officer,"\n\","") # I can't get rid of\n\.
这就是我的结局
" Patrolman Darvin Anderson\n\", \" Patrolman Stephen T Pina\n\", \" Sergeant Michael V Damiano\n\""
我需要帮助清洁 \n\
。
您可以将以下正则表达式与 str_match_all
一起使用:
\bID:\s*(\w+(?:\h+\w+)*)
> txt <- "Call Taker: Telephone Operators Sharon L Moran\n Location/Address: [BRO 6949] 61 WILSON ST\n ID: Patrolman Darvin Anderson\n Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45\n ID: Patrolman Stephen T Pina\n Disp-22:43:48 Clrd-22:46:10\n ID: Sergeant Michael V Damiano\n Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22"
> str_match_all(txt, "\bID:\s*(\w+(?:\h+\w+)*)")
[[1]]
[,1] [,2]
[1,] "ID: Patrolman Darvin Anderson" "Patrolman Darvin Anderson"
[2,] "ID: Patrolman Stephen T Pina" "Patrolman Stephen T Pina"
[3,] "ID: Sergeant Michael V Damiano" "Sergeant Michael V Damiano"
正则表达式匹配 ID:
作为一个完整的单词,然后匹配零个或多个空格(使用 \s*
),然后 捕获 可选的字母数字字符序列用水平空格分隔。 str_match_all
有助于提取捕获的部分,因此,您不能将 str_extract_all
与此正则表达式一起使用。
更新:
> time <- str_trim(str_extract(txt, " [[:digit:]]{4}"))
> Call_taker <- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %>% str_replace_all("\n","")
> address <- str_extract(txt, "Location/Address:.*\n")
> Police_officer <- str_match_all(txt, "\bID:\s*(\w+(?:\h+\w+)*)")
> BPD_log <- cbind(time,Call_taker,address,list(Police_officer[[1]][,2]))
> BPD_log <- as.data.frame(BPD_log)
> colnames(BPD_log) <- c("time", "Call_taker", "address", "Police_officer")
> BPD_log
time Call_taker address
1 6949 Telephone Operators Sharon L Moran Location/Address: [BRO 6949] 61 WILSON ST\n
Police_officer
1 Patrolman Darvin Anderson, Patrolman Stephen T Pina, Sergeant Michael V Damiano
>
我使用 police_officer <- str_extract_all(txtparts, "ID:.*\n")
从文本文件中提取了参与 911 呼叫的所有警察的姓名。
示例:
2237 DISTURBANCE 报告<br>
接线员:电话接线员 Sharon L Moran
Location/Address: [BRO 6949] 威尔逊街 61 号
ID:巡警达尔文·安德森
Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45
ID:巡警 Stephen T Pina
Disp-22:43:48 Clrd-22:46:10
ID:Michael V Damiano 中士
Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22
在某些部分,当它匹配多个 ID:
时,我得到:"c(\" Patrolman Darvin Anderson\n\", \" Patrolman Stephen T Pina\n\", \" Sergeant Michael V Damiano\n\")"
。
到目前为止,这是我尝试清理数据的方法:
police_officer <- str_replace_all(police_officer,"c\(.","")
police_officer <- str_replace_all(police_officer,"\)","")
police_officer <- str_replace_all(police_officer,"ID:","")
police_officer <- str_replace_all(police_officer,"\n\","") # I can't get rid of\n\.
这就是我的结局
" Patrolman Darvin Anderson\n\", \" Patrolman Stephen T Pina\n\", \" Sergeant Michael V Damiano\n\""
我需要帮助清洁 \n\
。
您可以将以下正则表达式与 str_match_all
一起使用:
\bID:\s*(\w+(?:\h+\w+)*)
> txt <- "Call Taker: Telephone Operators Sharon L Moran\n Location/Address: [BRO 6949] 61 WILSON ST\n ID: Patrolman Darvin Anderson\n Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45\n ID: Patrolman Stephen T Pina\n Disp-22:43:48 Clrd-22:46:10\n ID: Sergeant Michael V Damiano\n Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22"
> str_match_all(txt, "\bID:\s*(\w+(?:\h+\w+)*)")
[[1]]
[,1] [,2]
[1,] "ID: Patrolman Darvin Anderson" "Patrolman Darvin Anderson"
[2,] "ID: Patrolman Stephen T Pina" "Patrolman Stephen T Pina"
[3,] "ID: Sergeant Michael V Damiano" "Sergeant Michael V Damiano"
正则表达式匹配 ID:
作为一个完整的单词,然后匹配零个或多个空格(使用 \s*
),然后 捕获 可选的字母数字字符序列用水平空格分隔。 str_match_all
有助于提取捕获的部分,因此,您不能将 str_extract_all
与此正则表达式一起使用。
更新:
> time <- str_trim(str_extract(txt, " [[:digit:]]{4}"))
> Call_taker <- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %>% str_replace_all("\n","")
> address <- str_extract(txt, "Location/Address:.*\n")
> Police_officer <- str_match_all(txt, "\bID:\s*(\w+(?:\h+\w+)*)")
> BPD_log <- cbind(time,Call_taker,address,list(Police_officer[[1]][,2]))
> BPD_log <- as.data.frame(BPD_log)
> colnames(BPD_log) <- c("time", "Call_taker", "address", "Police_officer")
> BPD_log
time Call_taker address
1 6949 Telephone Operators Sharon L Moran Location/Address: [BRO 6949] 61 WILSON ST\n
Police_officer
1 Patrolman Darvin Anderson, Patrolman Stephen T Pina, Sergeant Michael V Damiano
>