dplyr 中的正则表达式匹配
Regular expression matching inside dplyr
在回答时写了如下代码:
df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))
require(stringr)
matches = str_match(df$Call_Num, "([A-Z]+)(\d+)\s*\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
现在我的问题是:是否有一种简单的方法可以将最后两行组合成一个 dplyr
调用,大概使用 mutate()
?或者,我也对 do()
的解决方案感兴趣。对于 mutate()
方法,因为我们要提取 2 个组,所以我将采用一种解决方案,使用不同的正则表达式调用 str_match()
两次,每个所需的组一个。
编辑: 澄清一下,我在这里看到的主要挑战是 str_match
returns 一个矩阵,我想知道如何处理它在 mutate()
或 do()
中。我对使用其他提取信息的方法解决原始问题不感兴趣。已经给出了很多这样的解决方案
您可以使用 tidyr 包中的 extract()
执行此操作:
extract(df, Call_Num, into = c("letter", "number"), regex = "([A-Z]+)(\d+)\s*\.", remove = FALSE)
Call_Num letter number
1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
4 D753 .F4 Circulating Collection, 3rd Floor D 753
5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
它不是 dplyr,但如上面链接的 CRAN 页面所述,tidyr "is designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines."
你可以试试do
df %>%
do(data.frame(., str_match(.$Call_Num, "([A-Z]+)(\d+)\s*\.")[,-1],
stringsAsFactors=FALSE)) %>%
rename_(.dots=setNames(names(.)[-1],c('letter', 'number')))
# Call_Num letter number
#1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
#2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
#3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
#4 D753 .F4 Circulating Collection, 3rd Floor D 753
#5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
或者正如@SamFirke 评论的那样,重命名列也可以使用
--- %>%
setNames(., c(names(.)[1], "letter", "number"))
在回答
df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))
require(stringr)
matches = str_match(df$Call_Num, "([A-Z]+)(\d+)\s*\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
现在我的问题是:是否有一种简单的方法可以将最后两行组合成一个 dplyr
调用,大概使用 mutate()
?或者,我也对 do()
的解决方案感兴趣。对于 mutate()
方法,因为我们要提取 2 个组,所以我将采用一种解决方案,使用不同的正则表达式调用 str_match()
两次,每个所需的组一个。
编辑: 澄清一下,我在这里看到的主要挑战是 str_match
returns 一个矩阵,我想知道如何处理它在 mutate()
或 do()
中。我对使用其他提取信息的方法解决原始问题不感兴趣。已经给出了很多这样的解决方案
您可以使用 tidyr 包中的 extract()
执行此操作:
extract(df, Call_Num, into = c("letter", "number"), regex = "([A-Z]+)(\d+)\s*\.", remove = FALSE)
Call_Num letter number
1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
4 D753 .F4 Circulating Collection, 3rd Floor D 753
5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
它不是 dplyr,但如上面链接的 CRAN 页面所述,tidyr "is designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines."
你可以试试do
df %>%
do(data.frame(., str_match(.$Call_Num, "([A-Z]+)(\d+)\s*\.")[,-1],
stringsAsFactors=FALSE)) %>%
rename_(.dots=setNames(names(.)[-1],c('letter', 'number')))
# Call_Num letter number
#1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
#2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
#3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
#4 D753 .F4 Circulating Collection, 3rd Floor D 753
#5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
或者正如@SamFirke 评论的那样,重命名列也可以使用
--- %>%
setNames(., c(names(.)[1], "letter", "number"))