在 R 中格式化和替换单个字符串中的多个日期

Formatting and Replacing Multiple Dates within a Single String in R

我有一个与 非常相似的问题。与我的不同之处在于,我可以在一个字符串中包含多个日期的文本。所有日期的格式都相同,如下所示

rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "

我所有的句子都是小写的,所有日期都遵循 %B %d %Y 格式。我可以使用以下代码提取所有日期:

> pattern <-  paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>% 
     regex(ignore_case = TRUE)
> str_extract_all(rep, pattern)
[[1]]
[1] "june 11 2022"   "august 4 2022"  "august 25 2022"

我想要做的是将格式为 %B %d %Y 的每个日期实例替换为格式 %Y-%m-%d。我试过这样的事情:

str_replace_all(rep, pattern, as.character(as.Date(str_extract_all(rep, pattern),format = "%B %d %Y")))

抛出错误 do not know how to convert 'str_extract_all' to class "Date"。这对我来说很有意义,因为我试图替换多个不同的日期,而 R 不知道用哪个日期替换它。

如果我将 str_extract_all 更改为 str_extract,我会得到:

"on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-06-11. on 2022-06-11 there will be a test "

这又是有道理的,因为 str_extract 正在获取日期的第一个实例,转换格式,并在所有日期实例中应用相同的日期。

我更希望解决方案使用 stringr 包,因为到目前为止我的大部分字符串整理一直在使用该包,但我 100% 对任何能够完成工作的解决方案持开放态度。

我们可以捕获模式,即一个或多个字符 (\w+) 后跟一个 space 然后一个或两个数字 (\d{1,2}),然后是 space然后四个数字 (\d{4}) 作为一个组 ((...)) 并在替换中传递一个函数将捕获的组转换为 Date class

library(stringr)
str_replace_all(rep, "(\w+ \d{1,2} \d{4})", function(x) as.Date(x, "%b %d %Y"))

-输出

[1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "

注意:最好用不同的名称命名对象,因为 rep 是一个 base R 函数名

您可以将具有多个替换项的命名向量传递给 str_replace_all():

library(stringr)

rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
pattern <-  paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>% 
  regex(ignore_case = TRUE)
extracted <- str_extract_all(rep, pattern)[[1]]
replacements <- setNames(as.character(as.Date(extracted, format = "%B %d %Y")), 
                     extracted)
str_replace_all(rep, replacements)
#> [1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "

reprex package (v2.0.1)

创建于 2022-05-26