将数字和日期拆分到单独的列中

Split numbers and dates into separate columns

我的数据包含具有三个重要特征的文本字符串,一个由“:”分隔的 ID 号以及一个开始日期和一个结束日期。我需要将这些树编号分为三个单独的列。我尝试了不同的解决方案,从 unnest_tokens、grepl/grep 到分开,但似乎都做不好,我可能会得到一个约会,但我似乎无法将它们放入正确的顺序或放入数据框。

输入数据:

input<- data.frame(
  id=c(1,2,3),
  value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021 
  21.02.2022",
  "but the strings always end with the same - document id, start date: and end date  2022:02 
  30.04.2020 18.02.2022",
  "so I need to split document id, start and end dates into separate columns 2000:01 
  07.10.2000 15.02.2021")
  )

期望的输出:

output <-data.frame(
 id=c(1,2,3),
 value=c("a long title containing all sorts - off `characters`",
 "but the strings always end with the same - document id, start date: and end date",
 "so I need to split document id, start and end dates into separate columns"),
 docid=c("2022:03", "2022:02", "2000:01"),
 start=c("29.10.2021", "30.04.2020", "07.10.2000"),
 end=c("21.02.2022", "18.02.2022", "15.02.2021")
  )

这是通过 extract 最方便地完成的:在它的 regex 参数中,我们详尽地描述了我们想要拆分成列的字符串作为一个复杂的模式,其中需要进入的部分列被包装到捕获组 (...):

library(tidyr)
input %>%
  extract(value,
          into = c("value", "docid", "start", "end"),
          regex = "(.*)\s(\d{4}:\d{2})\s{1,}(.*)\s{1,}(.*)")
  id                                                                             value   docid      start
1  1                              a long title containing all sorts - off `characters` 2022:03 29.10.2021
2  2 but the strings always end with the same - document id, start date: and end date  2022:02 30.04.2020
3  3         so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
         end
1 21.02.2022
2 18.02.2022
3 15.02.2021