将数字和日期拆分到单独的列中
Split numbers and dates into separate columns
我的数据包含具有三个重要特征的文本字符串,一个由“:”分隔的 ID 号以及一个开始日期和一个结束日期。我需要将这些树编号分为三个单独的列。我尝试了不同的解决方案,从 unnest_tokens、grepl/grep 到分开,但似乎都做不好,我可能会得到一个约会,但我似乎无法将它们放入正确的顺序或放入数据框。
输入数据:
input<- data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021
21.02.2022",
"but the strings always end with the same - document id, start date: and end date 2022:02
30.04.2020 18.02.2022",
"so I need to split document id, start and end dates into separate columns 2000:01
07.10.2000 15.02.2021")
)
期望的输出:
output <-data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters`",
"but the strings always end with the same - document id, start date: and end date",
"so I need to split document id, start and end dates into separate columns"),
docid=c("2022:03", "2022:02", "2000:01"),
start=c("29.10.2021", "30.04.2020", "07.10.2000"),
end=c("21.02.2022", "18.02.2022", "15.02.2021")
)
这是通过 extract
最方便地完成的:在它的 regex
参数中,我们详尽地描述了我们想要拆分成列的字符串作为一个复杂的模式,其中需要进入的部分列被包装到捕获组 (...)
:
library(tidyr)
input %>%
extract(value,
into = c("value", "docid", "start", "end"),
regex = "(.*)\s(\d{4}:\d{2})\s{1,}(.*)\s{1,}(.*)")
id value docid start
1 1 a long title containing all sorts - off `characters` 2022:03 29.10.2021
2 2 but the strings always end with the same - document id, start date: and end date 2022:02 30.04.2020
3 3 so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
end
1 21.02.2022
2 18.02.2022
3 15.02.2021
我的数据包含具有三个重要特征的文本字符串,一个由“:”分隔的 ID 号以及一个开始日期和一个结束日期。我需要将这些树编号分为三个单独的列。我尝试了不同的解决方案,从 unnest_tokens、grepl/grep 到分开,但似乎都做不好,我可能会得到一个约会,但我似乎无法将它们放入正确的顺序或放入数据框。
输入数据:
input<- data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021
21.02.2022",
"but the strings always end with the same - document id, start date: and end date 2022:02
30.04.2020 18.02.2022",
"so I need to split document id, start and end dates into separate columns 2000:01
07.10.2000 15.02.2021")
)
期望的输出:
output <-data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters`",
"but the strings always end with the same - document id, start date: and end date",
"so I need to split document id, start and end dates into separate columns"),
docid=c("2022:03", "2022:02", "2000:01"),
start=c("29.10.2021", "30.04.2020", "07.10.2000"),
end=c("21.02.2022", "18.02.2022", "15.02.2021")
)
这是通过 extract
最方便地完成的:在它的 regex
参数中,我们详尽地描述了我们想要拆分成列的字符串作为一个复杂的模式,其中需要进入的部分列被包装到捕获组 (...)
:
library(tidyr)
input %>%
extract(value,
into = c("value", "docid", "start", "end"),
regex = "(.*)\s(\d{4}:\d{2})\s{1,}(.*)\s{1,}(.*)")
id value docid start
1 1 a long title containing all sorts - off `characters` 2022:03 29.10.2021
2 2 but the strings always end with the same - document id, start date: and end date 2022:02 30.04.2020
3 3 so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
end
1 21.02.2022
2 18.02.2022
3 15.02.2021