从具有不同日期格式的字符串变量中提取 R 中的日期，这些日期格式缺乏通用结构/困难模式

Question

我有一列大约 1300 个字符，我需要从中提取一个日期，如果该字符包含一个日期（即如果 NA 则没有要获取的日期）并且如果它包含多个日期，我只需要一;如果它包含日期间隔，我不需要日期。

例如，这里有 10 个最能说明不同情况的观察结果，我想在旁边发表评论：

string1 <- 'Eff. 1/1/96 ACWD Res #96-006 Service' # need the date
string2 <- 'NA' # irrelevant 
string3 <-'Effective 2/1/07' # need the date
string4 <- 'Effective: 3/01/2011' # need the date
string5 <- 'Eff. July 1, 1995 Ord. #92 Includes Cap Exp Ch' # need the date
string6 <- 'Effective: 2010-11' # need the date
string7 <- 'Eff. January 02' # need the date 
string8 <- 'Effective 1/1/09 Billing (svc prd 10/15 - 12/15/08)' # need first date only, not intervals
string9 <- 'Eff. 9/1/95 Resolution No. 63-95 1st 1000 g. free' # need the date
string10 <- '(svc prd 10/15-12/15/08)' # don't need interval dates

因此 strings 1 & 3 & 8 & 9（相同格式）、string4、string5、string6 和 string7 具有不同的日期格式。此外，string6 和 string7 有更多问题。 string6 可以取为 1/1/10（一般取为 1/1/FIRST YEAR），而 string7 有一个年份可以用另一个字符列来识别，命名为 FY 包含值如FY 9596，则string7可取为1/2/95。

10 个字符串的期望输出应该是：（也可以都是YYYY-MM-DD，只要一致就没关系）

1/1/96
NULL
2/1/07
3/1/11
7/1/95
1/1/10
1/2/95
1/1/09
9/1/95
NULL

当我一次在 10 上测试它时，使用以下

for(j in 1:10){
strapplyc(string[j], "\d+/\d+/\d+", simplify = TRUE)
}

由于实例日期格式的结构差异，我得到以下信息：

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "[=13=]2") { : 
  missing value where TRUE/FALSE needed

特别是，string5、string6、string7 未能 return 我需要的，正如预期的那样，我得到了 NULL；此外，string8 无法 return 我需要的，因为我得到

      [,1]      
[1,] "1/1/09"  
[2,] "12/15/08"

最后，string10 未能 return 我需要的，我得到 12/15/08。

对于 string5、string6、string7，ifelse 是 mutate 的最有效方法吗？？；对于 string10 我想如果日期前面有 - 则分配 NULL 因为我认为这可能表示一个与我的目的无关的间隔但是 string6 包含我需要的连字符.

据我所知，我看到了一些相关的帖子 here, & here。但认为这种情况非常不同。如果情况并非如此，请提前致歉。

非常感谢任何帮助！！

Answer 1

根据@mnist 的评论和我随后评论中的公认模式，我拆分了数据（让 myData 表示我的数据框，String 表示所有 1300 的列字符串观察）与 grepl

myData <- myData %>% filter(grepl("Eff|eff|Ef",String))

然后我再次将myData分成2个子集，案例1（nice case）对应于filter(grepl("\d+/\d+/\d+", String))，案例2分别对应于filter(!grepl("\d+/\d+/\d+", String))。事实证明，案例 2（烦人的案例）仅占观察结果的 3%（<50 obs），我想我会手动处理，因为它不多。

事实证明，案例 1 只有一个观察结果，如 string8，所以我手动更正了它。

从具有不同日期格式的字符串变量中提取 R 中的日期，这些日期格式缺乏通用结构/困难模式

Extracting dates in R, from a string variable with different date formats exhibiting lack of general structure / difficult pattern

regex

r

dplyr

grepl

gsubfn