如何从文本中提取日期

Question

我试图从以下文本中提取日期。不幸的是，它一直给我警告，结果是 NA

我有以下文字：

"IRA-401K Investment Assets Under Management (AUM)  As of July 31, 2018 BMG Funds  
7,743,573 BMG BullionBars  ,176,561 TOTAL  2,920,134 Physical Holdings Download 
Scotiabank BMG BullionBars List Download Brinks BMG BullionBars List Holdings by Ounces As 
of July 31, 2018  Gold Bars 21,132.496 Silver Bars 453,531.574 Silver Coins 
80,500 Platinum Bars"

文本包含以下日期：2018 年 7 月 31 日。这些日期在文本中出现了两次。

我使用以下代码从文本中提取日期。

test_take <- lapply(cleanurl_text, parse_date_time, orders = "mdy", 
             locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"))

我收到以下错误消息：

Warning message: All formats failed to parse. No formats found.

当我包含 exact = TRUE

test_take <- lapply(as.character(cleanurl_text), parse_date_time, orders = "mdy", 
       locale = Sys.setlocale('LC_TIME', locale = "English_Canada.1252"), exact = TRUE)

我收到以下警告：

Warning message: 1 failed to parse.

生成的对象仍然包含 NA。

Answer 1

以下正则表达式可以提取发布格式的日期。

pattern <- paste(month.name, collapse = "|")
pattern <- paste0("(", pattern, ")\s\d{1,2}.{1,2}\d{4}")

m <- gregexpr(pattern, cleanurl_text)
regmatches(cleanurl_text, m)
#[[1]]
#[1] "July 31, 2018" "July 31, 2018"

请注意，这可以在一行代码中完成，regmatches(gregexpr(.))，但我选择了两行以使其更具可读性。

如何从文本中提取日期

How to extract date from the text

regex

r

web-scraping

anytime