如何从非结构化字符串中提取 dd/mm/yyyy 格式的日期?
How to extract out dates in a dd/mm/yyyy format from an unstructured string?
我有几个像下面这样的字符串:
'Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;'
预期结果:October 2017
'January 7;30;39;24;46;1750;April 2017;April 30;February;'
预期结果:April 2017
'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;'
预期结果:mid-October
我知道这个字符串完全是非结构化的,但是我们可以有一个 python 代码来从这些字符串中获取日期吗?
这是我尝试提取数据实体的 NER 模型的一部分。
我已经尝试了一些方法,但由于字符串没有正确的模式,这些方法甚至都没有接近结果
您可以使用 datefinder
和正则表达式来检查找到的日期时间字符串中的月份名称:
import datefinder, re
from datetime import datetime
strs = ['Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;',
'January 7;30;39;24;46;1750;April 2017;April 30;February;',
'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;']
day_of_week_rx = re.compile(r'(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', re.I)
for s in strs:
raw_dates = list(datefinder.find_dates(s, source=True))
print([y for x,y in raw_dates if day_of_week_rx.search(y)])
输出:
['October 2017', 'March 2018', 'Jan. 4', 'Dec. 21']
['January 7', 'April 2017', 'April 30']
[]
请注意,mid-October
无法转换为有效的日期时间,因此不会被提取。您将需要应用一些更具体的正则表达式,例如 re.search(r'\b(?:half|mid)-(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', text)
.
(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)
匹配英文月份全称和缩写。
我有几个像下面这样的字符串:
'Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;'
预期结果:October 2017
'January 7;30;39;24;46;1750;April 2017;April 30;February;'
预期结果:April 2017
'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;'
预期结果:mid-October
我知道这个字符串完全是非结构化的,但是我们可以有一个 python 代码来从这些字符串中获取日期吗?
这是我尝试提取数据实体的 NER 模型的一部分。
我已经尝试了一些方法,但由于字符串没有正确的模式,这些方法甚至都没有接近结果
您可以使用 datefinder
和正则表达式来检查找到的日期时间字符串中的月份名称:
import datefinder, re
from datetime import datetime
strs = ['Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;',
'January 7;30;39;24;46;1750;April 2017;April 30;February;',
'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;']
day_of_week_rx = re.compile(r'(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', re.I)
for s in strs:
raw_dates = list(datefinder.find_dates(s, source=True))
print([y for x,y in raw_dates if day_of_week_rx.search(y)])
输出:
['October 2017', 'March 2018', 'Jan. 4', 'Dec. 21']
['January 7', 'April 2017', 'April 30']
[]
请注意,mid-October
无法转换为有效的日期时间,因此不会被提取。您将需要应用一些更具体的正则表达式,例如 re.search(r'\b(?:half|mid)-(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', text)
.
(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)
匹配英文月份全称和缩写。