在 Python 中使用 Dateutil 时提取某些日期格式失败
Extraction of some date formats failed when using Dateutil in Python
在发布这个问题之前,我经历了多次 link,所以请仔细阅读,下面是解决了我 90% 问题的两个答案:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
问题:我需要在Python
中解析多种格式的多个日期
以上链接的解决方案:我可以这样做,但仍然有某些格式我不能这样做。
仍然无法解析的格式有:
文字='I want to visit from May 16-May 18'
正文='I want to visit from May 16-18'
文字='I want to visit from May 6 May 18'
我也尝试过正则表达式,但由于日期可以采用任何格式,因此排除了该选项,因为代码变得非常复杂。因此,请建议我修改 link 上提供的代码,以便上面的 3 种格式也可以在同一台上处理。
这种问题总是需要用新的边缘案例进行调整,但以下方法相当可靠:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
这将按如下方式转换测试字符串:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
工作原理如下:
首先创建一个有效月份名称列表,即完整和缩写。
进行翻译table以便于快速删除文本中的任何标点符号。
拆分文本,并使用带有正则表达式的函数仅提取日期部分来识别日期或月份。
根据部分是否为数字对列表进行排序,这会将月份分组到前面,数字分组到末尾。
取每个列表的第一部分和最后一部分。将月份转换为完整形式,例如Aug
到 August
并将每个转换为 datetime
个对象。
如果某个日期出现在前一个日期之前,请添加一整年。
在发布这个问题之前,我经历了多次 link,所以请仔细阅读,下面是解决了我 90% 问题的两个答案:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
问题:我需要在Python
中解析多种格式的多个日期以上链接的解决方案:我可以这样做,但仍然有某些格式我不能这样做。
仍然无法解析的格式有:
文字='I want to visit from May 16-May 18'
正文='I want to visit from May 16-18'
文字='I want to visit from May 6 May 18'
我也尝试过正则表达式,但由于日期可以采用任何格式,因此排除了该选项,因为代码变得非常复杂。因此,请建议我修改 link 上提供的代码,以便上面的 3 种格式也可以在同一台上处理。
这种问题总是需要用新的边缘案例进行调整,但以下方法相当可靠:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
这将按如下方式转换测试字符串:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
工作原理如下:
首先创建一个有效月份名称列表,即完整和缩写。
进行翻译table以便于快速删除文本中的任何标点符号。
拆分文本,并使用带有正则表达式的函数仅提取日期部分来识别日期或月份。
根据部分是否为数字对列表进行排序,这会将月份分组到前面,数字分组到末尾。
取每个列表的第一部分和最后一部分。将月份转换为完整形式,例如
Aug
到August
并将每个转换为datetime
个对象。如果某个日期出现在前一个日期之前,请添加一整年。