正确解析日期的 excel 格式
Properly parsing excel format for dates
我已经为这个问题苦苦挣扎了一段时间。我试过多个 python excel 库,它们似乎都有同样的问题。对于 xlsx 文件,最终期望的结果基本上是 "what you see is what you get"。所有与 excel 交互的 python 库 return 存储在 excel 中的值以及该值可能的相应格式。我正在努力然后使用该格式来实际获得一个看起来像您在 excel 或其他 spreadsheet 应用程序(如 libre office calc.
中看到的值
假设我们有一个 sheet,其中一行看起来像这样:
格式(使用 libre office calc 显示)在这里:
现在这里有一些代码可以打开 sheet 并输出存储的值和格式
import openpyxl
book = openpyxl.load_workbook(
'test.xlsx',
read_only=True,
data_only=False,
)
sheet = book.worksheets[0]
for row in sheet.iter_rows():
for cell in row:
print('FORMAT:', cell.number_format)
print('VALUE:', cell.value)
print('TYPE:', type(cell.value))
运行 该代码(python 3.6.7,openpyxl 3.0.1)产生以下截断输出:
FORMAT: yyyy\-mm\-dd\Thh:mm\Z
VALUE: 2017-04-19 15:17:00.000004
TYPE: <class 'datetime.datetime'>
...
我的问题是,如何将该格式字符串 (yyyy-mm-dd\Thh:mm\Z) 解析为有效的 python strftime 日期时间表示形式。我开始编写一个简单的函数,使用字符串替换将 yyyy
替换为 %Y
,将 yy
替换为 %y
等等。但是后来我注意到格式字符串中有两个 mm
实例,一个对应月份,一个对应分钟!你希望如何解析它?月份总是在第一位吗?当只有几分钟时会发生什么?如果您想要时间在前、日期在后的日期时间格式怎么办?
如有任何帮助,我们将不胜感激。要么是已经执行此操作的 python 库,要么是有据可查的 xlsx 文件格式规范,它允许我构建自己的解析器(我找到了这个,但它似乎没有我想要的:https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/300280fd-e4fe-4675-a924-4d383af48d3b),或另一种语言的示例。如果这可以推广到日期之外并且始终用于所有 excel 格式,那也很好。
Question: Parse string ("yyyy-mm-dd\Thh:mm\Z"
) into a valid datetime.strftime
Format Code.
import re
from datetime import datetime
class XLSXdatetime:
translate = {'yyyy': '%Y', 'mm': '%m', 'dd': '%d',
'hh:mm': '%H:%M', 'hh:mm:ss': '%H:%M:%S'}
rec = re.compile(r'([\w:]+|\.)')
def __init__(self, xlsx_format):
self.xlsx_format = xlsx_format
@property
def format(self):
_format = []
for item in XLSXdatetime.rec.findall(self.xlsx_format):
if item.startswith('\'):
item = item[1:]
_format.append(XLSXdatetime.translate.get(item, item))
return ''.join(_format)
def strftime(self, data):
return data.strftime(self.format)
Usage:
data = datetime.strptime('2017-04-19 15:17:00.000004', '%Y-%m-%d %H:%M:%S.%f')
print('data: {}'.format(data))
# Long version
for _format in ['yyyy-mm-dd hh:mm:ss',
'yyyy\-mm\-dd\Thh:mm\Z'
]:
xlsx_datetime = XLSXdatetime(_format)
print("{} => {} = '{}'".format(_format,
xlsx_datetime.format,
xlsx_datetime.strftime(data)))
Output:
data: 2017-04-19 15:17:00.000004
yyyy-mm-dd hh:mm:ss => %Y%m%d%H:%M:%S = '2017041915:17:00'
yyyy\-mm\-dd\Thh:mm\Z => %Y-%m-%dT%H:%MZ = '2017-04-19T15:17Z'
# Short version
for _format in ['yyyy-mm-dd hh:mm:ss',
'yyyy\-mm\-dd\Thh:mm\Z'
]:
print("'{}'".format(XLSXdatetime(_format).strftime(data)))
Output:
data: 2017-04-19 15:17:00.000004
'2017041915:17:00'
'2017-04-19T15:17Z'
测试 Python:3.6
不幸的是,stovfl 的解决方案实际上并未推广到所有 xlsx 格式。通过 Microsoft 文档进行大量搜索后,我终于找到了 this page,其中记录了 excel number_format.
的一些规则
重要注意事项:
- mm 和 m 仅当最直接的前面代码是 hh 或 h(小时)或者最直接的后面代码是 ss 或 s(秒)时才表示分钟,否则 mm 和 m 表示月份。
- 大多数非代码字符必须以反斜杠开头
- 被引号包围的字符按字面解释(不是代码)
- 有大量字符显示时没有任何转义符或引号
- 有
sections
的概念,用分号隔开。出于此解决方案的目的,我选择忽略部分(因为如果实际使用这些部分,则结果输出看起来不像日期)。
- excel 中的某些代码在 strftime 中没有等效代码。例如,
mmmmm
显示月份的第一个字母。对于我的解决方案,我选择用类似的 strftime 代码替换这些代码(对于 mmmmm
,我选择 %b
,它显示月份的缩写)。我在评论中注意到了这些代码
无论如何,我刚刚构建了一个给定 excel number_format 日期字符串的函数,returns 相当于 python strftime。我希望这可以帮助那些寻找将 "What you see is what you get" 从 excel 变为 python 的方法的人。
EXCEL_CODES = {
'yyyy': '%Y',
'yy': '%y',
'dddd': '%A',
'ddd': '%a',
'dd': '%d',
'd': '%-d',
# Different from excel as there is no J-D in strftime
'mmmmmm': '%b',
'mmmm': '%B',
'mmm': '%b',
'hh': '%H',
'h': '%-H',
'ss': '%S',
's': '%-S',
# Possibly different from excel as there is no am/pm in strftime
'am/pm': '%p',
# Different from excel as there is no A/P or a/p in strftime
'a/p': '%p',
}
EXCEL_MINUTE_CODES = {
'mm': '%M',
'm': '%-M',
}
EXCEL_MONTH_CODES = {
'mm': '%m',
'm': '%-m',
}
EXCEL_MISC_CHARS = [
'$',
'+',
'(',
':',
'^',
'\'',
'{',
'<',
'=',
'-',
'/',
')',
'!',
'&',
'~',
'}',
'>',
' ',
]
EXCEL_ESCAPE_CHAR = '\'
EXCEL_SECTION_DIVIDER = ';'
def convert_excel_date_format_string(excel_date):
'''
Created using documentation here:
https://support.office.com/en-us/article/review-guidelines-for-customizing-a-number-format-c0a1d1fa-d3f4-4018-96b7-9c9354dd99f5
'''
# The python date string that is being built
python_date = ''
# The excel code currently being parsed
excel_code = ''
prev_code = ''
# If the previous character was the escape character
char_escaped = False
# If we are in a quotation block (surrounded by "")
quotation_block = False
# Variables used for checking if a code should be a minute or a month
checking_minute_or_month = False
minute_or_month_buffer = ''
for c in excel_date:
ec = excel_code.lower()
# The previous character was an escape, the next character should be added normally
if char_escaped:
if checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
char_escaped = False
continue
# Inside a quotation block
if quotation_block:
if c == '"':
# Quotation block should now end
quotation_block = False
elif checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
continue
# The start of a quotation block
if c == '"':
quotation_block = True
continue
if c == EXCEL_SECTION_DIVIDER:
# We ignore excel sections for datetimes
break
is_escape_char = c == EXCEL_ESCAPE_CHAR
# The am/pm and a/p code add some complications, need to make sure we are not that code
is_misc_char = c in EXCEL_MISC_CHARS and (c != '/' or (ec != 'am' and ec != 'a'))
# Code is finished, check if it is a proper code
if (is_escape_char or is_misc_char) and ec:
# Checking if the previous code should have been minute or month
if checking_minute_or_month:
if ec == 'ss' or ec == 's':
# It should be a minute!
minute_or_month_buffer = EXCEL_MINUTE_CODES[prev_code] + minute_or_month_buffer
else:
# It should be a months!
minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
python_date += minute_or_month_buffer
checking_minute_or_month = False
minute_or_month_buffer = ''
if ec in EXCEL_CODES:
python_date += EXCEL_CODES[ec]
# Handle months/minutes differently
elif ec in EXCEL_MINUTE_CODES:
# If preceded by hours, we know this is referring to minutes
if prev_code == 'h' or prev_code == 'hh':
python_date += EXCEL_MINUTE_CODES[ec]
else:
# Have to check if the next code is ss or s
checking_minute_or_month = True
minute_or_month_buffer = ''
else:
# Have to abandon this attempt to convert because the code is not recognized
return None
prev_code = ec
excel_code = ''
if is_escape_char:
char_escaped = True
elif is_misc_char:
# Add the misc char
if checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
else:
# Just add to the code
excel_code += c
# Complete, check if there is still a buffer
if checking_minute_or_month:
# We know it's a month because there were no more codes after
minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
python_date += minute_or_month_buffer
if excel_code:
ec = excel_code.lower()
if ec in EXCEL_CODES:
python_date += EXCEL_CODES[ec]
elif ec in EXCEL_MINUTE_CODES:
if prev_code == 'h' or prev_code == 'hh':
python_date += EXCEL_MINUTE_CODES[ec]
else:
python_date += EXCEL_MONTH_CODES[ec]
else:
return None
return python_date
使用 openpyxl 3.0.1 python 3.6.7 测试
我已经为这个问题苦苦挣扎了一段时间。我试过多个 python excel 库,它们似乎都有同样的问题。对于 xlsx 文件,最终期望的结果基本上是 "what you see is what you get"。所有与 excel 交互的 python 库 return 存储在 excel 中的值以及该值可能的相应格式。我正在努力然后使用该格式来实际获得一个看起来像您在 excel 或其他 spreadsheet 应用程序(如 libre office calc.
中看到的值假设我们有一个 sheet,其中一行看起来像这样:
格式(使用 libre office calc 显示)在这里:
现在这里有一些代码可以打开 sheet 并输出存储的值和格式
import openpyxl
book = openpyxl.load_workbook(
'test.xlsx',
read_only=True,
data_only=False,
)
sheet = book.worksheets[0]
for row in sheet.iter_rows():
for cell in row:
print('FORMAT:', cell.number_format)
print('VALUE:', cell.value)
print('TYPE:', type(cell.value))
运行 该代码(python 3.6.7,openpyxl 3.0.1)产生以下截断输出:
FORMAT: yyyy\-mm\-dd\Thh:mm\Z
VALUE: 2017-04-19 15:17:00.000004
TYPE: <class 'datetime.datetime'>
...
我的问题是,如何将该格式字符串 (yyyy-mm-dd\Thh:mm\Z) 解析为有效的 python strftime 日期时间表示形式。我开始编写一个简单的函数,使用字符串替换将 yyyy
替换为 %Y
,将 yy
替换为 %y
等等。但是后来我注意到格式字符串中有两个 mm
实例,一个对应月份,一个对应分钟!你希望如何解析它?月份总是在第一位吗?当只有几分钟时会发生什么?如果您想要时间在前、日期在后的日期时间格式怎么办?
如有任何帮助,我们将不胜感激。要么是已经执行此操作的 python 库,要么是有据可查的 xlsx 文件格式规范,它允许我构建自己的解析器(我找到了这个,但它似乎没有我想要的:https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/300280fd-e4fe-4675-a924-4d383af48d3b),或另一种语言的示例。如果这可以推广到日期之外并且始终用于所有 excel 格式,那也很好。
Question: Parse string (
"yyyy-mm-dd\Thh:mm\Z"
) into a validdatetime.strftime
Format Code.
import re
from datetime import datetime
class XLSXdatetime:
translate = {'yyyy': '%Y', 'mm': '%m', 'dd': '%d',
'hh:mm': '%H:%M', 'hh:mm:ss': '%H:%M:%S'}
rec = re.compile(r'([\w:]+|\.)')
def __init__(self, xlsx_format):
self.xlsx_format = xlsx_format
@property
def format(self):
_format = []
for item in XLSXdatetime.rec.findall(self.xlsx_format):
if item.startswith('\'):
item = item[1:]
_format.append(XLSXdatetime.translate.get(item, item))
return ''.join(_format)
def strftime(self, data):
return data.strftime(self.format)
Usage:
data = datetime.strptime('2017-04-19 15:17:00.000004', '%Y-%m-%d %H:%M:%S.%f') print('data: {}'.format(data)) # Long version for _format in ['yyyy-mm-dd hh:mm:ss', 'yyyy\-mm\-dd\Thh:mm\Z' ]: xlsx_datetime = XLSXdatetime(_format) print("{} => {} = '{}'".format(_format, xlsx_datetime.format, xlsx_datetime.strftime(data)))
Output:
data: 2017-04-19 15:17:00.000004 yyyy-mm-dd hh:mm:ss => %Y%m%d%H:%M:%S = '2017041915:17:00' yyyy\-mm\-dd\Thh:mm\Z => %Y-%m-%dT%H:%MZ = '2017-04-19T15:17Z'
# Short version for _format in ['yyyy-mm-dd hh:mm:ss', 'yyyy\-mm\-dd\Thh:mm\Z' ]: print("'{}'".format(XLSXdatetime(_format).strftime(data)))
Output:
data: 2017-04-19 15:17:00.000004 '2017041915:17:00' '2017-04-19T15:17Z'
测试 Python:3.6
不幸的是,stovfl 的解决方案实际上并未推广到所有 xlsx 格式。通过 Microsoft 文档进行大量搜索后,我终于找到了 this page,其中记录了 excel number_format.
的一些规则重要注意事项:
- mm 和 m 仅当最直接的前面代码是 hh 或 h(小时)或者最直接的后面代码是 ss 或 s(秒)时才表示分钟,否则 mm 和 m 表示月份。
- 大多数非代码字符必须以反斜杠开头
- 被引号包围的字符按字面解释(不是代码)
- 有大量字符显示时没有任何转义符或引号
- 有
sections
的概念,用分号隔开。出于此解决方案的目的,我选择忽略部分(因为如果实际使用这些部分,则结果输出看起来不像日期)。 - excel 中的某些代码在 strftime 中没有等效代码。例如,
mmmmm
显示月份的第一个字母。对于我的解决方案,我选择用类似的 strftime 代码替换这些代码(对于mmmmm
,我选择%b
,它显示月份的缩写)。我在评论中注意到了这些代码
无论如何,我刚刚构建了一个给定 excel number_format 日期字符串的函数,returns 相当于 python strftime。我希望这可以帮助那些寻找将 "What you see is what you get" 从 excel 变为 python 的方法的人。
EXCEL_CODES = {
'yyyy': '%Y',
'yy': '%y',
'dddd': '%A',
'ddd': '%a',
'dd': '%d',
'd': '%-d',
# Different from excel as there is no J-D in strftime
'mmmmmm': '%b',
'mmmm': '%B',
'mmm': '%b',
'hh': '%H',
'h': '%-H',
'ss': '%S',
's': '%-S',
# Possibly different from excel as there is no am/pm in strftime
'am/pm': '%p',
# Different from excel as there is no A/P or a/p in strftime
'a/p': '%p',
}
EXCEL_MINUTE_CODES = {
'mm': '%M',
'm': '%-M',
}
EXCEL_MONTH_CODES = {
'mm': '%m',
'm': '%-m',
}
EXCEL_MISC_CHARS = [
'$',
'+',
'(',
':',
'^',
'\'',
'{',
'<',
'=',
'-',
'/',
')',
'!',
'&',
'~',
'}',
'>',
' ',
]
EXCEL_ESCAPE_CHAR = '\'
EXCEL_SECTION_DIVIDER = ';'
def convert_excel_date_format_string(excel_date):
'''
Created using documentation here:
https://support.office.com/en-us/article/review-guidelines-for-customizing-a-number-format-c0a1d1fa-d3f4-4018-96b7-9c9354dd99f5
'''
# The python date string that is being built
python_date = ''
# The excel code currently being parsed
excel_code = ''
prev_code = ''
# If the previous character was the escape character
char_escaped = False
# If we are in a quotation block (surrounded by "")
quotation_block = False
# Variables used for checking if a code should be a minute or a month
checking_minute_or_month = False
minute_or_month_buffer = ''
for c in excel_date:
ec = excel_code.lower()
# The previous character was an escape, the next character should be added normally
if char_escaped:
if checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
char_escaped = False
continue
# Inside a quotation block
if quotation_block:
if c == '"':
# Quotation block should now end
quotation_block = False
elif checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
continue
# The start of a quotation block
if c == '"':
quotation_block = True
continue
if c == EXCEL_SECTION_DIVIDER:
# We ignore excel sections for datetimes
break
is_escape_char = c == EXCEL_ESCAPE_CHAR
# The am/pm and a/p code add some complications, need to make sure we are not that code
is_misc_char = c in EXCEL_MISC_CHARS and (c != '/' or (ec != 'am' and ec != 'a'))
# Code is finished, check if it is a proper code
if (is_escape_char or is_misc_char) and ec:
# Checking if the previous code should have been minute or month
if checking_minute_or_month:
if ec == 'ss' or ec == 's':
# It should be a minute!
minute_or_month_buffer = EXCEL_MINUTE_CODES[prev_code] + minute_or_month_buffer
else:
# It should be a months!
minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
python_date += minute_or_month_buffer
checking_minute_or_month = False
minute_or_month_buffer = ''
if ec in EXCEL_CODES:
python_date += EXCEL_CODES[ec]
# Handle months/minutes differently
elif ec in EXCEL_MINUTE_CODES:
# If preceded by hours, we know this is referring to minutes
if prev_code == 'h' or prev_code == 'hh':
python_date += EXCEL_MINUTE_CODES[ec]
else:
# Have to check if the next code is ss or s
checking_minute_or_month = True
minute_or_month_buffer = ''
else:
# Have to abandon this attempt to convert because the code is not recognized
return None
prev_code = ec
excel_code = ''
if is_escape_char:
char_escaped = True
elif is_misc_char:
# Add the misc char
if checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
else:
# Just add to the code
excel_code += c
# Complete, check if there is still a buffer
if checking_minute_or_month:
# We know it's a month because there were no more codes after
minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
python_date += minute_or_month_buffer
if excel_code:
ec = excel_code.lower()
if ec in EXCEL_CODES:
python_date += EXCEL_CODES[ec]
elif ec in EXCEL_MINUTE_CODES:
if prev_code == 'h' or prev_code == 'hh':
python_date += EXCEL_MINUTE_CODES[ec]
else:
python_date += EXCEL_MONTH_CODES[ec]
else:
return None
return python_date
使用 openpyxl 3.0.1 python 3.6.7 测试