正确解析日期的 excel 格式

Properly parsing excel format for dates

我已经为这个问题苦苦挣扎了一段时间。我试过多个 python excel 库,它们似乎都有同样的问题。对于 xlsx 文件,最终期望的结果基本上是 "what you see is what you get"。所有与 excel 交互的 python 库 return 存储在 excel 中的值以及该值可能的相应格式。我正在努力然后使用该格式来实际获得一个看起来像您在 excel 或其他 spreadsheet 应用程序(如 libre office calc.

中看到的值

假设我们有一个 sheet,其中一行看起来像这样:

格式(使用 libre office calc 显示)在这里:

现在这里有一些代码可以打开 sheet 并输出存储的值和格式

import openpyxl
book = openpyxl.load_workbook(
    'test.xlsx',
    read_only=True,
    data_only=False,
)
sheet = book.worksheets[0]
for row in sheet.iter_rows():
    for cell in row:
        print('FORMAT:', cell.number_format)
        print('VALUE:', cell.value)
        print('TYPE:', type(cell.value))

运行 该代码(python 3.6.7,openpyxl 3.0.1)产生以下截断输出:

FORMAT: yyyy\-mm\-dd\Thh:mm\Z
VALUE: 2017-04-19 15:17:00.000004
TYPE: <class 'datetime.datetime'>
...

我的问题是,如何将该格式字符串 (yyyy-mm-dd\Thh:mm\Z) 解析为有效的 python strftime 日期时间表示形式。我开始编写一个简单的函数,使用字符串替换将 yyyy 替换为 %Y,将 yy 替换为 %y 等等。但是后来我注意到格式字符串中有两个 mm 实例,一个对应月份,一个对应分钟!你希望如何解析它?月份总是在第一位吗?当只有几分钟时会发生什么?如果您想要时间在前、日期在后的日期时间格式怎么办?

如有任何帮助,我们将不胜感激。要么是已经执行此操作的 python 库,要么是有据可查的 xlsx 文件格式规范,它允许我构建自己的解析器(我找到了这个,但它似乎没有我想要的:https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/300280fd-e4fe-4675-a924-4d383af48d3b),或另一种语言的示例。如果这可以推广到日期之外并且始终用于所有 excel 格式,那也很好。

Question: Parse string ("yyyy-mm-dd\Thh:mm\Z") into a valid datetime.strftime Format Code.



import re
from datetime import datetime

class XLSXdatetime:
    translate = {'yyyy': '%Y', 'mm': '%m', 'dd': '%d', 
                 'hh:mm': '%H:%M', 'hh:mm:ss': '%H:%M:%S'}
    rec = re.compile(r'([\w:]+|\.)')

    def __init__(self, xlsx_format):
        self.xlsx_format = xlsx_format

    @property
    def format(self):
        _format = []
        for item in XLSXdatetime.rec.findall(self.xlsx_format):
            if item.startswith('\'):
                item = item[1:]
            _format.append(XLSXdatetime.translate.get(item, item))

        return ''.join(_format)

    def strftime(self, data):
        return data.strftime(self.format)

Usage:

  • data = datetime.strptime('2017-04-19 15:17:00.000004', '%Y-%m-%d %H:%M:%S.%f')
    print('data: {}'.format(data))
    
    # Long version
    for _format in ['yyyy-mm-dd hh:mm:ss', 
                    'yyyy\-mm\-dd\Thh:mm\Z'
                   ]:
        xlsx_datetime = XLSXdatetime(_format)    
        print("{} => {} = '{}'".format(_format, 
                                       xlsx_datetime.format, 
                                       xlsx_datetime.strftime(data)))
    

    Output:

    data: 2017-04-19 15:17:00.000004
    yyyy-mm-dd hh:mm:ss => %Y%m%d%H:%M:%S = '2017041915:17:00'
    yyyy\-mm\-dd\Thh:mm\Z => %Y-%m-%dT%H:%MZ = '2017-04-19T15:17Z'
    

  • # Short version
    for _format in ['yyyy-mm-dd hh:mm:ss', 
                    'yyyy\-mm\-dd\Thh:mm\Z'
                   ]:
        print("'{}'".format(XLSXdatetime(_format).strftime(data)))
    

    Output:

    data: 2017-04-19 15:17:00.000004
    '2017041915:17:00'
    '2017-04-19T15:17Z'
    

测试 Python:3.6

不幸的是,stovfl 的解决方案实际上并未推广到所有 xlsx 格式。通过 Microsoft 文档进行大量搜索后,我终于找到了 this page,其中记录了 excel number_format.

的一些规则

重要注意事项:

  • mm 和 m 仅当最直接的前面代码是 hh 或 h(小时)或者最直接的后面代码是 ss 或 s(秒)时才表示分钟,否则 mm 和 m 表示月份。
  • 大多数非代码字符必须以反斜杠开头
  • 被引号包围的字符按字面解释(不是代码)
  • 有大量字符显示时没有任何转义符或引号
  • sections的概念,用分号隔开。出于此解决方案的目的,我选择忽略部分(因为如果实际使用这些部分,则结果输出看起来不像日期)。
  • excel 中的某些代码在 strftime 中没有等效代码。例如,mmmmm 显示月份的第一个字母。对于我的解决方案,我选择用类似的 strftime 代码替换这些代码(对于 mmmmm,我选择 %b,它显示月份的缩写)。我在评论中注意到了这些代码

无论如何,我刚刚构建了一个给定 excel number_format 日期字符串的函数,returns 相当于 python strftime。我希望这可以帮助那些寻找将 "What you see is what you get" 从 excel 变为 python 的方法的人。

EXCEL_CODES = {
        'yyyy': '%Y',
        'yy': '%y',
        'dddd': '%A',
        'ddd': '%a',
        'dd': '%d',
        'd': '%-d',
        # Different from excel as there is no J-D in strftime
        'mmmmmm': '%b',
        'mmmm': '%B',
        'mmm': '%b',
        'hh': '%H',
        'h': '%-H',
        'ss': '%S',
        's': '%-S',
        # Possibly different from excel as there is no am/pm in strftime
        'am/pm': '%p',
        # Different from excel as there is no A/P or a/p in strftime
        'a/p': '%p',
}

EXCEL_MINUTE_CODES = {
    'mm': '%M',
    'm': '%-M',
}
EXCEL_MONTH_CODES = {
    'mm': '%m',
    'm': '%-m',
}

EXCEL_MISC_CHARS = [
    '$',
    '+',
    '(',
    ':',
    '^',
    '\'',
    '{',
    '<',
    '=',
    '-',
    '/',
    ')',
    '!',
    '&',
    '~',
    '}',
    '>',
    ' ',
]

EXCEL_ESCAPE_CHAR = '\'
EXCEL_SECTION_DIVIDER = ';'

def convert_excel_date_format_string(excel_date):
    '''
    Created using documentation here:
    https://support.office.com/en-us/article/review-guidelines-for-customizing-a-number-format-c0a1d1fa-d3f4-4018-96b7-9c9354dd99f5

    '''
    # The python date string that is being built
    python_date = ''
    # The excel code currently being parsed
    excel_code = ''
    prev_code = ''
    # If the previous character was the escape character
    char_escaped = False
    # If we are in a quotation block (surrounded by "")
    quotation_block = False
    # Variables used for checking if a code should be a minute or a month
    checking_minute_or_month = False
    minute_or_month_buffer = ''

    for c in excel_date:
        ec = excel_code.lower()
        # The previous character was an escape, the next character should be added normally
        if char_escaped:
            if checking_minute_or_month:
                minute_or_month_buffer += c
            else:
                python_date += c
            char_escaped = False
            continue
        # Inside a quotation block
        if quotation_block:
            if c == '"':
                # Quotation block should now end
                quotation_block = False
            elif checking_minute_or_month:
                minute_or_month_buffer += c
            else:
                python_date += c
            continue
        # The start of a quotation block
        if c == '"':
            quotation_block = True
            continue
        if c == EXCEL_SECTION_DIVIDER:
            # We ignore excel sections for datetimes
            break

        is_escape_char = c == EXCEL_ESCAPE_CHAR
        # The am/pm and a/p code add some complications, need to make sure we are not that code
        is_misc_char = c in EXCEL_MISC_CHARS and (c != '/' or (ec != 'am' and ec != 'a'))
        # Code is finished, check if it is a proper code
        if (is_escape_char or is_misc_char) and ec:
            # Checking if the previous code should have been minute or month
            if checking_minute_or_month:
                if ec == 'ss' or ec == 's':
                    # It should be a minute!
                    minute_or_month_buffer = EXCEL_MINUTE_CODES[prev_code] + minute_or_month_buffer
                else:
                    # It should be a months!
                    minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
                python_date += minute_or_month_buffer
                checking_minute_or_month = False
                minute_or_month_buffer = ''

            if ec in EXCEL_CODES:
                python_date += EXCEL_CODES[ec]
            # Handle months/minutes differently
            elif ec in EXCEL_MINUTE_CODES:
                # If preceded by hours, we know this is referring to minutes
                if prev_code == 'h' or prev_code == 'hh':
                    python_date += EXCEL_MINUTE_CODES[ec]
                else:
                    # Have to check if the next code is ss or s
                    checking_minute_or_month = True
                    minute_or_month_buffer = ''
            else:
                # Have to abandon this attempt to convert because the code is not recognized
                return None
            prev_code = ec
            excel_code = ''
        if is_escape_char:
            char_escaped = True
        elif is_misc_char:
            # Add the misc char
            if checking_minute_or_month:
                minute_or_month_buffer += c
            else:
                python_date += c
        else:
            # Just add to the code
            excel_code += c

    # Complete, check if there is still a buffer
    if checking_minute_or_month:
        # We know it's a month because there were no more codes after
        minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
        python_date += minute_or_month_buffer
    if excel_code:
        ec = excel_code.lower()
        if ec in EXCEL_CODES:
            python_date += EXCEL_CODES[ec]
        elif ec in EXCEL_MINUTE_CODES:
            if prev_code == 'h' or prev_code == 'hh':
                python_date += EXCEL_MINUTE_CODES[ec]
            else:
                python_date += EXCEL_MONTH_CODES[ec]
        else:
            return None
    return python_date

使用 openpyxl 3.0.1 python 3.6.7 测试