当值不符合模式时，如何将 Pandas 列转换为日期类型？

Question

我有休闲数据框：

    Timestamp           real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   17FEB20:24:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051

如何将列转换为 datetime64？

有两件事让我觉得这个问题很棘手：

列 Timestamp，索引 4 的值为：17FEB20:24:00:00，这似乎不是有效的日期时间（尽管它是由 SAS 程序输出的。. .).
列 real time 没有休闲模式，似乎无法通过 date_parser 匹配。

这是我试图解决的第一列 (Timestamp)：

data['Timestamp'] = pd.to_datetime(
    data['Timestamp'],
    format='%d%b%y:%H:%M:%S')

但是由于索引 4 (17FEB20:24:00:00) 的值，我得到： ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match)。如果我删除这一行，它确实有效，但我必须找到一种方法来解决它，因为我的数据集有数千行，我不能简单地忽略它们。也许有办法将其转换为第二天的零时？

下面是一段代码，用于创建上述数据帧示例，以便花一些时间来解决问题（如果需要）：

data = pd.DataFrame({
    'Timestamp':[
        '17FEB20:23:59:50',
        '17FEB20:23:59:55',
        '17FEB20:23:59:57',
        '17FEB20:23:59:57',
        '17FEB20:24:00:00',
        '18FEB20:00:00:00',
        '18FEB20:00:00:02',
        '18FEB20:00:00:02'],
    'real time': [
        '0.003',
        '0.003',
        '0.012',
        '02:54.8',
        '0.03',
        '0',
        '54.211',
        '0.051',
        ]})

感谢您的帮助！

Answer 1

如果您的数据不是太大，您可能需要考虑遍历数据帧。你可以这样做。

for index, row in data.iterrows():
    if row['Timestamp'][8:10] == '24':
        date = (pd.to_datetime(row['Timestamp'][:7]).date() + pd.DateOffset(1)).strftime('%d%b%y').upper()
        data.loc[index, 'Timestamp'] = date + ':00:00:00'

这是结果。

        Timestamp      real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   18FEB20:00:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051

Answer 2

我是这样处理的：

对于专栏 Timestamp，我使用了这个（感谢@merit_2 在第一个评论中分享它）。
对于列 real time，我使用一些条件进行解析。

代码如下：

import os
import pandas as pd
from datetime import timedelta

# Parsing "real time" column:

## Apply mask '.000' to the microseconds
data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]

## apply mask over all '00:00:00.000'
placeholders = {
    1: '00:00:00.00',
    2: '00:00:00.0',
    3: '00:00:00.',
    4: '00:00:00',
    5: '00:00:0',
    6: '00:00:',
    7: '00:00',
    8: '00:0',
    9: '00:',
    10:'00',
    11:'0'}

for cond_len in placeholders:
    condition = data['real time'].str.len() == cond_len
    data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']

# Parsing "Timestamp" column:
selrow = data['Timestamp'].str.contains('24:00')
data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)

# Convert to columns to datetime type:
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')

# check results:
display(data)
display(data.dtypes)

这是输出：

    Timestamp           real time
0   2020-02-17 23:59:50 1900-01-01 00:00:00.003
1   2020-02-17 23:59:55 1900-01-01 00:00:00.003
2   2020-02-17 23:59:57 1900-01-01 00:00:00.012
3   2020-02-17 23:59:57 1900-01-01 00:02:54.800
4   2020-02-18 00:00:00 1900-01-01 00:00:00.030
5   2020-02-18 00:00:00 1900-01-01 00:00:00.000
6   2020-02-18 00:00:02 1900-01-01 00:00:54.211
7   2020-02-18 00:00:02 1900-01-01 00:00:00.051

Timestamp    datetime64[ns]
real time    datetime64[ns]

也许有一个聪明的方法来做到这一点，但现在它适合。

当值不符合模式时，如何将 Pandas 列转换为日期类型？

How to convert Pandas column into date type when values don't respect a pattern?

python

dataframe

python-datetime

pandas

dateparser