Pandas:无法减去日期时间对象(timedelta、datetime)

Pandas: Cannot subtract date-time objects (timedelta, datetime)

设置如下:

Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:00:30)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.


import pandas as pd
import numpy as np
from datetime import datetime, timedelta

df = pd.DataFrame({
                      'user_id': [1,2,3,4,5,6],
                      'created_at': [
                              '2017-01-01 10:10:15',
                              '2017-01-01 11:11:11',
                              '2017-01-01 12:12:12',
                              '2017-01-01 10:10:20',
                              '2017-01-01 10:10:34',
                              '2017-01-01 11:11:21'],
                      'transaction_value': [10, 20, 10, 30, 40, 50]
                      })

# convert string to datetime obj
df['created_at'] = pd.to_datetime(df['created_at'])


# convert other columns to numeric
cols = df.columns.drop('created_at')

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

# creating lag1 and lag2
df['lag1'] = (
        df.sort_values(by=['created_at'], ascending=True)['created_at']
        .shift(periods=1, axis=0).fillna(0)
        )

df['lag2'] = (
        df.sort_values(by=['created_at'], ascending=True)['created_at']
        .shift(periods=-1, axis=0).fillna(0)
        )

# 0's to NaN
df = df.replace(0, np.nan, inplace=False)

# convert to datetime
cols = [col for col in df if col.startswith('lag')]

df[cols] = df[cols].apply(pd.to_datetime, errors='coerce')

Out[62]:
   user_id          created_at  transaction_value                lag1                lag2
0        1 2017-01-01 10:10:15                 10                 NaT                 NaT
1        2 2017-01-01 11:11:11                 20 2017-01-01 10:10:34 2017-01-01 10:10:34
2        3 2017-01-01 12:12:12                 10 2017-01-01 11:11:21 2017-01-01 11:11:21
3        4 2017-01-01 10:10:20                 30 2017-01-01 10:10:15 2017-01-01 10:10:15
4        5 2017-01-01 10:10:34                 40 2017-01-01 10:10:20 2017-01-01 10:10:20
5        6 2017-01-01 11:11:21                 50 2017-01-01 11:11:11 2017-01-01 11:11:11
In [63]: df.dtypes
Out[63]:
user_id                       int64
created_at           datetime64[ns]
transaction_value             int64
lag1                 datetime64[ns]
lag2                 datetime64[ns]
dtype: object

我想要所有时间戳列之间的差异(结果以秒为单位)。

以下是我尝试过的许多方法:

尝试 #1:

def x(a,b):
    return timedelta(a - b).total_seconds()

df.apply(lambda f: x(f['created_at'],f['lag1']), axis=1)

TypeError: unsupported type for timedelta days component: NaTType

In [69]:

好的,尝试 #2:

pd.Timedelta(df['lag1'].difference(df['lag2']))

AttributeError: 'Series' object has no attribute 'difference'

好的...尝试#3:


pd.Timedelta(df['lag1'].subtract(df['lag2']).to_seconds())

AttributeError: 'Series' object has no attribute 'to_seconds'

现在我只是到处乱扔东西,看看会粘住什么,因为这对我来说没有任何意义:

df['lag1'].subtract(df['lag2']).to_timedelta64

AttributeError: 'Series' object has no attribute 'to_timedelta64'

t1 = df['lag1']
t2 = df['lag2']

pd.Timedelta(t2 - t1).seconds

ValueError: Value must be Timedelta, string, integer, float, timedelta or convertible, not Series

我不应该写一段代码来获得两个日期时间之间的差异

我所在的机器:

MacBook Air M1 2020 16GB RAM(macOS Big Sur 版本 11.2.1)

由于两列都是 pandas Timestamp,您可以这样做:

def x(a, b):
    return (a - b).total_seconds()
    
df.apply(lambda f: x(f['created_at'],f['lag1']), axis=1)

我不知道你想在 NaT 情况下做什么(这个功能 returns NaN ),但你可以很容易地改变它。

我没有接触过NaT values,如果需要,请随意用0或其他值填充它们。

我们可以使用 pd.timedeltadt 访问器,然后应用 total_seconds 方法。
如果要求输出为 int type,则在代码末尾添加 .astype(int)

代码

df['lag_diff'] = pd.to_timedelta(df.lag1 - df.lag2, unit='s').dt.total_seconds()

来自提供设置的输入

user_id created_at  transaction_value   lag1    lag2
0   1   2017-01-01 10:10:15 10  NaT            2017-01-01 10:10:20
1   2   2017-01-01 11:11:11 20  2017-01-01 10:10:34 2017-01-01 11:11:21
2   3   2017-01-01 12:12:12 10  2017-01-01 11:11:21 NaT
3   4   2017-01-01 10:10:20 30  2017-01-01 10:10:15 2017-01-01 10:10:34
4   5   2017-01-01 10:10:34 40  2017-01-01 10:10:20 2017-01-01 11:11:11
5   6   2017-01-01 11:11:21 50  2017-01-01 11:11:11 2017-01-01 12:12:12

列的输出子集

    lag1                lag2                lag_diff
0   NaT                 2017-01-01 10:10:20   NaN
1   2017-01-01 10:10:34 2017-01-01 11:11:21 -3647.0
2   2017-01-01 11:11:21 NaT                 NaN
3   2017-01-01 10:10:15 2017-01-01 10:10:34 -19.0
4   2017-01-01 10:10:20 2017-01-01 11:11:11 -3651.0
5   2017-01-01 11:11:11 2017-01-01 12:12:12 -3661.0