Calculating the duration of days between two formatted dates in Python results in "OverflowError: int too big to convert"

Calculating the duration of days between two formatted dates in Python results in "OverflowError: int too big to convert"

我有一个 320000 行和 18 列的 DataFrame。 其中两列是项目开始日期和项目结束日期。 我只是想添加一个以天为单位的项目持续时间的列。

df['proj_duration'] = df['END_FORMATED'] - df['START_FORMATED']

数据是从 SQL 服务器导入的。

日期格式为 (yyyy-mm-dd)。

当我运行上面的代码时,我得到这个错误:

Traceback (most recent call last):

File "pandas_libs\tslibs\timedeltas.pyx", line 234, in pandas._libs.tslibs.timedeltas.array_to_timedelta64

TypeError: Expected unicode, got datetime.timedelta

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "", line 1, in df['proj_duration'] = df['END_FORMATED'] - df['START_FORMATED']

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\ops\common.py", line 64, in new_method return method(self, other)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\ops_init_.py", line 502, in wrapper return _construct_result(left, result, index=left.index, name=res_name)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\ops_init_.py", line 475, in _construct_result out = left._constructor(result, index=index)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\series.py", line 305, in init data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\construction.py", line 424, in sanitize_array subarr = _try_cast(data, dtype, copy, raise_cast_failure)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\construction.py", line 537, in _try_cast subarr = maybe_cast_to_datetime(arr, dtype)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1346, in maybe_cast_to_datetime value = maybe_infer_to_datetimelike(value)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1198, in maybe_infer_to_datetimelike value = try_timedelta(v)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1187, in try_timedelta return to_timedelta(v)._ndarray_values.reshape(shape)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\tools\timedeltas.py", line 102, in to_timedelta return _convert_listlike(arg, unit=unit, errors=errors)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\tools\timedeltas.py", line 140, in _convert_listlike value = sequence_to_td64ns(arg, unit=unit, errors=errors, copy=False)[0]

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\arrays\timedeltas.py", line 943, in sequence_to_td64ns data = objects_to_td64ns(data, unit=unit, errors=errors)

File "C:\Users797\Anaconda3\lib\site-packages\pandas\core\arrays\timedeltas.py", line 1052, in objects_to_td64ns result = array_to_timedelta64(values, unit=unit, errors=errors)

File "pandas_libs\tslibs\timedeltas.pyx", line 239, in pandas._libs.tslibs.timedeltas.array_to_timedelta64

File "pandas_libs\tslibs\timedeltas.pyx", line 198, in pandas._libs.tslibs.timedeltas.convert_to_timedelta64

File "pandas_libs\tslibs\timedeltas.pyx", line 143, in pandas._libs.tslibs.timedeltas.delta_to_nanoseconds

OverflowError: int too big to convert

我怀疑日期格式有问题。我试过了:

a = df.head(50000)['END_FORMATED']
b = df.head(50000)['START_FORMATED']
c = a-b

并得到同样的错误。但是,当我 运行 它处理最后 50000 行时,它工作得很好:

x = df.tail(50000)['END_FORMATED']
y = df.tail(50000)['START_FORMATED']
z = x-y

这表明问题并不存在于所有数据集中,仅存在于某些行中。

知道如何解决这个问题吗? 谢谢!

您的 SQL 数据集中似乎有一个日期设置为 1009-01-06。 pandas 理解 1677-09-21 和 2262-04-11 之间的日期,根据 this oficial documentation.

尝试将每个系列转换为日期时间对象以捕获某些条目是否不是预期格式,infer_datetime_format = Trueerrors = 'coerce' 如下:

df['START_FORMATED'] = ['2020-05-05', '2020-05-06', '2020-05-07', 1009-01-06]
df['END_FORMATED'] = ['2020-06-05', '2020-06-06', '2020-06-07', '2020-06-08']

df['proj_duration'] = pd.to_datetime(df['END_FORMATED'], infer_datetime_format = True, errors = 'coerce') - pd.to_datetime(df['START_FORMATED'], infer_datetime_format=True, errors = 'coerce')

这将在无法使用时设置 NaT 值 pd.to_datetime(),导致此 df:

      START_FORMATED END_FORMATED proj_duration
0         2020-05-05   2020-06-05       31 days
1         2020-05-06   2020-06-06       31 days
2         2020-05-07   2020-06-07       31 days
3         1009-01-06   2020-06-08           NaT