仅在 pandas 数据帧中插入(或外推)小间隙
Interpolate (or extrapolate) only small gaps in pandas dataframe
我有一个 pandas DataFrame,以时间为索引(1 分钟频率)和几列数据。有时数据包含 NaN。如果是这样,我只想在间隔不超过 5 分钟时进行插值。在这种情况下,这将是最多 5 个连续的 NaN。数据可能看起来像这样(几个测试用例,显示问题):
import numpy as np
import pandas as pd
from datetime import datetime
start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
'd': [np.NaN]*8,
'e': [np.NaN]*7 + [2330.3],
'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
'g': [2330.3] + [np.NaN]*7,
'h': [2330.3] + [np.NaN]*6 + [2777.7]})
它是这样写的:
In [147]: data
Out[147]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
我知道 data.interpolate()
但它有几个缺陷,因为它产生了这个结果,这对列 a-e 是好的,但对于列 f-h 它由于不同的原因失败了::
a b c d e f g \
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN NaN 2330.3
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN 2330.3
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN 2330.3
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 2330.3
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 2330.3
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 2330.3
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 2330.3
h
2014-02-21 14:50:00 2330.300000
2014-02-21 14:51:00 2394.214286
2014-02-21 14:52:00 2458.128571
2014-02-21 14:53:00 2522.042857
2014-02-21 14:54:00 2585.957143
2014-02-21 14:55:00 2649.871429
2014-02-21 14:56:00 2713.785714
2014-02-21 14:57:00 2777.700000
f) 开始时间隔由 4 分钟的 NaN 组成,应将其替换为该值 2763.0(即向后推断时间)
g) 间隔超过 5 分钟,但仍被推断
h) 间隔超过 5 分钟,但间隔仍然是内插的。
我理解这些原因,当然我没有指定它不应插入超过 5 分钟的间隔。我知道 interpolate
只能及时向前推断,但我希望它也能及时向后推断。有什么已知的方法可以解决我的问题,而无需重新发明轮子吗?
编辑:
方法 data.interpolate
接受输入参数 limit
,它定义了要被插值替换的连续 NaN 的最大数量。但这仍然会插值到极限,但在那种情况下我想继续使用所有 NaN。
所以这是一个应该可以解决问题的面具。只需 interpolate
,然后应用掩码将适当的值重置为 NaN。老实说,这比我意识到的要多一些工作,因为我必须遍历每一列,但是如果没有我提供一些像 'ones'.
这样的虚拟列,groupby 就不能正常工作了
无论如何,如果有任何不清楚的地方我可以解释,但实际上只有几行有点难以理解。有关 df['new']
行上的技巧的更多解释,请参阅 或打印出单独的行以更好地了解发生了什么。
mask = data.copy()
for i in list('abcdefgh'):
df = pd.DataFrame( data[i] )
df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
df['ones'] = 1
mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()
In [7]: data
Out[7]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
In [8]: mask
Out[8]:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True True False False True False False
2014-02-21 14:55:00 True True True False False True False False
2014-02-21 14:56:00 True True True False False True False False
2014-02-21 14:57:00 True True True False True True False True
如果你不做任何关于外推的更高级的事情,从那里开始就很容易了:
In [9]: data.interpolate().bfill()[mask]
Out[9]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN 2763.0 2330.3 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
编辑添加:这是一个更快(大约是这个示例数据的 2 倍)和稍微简单的方法,将一些东西移到循环之外:
mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()
在找到上述答案之前,我不得不解决一个类似的问题并想出了一个基于 numpy
的解决方案。因为我的代码大约是。快十倍,我在这里提供它以供将来对某人有用。它在系列末尾处理 NaN 的方式与 不同。如果一个系列以 NaN 结尾,它会将最后一个间隙标记为无效。
代码如下:
def bfill_nan(arr):
""" Backward-fill NaNs """
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
out = arr[idx]
return out
def calc_mask(arr, maxgap):
""" Mask NaN gaps longer than `maxgap` """
isnan = np.isnan(arr)
cumsum = np.cumsum(isnan).astype('float')
diff = np.zeros_like(arr)
diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
diff[isnan] = np.nan
diff = bfill_nan(diff)
return (diff < maxgap) | ~isnan
mask = data.copy()
for column_name in data:
x = data[column_name].values
mask[column_name] = calc_mask(x, 5)
print('data:')
print(data)
print('\nmask:')
print mask
输出:
data:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
mask:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True False False False True False False
2014-02-21 14:55:00 True True False False False True False False
2014-02-21 14:56:00 True True False False False True False False
2014-02-21 14:57:00 True True False False True True False True
根据interpolate
documentation limit_area
如下使用的是0.23.0版本中的新内容。我不确定这是否是 e 和 g 列所需的输出,因为您没有详细指定所需的输出。
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(2014,2,21,14,50)
df = data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
'd': [np.NaN]*8,
'e': [np.NaN]*7 + [2330.3],
'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
'g': [2330.3] + [np.NaN]*7,
'h': [2330.3] + [np.NaN]*6 + [2777.7]})
df.interpolate(
limit=5,
inplace=True,
limit_direction='both',
limit_area='outside',
)
print(df)
输出:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN 2763.0 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN 2763.0 2330.3 NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN 2330.3 2763.0 2330.3 NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN 2330.3 2763.0 2330.3 NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN 2330.3 2763.0 2330.3 NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN 2330.3 2142.3 2330.3 NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN 2330.3 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
我继续将@JohnE 的 改编成一个函数(有一些tweaks/improvements)。我正在使用 Python 3.8,我相信类型提示已针对 3.9 进行了更改,因此您可能需要适应。
from typing import Union
def fill_with_hard_limit(
df_or_series: Union[pd.DataFrame, pd.Series], limit: int,
fill_method='interpolate',
**fill_method_kwargs) -> Union[pd.DataFrame, pd.Series]:
"""The fill methods from Pandas such as ``interpolate`` or ``bfill``
will fill ``limit`` number of NaNs, even if the total number of
consecutive NaNs is larger than ``limit``. This function instead
does not fill any data when the number of consecutive NaNs
is > ``limit``.
Adapted from:
:param df_or_series: DataFrame or Series to perform interpolation
on.
:param limit: Maximum number of consecutive NaNs to allow. Any
occurrences of more consecutive NaNs than ``limit`` will have no
filling performed.
:param fill_method: Filling method to use, e.g. 'interpolate',
'bfill', etc.
:param fill_method_kwargs: Keyword arguments to pass to the
fill_method, in addition to the given limit.
:returns: A filled version of the given df_or_series according
to the given inputs.
"""
# Keep things simple, ensure we have a DataFrame.
try:
df = df_or_series.to_frame()
except AttributeError:
df = df_or_series
# Initialize our mask.
mask = pd.DataFrame(True, index=df.index, columns=df.columns)
# Get cumulative sums of consecutive NaNs.
grp = (df.notnull() != df.shift().notnull()).cumsum()
# Add columns of ones.
grp['ones'] = 1
# Loop through columns and update the mask.
for col in df.columns:
mask.loc[:, col] = (
(grp.groupby(col)['ones'].transform('count') <= limit)
| df[col].notnull()
)
# Now, interpolate and use the mask to create NaNs for the larger
# gaps.
method = getattr(df, fill_method)
out = method(limit=limit, **fill_method_kwargs)[mask]
# Be nice to the caller and return a Series if that's what they
# provided.
if isinstance(df_or_series, pd.Series):
# Return a Series.
return out.loc[:, out.columns[0]]
return out
用法:
>>> data_filled = fill_with_hard_limit(data, 5)
>>> data_filled
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
我有一个 pandas DataFrame,以时间为索引(1 分钟频率)和几列数据。有时数据包含 NaN。如果是这样,我只想在间隔不超过 5 分钟时进行插值。在这种情况下,这将是最多 5 个连续的 NaN。数据可能看起来像这样(几个测试用例,显示问题):
import numpy as np
import pandas as pd
from datetime import datetime
start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
'd': [np.NaN]*8,
'e': [np.NaN]*7 + [2330.3],
'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
'g': [2330.3] + [np.NaN]*7,
'h': [2330.3] + [np.NaN]*6 + [2777.7]})
它是这样写的:
In [147]: data
Out[147]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
我知道 data.interpolate()
但它有几个缺陷,因为它产生了这个结果,这对列 a-e 是好的,但对于列 f-h 它由于不同的原因失败了::
a b c d e f g \
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN NaN 2330.3
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN 2330.3
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN 2330.3
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 2330.3
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 2330.3
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 2330.3
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 2330.3
h
2014-02-21 14:50:00 2330.300000
2014-02-21 14:51:00 2394.214286
2014-02-21 14:52:00 2458.128571
2014-02-21 14:53:00 2522.042857
2014-02-21 14:54:00 2585.957143
2014-02-21 14:55:00 2649.871429
2014-02-21 14:56:00 2713.785714
2014-02-21 14:57:00 2777.700000
f) 开始时间隔由 4 分钟的 NaN 组成,应将其替换为该值 2763.0(即向后推断时间)
g) 间隔超过 5 分钟,但仍被推断
h) 间隔超过 5 分钟,但间隔仍然是内插的。
我理解这些原因,当然我没有指定它不应插入超过 5 分钟的间隔。我知道 interpolate
只能及时向前推断,但我希望它也能及时向后推断。有什么已知的方法可以解决我的问题,而无需重新发明轮子吗?
编辑:
方法 data.interpolate
接受输入参数 limit
,它定义了要被插值替换的连续 NaN 的最大数量。但这仍然会插值到极限,但在那种情况下我想继续使用所有 NaN。
所以这是一个应该可以解决问题的面具。只需 interpolate
,然后应用掩码将适当的值重置为 NaN。老实说,这比我意识到的要多一些工作,因为我必须遍历每一列,但是如果没有我提供一些像 'ones'.
无论如何,如果有任何不清楚的地方我可以解释,但实际上只有几行有点难以理解。有关 df['new']
行上的技巧的更多解释,请参阅
mask = data.copy()
for i in list('abcdefgh'):
df = pd.DataFrame( data[i] )
df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
df['ones'] = 1
mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()
In [7]: data
Out[7]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
In [8]: mask
Out[8]:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True True False False True False False
2014-02-21 14:55:00 True True True False False True False False
2014-02-21 14:56:00 True True True False False True False False
2014-02-21 14:57:00 True True True False True True False True
如果你不做任何关于外推的更高级的事情,从那里开始就很容易了:
In [9]: data.interpolate().bfill()[mask]
Out[9]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN 2763.0 2330.3 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
编辑添加:这是一个更快(大约是这个示例数据的 2 倍)和稍微简单的方法,将一些东西移到循环之外:
mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()
在找到上述答案之前,我不得不解决一个类似的问题并想出了一个基于 numpy
的解决方案。因为我的代码大约是。快十倍,我在这里提供它以供将来对某人有用。它在系列末尾处理 NaN 的方式与
代码如下:
def bfill_nan(arr):
""" Backward-fill NaNs """
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
out = arr[idx]
return out
def calc_mask(arr, maxgap):
""" Mask NaN gaps longer than `maxgap` """
isnan = np.isnan(arr)
cumsum = np.cumsum(isnan).astype('float')
diff = np.zeros_like(arr)
diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
diff[isnan] = np.nan
diff = bfill_nan(diff)
return (diff < maxgap) | ~isnan
mask = data.copy()
for column_name in data:
x = data[column_name].values
mask[column_name] = calc_mask(x, 5)
print('data:')
print(data)
print('\nmask:')
print mask
输出:
data:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
mask:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True False False False True False False
2014-02-21 14:55:00 True True False False False True False False
2014-02-21 14:56:00 True True False False False True False False
2014-02-21 14:57:00 True True False False True True False True
根据interpolate
documentation limit_area
如下使用的是0.23.0版本中的新内容。我不确定这是否是 e 和 g 列所需的输出,因为您没有详细指定所需的输出。
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(2014,2,21,14,50)
df = data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
'd': [np.NaN]*8,
'e': [np.NaN]*7 + [2330.3],
'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
'g': [2330.3] + [np.NaN]*7,
'h': [2330.3] + [np.NaN]*6 + [2777.7]})
df.interpolate(
limit=5,
inplace=True,
limit_direction='both',
limit_area='outside',
)
print(df)
输出:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN 2763.0 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN 2763.0 2330.3 NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN 2330.3 2763.0 2330.3 NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN 2330.3 2763.0 2330.3 NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN 2330.3 2763.0 2330.3 NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN 2330.3 2142.3 2330.3 NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN 2330.3 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
我继续将@JohnE 的
from typing import Union
def fill_with_hard_limit(
df_or_series: Union[pd.DataFrame, pd.Series], limit: int,
fill_method='interpolate',
**fill_method_kwargs) -> Union[pd.DataFrame, pd.Series]:
"""The fill methods from Pandas such as ``interpolate`` or ``bfill``
will fill ``limit`` number of NaNs, even if the total number of
consecutive NaNs is larger than ``limit``. This function instead
does not fill any data when the number of consecutive NaNs
is > ``limit``.
Adapted from:
:param df_or_series: DataFrame or Series to perform interpolation
on.
:param limit: Maximum number of consecutive NaNs to allow. Any
occurrences of more consecutive NaNs than ``limit`` will have no
filling performed.
:param fill_method: Filling method to use, e.g. 'interpolate',
'bfill', etc.
:param fill_method_kwargs: Keyword arguments to pass to the
fill_method, in addition to the given limit.
:returns: A filled version of the given df_or_series according
to the given inputs.
"""
# Keep things simple, ensure we have a DataFrame.
try:
df = df_or_series.to_frame()
except AttributeError:
df = df_or_series
# Initialize our mask.
mask = pd.DataFrame(True, index=df.index, columns=df.columns)
# Get cumulative sums of consecutive NaNs.
grp = (df.notnull() != df.shift().notnull()).cumsum()
# Add columns of ones.
grp['ones'] = 1
# Loop through columns and update the mask.
for col in df.columns:
mask.loc[:, col] = (
(grp.groupby(col)['ones'].transform('count') <= limit)
| df[col].notnull()
)
# Now, interpolate and use the mask to create NaNs for the larger
# gaps.
method = getattr(df, fill_method)
out = method(limit=limit, **fill_method_kwargs)[mask]
# Be nice to the caller and return a Series if that's what they
# provided.
if isinstance(df_or_series, pd.Series):
# Return a Series.
return out.loc[:, out.columns[0]]
return out
用法:
>>> data_filled = fill_with_hard_limit(data, 5)
>>> data_filled
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7