仅在 pandas 数据帧中插入(或外推)小间隙

Interpolate (or extrapolate) only small gaps in pandas dataframe

我有一个 pandas DataFrame,以时间为索引(1 分钟频率)和几列数据。有时数据包含 NaN。如果是这样,我只想在间隔不超过 5 分钟时进行插值。在这种情况下,这将是最多 5 个连续的 NaN。数据可能看起来像这样(几个测试用例,显示问题):

import numpy as np
import pandas as pd
from datetime import datetime

start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})

它是这样写的:

In [147]: data
Out[147]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

我知道 data.interpolate() 但它有几个缺陷,因为它产生了这个结果,这对列 a-e 是好的,但对于列 f-h 它由于不同的原因失败了::

                         a      b      c   d       e       f       g  \
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3   
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0  2330.3   
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3  2330.3   
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3  2330.3   
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3  2330.3   

                               h  
2014-02-21 14:50:00  2330.300000  
2014-02-21 14:51:00  2394.214286  
2014-02-21 14:52:00  2458.128571  
2014-02-21 14:53:00  2522.042857  
2014-02-21 14:54:00  2585.957143  
2014-02-21 14:55:00  2649.871429  
2014-02-21 14:56:00  2713.785714  
2014-02-21 14:57:00  2777.700000 

f) 开始时间隔由 4 分钟的 NaN 组成,应将其替换为该值 2763.0(即向后推断时间)

g) 间隔超过 5 分钟,但仍被推断

h) 间隔超过 5 分钟,但间隔仍然是内插的。

我理解这些原因,当然我没有指定它不应插入超过 5 分钟的间隔。我知道 interpolate 只能及时向前推断,但我希望它也能及时向后推断。有什么已知的方法可以解决我的问题,而无需重新发明轮子吗?

编辑: 方法 data.interpolate 接受输入参数 limit,它定义了要被插值替换的连续 NaN 的最大数量。但这仍然会插值到极限,但在那种情况下我想继续使用所有 NaN。

所以这是一个应该可以解决问题的面具。只需 interpolate,然后应用掩码将适当的值重置为 NaN。老实说,这比我意识到的要多一些工作,因为我必须遍历每一列,但是如果没有我提供一些像 'ones'.

这样的虚拟列,groupby 就不能正常工作了

无论如何,如果有任何不清楚的地方我可以解释,但实际上只有几行有点难以理解。有关 df['new'] 行上的技巧的更多解释,请参阅 或打印出单独的行以更好地了解发生了什么。

mask = data.copy()
for i in list('abcdefgh'):
    df = pd.DataFrame( data[i] )
    df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
    df['ones'] = 1
    mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()

In [7]: data
Out[7]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

In [8]: mask
Out[8]: 
                        a     b     c      d      e     f      g      h
2014-02-21 14:50:00  True  True  True  False  False  True   True   True
2014-02-21 14:51:00  True  True  True  False  False  True  False  False
2014-02-21 14:52:00  True  True  True  False  False  True  False  False
2014-02-21 14:53:00  True  True  True  False  False  True  False  False
2014-02-21 14:54:00  True  True  True  False  False  True  False  False
2014-02-21 14:55:00  True  True  True  False  False  True  False  False
2014-02-21 14:56:00  True  True  True  False  False  True  False  False
2014-02-21 14:57:00  True  True  True  False   True  True  False   True

如果你不做任何关于外推的更高级的事情,从那里开始就很容易了:

In [9]: data.interpolate().bfill()[mask]
Out[9]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

编辑添加:这是一个更快(大约是这个示例数据的 2 倍)和稍微简单的方法,将一些东西移到循环之外:

mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
    mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()

在找到上述答案之前,我不得不解决一个类似的问题并想出了一个基于 numpy 的解决方案。因为我的代码大约是。快十倍,我在这里提供它以供将来对某人有用。它在系列末尾处理 NaN 的方式与 不同。如果一个系列以 NaN 结尾,它会将最后一个间隙标记为无效。

代码如下:


def bfill_nan(arr):
    """ Backward-fill NaNs """
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
    idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
    out = arr[idx]
    return out

def calc_mask(arr, maxgap):
    """ Mask NaN gaps longer than `maxgap` """
    isnan = np.isnan(arr)
    cumsum = np.cumsum(isnan).astype('float')
    diff = np.zeros_like(arr)
    diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
    diff[isnan] = np.nan
    diff = bfill_nan(diff)
    return (diff < maxgap) | ~isnan


mask = data.copy()

for column_name in data:
    x = data[column_name].values
    mask[column_name] = calc_mask(x, 5)

print('data:')
print(data)

print('\nmask:')
print mask

输出:

data:
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

mask:
                        a     b      c      d      e     f      g      h
2014-02-21 14:50:00  True  True   True  False  False  True   True   True
2014-02-21 14:51:00  True  True   True  False  False  True  False  False
2014-02-21 14:52:00  True  True   True  False  False  True  False  False
2014-02-21 14:53:00  True  True   True  False  False  True  False  False
2014-02-21 14:54:00  True  True  False  False  False  True  False  False
2014-02-21 14:55:00  True  True  False  False  False  True  False  False
2014-02-21 14:56:00  True  True  False  False  False  True  False  False
2014-02-21 14:57:00  True  True  False  False   True  True  False   True

根据interpolate documentation limit_area 如下使用的是0.23.0版本中的新内容。我不确定这是否是 e 和 g 列所需的输出,因为您没有详细指定所需的输出。

import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta

start = datetime(2014,2,21,14,50)
df = data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})

df.interpolate(
    limit=5,
    inplace=True,
    limit_direction='both',
    limit_area='outside',
    )

print(df)

输出:

                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN  2763.0  2330.3     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN  2330.3  2142.3  2330.3     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN  2330.3  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

我继续将@JohnE 的 改编成一个函数(有一些tweaks/improvements)。我正在使用 Python 3.8,我相信类型提示已针对 3.9 进行了更改,因此您可能需要适应。

from typing import Union

def fill_with_hard_limit(
        df_or_series: Union[pd.DataFrame, pd.Series], limit: int,
        fill_method='interpolate',
        **fill_method_kwargs) -> Union[pd.DataFrame, pd.Series]:
    """The fill methods from Pandas such as ``interpolate`` or ``bfill``
    will fill ``limit`` number of NaNs, even if the total number of
    consecutive NaNs is larger than ``limit``. This function instead
    does not fill any data when the number of consecutive NaNs
    is > ``limit``.

    Adapted from: 

    :param df_or_series: DataFrame or Series to perform interpolation
        on.
    :param limit: Maximum number of consecutive NaNs to allow. Any
        occurrences of more consecutive NaNs than ``limit`` will have no
        filling performed.
    :param fill_method: Filling method to use, e.g. 'interpolate',
        'bfill', etc.
    :param fill_method_kwargs: Keyword arguments to pass to the
        fill_method, in addition to the given limit.

    :returns: A filled version of the given df_or_series according
        to the given inputs.
    """

    # Keep things simple, ensure we have a DataFrame.
    try:
        df = df_or_series.to_frame()
    except AttributeError:
        df = df_or_series

    # Initialize our mask.
    mask = pd.DataFrame(True, index=df.index, columns=df.columns)

    # Get cumulative sums of consecutive NaNs.
    grp = (df.notnull() != df.shift().notnull()).cumsum()

    # Add columns of ones.
    grp['ones'] = 1

    # Loop through columns and update the mask.
    for col in df.columns:

        mask.loc[:, col] = (
                (grp.groupby(col)['ones'].transform('count') <= limit)
                | df[col].notnull()
        )

    # Now, interpolate and use the mask to create NaNs for the larger
    # gaps.
    method = getattr(df, fill_method)
    out = method(limit=limit, **fill_method_kwargs)[mask]

    # Be nice to the caller and return a Series if that's what they
    # provided.
    if isinstance(df_or_series, pd.Series):
        # Return a Series.
        return out.loc[:, out.columns[0]]

    return out

用法:

>>> data_filled = fill_with_hard_limit(data, 5)
>>> data_filled
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7