Pandas：在包含列表对象的系列上重叠前向填充

Question

我有一个 Series/DataFrame 就像这个。其中包含的元素是一个或多个值的列表：

0      NaN
1     [40]
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9     [35]
10     NaN
11     NaN
12    [28]
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
Name: tags, dtype: object

我想用最多连续五个条目的最新值来填充缺失值。限制为 5 的 ffill 是最合适的。但是我的用例是这样的，我希望前向填充重叠。我的预期输出看起来像这样：

0          NaN
1         [40]
2         [40]
3         [40]
4         [40]
5         [40]
6         [40]
7          NaN
8          NaN
9         [35]
10        [35]
11        [35]
12        [28]
13    [35, 28]
14    [35, 28]
15        [28]
16        [28]
17        [28]
Name: tags, dtype: object

上面的例子是为了简单起见，我描述的这个函数是一个更大的 pd.groupby 操作的一部分，带有更多的标签，因此 python 循环不是'没什么帮助。我不关心带有标签本身的索引，只有那些 filled 对我来说很重要。也许使用 pandas cumsum 并根据索引差异进行切片的方法在这里可行吗？

解决这个问题的任何想法都会对我有极大的帮助。提前致谢！

Answer 1

# init the DataFrame
temp = pd.DataFrame({"tags":[
    np.nan, [40], np.nan, np.nan, np.nan, 
    np.nan, np.nan, np.nan, np.nan, [35], 
    np.nan, np.nan, [28], np.nan, np.nan, 
    np.nan, np.nan, np.nan]})

# initialize the result with empty lists for list concatenation
temp['ctags'] = temp['tags'].apply(lambda x: [] if type(x) == float else x)

window = 5
for i in range(1, window):
    temp['ctags'] = temp['ctags'] + temp['tags'].shift(i).apply(lambda x: [] if type(x) == float else x)

temp['ctags']

给出输出：

0           []
1         [40]
2         [40]
3         [40]
4         [40]
5         [40]
6           []
7           []
8           []
9         [35]
10        [35]
11        [35]
12    [28, 35]
13    [28, 35]
14        [28]
15        [28]
16        [28]
17          []

我能够为我的问题想出这个快速解决方案。但这里的问题是它没有我希望的那样高效，而且如果我将填充限制增加到 10，它的效率会比现在更低。

编辑： 添加循环以实现可重用性。累积解决方案，因此内存效率更高。

Answer 2

你可以试试：

# fill na by empty list 
df['tags'] = [[] if na else s for s, na in zip(df['tags'], df['tags'].isna())]

# compute rolling windows
df['res'] = [[l for ls in window for l in ls] for window in df['tags'].rolling(5)]
print(df)

输出

    tags       res
0     []        []
1   [40]      [40]
2     []      [40]
3     []      [40]
4     []      [40]
5     []      [40]
6     []        []
7     []        []
8     []        []
9   [35]      [35]
10    []      [35]
11    []      [35]
12  [28]  [35, 28]
13    []  [35, 28]
14    []      [28]
15    []      [28]
16    []      [28]
17    []        []

作为替代方案，您可以使用 chain.from_iterable:

from itertools import chain

# compute rolling windows
df['res'] = [list(chain.from_iterable(window)) for window in df['tags'].rolling(5)]

请参阅此以比较 pandas 中列表展平的几种方法。

Pandas：在包含列表对象的系列上重叠前向填充

Pandas: Forward fill with overlap on Series containing List Objects

python

list

dataframe

pandas

fillna