在 pandas DataFrame 中取消嵌套（分解）多个列表列的有效方法

Question

我正在将多个 JSON 对象读取到一个 DataFrame 中。问题是有些列是列表。此外，数据非常大，因此我无法使用互联网上可用的解决方案。它们非常慢且内存效率低

我的数据如下所示：

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

这是我的数据的形状：(441079, 12)

我想要的输出是：

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

编辑：在被标记为重复之后，我想强调一个事实，即在这个问题中我正在寻找一种高效分解多列的方法。因此，批准的答案能够有效地在非常大的数据集上分解任意数量的列。其他问题的答案未能做到的事情（这就是我在测试这些解决方案后问这个问题的原因）。

Answer 1

在 A 和其余列 apply 和 stack 上使用 set_index 值。所有这些都浓缩在一个衬垫中。

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Answer 2

def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

用法：

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Answer 3

pandas >= 0.25

假设所有列都有相同数量的列表，您可以在每一列上调用 Series.explode。

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

想法是将所有必须 NOT 展开的列设置为索引，然后在展开后重置索引。

它也更快。

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 4

基于@cs95 的回答，我们可以在 lambda 函数中使用 if 子句，而不是将所有其他列设置为 index。这样做有以下优点：

保留列顺序
让您使用要修改的集合轻松指定列，x.name in [...]，或不修改 x.name not in [...]。

df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)

     A   B   C   D   E
0   x1  v1  c1  d1  e1
0   x1  v2  c2  d2  e2
1   x2  v3  c3  d3  e3
1   x2  v4  c4  d4  e4
2   x3  v5  c5  d5  e5
2   x3  v6  c6  d6  e6
3   x4  v7  c7  d7  e7
3   x4  v8  c8  d8  e8

Answer 5

这是我使用 'apply' 函数的解决方案。主要 features/differences:

提供选项以指定已选择多列或所有列
提供选项指定值填充'missing'位置（通过参数fill_mode = 'external'；'internal'；或'trim'，解释会很长，请参阅下面的示例并尝试自己更改选项并检查结果）

注意：选项 'trim' 是根据我的需要开发的，超出了这个问题的范围

def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
    jcols = [j for j,v in enumerate(row.index) if v in cols]
    if len(jcols)<1:
        jcols = range(len(row.index))
    Ls = [lenx(x) for x in row.values]
    if not Ls[:-1]==Ls[1:]:
        vals = [v if isinstance(v,list) else [v] for v in row.values]
        if fill_mode=='external':
            vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e + [fill_value]*(max(Ls)-lenx(e))
                    for j,e in enumerate(vals)]
        elif fill_mode == 'internal':
            vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e+[e[-1]]*(max(Ls)-lenx(e)) 
                    for j,e in enumerate(vals)]
        else:
            vals = [e[0:min(Ls)] for e in vals]
        row = pd.Series(vals,index=row.index.tolist())
    return row

示例：

df=pd.DataFrame({
    'a':[[1],2,3],
    'b':[[4,5,7],[5,4],4],
    'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)

输出：

     a          b       c
0  [1]  [4, 5, 7]  [4, 5]
1    2     [5, 4]       5
2    3          4     [6]

fill_mode='external', all columns, fill_value = 'OK'
     a  b   c
0   1  4   4
0  OK  5   5
0  OK  7  OK
1   2  5   5
1  OK  4  OK
2   3  4   6

fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
     a  b       c
0   1  4  [4, 5]
0  OK  5      OK
0  OK  7      OK
1   2  5       5
1  OK  4      OK
2   3  4       6

fill_mode='internal', cols = ['a', 'b']
    a  b       c
0  1  4  [4, 5]
0  1  5  [4, 5]
0  1  7  [4, 5]
1  2  5       5
1  2  4       5
2  3  4       6

fill_mode='trim', all columns
    a  b  c
0  1  4  4
1  2  5  5
2  3  4  6

Answer 6

截至 pandas 1.3.0：

DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)

What’s new in 1.3.0 (July 2, 2021)

所以现在这个操作很简单：

df.explode(['B', 'C', 'D', 'E'])

    A   B   C   D   E
0  x1  v1  c1  d1  e1
0  x1  v2  c2  d2  e2
1  x2  v3  c3  d3  e3
1  x2  v4  c4  d4  e4
2  x3  v5  c5  d5  e5
2  x3  v6  c6  d6  e6
3  x4  v7  c7  d7  e7
3  x4  v8  c8  d8  e8

或者如果想要唯一索引：

df.explode(['B', 'C', 'D', 'E'], ignore_index=True)

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

在 pandas DataFrame 中取消嵌套（分解）多个列表列的有效方法

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

python

json

dataframe

pandas

pandas-explode

pandas >= 0.25