将不同长度的嵌套元组的列解包为 pandas 数据框中的多列

Question

我有以下数据框，其中包含一列嵌套元组：

index  nested_tuples
1      (('a',(1,0)),('b',(2,0)),('c',(3,0)))
2      (('a',(5,0)),('d',(6,0)),('e',(7,0)),('f',(8,0)))
3      (('c',(4,0)),('d',(5,0)),('g',(6,0)),('h',(7,0)))

我正在尝试解压元组以获得以下数据帧：

index  a   b   c   d   e   f   g   h
1      1   2   3
2      5           6   7   8
3              4   5           6   7

即对于每个元组（ char, (num1, num2) ），我希望 char 是一列， num1 是条目。我最初用 to_list() 尝试了各种方法，但由于迷你元组的数量和它们中的字符不同，我无法在不丢失信息的情况下使用它，最终我能想到的唯一解决方案是：

for index, row in df.iterrows():
    tuples = row['nested_tuples']
    if not tuples:
        continue
    for mini_tuple in tuples:
        df.loc[index, mini_tuple[0]] = mini_tuple[1][0]

然而，对于我拥有的实际数据框，嵌套元组很长且 df 非常大，iterrows 非常慢。有更好的矢量化方法吗？

Answer 1

在构建 DataFrame 之前在 vanilla Python 中清理数据可能更有效：

out = pd.DataFrame([{k:v[0] for k,v in tpl} for tpl in df['nested_tuples'].tolist()])

更简洁一点：

out = pd.DataFrame(map(dict, df['nested_tuples'])).stack().str[0].unstack()

另一个选项使用 apply:

out = pd.DataFrame(df['nested_tuples'].apply(lambda x: {k:v[0] for k,v in x}).tolist())

输出：

     a    b    c    d    e    f    g    h
0  1.0  2.0  3.0  NaN  NaN  NaN  NaN  NaN
1  5.0  NaN  NaN  6.0  7.0  8.0  NaN  NaN
2  NaN  NaN  4.0  5.0  NaN  NaN  6.0  7.0

将不同长度的嵌套元组的列解包为 pandas 数据框中的多列

Unpacking column of nested tuples of different lengths into multiple columns in pandas dataframe

python

tuples

dataframe

python-3.x

pandas