get_dummies 包含列表的 Pandas 列

Question

假设我有一个 DataFrame，其中有一列包含字符串列表，如下所示：

    Name    Fruit
0   Curly   [Apple]
1   Moe     [Orange]
2   Larry   [Apple, Banana]

我怎样才能把它变成这样的东西？

    Name     Fruit_Apple   Fruit_Orange   Fruit_Banana
0   Curly              1              0              0
1   Moe                0              1              0
2   Larry              1              0              1

我觉得我会以某种方式使用 pandas.get_dummies()，但我似乎无法理解。有帮助吗？

Answer 1

import pandas as pd

df = pd.DataFrame({'Name': ['Curly', 'Moe', 'Larry'],
                   'Fruit': [['Apple'], ['Orange'], ['Apple', 'Banana']]},
                  columns=['Name', 'Fruit'])

# a one-liner... that's pretty long    
dummies_df = pd.get_dummies(
  df.join(pd.Series(df['Fruit'].apply(pd.Series).stack().reset_index(1, drop=True),
                    name='Fruit1')).drop('Fruit', axis=1).rename(columns={'Fruit1': 'Fruit'}),
  columns=['Fruit']).groupby('Name', as_index=False).sum()

print(dummies_df)

我会将其分解为以下步骤：

第 1 步：

df['Fruit'].apply(pd.Series).stack().reset_index(1, drop=True)

此步骤将 pd.Series 应用于您的列表，将列表中的每个项目拆分到一个新列中。 stack 然后将这些列堆叠成一列，同时保留重要的索引信息。 reset_index 部分重置索引的级别 1 并删除它，因为它不需要。你最终得到这个：

0     Apple
1    Orange
2     Apple
2    Banana
dtype: object

第 2 步：

您会注意到 pd.Series( *Step 1 here*, name='Fruit1') 包裹在上面的第 1 步代码中，因为接下来我们将把这个系列加入现有的数据框，所以我们需要 name 才能做到这一点。

第 3 步：

df.join(* steps 1 and 2 code *).drop('Fruit', axis=1).rename(columns={'Fruit1': 'Fruit'})

因为我们现在有一个带有名称 (Fruit1) 的 pd.Series，所以我们将 Fruit1 系列加入到原来的 df 中，后者有三列。然后我们调用 drop 来删除原来的 Fruit 列。现在我们只有两列 Name 和 Fruit1 但我们希望 Fruit 被命名为 Fruit 所以我们将其重命名为 rename.

第 4 步：

pd.get_dummies(* steps 1, 2, and 3 here*, columns=['Fruit'])

在这里，我们最终调用了 get_dummies 并且我们使用 columns=['Fruit'] 专门告诉 get_dummies 只为 Fruit 列获取虚拟对象。

    Name  Fruit_Apple  Fruit_Banana  Fruit_Orange
0  Curly          1.0           0.0           0.0
1    Moe          0.0           0.0           1.0
2  Larry          1.0           0.0           0.0
2  Larry          0.0           1.0           0.0

第 5 步：

dummies_df = (*steps 1, 2, 3, and 4*).groupby('Name', as_index=False).sum()

最后，您在 Name 列上使用 groupby 并指定 as_index=False 以选择不将 Name 设置为索引。然后将该结果与 .sum()

求和

最终结果:

    Name  Fruit_Apple  Fruit_Banana  Fruit_Orange
0  Curly          1.0           0.0           0.0
1  Larry          1.0           1.0           0.0
2    Moe          0.0           0.0           1.0

get_dummies 包含列表的 Pandas 列

get_dummies for Pandas column containing list

python

pandas

data-science