循环 Pandas 数据帧以生成虚拟变量（1 或 0 输入）的有效方法

Question

我正在学习数据科学，想为我的数据集创建虚拟变量。

我有一个包含 "Product Category" 列的数据框，该列是匹配类别的列表，看起来像 ["Category1"、"Category2".."CategoryN"]

我知道 Pandas 有很好的功能可以自动生成虚拟变量 (pandas.get_dummies) 但在这种情况下，我不能使用它，我猜 (?)。

我知道如何遍历每一行以将 1 附加到每一列的匹配元素。我当前的代码是这样的：

for column_name in df.columns[1:]: #first column is "Product Category" and appended dummy columns (product category names) to the right previously
    for index, _ in enumerate(df[column_name][:10]): #limit 10 rows
        if column_name in df["Product Category"][index]:
            df[column_name][index] = 1

但是，上面的代码效率不高，我无法使用它，因为我有超过 100,000 行。我想以某种方式对整个数组进行操作，但我不知道该怎么做。

有人可以帮忙吗？

Answer 1

使用get_dummies()，您可以指定将哪些列转换为虚拟变量。考虑以下示例，其中多个项目可以共享同一类别但只会落入一个虚拟变量：

df = pd.DataFrame({'Languages':  ['R', 'Python', 'C#', 'PHP', 'Java', 'XSLT', 'SQL'],
                   'ProductCategory':  ['Statistical', 'General Purpose', 
                                        'General Purpose', 'Web', 'General Purpose', 
                                        'Special Purpose', 'Special Purpose']})
# BEFORE
print(df)

#    Languages  ProductCategory
# 0          R      Statistical
# 1     Python  General Purpose
# 2         C#  General Purpose
# 3        PHP              Web
# 4       Java  General Purpose
# 5       XSLT  Special Purpose
# 6        SQL  Special Purpose

newdf = pd.get_dummies(df, columns=['ProductCategory'], prefix=['Categ'])
# AFTER
print(newdf)

#    Languages  Categ_General Purpose  Categ_Special Purpose  Categ_Statistical  Categ_Web
# 0         R                      0                      0                  1          0
# 1    Python                      1                      0                  0          0
# 2        C#                      1                      0                  0          0
# 3       PHP                      0                      0                  0          1
# 4      Java                      1                      0                  0          0
# 5      XSLT                      0                      1                  0          0
# 6       SQL                      0                      1                  0          0

Answer 2

我假设您的问题是每一行都可以设置多个虚拟对象，因此 "Product Category" 的值是一列类别列表。也许这应该可行，尽管我不确定它的内存效率如何。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"Product Category": [['Category1', 'Category2'],
   ...:                                         ['Category3'],
   ...:                                         ['Category1', 'Category4'],
   ...:                                         ['Category1', 'Category3', 'Category5']]})

In [3]: df
Out[3]:
                    Product Category
0             [Category1, Category2]
1                        [Category3]
2             [Category1, Category4]
3  [Category1, Category3, Category5]

In [4]: def list_to_dict(category_list):
   ...:         n_categories = len(category_list)
   ...:         return dict(zip(category_list, [1]*n_categories))
   ...:

In [5]: df_dummies = pd.DataFrame(list(df['Product Category'].apply(list_to_dict).values)).fillna(0)

In [6]: df_new = df.join(df_dummies)

In [7]: df_new
Out[7]:
                    Product Category  Category1  Category2  Category3 Category4  Category5
0             [Category1, Category2]          1          1          0         0          0
1                        [Category3]          0          0          1         0          0
2             [Category1, Category4]          1          0          0         1          0
3  [Category1, Category3, Category5]          1          0          1         0          1

循环 Pandas 数据帧以生成虚拟变量（1 或 0 输入）的有效方法

Efficient way to loop over Pandas Dataframe to make dummy variables (1 or 0 input)

python

numpy

pandas

data-science

array-broadcasting