使用 Python 的特征工程

Question

我有一个 pandas 数据集，其中一列是这样的：

         Genre
        ------------
         Documentary
         Documentary
         Comedy|Mystery|Thriller
         Animation|Comedy|Family
         Documentary
         Documentary|Family
         Action|Adventure|Fantasy|Sci-Fi
         Crime|Drama|Mystery
         Action|Crime|Mystery|Thriller

如何为每个流派名称创建多个列，如果包含该流派则填 1，否则填 0？

预期输出：Pandas数据帧

  Documentary  Comedy  Mystery  Thriller  Animation  Family  ......
    1           0       0          0        0          0
    1            0       0          0        0          0
    0            1        1         1        0          0

等等。

我尝试先将它转换为列表然后拆分它，但这不是 pythonic 的方式。

我们可以使用 apply 函数或其他一些有效的技术来有效地做到这一点吗？

Answer 1

使用Series.explode + pd.get_dummies:

s_explode=df['Genre'].str.split('|').explode()
dfc=pd.get_dummies(s_explode).groupby(level=0).sum()
new_df=pd.concat([df['Genre'],dfc],axis=1)
print(new_df)

                              Genre  Action  Adventure  Animation  Comedy  \
0                      Documentary       0          0          0       0   
1                      Documentary       0          0          0       0   
2          Comedy|Mystery|Thriller       0          0          0       1   
3          Animation|Comedy|Family       0          0          1       1   
4                      Documentary       0          0          0       0   
5               Documentary|Family       0          0          0       0   
6  Action|Adventure|Fantasy|Sci-Fi       1          1          0       0   
7              Crime|Drama|Mystery       0          0          0       0   
8    Action|Crime|Mystery|Thriller       1          0          0       0   

   Crime  Documentary  Drama  Family  Fantasy  Mystery  Sci-Fi  Thriller  
0      0            1      0       0        0        0       0         0  
1      0            1      0       0        0        0       0         0  
2      0            0      0       0        0        1       0         1  
3      0            0      0       1        0        0       0         0  
4      0            1      0       0        0        0       0         0  
5      0            1      0       1        0        0       0         0  
6      0            0      0       0        1        0       1         0  
7      1            0      1       0        0        1       0         0  
8      1            0      0       0        0        1       0         1

Answer 2

和str.get_dummies

直接简单

df1 = df.Genre.str.get_dummies('|')

Out[385]:
   Action  Adventure  Animation  Comedy  Crime  Documentary  Drama  Family  \
0       0          0          0       0      0            1      0       0
1       0          0          0       0      0            1      0       0
2       0          0          0       1      0            0      0       0
3       0          0          1       1      0            0      0       1
4       0          0          0       0      0            1      0       0
5       0          0          0       0      0            1      0       1
6       1          1          0       0      0            0      0       0
7       0          0          0       0      1            0      1       0
8       1          0          0       0      1            0      0       0

   Fantasy  Mystery  Sci-Fi  Thriller
0        0        0       0         0
1        0        0       0         0
2        0        1       0         1
3        0        0       0         0
4        0        0       0         0
5        0        0       0         0
6        1        0       1         0
7        0        1       0         0
8        0        1       0         1

使用 Python 的特征工程

Feature engineering using Python

feature-extraction

python-3.x

pandas