使用 Python 的特征工程
Feature engineering using Python
我有一个 pandas 数据集,其中一列是这样的:
Genre
------------
Documentary
Documentary
Comedy|Mystery|Thriller
Animation|Comedy|Family
Documentary
Documentary|Family
Action|Adventure|Fantasy|Sci-Fi
Crime|Drama|Mystery
Action|Crime|Mystery|Thriller
如何为每个流派名称创建多个列,如果包含该流派则填 1,否则填 0?
预期输出:Pandas数据帧
Documentary Comedy Mystery Thriller Animation Family ......
1 0 0 0 0 0
1 0 0 0 0 0
0 1 1 1 0 0
等等。
我尝试先将它转换为列表然后拆分它,但这不是 pythonic 的方式。
我们可以使用 apply
函数或其他一些有效的技术来有效地做到这一点吗?
使用Series.explode + pd.get_dummies:
s_explode=df['Genre'].str.split('|').explode()
dfc=pd.get_dummies(s_explode).groupby(level=0).sum()
new_df=pd.concat([df['Genre'],dfc],axis=1)
print(new_df)
Genre Action Adventure Animation Comedy \
0 Documentary 0 0 0 0
1 Documentary 0 0 0 0
2 Comedy|Mystery|Thriller 0 0 0 1
3 Animation|Comedy|Family 0 0 1 1
4 Documentary 0 0 0 0
5 Documentary|Family 0 0 0 0
6 Action|Adventure|Fantasy|Sci-Fi 1 1 0 0
7 Crime|Drama|Mystery 0 0 0 0
8 Action|Crime|Mystery|Thriller 1 0 0 0
Crime Documentary Drama Family Fantasy Mystery Sci-Fi Thriller
0 0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
2 0 0 0 0 0 1 0 1
3 0 0 0 1 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 1 0 0 0 0
6 0 0 0 0 1 0 1 0
7 1 0 1 0 0 1 0 0
8 1 0 0 0 0 1 0 1
和str.get_dummies
直接简单
df1 = df.Genre.str.get_dummies('|')
Out[385]:
Action Adventure Animation Comedy Crime Documentary Drama Family \
0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 1 0 0
2 0 0 0 1 0 0 0 0
3 0 0 1 1 0 0 0 1
4 0 0 0 0 0 1 0 0
5 0 0 0 0 0 1 0 1
6 1 1 0 0 0 0 0 0
7 0 0 0 0 1 0 1 0
8 1 0 0 0 1 0 0 0
Fantasy Mystery Sci-Fi Thriller
0 0 0 0 0
1 0 0 0 0
2 0 1 0 1
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 1 0 1 0
7 0 1 0 0
8 0 1 0 1
我有一个 pandas 数据集,其中一列是这样的:
Genre
------------
Documentary
Documentary
Comedy|Mystery|Thriller
Animation|Comedy|Family
Documentary
Documentary|Family
Action|Adventure|Fantasy|Sci-Fi
Crime|Drama|Mystery
Action|Crime|Mystery|Thriller
如何为每个流派名称创建多个列,如果包含该流派则填 1,否则填 0?
预期输出:Pandas数据帧
Documentary Comedy Mystery Thriller Animation Family ......
1 0 0 0 0 0
1 0 0 0 0 0
0 1 1 1 0 0
等等。
我尝试先将它转换为列表然后拆分它,但这不是 pythonic 的方式。
我们可以使用 apply
函数或其他一些有效的技术来有效地做到这一点吗?
使用Series.explode + pd.get_dummies:
s_explode=df['Genre'].str.split('|').explode()
dfc=pd.get_dummies(s_explode).groupby(level=0).sum()
new_df=pd.concat([df['Genre'],dfc],axis=1)
print(new_df)
Genre Action Adventure Animation Comedy \
0 Documentary 0 0 0 0
1 Documentary 0 0 0 0
2 Comedy|Mystery|Thriller 0 0 0 1
3 Animation|Comedy|Family 0 0 1 1
4 Documentary 0 0 0 0
5 Documentary|Family 0 0 0 0
6 Action|Adventure|Fantasy|Sci-Fi 1 1 0 0
7 Crime|Drama|Mystery 0 0 0 0
8 Action|Crime|Mystery|Thriller 1 0 0 0
Crime Documentary Drama Family Fantasy Mystery Sci-Fi Thriller
0 0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
2 0 0 0 0 0 1 0 1
3 0 0 0 1 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 1 0 0 0 0
6 0 0 0 0 1 0 1 0
7 1 0 1 0 0 1 0 0
8 1 0 0 0 0 1 0 1
和str.get_dummies
df1 = df.Genre.str.get_dummies('|')
Out[385]:
Action Adventure Animation Comedy Crime Documentary Drama Family \
0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 1 0 0
2 0 0 0 1 0 0 0 0
3 0 0 1 1 0 0 0 1
4 0 0 0 0 0 1 0 0
5 0 0 0 0 0 1 0 1
6 1 1 0 0 0 0 0 0
7 0 0 0 0 1 0 1 0
8 1 0 0 0 1 0 0 0
Fantasy Mystery Sci-Fi Thriller
0 0 0 0 0
1 0 0 0 0
2 0 1 0 1
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 1 0 1 0
7 0 1 0 0
8 0 1 0 1