Python Pandas 将 1 列字符串组合转换为多列分类数据
Python Pandas convert 1 column of combination of strings to multiple columns of categorical data
我正在做一个分析天气数据的项目。
下面是我的csv文件的缩略版(只关注最后一列“条件”):
Year,Month,Day,Hour,DOW,Maximum Temperature,Minimum Temperature,Temperature,Precipitation,Snow,SnowDepth,Wind Speed,Visibility,Cloud Cover,Relative Humidity,Conditions
2020,3,5,8,3,48.0,48.0,48.0,0.0,0.0,0.0,10.3,9.9,0.0,81.44,Clear
2020,3,5,10,3,56.9,56.9,56.9,0.0,0.0,0.0,6.3,9.9,25.1,55.29,Partially cloudy
2020,3,9,8,0,60.7,60.7,60.7,0.0,0.0,0.0,14.5,8.1,79.6,91.95,Overcast
2020,3,9,10,0,62.5,62.5,62.5,0.01,0.0,0.0,16.0,7.0,94.7,89.95,"Rain, Overcast"
2020,3,17,20,1,66.4,66.4,66.4,0.02,0.0,0.0,8.7,4.3,68.6,88.78,"Rain, Partially cloudy"
我想把它转成这样:
Clear,Partially cloudy,Rain,Overcast
1,0,0,0
0,1,0,0
0,0,0,1
0,0,1,1
0,1,1,0
我看到我可以使用下面的代码,但是当我在一个数据中有 2 个类别时,我不知道如何处理这种情况。
dataset['Conditions'] = dataset['Conditions'].map({1: 'Clear', 2: 'Partially cloudy', 3: 'Rain', 4: 'Snow'})
dataset = pd.get_dummies(dataset, columns=['Conditions'], prefix='', prefix_sep='')
提前谢谢你:)
你可以使用 pd.get_dummies
:
result = (
pd.get_dummies(
df.Conditions.str.split(', ', expand=True)
.stack())
.sum(level=0)
)
输出:
Clear Overcast Partially cloudy Rain
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 1 0 1
4 0 0 1 1
尝试 str.split + explode then sum 级别 0:
dummies = pd.get_dummies(
dataset['Conditions'].str.split(', ').explode()
).sum(level=0)
print(dummies)
dummies
:
Clear Overcast Partially cloudy Rain
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 1 0 1
4 0 0 1 1
要join回到原来的DataFrame:
dummies = pd.get_dummies(
dataset['Conditions'].str.split(', ').explode()
).sum(level=0)
# Join Back to dataset
dataset = dataset.drop(columns='Conditions').join(dummies)
print(dataset.to_string())
Year Month Day Hour ... Clear Overcast Partially cloudy Rain
0 2020 3 5 8 ... 1 0 0 0
1 2020 3 5 10 ... 0 0 1 0
2 2020 3 9 8 ... 0 1 0 0
3 2020 3 9 10 ... 0 1 0 1
4 2020 3 17 20 ... 0 0 1 1
import pandas as pd
xx = pd.DataFrame([[1,2,"ss"],[2,3,"cc"],[4,2,"d"]],columns=["v1","v2","s"])
pd.Series(xx["s"]).str.get_dummies()
我正在做一个分析天气数据的项目。 下面是我的csv文件的缩略版(只关注最后一列“条件”):
Year,Month,Day,Hour,DOW,Maximum Temperature,Minimum Temperature,Temperature,Precipitation,Snow,SnowDepth,Wind Speed,Visibility,Cloud Cover,Relative Humidity,Conditions
2020,3,5,8,3,48.0,48.0,48.0,0.0,0.0,0.0,10.3,9.9,0.0,81.44,Clear
2020,3,5,10,3,56.9,56.9,56.9,0.0,0.0,0.0,6.3,9.9,25.1,55.29,Partially cloudy
2020,3,9,8,0,60.7,60.7,60.7,0.0,0.0,0.0,14.5,8.1,79.6,91.95,Overcast
2020,3,9,10,0,62.5,62.5,62.5,0.01,0.0,0.0,16.0,7.0,94.7,89.95,"Rain, Overcast"
2020,3,17,20,1,66.4,66.4,66.4,0.02,0.0,0.0,8.7,4.3,68.6,88.78,"Rain, Partially cloudy"
我想把它转成这样:
Clear,Partially cloudy,Rain,Overcast
1,0,0,0
0,1,0,0
0,0,0,1
0,0,1,1
0,1,1,0
我看到我可以使用下面的代码,但是当我在一个数据中有 2 个类别时,我不知道如何处理这种情况。
dataset['Conditions'] = dataset['Conditions'].map({1: 'Clear', 2: 'Partially cloudy', 3: 'Rain', 4: 'Snow'})
dataset = pd.get_dummies(dataset, columns=['Conditions'], prefix='', prefix_sep='')
提前谢谢你:)
你可以使用 pd.get_dummies
:
result = (
pd.get_dummies(
df.Conditions.str.split(', ', expand=True)
.stack())
.sum(level=0)
)
输出:
Clear Overcast Partially cloudy Rain
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 1 0 1
4 0 0 1 1
尝试 str.split + explode then sum 级别 0:
dummies = pd.get_dummies(
dataset['Conditions'].str.split(', ').explode()
).sum(level=0)
print(dummies)
dummies
:
Clear Overcast Partially cloudy Rain
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 1 0 1
4 0 0 1 1
要join回到原来的DataFrame:
dummies = pd.get_dummies(
dataset['Conditions'].str.split(', ').explode()
).sum(level=0)
# Join Back to dataset
dataset = dataset.drop(columns='Conditions').join(dummies)
print(dataset.to_string())
Year Month Day Hour ... Clear Overcast Partially cloudy Rain
0 2020 3 5 8 ... 1 0 0 0
1 2020 3 5 10 ... 0 0 1 0
2 2020 3 9 8 ... 0 1 0 0
3 2020 3 9 10 ... 0 1 0 1
4 2020 3 17 20 ... 0 0 1 1
import pandas as pd
xx = pd.DataFrame([[1,2,"ss"],[2,3,"cc"],[4,2,"d"]],columns=["v1","v2","s"])
pd.Series(xx["s"]).str.get_dummies()