PANDAS - 将列表作为值的列转换为虚拟变量
PANDAS - converting a column with lists as values to dummy variables
我正在处理 airbnb 房源的数据集。其中一列称为便利设施,
并包含列表必须提供的所有便利设施。
几个例子:
[Internet, Wifi, Paid parking off premises]
[Internet, Wifi, Kitchen]
[Wifi, Smoking allowed, Heating]
我想用多个二元列替换此列,一种对应一种便利设施。
例如,其中之一是:
wifi --> 0,0,0,1,1,0,1,1,0,1,0,1
我找到了一种使用 for 循环实现此目的的方法:
all_amenities = []
for row in amenities:
all_amenities += row
all_amenities = set(all_amenities)
for col in all_amenities:
df[col] = 0
for i,amenities_of_listing in enumerate(amenities):
for amenity in amenities_of_listing:
df.loc[i,amenity] = 1
但这要花很长时间才能 运行 - 这里有人能想出更实用的方法吗?
我相信你需要 MultiLabelBinarizer
如果大的话效果很好 DataFrame
:
print (df)
amenisities
0 [Internet, Wifi, Paid parking off premises]
1 [Internet, Wifi, Kitchen]
2 [Wifi, Smoking allowed, Heating]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['amenisities']),columns=mlb.classes_)
print (df1)
Heating Internet Kitchen Paid parking off premises Smoking allowed \
0 0 1 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
Wifi
0 1
1 1
2 1
IIUC,你也可以试试 pd.get_dummies()
or series.str.get_dummies()
:
pd.get_dummies(s.explode()).max(level=0)
或者:
s.str.join('|').str.get_dummies()
将s
替换为df['column_name']
Heating Internet Kitchen Paid parking off premises Smoking allowed \
0 0 1 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
Wifi
0 1
1 1
2 1
我正在处理 airbnb 房源的数据集。其中一列称为便利设施, 并包含列表必须提供的所有便利设施。 几个例子:
[Internet, Wifi, Paid parking off premises]
[Internet, Wifi, Kitchen]
[Wifi, Smoking allowed, Heating]
我想用多个二元列替换此列,一种对应一种便利设施。 例如,其中之一是:
wifi --> 0,0,0,1,1,0,1,1,0,1,0,1
我找到了一种使用 for 循环实现此目的的方法:
all_amenities = []
for row in amenities:
all_amenities += row
all_amenities = set(all_amenities)
for col in all_amenities:
df[col] = 0
for i,amenities_of_listing in enumerate(amenities):
for amenity in amenities_of_listing:
df.loc[i,amenity] = 1
但这要花很长时间才能 运行 - 这里有人能想出更实用的方法吗?
我相信你需要 MultiLabelBinarizer
如果大的话效果很好 DataFrame
:
print (df)
amenisities
0 [Internet, Wifi, Paid parking off premises]
1 [Internet, Wifi, Kitchen]
2 [Wifi, Smoking allowed, Heating]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['amenisities']),columns=mlb.classes_)
print (df1)
Heating Internet Kitchen Paid parking off premises Smoking allowed \
0 0 1 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
Wifi
0 1
1 1
2 1
IIUC,你也可以试试 pd.get_dummies()
or series.str.get_dummies()
:
pd.get_dummies(s.explode()).max(level=0)
或者:
s.str.join('|').str.get_dummies()
将s
替换为df['column_name']
Heating Internet Kitchen Paid parking off premises Smoking allowed \
0 0 1 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
Wifi
0 1
1 1
2 1