PANDAS - 将列表作为值的列转换为虚拟变量

PANDAS - converting a column with lists as values to dummy variables

我正在处理 airbnb 房源的数据集。其中一列称为便利设施, 并包含列表必须提供的所有便利设施。 几个例子:

[Internet, Wifi, Paid parking off premises]

[Internet, Wifi, Kitchen]

[Wifi, Smoking allowed, Heating]

我想用多个二元列替换此列,一种对应一种便利设施。 例如,其中之一是:

wifi --> 0,0,0,1,1,0,1,1,0,1,0,1 

我找到了一种使用 for 循环实现此目的的方法:

all_amenities = []
for row in amenities:
    all_amenities += row

all_amenities = set(all_amenities)
for col in all_amenities:
    df[col] = 0

for i,amenities_of_listing in enumerate(amenities):
    for amenity in amenities_of_listing:
        df.loc[i,amenity] = 1

但这要花很长时间才能 运行 - 这里有人能想出更实用的方法吗?

我相信你需要 MultiLabelBinarizer 如果大的话效果很好 DataFrame:

print (df)
                                   amenisities
0  [Internet, Wifi, Paid parking off premises]
1                    [Internet, Wifi, Kitchen]
2             [Wifi, Smoking allowed, Heating]

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['amenisities']),columns=mlb.classes_)
print (df1)
   Heating  Internet  Kitchen  Paid parking off premises  Smoking allowed  \
0        0         1        0                          1                0   
1        0         1        1                          0                0   
2        1         0        0                          0                1   

   Wifi  
0     1  
1     1  
2     1 

IIUC,你也可以试试 pd.get_dummies() or series.str.get_dummies():

pd.get_dummies(s.explode()).max(level=0)

或者:

s.str.join('|').str.get_dummies()

s替换为df['column_name']


   Heating  Internet  Kitchen  Paid parking off premises  Smoking allowed  \
0        0         1        0                          1                0   
1        0         1        1                          0                0   
2        1         0        0                          0                1   

   Wifi  
0     1  
1     1  
2     1