Python 一个热编码多元变量
Python One Hot Encode Multivariate Variable
这是我的数据框:
Name Job
A Back-end Engineer
B Front-end Engineer;Product Manager
C Product Manager;Business Development;System Analyst
我想像这样将该数据帧转换为虚拟(一种热编码):
Name Back-end Engineer Business Development Front-end Engineer Product Manager System Analyst
A 1 0 0 0 0
B 0 0 1 1 0
C 0 1 0 1 0
我尝试使用 pandas.get_dummies 但它失败了,因为该变量是多变量的。
您可以尝试这样的操作:
import pandas as pd
from collections import defaultdict
df = pd.read_csv("path/to/your.csv")
jobs = df["Job"]
job_list = set()
for job in jobs:
job_names = job.split(";")
for job_name in job_names:
job_list.add(job_name)
new_df = defaultdict(list)
for index, row in df.iterrows():
new_df["Name"].append(row["Name"])
for job in job_list:
if job in row["Job"]:
new_df[job].append(1)
else:
new_df[job].append(0)
new_df = pd.DataFrame.from_dict(new_df)
new_df.to_csv("/path/to/new.csv")
这是我的数据框:
Name Job
A Back-end Engineer
B Front-end Engineer;Product Manager
C Product Manager;Business Development;System Analyst
我想像这样将该数据帧转换为虚拟(一种热编码):
Name Back-end Engineer Business Development Front-end Engineer Product Manager System Analyst
A 1 0 0 0 0
B 0 0 1 1 0
C 0 1 0 1 0
我尝试使用 pandas.get_dummies 但它失败了,因为该变量是多变量的。
您可以尝试这样的操作:
import pandas as pd
from collections import defaultdict
df = pd.read_csv("path/to/your.csv")
jobs = df["Job"]
job_list = set()
for job in jobs:
job_names = job.split(";")
for job_name in job_names:
job_list.add(job_name)
new_df = defaultdict(list)
for index, row in df.iterrows():
new_df["Name"].append(row["Name"])
for job in job_list:
if job in row["Job"]:
new_df[job].append(1)
else:
new_df[job].append(0)
new_df = pd.DataFrame.from_dict(new_df)
new_df.to_csv("/path/to/new.csv")