如何组合 2 个数据框,创建仅出现在第二个数据框中但不出现在第一个数据框中的行和 groupby 以获得总和?
How to combine 2 dataframe, create a row that appear only in the second dataframe but not in the 1st and groupby to get the sum?
我想合并 2 个数据帧。我尝试了几种方法,但不确定如何获得最终的数据框。感谢任何关于我如何做到这一点的建议。
data_list_1 = [['Employee', 'Course Name', 'Status'],
['Abel', "Course_A", "Completed"],
['Bain', "Course_A", "Incomplete"]]
data_list_2 = [['Employee', 'Course Name', 'Lesson Name', 'Lesson Score', 'Status'],
['Abel', 'Course_B', 'Lesson_1', 100, ""],
['Abel', 'Course_B', 'Lesson_2', 100, ""],
['Abel', 'Course_B', 'Lesson_3', 100, ""],
['Abel', 'Course_B', 'Lesson_4', 100, ""],
['Bain', 'Course_B', 'Lesson_1', 100, ""],
['Bain', 'Course_B', 'Lesson_2', 100, ""],
['Coot', 'Course_B', 'Lesson_1', 100, ""],
['Coot', 'Course_B', 'Lesson_2', 100, ""],
['Coot', 'Course_B', 'Lesson_3', 100, ""],
['Coot', 'Course_B', 'Lesson_4', 100, ""],
['Coot', 'Course_B', 'Lesson_5', 100, ""]]
Course_A_df = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
Course_B_df = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
我想要以下数据框以便在 Tableau 中将其用于可视化目的。基本上最终的 df 也应该有 None 值和 Course_B 如果所有 5 课分数都是 100 则状态完成。
to_achieved = [['Employee', 'Course Name', 'Lesson Name', 'Lesson Score', 'Status'],
['Abel', "Course_A", None, None, "Completed"],
['Bain', "Course_A", None, None, "Incomplete"],
['Coot', "Course_A", None, None, None],
['Abel', 'Course_B', 'Lesson_1', 100, "Incomplete"],
['Abel', 'Course_B', 'Lesson_2', 100, "Incomplete"],
['Abel', 'Course_B', 'Lesson_3', 100, "Incomplete"],
['Abel', 'Course_B', 'Lesson_4', 100, "Incomplete"],
['Bain', 'Course_B', 'Lesson_1', 100, "Incomplete"],
['Bain', 'Course_B', 'Lesson_2', 100, "Incomplete"],
['Coot', 'Course_B', 'Lesson_1', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_2', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_3', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_4', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_5', 100, "Completed"]]
to_achieved_df = pd.DataFrame(to_achieved[1:], columns = to_achieved[0])
to_achieved_df
我试过连接和合并,但它似乎没有给我想要的东西。
df_concat = pd.concat([Course_A_df, Course_B_df], axis=0, ignore_index=True)
df_concat
merged = pd.merge(left=Course_A_df, right=Course_B_df, left_on='Employee', right_on='Employee', how='left')
merged
对于状态的计算,我尝试了 groupby,但是有什么方法可以检查值是否为 500 并更新状态?
谢谢!
您可以.reindex
Course_A_df
添加缺少的员工:
Course_A_df = (
Course_A_df.set_index("Employee")
.reindex(Course_B_df["Employee"].unique())
.reset_index()
)
Course_A_df["Course Name"] = Course_A_df["Course Name"].ffill().bfill()
打印:
Employee Course Name Status
0 Abel Course_A Completed
1 Bain Course_A Incomplete
2 Coot Course_A NaN
然后将“状态”列添加到 Course_B_df
:
Course_B_df["Status"] = Course_B_df.groupby(
["Employee", "Course Name"], as_index=False
)["Lesson Score"].transform(
lambda x: "Complete" if x.sum() == 500 else "Incomplete"
)
打印:
Employee Course Name Lesson Name Lesson Score Status
0 Abel Course_B Lesson_1 100 Incomplete
1 Abel Course_B Lesson_2 100 Incomplete
2 Abel Course_B Lesson_3 100 Incomplete
3 Abel Course_B Lesson_4 100 Incomplete
4 Bain Course_B Lesson_1 100 Incomplete
5 Bain Course_B Lesson_2 100 Incomplete
6 Coot Course_B Lesson_1 100 Complete
7 Coot Course_B Lesson_2 100 Complete
8 Coot Course_B Lesson_3 100 Complete
9 Coot Course_B Lesson_4 100 Complete
10 Coot Course_B Lesson_5 100 Complete
最后.concat
两个:
out = pd.concat([Course_A_df, Course_B_df])
print(out[["Employee", "Course Name", "Lesson Name", "Lesson Score", "Status"]])
打印:
Employee Course Name Lesson Name Lesson Score Status
0 Abel Course_A NaN NaN Completed
1 Bain Course_A NaN NaN Incomplete
2 Coot Course_A NaN NaN NaN
0 Abel Course_B Lesson_1 100.0 Incomplete
1 Abel Course_B Lesson_2 100.0 Incomplete
2 Abel Course_B Lesson_3 100.0 Incomplete
3 Abel Course_B Lesson_4 100.0 Incomplete
4 Bain Course_B Lesson_1 100.0 Incomplete
5 Bain Course_B Lesson_2 100.0 Incomplete
6 Coot Course_B Lesson_1 100.0 Complete
7 Coot Course_B Lesson_2 100.0 Complete
8 Coot Course_B Lesson_3 100.0 Complete
9 Coot Course_B Lesson_4 100.0 Complete
10 Coot Course_B Lesson_5 100.0 Complete
我想合并 2 个数据帧。我尝试了几种方法,但不确定如何获得最终的数据框。感谢任何关于我如何做到这一点的建议。
data_list_1 = [['Employee', 'Course Name', 'Status'],
['Abel', "Course_A", "Completed"],
['Bain', "Course_A", "Incomplete"]]
data_list_2 = [['Employee', 'Course Name', 'Lesson Name', 'Lesson Score', 'Status'],
['Abel', 'Course_B', 'Lesson_1', 100, ""],
['Abel', 'Course_B', 'Lesson_2', 100, ""],
['Abel', 'Course_B', 'Lesson_3', 100, ""],
['Abel', 'Course_B', 'Lesson_4', 100, ""],
['Bain', 'Course_B', 'Lesson_1', 100, ""],
['Bain', 'Course_B', 'Lesson_2', 100, ""],
['Coot', 'Course_B', 'Lesson_1', 100, ""],
['Coot', 'Course_B', 'Lesson_2', 100, ""],
['Coot', 'Course_B', 'Lesson_3', 100, ""],
['Coot', 'Course_B', 'Lesson_4', 100, ""],
['Coot', 'Course_B', 'Lesson_5', 100, ""]]
Course_A_df = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
Course_B_df = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
我想要以下数据框以便在 Tableau 中将其用于可视化目的。基本上最终的 df 也应该有 None 值和 Course_B 如果所有 5 课分数都是 100 则状态完成。
to_achieved = [['Employee', 'Course Name', 'Lesson Name', 'Lesson Score', 'Status'],
['Abel', "Course_A", None, None, "Completed"],
['Bain', "Course_A", None, None, "Incomplete"],
['Coot', "Course_A", None, None, None],
['Abel', 'Course_B', 'Lesson_1', 100, "Incomplete"],
['Abel', 'Course_B', 'Lesson_2', 100, "Incomplete"],
['Abel', 'Course_B', 'Lesson_3', 100, "Incomplete"],
['Abel', 'Course_B', 'Lesson_4', 100, "Incomplete"],
['Bain', 'Course_B', 'Lesson_1', 100, "Incomplete"],
['Bain', 'Course_B', 'Lesson_2', 100, "Incomplete"],
['Coot', 'Course_B', 'Lesson_1', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_2', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_3', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_4', 100, "Completed"],
['Coot', 'Course_B', 'Lesson_5', 100, "Completed"]]
to_achieved_df = pd.DataFrame(to_achieved[1:], columns = to_achieved[0])
to_achieved_df
我试过连接和合并,但它似乎没有给我想要的东西。
df_concat = pd.concat([Course_A_df, Course_B_df], axis=0, ignore_index=True)
df_concat
merged = pd.merge(left=Course_A_df, right=Course_B_df, left_on='Employee', right_on='Employee', how='left')
merged
对于状态的计算,我尝试了 groupby,但是有什么方法可以检查值是否为 500 并更新状态?
谢谢!
您可以.reindex
Course_A_df
添加缺少的员工:
Course_A_df = (
Course_A_df.set_index("Employee")
.reindex(Course_B_df["Employee"].unique())
.reset_index()
)
Course_A_df["Course Name"] = Course_A_df["Course Name"].ffill().bfill()
打印:
Employee Course Name Status
0 Abel Course_A Completed
1 Bain Course_A Incomplete
2 Coot Course_A NaN
然后将“状态”列添加到 Course_B_df
:
Course_B_df["Status"] = Course_B_df.groupby(
["Employee", "Course Name"], as_index=False
)["Lesson Score"].transform(
lambda x: "Complete" if x.sum() == 500 else "Incomplete"
)
打印:
Employee Course Name Lesson Name Lesson Score Status
0 Abel Course_B Lesson_1 100 Incomplete
1 Abel Course_B Lesson_2 100 Incomplete
2 Abel Course_B Lesson_3 100 Incomplete
3 Abel Course_B Lesson_4 100 Incomplete
4 Bain Course_B Lesson_1 100 Incomplete
5 Bain Course_B Lesson_2 100 Incomplete
6 Coot Course_B Lesson_1 100 Complete
7 Coot Course_B Lesson_2 100 Complete
8 Coot Course_B Lesson_3 100 Complete
9 Coot Course_B Lesson_4 100 Complete
10 Coot Course_B Lesson_5 100 Complete
最后.concat
两个:
out = pd.concat([Course_A_df, Course_B_df])
print(out[["Employee", "Course Name", "Lesson Name", "Lesson Score", "Status"]])
打印:
Employee Course Name Lesson Name Lesson Score Status
0 Abel Course_A NaN NaN Completed
1 Bain Course_A NaN NaN Incomplete
2 Coot Course_A NaN NaN NaN
0 Abel Course_B Lesson_1 100.0 Incomplete
1 Abel Course_B Lesson_2 100.0 Incomplete
2 Abel Course_B Lesson_3 100.0 Incomplete
3 Abel Course_B Lesson_4 100.0 Incomplete
4 Bain Course_B Lesson_1 100.0 Incomplete
5 Bain Course_B Lesson_2 100.0 Incomplete
6 Coot Course_B Lesson_1 100.0 Complete
7 Coot Course_B Lesson_2 100.0 Complete
8 Coot Course_B Lesson_3 100.0 Complete
9 Coot Course_B Lesson_4 100.0 Complete
10 Coot Course_B Lesson_5 100.0 Complete