如何区分匹配 df['keys'] 并为它们创建新列
How to take difference of matching df['keys'] and create new column for them
我试图找出给定一组专业的性别之间的工资差距。
这是我的 table:
的文本版本
gender field group logwage
0 male BUSINESS 7.229572
10 female BUSINESS 7.072464
1 male COMM/JOURN 7.108538
11 female COMM/JOURN 7.015018
2 male COMPSCI/STAT 7.340410
12 female COMPSCI/STAT 7.169401
3 male EDUCATION 6.888829
13 female EDUCATION 6.770255
4 male ENGINEERING 7.397082
14 female ENGINEERING 7.323996
5 male HUMANITIES 7.053048
15 female HUMANITIES 6.920830
6 male MEDICINE 7.319011
16 female MEDICINE 7.193518
17 female NATSCI 6.993337
7 male NATSCI 7.089232
18 female OTHER 6.881126
8 male OTHER 7.091698
9 male SOCSCI/PSYCH 7.197572
19 female SOCSCI/PSYCH 6.968322
diff 对我不起作用,因为它会计算每个连续专业之间的差异。
这是现在的代码:
for row in sorted_mfield:
if sorted_mfield['field group']==sorted_mfield['field group'].shift(1):
diff= lambda x: x[0]-x[1]
我的下一个策略是回到未排序的数据框,其中 male 和 female 是它们自己的列,并从那里有所作为,但由于我已经花了一个小时尝试这样做,而且对pandas,我想我会问并找出它是如何工作的。谢谢
我会考虑使用 pivot
重塑您的 DataFrame,使其更易于计算。
代码:
df.pivot(index='field group', columns='gender', values='logwage').rename_axis([None], axis=1)
# female male
#field group
#BUSINESS 7.072464 7.229572
#COMM/JOURN 7.015018 7.108538
#COMPSCI/STAT 7.169401 7.340410
#EDUCATION 6.770255 6.888829
#ENGINEERING 7.323996 7.397082
#HUMANITIES 6.920830 7.053048
#MEDICINE 7.193518 7.319011
#NATSCI 6.993337 7.089232
#OTHER 6.881126 7.091698
#SOCSCI/PSYCH 6.968322 7.197572
df.male - df.female
#field group
#BUSINESS 0.157108
#COMM/JOURN 0.093520
#COMPSCI/STAT 0.171009
#EDUCATION 0.118574
#ENGINEERING 0.073086
#HUMANITIES 0.132218
#MEDICINE 0.125493
#NATSCI 0.095895
#OTHER 0.210572
#SOCSCI/PSYCH 0.229250
#dtype: float64
在数据的排序版本中使用 Pandas.DataFrame.shift() 的解决方案:
df.sort_values(by=['field group', 'gender'], inplace=True)
df['gap'] = df.logwage - df.logwage.shift(1)
df[df.gender =='male'][['field group', 'gap']]
使用示例数据生成以下输出:
field group gap
0 BUSINESS 0.157108
2 COMM/JOURN 0.093520
4 COMPSCI/STAT 0.171009
6 EDUCATION 0.118574
8 ENGINEERING 0.073086
10 HUMANITIES 0.132218
12 MEDICINE 0.125493
15 NATSCI 0.095895
17 OTHER 0.210572
18 SOCSCI/PSYCH 0.229250
注意:它认为每个字段组总是有一对值。如果你想验证它或消除没有这对的字段组,下面的代码进行过滤:
df_grouped = df.groupby('field group')
df_filtered = df_grouped.filter(lambda x: len(x) == 2)
我试图找出给定一组专业的性别之间的工资差距。
这是我的 table:
的文本版本 gender field group logwage
0 male BUSINESS 7.229572
10 female BUSINESS 7.072464
1 male COMM/JOURN 7.108538
11 female COMM/JOURN 7.015018
2 male COMPSCI/STAT 7.340410
12 female COMPSCI/STAT 7.169401
3 male EDUCATION 6.888829
13 female EDUCATION 6.770255
4 male ENGINEERING 7.397082
14 female ENGINEERING 7.323996
5 male HUMANITIES 7.053048
15 female HUMANITIES 6.920830
6 male MEDICINE 7.319011
16 female MEDICINE 7.193518
17 female NATSCI 6.993337
7 male NATSCI 7.089232
18 female OTHER 6.881126
8 male OTHER 7.091698
9 male SOCSCI/PSYCH 7.197572
19 female SOCSCI/PSYCH 6.968322
diff 对我不起作用,因为它会计算每个连续专业之间的差异。
这是现在的代码:
for row in sorted_mfield:
if sorted_mfield['field group']==sorted_mfield['field group'].shift(1):
diff= lambda x: x[0]-x[1]
我的下一个策略是回到未排序的数据框,其中 male 和 female 是它们自己的列,并从那里有所作为,但由于我已经花了一个小时尝试这样做,而且对pandas,我想我会问并找出它是如何工作的。谢谢
我会考虑使用 pivot
重塑您的 DataFrame,使其更易于计算。
代码:
df.pivot(index='field group', columns='gender', values='logwage').rename_axis([None], axis=1)
# female male
#field group
#BUSINESS 7.072464 7.229572
#COMM/JOURN 7.015018 7.108538
#COMPSCI/STAT 7.169401 7.340410
#EDUCATION 6.770255 6.888829
#ENGINEERING 7.323996 7.397082
#HUMANITIES 6.920830 7.053048
#MEDICINE 7.193518 7.319011
#NATSCI 6.993337 7.089232
#OTHER 6.881126 7.091698
#SOCSCI/PSYCH 6.968322 7.197572
df.male - df.female
#field group
#BUSINESS 0.157108
#COMM/JOURN 0.093520
#COMPSCI/STAT 0.171009
#EDUCATION 0.118574
#ENGINEERING 0.073086
#HUMANITIES 0.132218
#MEDICINE 0.125493
#NATSCI 0.095895
#OTHER 0.210572
#SOCSCI/PSYCH 0.229250
#dtype: float64
在数据的排序版本中使用 Pandas.DataFrame.shift() 的解决方案:
df.sort_values(by=['field group', 'gender'], inplace=True)
df['gap'] = df.logwage - df.logwage.shift(1)
df[df.gender =='male'][['field group', 'gap']]
使用示例数据生成以下输出:
field group gap
0 BUSINESS 0.157108
2 COMM/JOURN 0.093520
4 COMPSCI/STAT 0.171009
6 EDUCATION 0.118574
8 ENGINEERING 0.073086
10 HUMANITIES 0.132218
12 MEDICINE 0.125493
15 NATSCI 0.095895
17 OTHER 0.210572
18 SOCSCI/PSYCH 0.229250
注意:它认为每个字段组总是有一对值。如果你想验证它或消除没有这对的字段组,下面的代码进行过滤:
df_grouped = df.groupby('field group')
df_filtered = df_grouped.filter(lambda x: len(x) == 2)