如何在 pandas 数据框的列中查找特定值的计数并将其用于计算
how to find count of a specific value in column on pandas data frame and use it for calculations
我有一个类似于下面提到的 pandas 数据框,对于所有(域)唯一值,我想计算 Count(EV)+Count(PV)+count(DV)+count(GV ) 其中值 = 绿色 / 该唯一域中值的总数
Domain
EV
PV
DV
GV
Numerator(part)
denominator(part)
ideal Output
KA-BLR
Green
Blue
Green
1
6
0.166
KA-BLR
Green
Green
Blue
1
6
0.166
KL-TRV
Green
Blue
Yellow
Red
0.5
7
0.071
KL-TRV
Green
Blue
Blue
0.5
7
0.071
KL-COK
Blue
Blue
Yellow
Green
0.25
4
0.0625
TN-CHN
Green
Blue
0.5
5
0.1
TN-CHN
Green
Blue
Yellow
0.5
5
0.1
示例代码
OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():
#count of greens
EV_G = (df_RR['EV'] == 'Green').sum()
PV_G = (df_RR['PV'] == 'Green').sum()
DV_G = (df_RR['DV'] == 'Green').sum()
GV_G= (df_RR['GV'] == 'Green').sum()
#count of all values excluding null
EV = df_RR['EV'].sum()
PV = df_RR['PV'].sum()
DV = df_RR['DV'].sum()
GV = df_RR['GV'] .sum()
# so (0.25*(SUM for "DV" of greens (totally correct))+0.25*(SUM for "PV" of greens (totally correct))+0.25*(SUM for "EV" of greens (totally correct))+0.25*(SUM for "GV" of greens (totally correct)) / total count of values
Numerator = (0.25*EV_G) + (0.25*PV_G) + (0.25* DV_G) + (0.25* GV_G)
denominator = EV+PV+DV+GV
try:
OVER_ALL_SCORE [domain]=(Numerator /denominator )
except:
OVER_ALL_SCORE [domain]=0
df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)
目前此逻辑在所有域中返回相同的值。请帮忙解决
提前致谢
这是一个提供理想输出的解决方案:
OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():
sub_df = df_RR.loc[df_RR['Domain']==Domain]
#count of greens
EV_G = (sub_df['EV'] == 'Green').sum()
PV_G = (sub_df['PV'] == 'Green').sum()
DV_G = (sub_df['DV'] == 'Green').sum()
GV_G = (sub_df['GV'] == 'Green').sum()
#count of all values
EV = sub_df['EV'].count()
PV = sub_df['PV'].count()
DV = sub_df['DV'].count()
GV = sub_df['GV'].count()
numerator = (0.25*EV_G) + (0.25*PV_G) + (0.25* DV_G) + (0.25* GV_G)
denominator = EV+PV+DV+GV
try:
OVER_ALL_SCORE[Domain] = (numerator /denominator )
except:
OVER_ALL_SCORE[Domain] = 0
df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)
有一些变化是关键:
count() 与 sum()
在计算所有值时,您需要使用 count
方法而不是 sum
方法(否则,此代码只会连接 [=45= 中的字符串值]):
df_RR['EV'].sum()
returns:'GreenGreenGreenBlueGreen'(因为求和方法只是将系列中的所有值相加)。
改用这个:
df_RR['EV'].count()
它在您的果岭数中起作用的原因是此代码 df_RR['EV'] == 'Green'
正在返回一系列布尔值,这些布尔值可以正确求和以获得果岭数(因为它会将真值添加为1 和假为零):
True True True False True
等同于 1 1 1 0 1
主要问题
目前,您的计数在每个循环中都与您 运行 相同,因为您没有根据域进行过滤。我会创建子数据框。基于您作为循环第一步查看的域:
domain_df = df_RR.loc[df_RR['Domain'] == Domain]
我有一个类似于下面提到的 pandas 数据框,对于所有(域)唯一值,我想计算 Count(EV)+Count(PV)+count(DV)+count(GV ) 其中值 = 绿色 / 该唯一域中值的总数
Domain | EV | PV | DV | GV | Numerator(part) | denominator(part) | ideal Output |
---|---|---|---|---|---|---|---|
KA-BLR | Green | Blue | Green | 1 | 6 | 0.166 | |
KA-BLR | Green | Green | Blue | 1 | 6 | 0.166 | |
KL-TRV | Green | Blue | Yellow | Red | 0.5 | 7 | 0.071 |
KL-TRV | Green | Blue | Blue | 0.5 | 7 | 0.071 | |
KL-COK | Blue | Blue | Yellow | Green | 0.25 | 4 | 0.0625 |
TN-CHN | Green | Blue | 0.5 | 5 | 0.1 | ||
TN-CHN | Green | Blue | Yellow | 0.5 | 5 | 0.1 |
示例代码
OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():
#count of greens
EV_G = (df_RR['EV'] == 'Green').sum()
PV_G = (df_RR['PV'] == 'Green').sum()
DV_G = (df_RR['DV'] == 'Green').sum()
GV_G= (df_RR['GV'] == 'Green').sum()
#count of all values excluding null
EV = df_RR['EV'].sum()
PV = df_RR['PV'].sum()
DV = df_RR['DV'].sum()
GV = df_RR['GV'] .sum()
# so (0.25*(SUM for "DV" of greens (totally correct))+0.25*(SUM for "PV" of greens (totally correct))+0.25*(SUM for "EV" of greens (totally correct))+0.25*(SUM for "GV" of greens (totally correct)) / total count of values
Numerator = (0.25*EV_G) + (0.25*PV_G) + (0.25* DV_G) + (0.25* GV_G)
denominator = EV+PV+DV+GV
try:
OVER_ALL_SCORE [domain]=(Numerator /denominator )
except:
OVER_ALL_SCORE [domain]=0
df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)
目前此逻辑在所有域中返回相同的值。请帮忙解决
提前致谢
这是一个提供理想输出的解决方案:
OVER_ALL_SCORE = {}
for Domain in df_RR["Domain"].unique():
sub_df = df_RR.loc[df_RR['Domain']==Domain]
#count of greens
EV_G = (sub_df['EV'] == 'Green').sum()
PV_G = (sub_df['PV'] == 'Green').sum()
DV_G = (sub_df['DV'] == 'Green').sum()
GV_G = (sub_df['GV'] == 'Green').sum()
#count of all values
EV = sub_df['EV'].count()
PV = sub_df['PV'].count()
DV = sub_df['DV'].count()
GV = sub_df['GV'].count()
numerator = (0.25*EV_G) + (0.25*PV_G) + (0.25* DV_G) + (0.25* GV_G)
denominator = EV+PV+DV+GV
try:
OVER_ALL_SCORE[Domain] = (numerator /denominator )
except:
OVER_ALL_SCORE[Domain] = 0
df_RR['Overall_score']=df_RR['Domain'].map(OVER_ALL_SCORE)
有一些变化是关键:
count() 与 sum()
在计算所有值时,您需要使用 count
方法而不是 sum
方法(否则,此代码只会连接 [=45= 中的字符串值]):
df_RR['EV'].sum()
returns:'GreenGreenGreenBlueGreen'(因为求和方法只是将系列中的所有值相加)。
改用这个:
df_RR['EV'].count()
它在您的果岭数中起作用的原因是此代码 df_RR['EV'] == 'Green'
正在返回一系列布尔值,这些布尔值可以正确求和以获得果岭数(因为它会将真值添加为1 和假为零):
True True True False True
等同于 1 1 1 0 1
主要问题
目前,您的计数在每个循环中都与您 运行 相同,因为您没有根据域进行过滤。我会创建子数据框。基于您作为循环第一步查看的域:
domain_df = df_RR.loc[df_RR['Domain'] == Domain]