Pandas 检查 运行 pd.crosstab 之前的行中的键值,如果不存在则添加已知键值

Pandas Check for key values in rows prior to running pd.crosstab and add known key values if absent

大家下午好,

我 运行 一个脚本来创建一个交叉表样式的报告,从中将计算添加到输出中,这假设计算中使用的所有字段都存在。这个假设是不正确的,尤其是在月初,每个 Client 还没有 Results 的数量。我想在 运行 之前添加一些验证。 当且仅当每个 Client

的字段 Result 都存在以下值时,才没有问题
Covered
Customer Reject
Customer Timeout
Dealer Reject
Dealer Timeout
Done
No RFQ
Tied Covered
Tied Done
Tied Traded Away
Traded Away

所有值都存在并显示所需输出的情况:

df = {'Client': ['A', 'A', 'A',
                 'B', 'B', 'B', 'B','B','B','B','B',
                 'C',
                 'D','D','D','D','D','D','D','D','D','D','D'],
'Result': ['Covered', 'Customer Reject', 'Customer Timeout',
           'Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','Traded Away','No RFQ',
           'No RFQ',
           'Covered','Customer Reject','Customer Timeout','Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','Traded Away','No RFQ']}
df = pd.DataFrame.from_dict(df)
print(df)

   Client            Result
0       A           Covered
1       A   Customer Reject
2       A  Customer Timeout
3       B     Dealer Reject
4       B    Dealer Timeout
5       B              Done
6       B      Tied Covered
7       B         Tied Done
8       B  Tied Traded Away
9       B       Traded Away
10      B            No RFQ
11      C            No RFQ
12      D           Covered
13      D   Customer Reject
14      D  Customer Timeout
15      D     Dealer Reject
16      D    Dealer Timeout
17      D              Done
18      D      Tied Covered
19      D         Tied Done
20      D  Tied Traded Away
21      D       Traded Away
22      D            No RFQ

df = pd.crosstab(df.Client,df.Result,margins=True, margins_name='Total_Result_Per_Client').drop('Total_Result_Per_Client')
# Percentage_Priced_Back_Count = (Total - Dealer Reject - Dealer_Timeout) / (Total - Customer Reject)
df['Percentage_Priced_Back_Count'] = ( df['Total_Result_Per_Client'] - df['Dealer Reject'] - df['Dealer Timeout'] ) / ( df['Total_Result_Per_Client'] - df['Customer Reject'] )
# Hit_Rate_Count = (Total – (Covered + Tied Covered) – Customer Reject – Dealer Reject – Dealer Timeout – (Traded Away + Tied Traded Away)) / (Total – Dealer Reject – Dealer Timeout)
df['Hit_Rate_Count'] = ( df['Total_Result_Per_Client'] - df['Covered'] - df['Tied Covered'] - df['Customer Reject'] - df['Dealer Reject'] - df['Dealer Timeout'] - df['Traded Away'] - df['Tied Traded Away']) / \
                       ( df['Total_Result_Per_Client'] - df['Dealer Reject'] - df['Dealer Timeout'] )
df = df.stack().reset_index(name='value')


df2 = pd.DataFrame([[''] * len(df.columns), df.columns], columns=df.columns)
df1= (df.groupby('Client', group_keys=False)
        .apply(lambda d: d.append(df2))
        .iloc[:-2]
        .reset_index(drop=True))
print (df1)

示例输出:

15  Client                        Result     value
16       B                       Covered         0
17       B               Customer Reject         0
18       B              Customer Timeout         0
19       B                 Dealer Reject         1
20       B                Dealer Timeout         1
21       B                          Done         1
22       B                        No RFQ         1
23       B                  Tied Covered         1
24       B                     Tied Done         1
25       B              Tied Traded Away         1
26       B                   Traded Away         1
27       B       Total_Result_Per_Client         8
28       B  Percentage_Priced_Back_Count      0.75
29       B                Hit_Rate_Count       0.5

如果 ClientResult 中的值(用于计算)不存在,则会引发错误。在下面的示例中,Traded Away 被删除为 Client BDpd.crosstab 调用中出现错误:

df = {'Client': ['A', 'A', 'A',
                 'B', 'B', 'B', 'B','B','B','B',
                 'C',
                 'D','D','D','D','D','D','D','D','D','D'],
'Result': ['Covered', 'Customer Reject', 'Customer Timeout',
           'Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','No RFQ',
           'No RFQ',
           'Covered','Customer Reject','Customer Timeout','Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','No RFQ']}
df = pd.DataFrame.from_dict(df)
print(df)

df = pd.crosstab(df.Client,df.Result,margins=True, margins_name='Total_Result_Per_Client').drop('Total_Result_Per_Client')

KeyError: 'Traded Away'

即如果任何 Client 的已知 Result 值不存在,请在函数调用之前添加它(具有零值)以便完成所有计算。

我建议使用 DataFrame.reindex 添加列表中的所有缺失值并用 0:

填充它们
vals = ['Covered',
'Customer Reject',
'Customer Timeout',
'Dealer Reject',
'Dealer Timeout',
'Done',
'No RFQ',
'Tied Covered',
'Tied Done',
'Tied Traded Away',
'Traded Away']

df = (pd.crosstab(df.Client,
                 df.Result,
                 margins=True, 
                 margins_name='Total_Result_Per_Client')
        .drop('Total_Result_Per_Client')
        .reindex(vals + ['Total_Result_Per_Client'], axis=1, fill_value=0))

print(df)

Result  Covered  Customer Reject  Customer Timeout  Dealer Reject  \
Client                                                              
A             1                1                 1              0   
B             0                0                 0              1   
C             0                0                 0              0   
D             1                1                 1              1   

Result  Dealer Timeout  Done  No RFQ  Tied Covered  Tied Done  \
Client                                                          
A                    0     0       0             0          0   
B                    1     1       1             1          1   
C                    0     0       1             0          0   
D                    1     1       1             1          1   

Result  Tied Traded Away  Traded Away  Total_Result_Per_Client  
Client                                                          
A                      0            0                        3  
B                      1            0                        7  
C                      0            0                        1  
D                      1            0                       10