解析 pandas 数据框并根据不同条件应用规则

Parsing through a pandas Dataframe and applying rules based on different conditions

我有一个虚构的数据框来复制我在 python 中尝试解决的实际问题,将大型机系统上持有的账户利率与利率表中应设置的账户利率进行核对。

我有 3 个表,但在本示例中它们已合并到一个数据框中。

  1. 具有利率条件的帐户信息(df 的前 5 列)。这些费率是应用于帐户的实际费率,需要进行匹配以确保它们设置正确
  2. 非标准费率 - 一旦满足特定条件,某些帐户将应用这些非标准费率
  3. 标准费率 - 与上述相同,一旦满足某些条件,这些将适用
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,3.5,'GG','N','N','Y',np.NaN,np.NaN,'N','N',3.5,'GG'],
                    [7854567890,np.NaN,'GG','N','N','N',np.NaN,'GG','N','N',3.5,'GG'],
                    [9876542190,3.5,'FF','N','N','Y',np.NaN,np.NaN,'N','Y',3.5,'FI'],
                    [9632587415,3.5,'GG','N','N','N',3,'GG','N','N',3.5,'GG']],
columns = ['Account','Account_Spread','Account_Swing','indict_1','indict_2','Negotiated_Rate',
           'Non_std_Spread','Non_std_Code','Non_std_indict_1','Non_std_indict_2','Std_Spread','Std_Swing'])
df

条件:

期望的结果:

具有预期结果的示例输出:

df=pd.DataFrame([[1234567890,3.5,'GG','N','N','Y',np.NaN,np.NaN,'N','N',3.5,'GG','MatchOnSR',True,True],
                    [7854567890,np.NaN,'GG','N','N','N',np.NaN,'GG','N','N',3.5,'GG','MatchOnNSR',np.NaN,np.NaN],
                    [9876542190,3.5,'FF','N','N','Y',np.NaN,np.NaN,'N','Y',3.5,'FI','MismatchOnSR',True,False],
                    [9632587415,3.5,'GG','N','N','N',3,'GG','N','N',3.5,'GG','MismatchOnSNR',np.NaN,np.NaN]],
columns = ['Account','Account_Spread','Account_Swing','indict_1','indict_2','Negotiated_Rate',
           'Non_std_Spread','Non_std_Code','Non_std_indict_1','Non_std_indict_2','Std_Spread','Std_Swing','Is_Match','Match_indict_1','Match_indict_2'])
df

目前我没有任何可以解决这个问题的东西。我正在努力了解最好的入门方法是什么。非常感谢任何帮助。

终于明白了:

def compute_match(row):
    m = match_indict_1 = match_indict_2 = np.nan

    if row['Non_std_Spread'] == 'nan' and row['Non_std_Code'] == 'nan':
        match_indict_1 = row['indict_1'] == row['Non_std_indict_1']
        match_indict_2 = row['indict_2'] == row['Non_std_indict_2']
        if row['Account_Spread'] == row['Std_Spread'] and row['Account_Swing'] == row['Std_Swing']:
            m = 'MatchOnSR'
        else:
            m = 'MismatchOnSR'

    elif row['Non_std_Spread'] != 'nan' and row['Non_std_Code'] != 'nan' and row['Negotiated_Rate'] == 'N':
        match_indict_1 = match_indict_2 = np.nan
        if row['Account_Spread'] == row['Non_std_Spread'] and row['Account_Swing'] == row['Non_std_Code']:
            m = 'MatchOnNSR'
        else:
            m = 'MismatchOnNSR'

    return (m, match_indict_1, match_indict_2)


df = (
    pd.concat([
        df,
        (
            df
            .fillna('nan')
            .apply(compute_match, axis=1, result_type='expand')
            .rename({0:'Is_Match', 1:'Match_indict_1', 2:'Match_indict_2'}, axis=1)
        ),
    ], axis=1)
)

测试:

      Account  Account_Spread Account_Swing indict_1 indict_2 Negotiated_Rate  Non_std_Spread Non_std_Code Non_std_indict_1 Non_std_indict_2  Std_Spread Std_Swing       Is_Match Match_indict_1 Match_indict_2
0  1234567890             3.5            GG        N        N               Y             NaN          NaN                N                N         3.5        GG      MatchOnSR           True           True
1  7854567890             NaN            GG        N        N               N             NaN           GG                N                N         3.5        GG            NaN            NaN            NaN
2  9876542190             3.5            FF        N        N               Y             NaN          NaN                N                Y         3.5        FI   MismatchOnSR           True          False
3  9632587415             3.5            GG        N        N               N             3.0           GG                N                N         3.5        GG  MismatchOnNSR            NaN            NaN

注意 Is_Match 第 2 行是 NaN - 这是因为 Non_std_Spread 第 2 行是 NaNNon_std_Code 第 2 行是 不是NaN.

此回答提出了一些与版本相关的问题,请参阅评论

以一种纯粹的 Pandas 方式,但不是太惯用,可能效率不高:

nandf = df.query("Non_std_Spread.isna() and Non_std_Code.isna()")
nandf["match_indict_1"] = nandf["indict_1"] == nandf["Non_std_indict_1"]
nandf["match_indict_2"] = nandf["indict_2"] == nandf["Non_std_indict_2"]
nandf["Is_Match"] = np.where(
    (nandf["Account_Spread"] == nandf["Std_Spread"]) & (nandf["Account_Swing"] == nandf["Std_Swing"]),
    "MatchOnSR", "MismatchOnNSR",
)

nonandf = df.query("not(Non_std_Spread.isna()) and not(Non_std_Code.isna()) and Negotiated_Rate == 'N'")
nonandf["Is_Match"] = np.where(
    (nonandf["Account_Spread"] == nonandf["Non_std_Spread"]) & (nonandf["Account_Swing"] == nonandf["Non_std_Code"]),
    "MatchOnSR", "MismatchOnNSR",
)

df = nandf.combine_first(df)
df = nonandf.combine_first(df)

这个回答是对我在另一个回答中提出的一些版本相关问题的回应

您可以尝试使用面具 'cond_...',例如:

cond_nan = df['Non_std_Spread'].isna() & df['Non_std_Code'].isna()

df.loc[cond_nan,'match_indict_1'] = df.loc[cond_nan,'indict_1'] == df.loc[cond_nan,'Non_std_indict_1']
df.loc[cond_nan,'match_indict_2'] = df.loc[cond_nan,'indict_2'] == df.loc[cond_nan,'Non_std_indict_2']
df.loc[cond_nan,'Is_Match'] = np.where(
    (df.loc[cond_nan,'Account_Spread'] == df.loc[cond_nan,'Std_Spread']) & (df.loc[cond_nan,'Account_Swing'] == df.loc[cond_nan,'Std_Swing']),
    "MatchOnSR", "MismatchOnNSR",
)


cond_no_nan = ~df['Non_std_Spread'].isna() & ~df['Non_std_Code'].isna() & (df['Negotiated_Rate'] == 'N')

df.loc[cond_no_nan,'Is_Match'] = np.where(
    (df.loc[cond_no_nan,'Account_Spread'] == df.loc[cond_no_nan,'Non_std_Spread']) & (df.loc[cond_no_nan,'Account_Swing'] == df.loc[cond_no_nan,'Non_std_Code']),
    "MatchOnSR", "MismatchOnNSR",
)