解析 pandas 数据框并根据不同条件应用规则
Parsing through a pandas Dataframe and applying rules based on different conditions
我有一个虚构的数据框来复制我在 python 中尝试解决的实际问题,将大型机系统上持有的账户利率与利率表中应设置的账户利率进行核对。
我有 3 个表,但在本示例中它们已合并到一个数据框中。
- 具有利率条件的帐户信息(df 的前 5 列)。这些费率是应用于帐户的实际费率,需要进行匹配以确保它们设置正确
- 非标准费率 - 一旦满足特定条件,某些帐户将应用这些非标准费率
- 标准费率 - 与上述相同,一旦满足某些条件,这些将适用
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,3.5,'GG','N','N','Y',np.NaN,np.NaN,'N','N',3.5,'GG'],
[7854567890,np.NaN,'GG','N','N','N',np.NaN,'GG','N','N',3.5,'GG'],
[9876542190,3.5,'FF','N','N','Y',np.NaN,np.NaN,'N','Y',3.5,'FI'],
[9632587415,3.5,'GG','N','N','N',3,'GG','N','N',3.5,'GG']],
columns = ['Account','Account_Spread','Account_Swing','indict_1','indict_2','Negotiated_Rate',
'Non_std_Spread','Non_std_Code','Non_std_indict_1','Non_std_indict_2','Std_Spread','Std_Swing'])
df
条件:
- 如果列为“Non_std_Spread”或“”,则账户数据(账户利差和账户摆动)应仅与非标准利率匹配Non_std_Code”或两者都已填充,并且“Negogiated_Rate”列设置为 N。
- 如果列为“Non_std_Spread”或“[=80=”,则账户数据(价差和波动)应仅与标准汇率匹配]”均为 null,并且“Negogiated_Rate”列设置为 N 或 Y.
- 对于上述指标设置为Y的账户,非标准数据中的指标“Non_std_indict_1”和“Non_std_indict_2”需要与“indict_1”和“[进行比较=88=]" 分别报告匹配和不匹配。
期望的结果:
- 添加到数据框的新列标识是否检测到匹配或不匹配,将帐户利差和代码与其在非标准利率或标准利率中的等效项进行比较。类似于“MatchOnNSR”或“MismatchOnSR”。
- 另一列或多列比较当 Negogiated_Rate 被标记为 Y
时指标列之间是否发生不匹配
具有预期结果的示例输出:
df=pd.DataFrame([[1234567890,3.5,'GG','N','N','Y',np.NaN,np.NaN,'N','N',3.5,'GG','MatchOnSR',True,True],
[7854567890,np.NaN,'GG','N','N','N',np.NaN,'GG','N','N',3.5,'GG','MatchOnNSR',np.NaN,np.NaN],
[9876542190,3.5,'FF','N','N','Y',np.NaN,np.NaN,'N','Y',3.5,'FI','MismatchOnSR',True,False],
[9632587415,3.5,'GG','N','N','N',3,'GG','N','N',3.5,'GG','MismatchOnSNR',np.NaN,np.NaN]],
columns = ['Account','Account_Spread','Account_Swing','indict_1','indict_2','Negotiated_Rate',
'Non_std_Spread','Non_std_Code','Non_std_indict_1','Non_std_indict_2','Std_Spread','Std_Swing','Is_Match','Match_indict_1','Match_indict_2'])
df
目前我没有任何可以解决这个问题的东西。我正在努力了解最好的入门方法是什么。非常感谢任何帮助。
终于明白了:
def compute_match(row):
m = match_indict_1 = match_indict_2 = np.nan
if row['Non_std_Spread'] == 'nan' and row['Non_std_Code'] == 'nan':
match_indict_1 = row['indict_1'] == row['Non_std_indict_1']
match_indict_2 = row['indict_2'] == row['Non_std_indict_2']
if row['Account_Spread'] == row['Std_Spread'] and row['Account_Swing'] == row['Std_Swing']:
m = 'MatchOnSR'
else:
m = 'MismatchOnSR'
elif row['Non_std_Spread'] != 'nan' and row['Non_std_Code'] != 'nan' and row['Negotiated_Rate'] == 'N':
match_indict_1 = match_indict_2 = np.nan
if row['Account_Spread'] == row['Non_std_Spread'] and row['Account_Swing'] == row['Non_std_Code']:
m = 'MatchOnNSR'
else:
m = 'MismatchOnNSR'
return (m, match_indict_1, match_indict_2)
df = (
pd.concat([
df,
(
df
.fillna('nan')
.apply(compute_match, axis=1, result_type='expand')
.rename({0:'Is_Match', 1:'Match_indict_1', 2:'Match_indict_2'}, axis=1)
),
], axis=1)
)
测试:
Account Account_Spread Account_Swing indict_1 indict_2 Negotiated_Rate Non_std_Spread Non_std_Code Non_std_indict_1 Non_std_indict_2 Std_Spread Std_Swing Is_Match Match_indict_1 Match_indict_2
0 1234567890 3.5 GG N N Y NaN NaN N N 3.5 GG MatchOnSR True True
1 7854567890 NaN GG N N N NaN GG N N 3.5 GG NaN NaN NaN
2 9876542190 3.5 FF N N Y NaN NaN N Y 3.5 FI MismatchOnSR True False
3 9632587415 3.5 GG N N N 3.0 GG N N 3.5 GG MismatchOnNSR NaN NaN
注意 Is_Match
第 2 行是 NaN
- 这是因为 Non_std_Spread
第 2 行是 NaN
但 Non_std_Code
第 2 行是 不是NaN
.
此回答提出了一些与版本相关的问题,请参阅评论
以一种纯粹的 Pandas 方式,但不是太惯用,可能效率不高:
nandf = df.query("Non_std_Spread.isna() and Non_std_Code.isna()")
nandf["match_indict_1"] = nandf["indict_1"] == nandf["Non_std_indict_1"]
nandf["match_indict_2"] = nandf["indict_2"] == nandf["Non_std_indict_2"]
nandf["Is_Match"] = np.where(
(nandf["Account_Spread"] == nandf["Std_Spread"]) & (nandf["Account_Swing"] == nandf["Std_Swing"]),
"MatchOnSR", "MismatchOnNSR",
)
nonandf = df.query("not(Non_std_Spread.isna()) and not(Non_std_Code.isna()) and Negotiated_Rate == 'N'")
nonandf["Is_Match"] = np.where(
(nonandf["Account_Spread"] == nonandf["Non_std_Spread"]) & (nonandf["Account_Swing"] == nonandf["Non_std_Code"]),
"MatchOnSR", "MismatchOnNSR",
)
df = nandf.combine_first(df)
df = nonandf.combine_first(df)
这个回答是对我在另一个回答中提出的一些版本相关问题的回应
您可以尝试使用面具 'cond_...',例如:
cond_nan = df['Non_std_Spread'].isna() & df['Non_std_Code'].isna()
df.loc[cond_nan,'match_indict_1'] = df.loc[cond_nan,'indict_1'] == df.loc[cond_nan,'Non_std_indict_1']
df.loc[cond_nan,'match_indict_2'] = df.loc[cond_nan,'indict_2'] == df.loc[cond_nan,'Non_std_indict_2']
df.loc[cond_nan,'Is_Match'] = np.where(
(df.loc[cond_nan,'Account_Spread'] == df.loc[cond_nan,'Std_Spread']) & (df.loc[cond_nan,'Account_Swing'] == df.loc[cond_nan,'Std_Swing']),
"MatchOnSR", "MismatchOnNSR",
)
cond_no_nan = ~df['Non_std_Spread'].isna() & ~df['Non_std_Code'].isna() & (df['Negotiated_Rate'] == 'N')
df.loc[cond_no_nan,'Is_Match'] = np.where(
(df.loc[cond_no_nan,'Account_Spread'] == df.loc[cond_no_nan,'Non_std_Spread']) & (df.loc[cond_no_nan,'Account_Swing'] == df.loc[cond_no_nan,'Non_std_Code']),
"MatchOnSR", "MismatchOnNSR",
)
我有一个虚构的数据框来复制我在 python 中尝试解决的实际问题,将大型机系统上持有的账户利率与利率表中应设置的账户利率进行核对。
我有 3 个表,但在本示例中它们已合并到一个数据框中。
- 具有利率条件的帐户信息(df 的前 5 列)。这些费率是应用于帐户的实际费率,需要进行匹配以确保它们设置正确
- 非标准费率 - 一旦满足特定条件,某些帐户将应用这些非标准费率
- 标准费率 - 与上述相同,一旦满足某些条件,这些将适用
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,3.5,'GG','N','N','Y',np.NaN,np.NaN,'N','N',3.5,'GG'],
[7854567890,np.NaN,'GG','N','N','N',np.NaN,'GG','N','N',3.5,'GG'],
[9876542190,3.5,'FF','N','N','Y',np.NaN,np.NaN,'N','Y',3.5,'FI'],
[9632587415,3.5,'GG','N','N','N',3,'GG','N','N',3.5,'GG']],
columns = ['Account','Account_Spread','Account_Swing','indict_1','indict_2','Negotiated_Rate',
'Non_std_Spread','Non_std_Code','Non_std_indict_1','Non_std_indict_2','Std_Spread','Std_Swing'])
df
条件:
- 如果列为“Non_std_Spread”或“”,则账户数据(账户利差和账户摆动)应仅与非标准利率匹配Non_std_Code”或两者都已填充,并且“Negogiated_Rate”列设置为 N。
- 如果列为“Non_std_Spread”或“[=80=”,则账户数据(价差和波动)应仅与标准汇率匹配]”均为 null,并且“Negogiated_Rate”列设置为 N 或 Y.
- 对于上述指标设置为Y的账户,非标准数据中的指标“Non_std_indict_1”和“Non_std_indict_2”需要与“indict_1”和“[进行比较=88=]" 分别报告匹配和不匹配。
期望的结果:
- 添加到数据框的新列标识是否检测到匹配或不匹配,将帐户利差和代码与其在非标准利率或标准利率中的等效项进行比较。类似于“MatchOnNSR”或“MismatchOnSR”。
- 另一列或多列比较当 Negogiated_Rate 被标记为 Y 时指标列之间是否发生不匹配
具有预期结果的示例输出:
df=pd.DataFrame([[1234567890,3.5,'GG','N','N','Y',np.NaN,np.NaN,'N','N',3.5,'GG','MatchOnSR',True,True],
[7854567890,np.NaN,'GG','N','N','N',np.NaN,'GG','N','N',3.5,'GG','MatchOnNSR',np.NaN,np.NaN],
[9876542190,3.5,'FF','N','N','Y',np.NaN,np.NaN,'N','Y',3.5,'FI','MismatchOnSR',True,False],
[9632587415,3.5,'GG','N','N','N',3,'GG','N','N',3.5,'GG','MismatchOnSNR',np.NaN,np.NaN]],
columns = ['Account','Account_Spread','Account_Swing','indict_1','indict_2','Negotiated_Rate',
'Non_std_Spread','Non_std_Code','Non_std_indict_1','Non_std_indict_2','Std_Spread','Std_Swing','Is_Match','Match_indict_1','Match_indict_2'])
df
目前我没有任何可以解决这个问题的东西。我正在努力了解最好的入门方法是什么。非常感谢任何帮助。
终于明白了:
def compute_match(row):
m = match_indict_1 = match_indict_2 = np.nan
if row['Non_std_Spread'] == 'nan' and row['Non_std_Code'] == 'nan':
match_indict_1 = row['indict_1'] == row['Non_std_indict_1']
match_indict_2 = row['indict_2'] == row['Non_std_indict_2']
if row['Account_Spread'] == row['Std_Spread'] and row['Account_Swing'] == row['Std_Swing']:
m = 'MatchOnSR'
else:
m = 'MismatchOnSR'
elif row['Non_std_Spread'] != 'nan' and row['Non_std_Code'] != 'nan' and row['Negotiated_Rate'] == 'N':
match_indict_1 = match_indict_2 = np.nan
if row['Account_Spread'] == row['Non_std_Spread'] and row['Account_Swing'] == row['Non_std_Code']:
m = 'MatchOnNSR'
else:
m = 'MismatchOnNSR'
return (m, match_indict_1, match_indict_2)
df = (
pd.concat([
df,
(
df
.fillna('nan')
.apply(compute_match, axis=1, result_type='expand')
.rename({0:'Is_Match', 1:'Match_indict_1', 2:'Match_indict_2'}, axis=1)
),
], axis=1)
)
测试:
Account Account_Spread Account_Swing indict_1 indict_2 Negotiated_Rate Non_std_Spread Non_std_Code Non_std_indict_1 Non_std_indict_2 Std_Spread Std_Swing Is_Match Match_indict_1 Match_indict_2
0 1234567890 3.5 GG N N Y NaN NaN N N 3.5 GG MatchOnSR True True
1 7854567890 NaN GG N N N NaN GG N N 3.5 GG NaN NaN NaN
2 9876542190 3.5 FF N N Y NaN NaN N Y 3.5 FI MismatchOnSR True False
3 9632587415 3.5 GG N N N 3.0 GG N N 3.5 GG MismatchOnNSR NaN NaN
注意 Is_Match
第 2 行是 NaN
- 这是因为 Non_std_Spread
第 2 行是 NaN
但 Non_std_Code
第 2 行是 不是NaN
.
此回答提出了一些与版本相关的问题,请参阅评论
以一种纯粹的 Pandas 方式,但不是太惯用,可能效率不高:
nandf = df.query("Non_std_Spread.isna() and Non_std_Code.isna()")
nandf["match_indict_1"] = nandf["indict_1"] == nandf["Non_std_indict_1"]
nandf["match_indict_2"] = nandf["indict_2"] == nandf["Non_std_indict_2"]
nandf["Is_Match"] = np.where(
(nandf["Account_Spread"] == nandf["Std_Spread"]) & (nandf["Account_Swing"] == nandf["Std_Swing"]),
"MatchOnSR", "MismatchOnNSR",
)
nonandf = df.query("not(Non_std_Spread.isna()) and not(Non_std_Code.isna()) and Negotiated_Rate == 'N'")
nonandf["Is_Match"] = np.where(
(nonandf["Account_Spread"] == nonandf["Non_std_Spread"]) & (nonandf["Account_Swing"] == nonandf["Non_std_Code"]),
"MatchOnSR", "MismatchOnNSR",
)
df = nandf.combine_first(df)
df = nonandf.combine_first(df)
这个回答是对我在另一个回答中提出的一些版本相关问题的回应
您可以尝试使用面具 'cond_...',例如:
cond_nan = df['Non_std_Spread'].isna() & df['Non_std_Code'].isna()
df.loc[cond_nan,'match_indict_1'] = df.loc[cond_nan,'indict_1'] == df.loc[cond_nan,'Non_std_indict_1']
df.loc[cond_nan,'match_indict_2'] = df.loc[cond_nan,'indict_2'] == df.loc[cond_nan,'Non_std_indict_2']
df.loc[cond_nan,'Is_Match'] = np.where(
(df.loc[cond_nan,'Account_Spread'] == df.loc[cond_nan,'Std_Spread']) & (df.loc[cond_nan,'Account_Swing'] == df.loc[cond_nan,'Std_Swing']),
"MatchOnSR", "MismatchOnNSR",
)
cond_no_nan = ~df['Non_std_Spread'].isna() & ~df['Non_std_Code'].isna() & (df['Negotiated_Rate'] == 'N')
df.loc[cond_no_nan,'Is_Match'] = np.where(
(df.loc[cond_no_nan,'Account_Spread'] == df.loc[cond_no_nan,'Non_std_Spread']) & (df.loc[cond_no_nan,'Account_Swing'] == df.loc[cond_no_nan,'Non_std_Code']),
"MatchOnSR", "MismatchOnNSR",
)