根据 NaN 值进行计算或函数跳过行

Make a calculation or function skip rows based on NaN values

我有两个数据框,我希望对它们应用两个单独的函数,它们将独立地对每个数据框执行验证检查,然后出现的任何差异都将连接到一个转换后的列表中。

我面临的问题是,只有当它正在分析的两个数据帧中的任何一个的所有数字列中都存在数值时,才应该进行第一次验证检查。如果第一次验证检查的一行中有 ANY 个 NaN 值,则应跳过该行。

第二次验证检查不需要该规范。

以下是数据框、函数和转换:

import pandas as pd
import numpy as np

df1 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
        'Price': [2,1.5,np.nan,2.5,3,4,np.nan,3.5,1.5,2],'Amount':[40,19,np.nan,np.nan,60,70,80,np.nan,45,102],
        'Quantity Frozen':[3,4,np.nan,15,np.nan,9,12,8,np.nan,80],
        'Quantity Fresh':[37,12,np.nan,45,np.nan,61,np.nan,24,14,20],
        'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,21,40]}
df1 = pd.DataFrame(df1, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])

df2 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
        'Price': [2,1.5,np.nan,2.6,3,4,np.nan,3.5,1.5,2],'Amount':[40,16,np.nan,np.nan,60,72,80,np.nan,45,100],
        'Quantity Frozen':[3,4,np.nan,np.nan,np.nan,9,12,8,np.nan,80],
        'Quantity Fresh':[np.nan,12,np.nan,45,np.nan,61,np.nan,24,15,20],
        'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,20,40]}

df2 = pd.DataFrame(df2, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])

#Validation Check 1:

for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
        dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
        for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
                print('{} differences in {}:'.format(name, varname))
                print(dataset[var].value_counts())
                print('\n')

#Validation Check 2:

for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
        dataset['dif_Multiple'] = dataset['Price'] * dataset['Quantity Fresh']
        for varname,var in {'Multiple vs. Price x Quantity Fresh':'dif_Multiple'}.items():
                print('{} differences in {}:'.format(name, varname))
                print(dataset[var].value_counts())
                print('\n')

# #Wrangling internal inconsistency data frames to be in correct format
inconsistency_vars = ['dif_Stock on Hand','dif_Multiple']
inconsistency_var_betternames = {'dif_Stock on Hand':'Stock on Hand = Quantity Fresh + Quantity Frozen','dif_Multiple':'Multiple = Price x Quantity on Hand'}

# #Rollup1
idvars1=['Fruits']
df1 = df1[idvars1 + inconsistency_vars]
df2 = df2[idvars1 + inconsistency_vars]
df1 = df1.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df2 = df2.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df1['dataset'] = 'Fruit Dataset1'
df2['dataset'] = 'Fruit Dataset2'

# #First table in Internal Inconsistencies Sheet (Table 5)
inconsistent = pd.concat([df1,df2])
inconsistent = inconsistent[['variable','Difference Magnitude','dataset','Fruits']]
inconsistent['variable'] = inconsistent['variable'].map(inconsistency_var_betternames)
inconsistent = inconsistent[inconsistent['Difference Magnitude'] != 0]

这是所需的输出,对于第一个验证检查,它会跳过任一数据框中的行,这些行在数字列中具有 ANY NaN 值(每一列,但 'Fruits'):

#Desired output
inconsistent_true = {'variable': ["Stock on Hand = Quantity Fresh + Quantity Frozen","Stock on Hand = Quantity Fresh + Quantity Frozen","Multiple = Price x Quantity on Hand",
"Multiple = Price x Quantity on Hand","Multiple = Price x Quantity on Hand"],
        'Difference Magnitude': [1,2,1,4.5,2.5],
        'dataset':["Fruit Dataset1","Fruit Dataset1","Fruit Dataset2","Fruit Dataset2","Fruit Datset2"],
        'Fruits':["Blueberry","Coconut","Blueberry","Cherry","Pear"]}
inconsistent_true = pd.DataFrame(inconsistent_true, columns = ['variable', 'Difference Magnitude','dataset','Fruits'])

一个可能派上用场的pandas函数是pd.isnull()returnTrue for np.nan值-

以df1-

为例
pd.isnull(df1['Amount'][2])
True

这可以作为对所有数字列的检查添加,然后仅使用列 'numeric_check' 值为 1-[=13= 的行]

df1['numeric_check'] = df1.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or 
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)

参考修改后的验证检查 1 -

#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():

    if '1' in name: # check to implement condition for only df1

        # Adding the 'numeric_check' column to dataset df
        dataset['numeric_check'] = dataset.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
        pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or 
        pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)

        # filter out Nan rows, they will not be considered for this check
        dataset = dataset.loc[dataset['numeric_check']==1]

        dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
        for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
                print('{} differences in {}:'.format(name, varname))
                print(dataset[var].value_counts())
                print('\n')

希望我明白你的意思。

# make boolean mask, True if all numeric values are not NaN
mask = df1.select_dtypes('number').notna().all(axis=1)

print(df1[mask])

      Fruits  Price  Amount  Quantity Frozen  Quantity Fresh  Multiple
0     Banana    2.0    40.0              3.0            37.0      74.0
1  Blueberry    1.5    19.0              4.0            12.0      17.0
5  Pineapple    4.0    70.0              9.0            61.0     244.0
9    Coconut    2.0   102.0             80.0            20.0      40.0