如何使用 DataFrames 比较两个 CSV 文件并检索不同的单元格？为什么我在浮动单元格中得到这么多小数？

Question

我有一个这样的 CSV 文件 (example.csv)。

STRING_COL,INT_1,INT_2,FLOAT,INT_3
Hello,9,65151651,3234.54848,7832
This is a string,2,5484651,34.234,-999
Another,2,62189548,51.51658,-999
Test,2,2131514,5.2156,-999
Ham,9,6546548,2.15,-999
String,9,3216546,2.15468,-999

每个单元格都有不同的小数位数。它们也可以是字符串或整数（Int64、Int8、...）。然后我也有一个类似的 CSV，但更改了一些值。我想检查两个文件之间的差异。

因此我编写了一个与此类似的代码来逐个单元地比较值：

import pandas as pd

df = pd.read_csv(
    'example.csv', delimiter=',', comment='#', skip_blank_lines=True,
    verbose=False, engine='python', dtype=str
)
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))

df_2 = pd.read_csv(
    'example_2.csv', delimiter=',', comment='#', skip_blank_lines=True,  # file with small changes
    verbose=False, engine='python', dtype=str
)
df_2 = df_2.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))

for i in list(df.index):
    for column in list(df.columns):
        old = df.loc[i, column]
        new = df_2.loc[i, column]
        if old != new:
            print('DIFFERENT VALUE >> INDEX: {} | OLD: {} | NEW: {}'.format(i, old, new))

如果您运行使用小型 CSV 文件的示例，我很确定它会运行良好。但是对于一个巨大的 CSV 文件，一些奇怪的事情正在发生。我不明白为什么有时很多值都运行与这个相关：

1.6440000000000001  >> original value 1.644
7.7189999999999985  >> original value 7.7189

然后，如果我比较它们，就会发现它们是不同的，这不是真的，因为值是相同的。怎么了？有没有办法来解决这个问题？有没有更好的方法来比较数据帧的值？

注意：也许我在原始代码的其他部分做错了什么，但我认为我已经写了最重要和最相关的部分。

注意 2：我考虑到 != 运算符不适用于 NaN 值。我使用 np.isnan 来检查此更改。

更新。 "yes, it is equal" 和 "no, it is not equal" 就不用我去比较和说了。我需要逐个单元格地检索更改的值。

Answer 1

终于找到了比较合适的方法：np.isclose(). I have read the duplicated question I found and some other questions about the epsilon value: numpy.finfo(), epsilon

: Numbers which differ by less than machine epsilon are numerically the same

    abs(a - b) < epsilon
    absolute(a - b) <= (atol + rtol * absolute(b))      # np.isclose() method

所以我需要做这样的东西。如果我在 float32 and float64 or float16

之间进行比较，我必须检查会发生什么

eps64 = np.finfo(np.float64).eps
for col in df.columns:
    np.isclose(
        df[col],
        df_2[col],
        equal_nan=False,
        atol=0.0,
        rtol=eps64
    )

但现在我面临的问题是，如果我想将值复制到其他变量，我复制了不准确的值 1.6440000000000001。我现在要做的是将值转换为 float >> float(1.6440000000000001)

如何使用 DataFrames 比较两个 CSV 文件并检索不同的单元格？为什么我在浮动单元格中得到这么多小数？

How to compare two CSV files using DataFrames and retrieve the different cells? Why am I getting so many decimals figures in the float cells?

python

floating-accuracy

dataframe

python-3.x

pandas