Pandas 重新索引到更高分辨率

Pandas reindex to higher resolution

我有一个 pandas 数据框,索引为 3 到 15,步长为 0.5,我想将其重新索引为 0.1 步。 我试过这段代码,但它不起作用

# create data and set index and print for verification
df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
df.reindex(np.arange(3,5,0.1)).head(15)

以上代码输出如下:

A B
3.0 3.0
3.1 NaN
3.2 NaN
3.3 NaN
3.4 NaN
3.5 NaN * expected output in this position to be 3.5 since it exists in the original df
3.6 NaN
3.7 NaN
3.8 NaN

奇怪的是,当从 0 而不是 3 重新索引时,问题得到解决,如下面的代码所示:

df = pd.DataFrame({'A':np.arange(3,5,0.5),'B':np.arange(3,5,0.5)})
df.set_index('A', inplace = True)
print(df.head())
df.reindex(np.arange(0,5,0.1)).head(60)

输出现在正确显示

A B
0.0 NaN
... ...
3.0 3.0
3.1 NaN
3.2 NaN
3.3 NaN
3.4 NaN
3.5 3.5
3.6 NaN
3.7 NaN
3.8 NaN

我 运行 python 3.8.5 Windows 10.

Pandas 版本为 1.4.07

Numpy 版本为 1.22.1

有谁知道为什么会这样?如果它是一个已知的或新的错误?如果错误已在较新版本的 python、pandas 或 numpy?

中得到修复

谢谢

好问题。

答案是因为 np.arange(3,5,0.1) 创建的值 3.5 不完全是 3.5。它是 3.5000000000000004。但是 np.arange(0,5,0.1) 确实创建了一个 3.5,它正好是 3.5。另外,np.arange(3,5,0.5) 还会生成一个 3.5,它正好是 3.5。

pd.Index(np.arange(3,5,0.1)) 

Float64Index([               3.0,                3.1,                3.2,
              3.3000000000000003, 3.4000000000000004, 3.5000000000000004,
              3.6000000000000005, 3.7000000000000006, 3.8000000000000007,
               3.900000000000001,  4.000000000000001,  4.100000000000001,
               4.200000000000001,  4.300000000000001,  4.400000000000001,
               4.500000000000002,  4.600000000000001,  4.700000000000001,
               4.800000000000002,  4.900000000000002],
             dtype='float64')

pd.Index(np.arange(0,5,0.1))

Float64Index([                0.0,                 0.1,                 0.2,
              0.30000000000000004,                 0.4,                 0.5,
               0.6000000000000001,  0.7000000000000001,                 0.8,
                              0.9,                 1.0,                 1.1,
               1.2000000000000002,                 1.3,  1.4000000000000001,
                              1.5,                 1.6,  1.7000000000000002,
                              1.8,  1.9000000000000001,                 2.0,
                              2.1,                 2.2,  2.3000000000000003,
               2.4000000000000004,                 2.5,                 2.6,
                              2.7,  2.8000000000000003,  2.9000000000000004,
                              3.0,                 3.1,                 3.2,
               3.3000000000000003,  3.4000000000000004,                 3.5,
                              3.6,                 3.7,  3.8000000000000003,
               3.9000000000000004,                 4.0,  4.1000000000000005,
                              4.2,                 4.3,                 4.4,
                              4.5,  4.6000000000000005,                 4.7,
                4.800000000000001,                 4.9],
             dtype='float64')

pd.Index(np.arange(3,5,0.5))

Float64Index([3.0, 3.5, 4.0, 4.5], dtype='float64')

这肯定与Numpy有关:

np.arange(3,5,0.1)[5]

3.5000000000000004

np.arange(3,5,0.1)[5] == 3.5

False

这种情况记录在 Numpy arange 文档中:

https://numpy.org/doc/stable/reference/generated/numpy.arange.html

The length of the output might not be numerically stable.

Another stability issue is due to the internal implementation of numpy.arange. The actual step value used to populate the array is dtype(start + step) - dtype(start) and not step. Precision loss can occur here, due to casting or due to using floating points when start is much larger than step. This can lead to unexpected behaviour.

看来 np.linspace 可以帮到你:

pd.Index(np.linspace(3,5,num=21))

Float64Index([3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2,
              4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0],
             dtype='float64')