Pandas 中的词典错误?

A Lexicographical Bug in Pandas?

出于好奇,请淡定这个问题:

由于我想看看MultiIndex中的切片是如何工作的,我遇到了以下情况↓

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

Returns:

a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

注意 索引是 而不是 的排序顺序,即。 a, c, b 是将导致 expected 切片时我们想要的错误的顺序。

# When we do slicing
data.loc["a":"c"]

错误如:

UnsortedIndexError

----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

这是预期的。但是现在,在执行了以下步骤之后:

# Making a DataFrame
data = data.unstack()

# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])

# Which looks like 
   1  2
a  5  0
c  8  6
b  6  3

# Then again making series
data = data.stack()

# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)


# Which looks like before
a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

问题

所以,现在的过程是:Series → Unstack → DataFrame → Stack → Series

现在,如果我像以前一样进行切片 (仍然使用未排序的索引) 我们 不会出现任何错误!

# The same slicing
data.loc["a":"c"]

没有错误的结果:

a  1    5
   2    0
c  1    8
   2    6
dtype: int32

即使data.index.is_monotonicFalse。那为什么还要切片呢?

所以问题是:为什么?

I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.

那是错误还是我在这里遗漏的新概念?

谢谢!
阿尤什∞沙阿

更新: 我已经使用 data.reindex() 再次取消排序。请再看一遍

您的 2 个数据帧之间的区别如下:

index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

data = pd.Series(np.random.randint(10, size=6), index=index)

data2 = data.unstack().reindex(["a", "c", "b"]).stack()

>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])

>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

即使您的两个索引外观(值)相同,但内部索引(代码)不同。

检查 MultiIndexthis method

        Create a new MultiIndex from the current to monotonically sorted
        items IN the levels. This does not actually make the entire MultiIndex
        monotonic, JUST the levels.

        The resulting MultiIndex will have the same outward
        appearance, meaning the same .values and ordering. It will also
        be .equals() to the original.

旧答案

# Making a DataFrame
data = data.unstack()

# Which looks like         # <- WRONG
   1  2                    #    1  2
a  5  0                    # a  8  0
c  8  6                    # b  4  1
b  6  3                    # c  7  6

# Then again making series
data = data.stack()

# Which looks like before  # <- WRONG
a  1    5                  # a  1    2
   2    0                  #    2    1
c  1    8                  # b  1    0
   2    6                  #    2    1
b  1    6                  # c  1    3
   2    3                  #    2    9
dtype: int32

如果你想使用切片,你必须检查索引是否是单调的:

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

>>> data.index.is_monotonic
False

>>> data.unstack().stack().index.is_monotonic
True

>>> data.sort_index().index.is_monotonic
True