在 MultiIndex pandas 数据帧上使用多维索引？

Question

我有一个多索引 pandas 数据框，看起来像这样（称为 p_z）：

                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
...                  ...
26692 891      -0.459227
      892       0.055993
      893      -0.469857
      894       0.192554
      895       0.155738

[11742280 rows x 1 columns]

我希望能够 select 基于另一个多维数据框（或 numpy 数组）的某些行。它看起来像一个 pandas 数据框（称为 tofpid）：

                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
...                ...
26692 193          649
      194          670
      195          690
      196          725
      197          737

[2006548 rows x 1 columns]

我也把它当作一个笨拙的数组，它是一个 (26692, ) 数组（每个条目都有一个非标准数量的子条目）。这是一个 selection df/array 告诉 p_z df 要保留哪些行。所以在 p_z 的条目 0 中，它应该保留子条目 0、2、4、5、7 等

我在 pandas 中找不到完成此操作的方法。我是 pandas 的新手，甚至是 multiindex 的新手；但我觉得应该有办法做到这一点。如果它能够更好地广播，我将在大约 1500 个类似大小的数据帧上进行广播。如果有帮助，这些数据帧来自使用 uproot 导入的 *.root 文件（如果没有 pandas 有另一种方法可以做到这一点，我会接受；但我很想使用 pandas 来保持事情井井有条）。

编辑：这是一个可重现的例子（由 Jim Pavinski 的回答提供；谢谢！）。

import awkward as ak
import pandas as pd

>>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  
                      0.338738, 0.636035],
                    [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, 
                     -0.459227]])
>>> p_z = ak.to_pandas(p_z)
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid = ak.to_pandas(tofpid)

这两个数据帧都是在 uproot 中本地生成的，但这将重现与 uproot 相同的数据帧（使用笨拙的库）。

Answer 1

IIUC:

输入数据：

>>> p_z
                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284

>>> tofpid
                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7

从第二个数据框的列（条目、tofpid）创建一个新的多索引：

mi = pd.MultiIndex.from_frame(tofpid.reset_index(level='subentry', drop=True)
                                    .reset_index())

输出结果：

>>> p_z.loc[mi.intersection(p_z.index)]
              p_z
entry
0     0  0.338738
      2 -0.307365
      4  0.243284

Answer 2

这是一个可重现的示例，具有足够的结构来表示问题（使用 awkward 库）：

>>> import awkward as ak
>>> 
>>> p_z = ak.Array([
...     [ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  0.338738, 0.636035],
...     [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, -0.459227],
... ])
>>> p_z
<Array [[0.339, 0.636, ... 0.156, -0.459]] type='2 * var * float64'>
>>> 
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid
<Array [[0, 2, 4, 5], [1, 2, 4]] type='2 * var * int64'>

在Pandas形式中，这是：

>>> df_p_z = ak.to_pandas(p_z)
>>> df_p_z
                  values
entry subentry          
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
      5         0.338738
      6         0.636035
1     0        -0.459227
      1         0.055993
      2        -0.469857
      3         0.192554
      4         0.155738
      5        -0.459227
>>> df_tofpid = ak.to_pandas(tofpid)
>>> df_tofpid
                values
entry subentry        
0     0              0
      1              2
      2              4
      3              5
1     0              1
      1              2
      2              4

作为一个Awkward Array，你要做的是slice the first array by the second。也就是说，你想要 p_z[tofpid]:

>>> p_z[tofpid]
<Array [[0.339, -0.307, ... -0.47, 0.156]] type='2 * var * float64'>
>>> p_z[tofpid].tolist()
[[0.338738, -0.307365, 0.243284, 0.338738], [0.055993, -0.469857, 0.155738]]

使用 Pandas，我设法做到了：

>>> df_p_z.loc[df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()]
                  values
entry subentry          
0     0         0.338738
      2        -0.307365
      4         0.243284
      5         0.338738
1     1         0.055993
      2        -0.469857
      4         0.155738

这里发生的事情是 df_tofpid.reset_index(level=0) 将 MultiIndex 的 "entry" 部分变成一列，然后 apply 在每一行上执行一个 Python 函数，如果 axis=1，每一行都是x.values，tolist()把结果变成一个元组列表，比如

>>> df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()
[(0, 0), (0, 2), (0, 4), (0, 5), (1, 1), (1, 2), (1, 4)]

这就是 loc 从其 MultiIndex 中 select entry/subentry 对所需要的。

我的 Pandas 解决方案有两个缺点：它很复杂，而且它要经过 Python 迭代和对象，它不像数组那样可缩放。 很有可能 Pandas 专家会找到比我更好的解决方案。 关于 Pandas 我不知道的很多。

在 MultiIndex pandas 数据帧上使用多维索引？

Use a multidimensional index on a MultiIndex pandas dataframe?

python

multi-index

pandas

uproot

awkward-array