使用具有索引值和多列值的 .loc 对 pandas 数据框进行切片，然后设置值

Question

我有一个数据框，我想 select 数据框的一个子集同时使用索引值和列值。我可以分别执行这两项操作，但无法弄清楚同时执行它们的语法。示例：

import pandas as pd

# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]

df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df

Returns:

    col1    col2
1   1   0
2   2   0
3   3   0
4   1   0
5   2   1
6   3   1
17  3   1
18  4   1
91  1   0
104 3   1

使用.loc，我可以按索引收集：

rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]

Returns:

    col1    col2
5   1   5
6   2   6
17  3   7

或者我可以 select 按列值：

df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]

返回：

    col1    col2
5   2   1
6   3   1
17  3   1
104 3   1

但是，我不能两者都做。当我尝试以下操作时：

df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]

Returns:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

我尝试了一些其他变体（例如“&”而不是“,”），但这些 return 相同或其他错误。

一旦我收集了这个切片，我希望在主数据帧上重新分配值。我想一旦完成上述操作，这将是微不足道的，但我在这里注明以防万一。我的目标是在下面分配类似 df2 的内容：

c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)

到索引和多列条件引用的切片（覆盖原始数据帧中的内容）。

Answer 1

出现 IndexingError 的原因是您使用 2 种不同大小的数组调用 df.loc。

df.loc[rel_index] 的长度为 3，而 df['col1'].isin(relc1) 的长度为 10。

您需要索引结果的长度也为 10。如果您查看 df['col1'].isin(relc1) 的输出，它是一个布尔数组。

您可以通过将 df.loc[rel_index] 替换为 df.index.isin([5,6,17])

来获得具有适当长度的类似数组

所以你最终得到：

df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]

哪个returns:

    col1  col2
5      2     1
6      3     1
17     3     1

就是说，我不确定为什么您的索引会变成这样。通常当按索引切片时，您将使用 df.iloc 并且您的索引将匹配 0,1,2...等。格式。

或者，您可以先按值搜索 - 然后将结果数据帧分配给新变量 df2

df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]

那么 df2.loc[rel_index] 就可以正常工作了。

至于您的总体目标，您可以简单地执行以下操作：

c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)

df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2

Answer 2

@Rexovas 解释得很好，这是一个替代方案，您可以在分配之前计算索引上的过滤器 - 它有点长，涉及 MultiIndex，但是一旦您了解 MultiIndex，应该很直观:

(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)

    col1  col2
5      1     5
6      2     6
17     3     7

使用具有索引值和多列值的 .loc 对 pandas 数据框进行切片，然后设置值

Slice pandas dataframe using .loc with both index values and multiple column values, then set values

python

pandas

dataframe

slice