为什么带有两组括号的 .loc 赋值会导致 pandas.DataFrame 中的 NaN?

Why does .loc assignment with two sets of brackets result in NaN in a pandas.DataFrame?

我有一个数据框:

name age
0 Paul 25
1 John 27
2 Bill 23

我知道如果我输入:

df[['name']] = df[['age']]

我会得到以下内容:

name age
0 25 25
1 27 27
2 23 23

但我希望命令的结果相同

df.loc[:, ['name']] = df.loc[:, ['age']]

但是,我得到的是:

name age
0 NaN 25
1 NaN 27
2 NaN 23

出于某种原因,如果我省略列名称周围的那些方括号 [],我将得到我预期的结果。那就是命令:

df.loc[:, 'name'] = df.loc[:, 'age']

给出正确的结果:

name age
0 25 25
1 27 27
2 23 23

为什么两对 .loc 的括号会导致 NaN? 这是某种错误还是预期的行为?我无法弄清楚这种行为的原因。

这是因为对于 loc 赋值,所有索引轴都是对齐的,包括列:由于 agename 不匹配,因此没有要赋值的数据,因此NaNs.

您可以通过重命名列使其工作:

df.loc[:, ["name"]] = df.loc[:, ["age"]].rename(columns={"age": "name"})

或通过访问 numpy 数组:

df.loc[:, ["name"]] = df.loc[:, ["age"]].values

当您使用双括号 [[]] 时,您正在分配一个 DataFrame。你想要的是分配一个(列)系列,为此你只使用一个括号 [].

这是一些代码:

import pandas as pd
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)

df[['name']] = df[['age']]
print("\ndf[['name']] = df[['age']]\n",df)

print("df.loc[:, ['age']]:", type(df.loc[:, ['age']]))
print("df.loc[:, ['name']]:", type(df.loc[:, ['name']]))
df.loc[:, ['name']] = df.loc[:, ['age']]
print("\ndf.loc[:, ['name']] = df.loc[:, ['age']]\n",df)
    
print('=======================')
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)

print("type(df.loc[:, 'age']):", type(df.loc[:, 'age']))
print("type(df.loc[:, 'name']):", type(df.loc[:, 'name']))
df.loc[:, 'name'] = df.loc[:, 'age']
print("\ndf.loc[:, 'name'] = df.loc[:, 'age']\n",df)

并且输出:

Inital Dataframe:
    name  age
0  Paul   25
1  John   27
2  Bill   23

df[['name']] = df[['age']]
    name  age
0    25   25
1    27   27
2    23   23
df.loc[:, ['age']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']]: <class 'pandas.core.frame.DataFrame'>

df.loc[:, ['name']] = df.loc[:, ['age']]
    name   age
0   NaN  25.0
1   NaN  27.0
2   NaN  23.0
=======================
Inital Dataframe:
    name  age
0  Paul   25
1  John   27
2  Bill   23
type(df.loc[:, 'age']): <class 'pandas.core.series.Series'>
type(df.loc[:, 'name']): <class 'pandas.core.series.Series'>

df.loc[:, 'name'] = df.loc[:, 'age']
    name  age
0    25   25
1    27   27
2    23   23

然而,这里有另一个奇怪的行为:将双括号分配给不同的变量,比如 df1df2,然后 df1 = df2 起作用了! 这是更多代码:

df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)

df1 = df.loc[:, ['name']]
df2 = df.loc[:, ['age']]
print("\ndf1 = df.loc[:, ['name']]\n",df1)
print("\ndf2 = df.loc[:, ['age']]\n",df2)

df1=df2
print("\ndf1=df2\ndf1:\n",df1)

并且输出:

Inital Dataframe:
    name  age
0  Paul   25
1  John   27
2  Bill   23

df1 = df.loc[:, ['name']]
    name
0  Paul
1  John
2  Bill

df2 = df.loc[:, ['age']]
    age
0   25
1   27
2   23

df1=df2
df1:
    age
0   25
1   27
2   23

From the Docs Pandas Data Alignment(emphasis mine):

pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.

您可以在 Basics header 下找到标有警告的摘录。 他们已经举例说明了。

In [9]: df[['A', 'B']]
Out[9]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]

In [11]: df[['A', 'B']]
Out[11]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

来自文档(强调我的)

This will not modify df because the column alignment is before value assignment.

明确避免自动对齐

Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).

当 LHS 和 RHS 是数据帧时,对齐就会起作用。为避免对齐,请尝试使用。

df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()

你手头有两个案子,

  • .loc 赋值 pd.DataFrame.
  • .loc 赋值 pd.Series 在编辑中。

.loc pd.DataFrame

中的赋值

pd.DataFrame中有2个轴indexcolumns。所以,当你这样做时

df.loc[:, ['name']] = df.loc[:, ['age']]

LHS 的列 A 与 RHS 列 B 不对齐,因此在赋值后得到所有 NaN

来自文档 DataAlignment(强调我的)

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.

如果不是全部,您可以在大多数 pandas 操作中找到此行为。例如,加法、减法、乘法等。不匹配的索引和列用 NaN.

填充

来自数据对齐和算术的示例

df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

df + df2 

         A         B         C   D
0  0.045691 -0.014138  1.380871 NaN
1 -0.955398 -1.501007  0.037181 NaN
2 -0.662690  1.534833 -0.859691 NaN
3 -2.452949  1.237274 -0.133712 NaN
4  1.414490  1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

回答你的

But why do column indexes need to match? I can see why one want row indexes to match, but why column indexes?

我们来看上面的例子,如果列没有对齐,你怎么添加两个DataFrame?在列和索引上对齐它们是有意义的。


.loc pd.Series

中的赋值

pd.Series 只有 一个 轴,即 index。这就是为什么当你这样做时它起作用的原因

df.loc[:, 'name'] = df.loc[:, 'age']

由于pd.Series只有一个轴,pandas试图对齐index并且成功了。当然,如果 index 不对齐,则会导致 NaN 值。

From Docs Series Alignment(emphasis mine):

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.