为什么带有两组括号的 .loc 赋值会导致 pandas.DataFrame 中的 NaN?
Why does .loc assignment with two sets of brackets result in NaN in a pandas.DataFrame?
我有一个数据框:
name
age
0
Paul
25
1
John
27
2
Bill
23
我知道如果我输入:
df[['name']] = df[['age']]
我会得到以下内容:
name
age
0
25
25
1
27
27
2
23
23
但我希望命令的结果相同:
df.loc[:, ['name']] = df.loc[:, ['age']]
但是,我得到的是:
name
age
0
NaN
25
1
NaN
27
2
NaN
23
出于某种原因,如果我省略列名称周围的那些方括号 []
,我将得到我预期的结果。那就是命令:
df.loc[:, 'name'] = df.loc[:, 'age']
给出正确的结果:
name
age
0
25
25
1
27
27
2
23
23
为什么两对 .loc
的括号会导致 NaN? 这是某种错误还是预期的行为?我无法弄清楚这种行为的原因。
这是因为对于 loc
赋值,所有索引轴都是对齐的,包括列:由于 age
和 name
不匹配,因此没有要赋值的数据,因此NaNs.
您可以通过重命名列使其工作:
df.loc[:, ["name"]] = df.loc[:, ["age"]].rename(columns={"age": "name"})
或通过访问 numpy 数组:
df.loc[:, ["name"]] = df.loc[:, ["age"]].values
当您使用双括号 [[]] 时,您正在分配一个 DataFrame。你想要的是分配一个(列)系列,为此你只使用一个括号 [].
这是一些代码:
import pandas as pd
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df[['name']] = df[['age']]
print("\ndf[['name']] = df[['age']]\n",df)
print("df.loc[:, ['age']]:", type(df.loc[:, ['age']]))
print("df.loc[:, ['name']]:", type(df.loc[:, ['name']]))
df.loc[:, ['name']] = df.loc[:, ['age']]
print("\ndf.loc[:, ['name']] = df.loc[:, ['age']]\n",df)
print('=======================')
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
print("type(df.loc[:, 'age']):", type(df.loc[:, 'age']))
print("type(df.loc[:, 'name']):", type(df.loc[:, 'name']))
df.loc[:, 'name'] = df.loc[:, 'age']
print("\ndf.loc[:, 'name'] = df.loc[:, 'age']\n",df)
并且输出:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df[['name']] = df[['age']]
name age
0 25 25
1 27 27
2 23 23
df.loc[:, ['age']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']] = df.loc[:, ['age']]
name age
0 NaN 25.0
1 NaN 27.0
2 NaN 23.0
=======================
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
type(df.loc[:, 'age']): <class 'pandas.core.series.Series'>
type(df.loc[:, 'name']): <class 'pandas.core.series.Series'>
df.loc[:, 'name'] = df.loc[:, 'age']
name age
0 25 25
1 27 27
2 23 23
然而,这里有另一个奇怪的行为:将双括号分配给不同的变量,比如 df1
和 df2
,然后 df1 = df2
起作用了!
这是更多代码:
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df1 = df.loc[:, ['name']]
df2 = df.loc[:, ['age']]
print("\ndf1 = df.loc[:, ['name']]\n",df1)
print("\ndf2 = df.loc[:, ['age']]\n",df2)
df1=df2
print("\ndf1=df2\ndf1:\n",df1)
并且输出:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df1 = df.loc[:, ['name']]
name
0 Paul
1 John
2 Bill
df2 = df.loc[:, ['age']]
age
0 25
1 27
2 23
df1=df2
df1:
age
0 25
1 27
2 23
From the Docs Pandas Data Alignment
(emphasis mine):
pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
您可以在 Basics
header 下找到标有警告的摘录。
他们已经举例说明了。
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
In [11]: df[['A', 'B']]
Out[11]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
来自文档(强调我的):
This will not modify df because the column alignment is before value assignment.
明确避免自动对齐
Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).
当 LHS 和 RHS 是数据帧时,对齐就会起作用。为避免对齐,请尝试使用。
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
你手头有两个案子,
.loc
赋值 pd.DataFrame
.
.loc
赋值 pd.Series
在编辑中。
.loc
pd.DataFrame
中的赋值
在pd.DataFrame
中有2个轴index
和columns
。所以,当你这样做时
df.loc[:, ['name']] = df.loc[:, ['age']]
LHS 的列 A
与 RHS 列 B
不对齐,因此在赋值后得到所有 NaN
。
来自文档 DataAlignment
(强调我的)
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.
如果不是全部,您可以在大多数 pandas 操作中找到此行为。例如,加法、减法、乘法等。不匹配的索引和列用 NaN
.
填充
来自数据对齐和算术的示例
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])
df + df2
A B C D
0 0.045691 -0.014138 1.380871 NaN
1 -0.955398 -1.501007 0.037181 NaN
2 -0.662690 1.534833 -0.859691 NaN
3 -2.452949 1.237274 -0.133712 NaN
4 1.414490 1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
回答你的
But why do column indexes need to match? I can see why one want row indexes to match, but why column indexes?
我们来看上面的例子,如果列没有对齐,你怎么添加两个DataFrame?在列和索引上对齐它们是有意义的。
.loc
pd.Series
中的赋值
pd.Series
只有 一个 轴,即 index
。这就是为什么当你这样做时它起作用的原因
df.loc[:, 'name'] = df.loc[:, 'age']
由于pd.Series
只有一个轴,pandas试图对齐index
并且成功了。当然,如果 index
不对齐,则会导致 NaN
值。
From Docs Series Alignment
(emphasis mine):
The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN
.
我有一个数据框:
name | age | |
---|---|---|
0 | Paul | 25 |
1 | John | 27 |
2 | Bill | 23 |
我知道如果我输入:
df[['name']] = df[['age']]
我会得到以下内容:
name | age | |
---|---|---|
0 | 25 | 25 |
1 | 27 | 27 |
2 | 23 | 23 |
但我希望命令的结果相同:
df.loc[:, ['name']] = df.loc[:, ['age']]
但是,我得到的是:
name | age | |
---|---|---|
0 | NaN | 25 |
1 | NaN | 27 |
2 | NaN | 23 |
出于某种原因,如果我省略列名称周围的那些方括号 []
,我将得到我预期的结果。那就是命令:
df.loc[:, 'name'] = df.loc[:, 'age']
给出正确的结果:
name | age | |
---|---|---|
0 | 25 | 25 |
1 | 27 | 27 |
2 | 23 | 23 |
为什么两对 .loc
的括号会导致 NaN? 这是某种错误还是预期的行为?我无法弄清楚这种行为的原因。
这是因为对于 loc
赋值,所有索引轴都是对齐的,包括列:由于 age
和 name
不匹配,因此没有要赋值的数据,因此NaNs.
您可以通过重命名列使其工作:
df.loc[:, ["name"]] = df.loc[:, ["age"]].rename(columns={"age": "name"})
或通过访问 numpy 数组:
df.loc[:, ["name"]] = df.loc[:, ["age"]].values
当您使用双括号 [[]] 时,您正在分配一个 DataFrame。你想要的是分配一个(列)系列,为此你只使用一个括号 [].
这是一些代码:
import pandas as pd
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df[['name']] = df[['age']]
print("\ndf[['name']] = df[['age']]\n",df)
print("df.loc[:, ['age']]:", type(df.loc[:, ['age']]))
print("df.loc[:, ['name']]:", type(df.loc[:, ['name']]))
df.loc[:, ['name']] = df.loc[:, ['age']]
print("\ndf.loc[:, ['name']] = df.loc[:, ['age']]\n",df)
print('=======================')
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
print("type(df.loc[:, 'age']):", type(df.loc[:, 'age']))
print("type(df.loc[:, 'name']):", type(df.loc[:, 'name']))
df.loc[:, 'name'] = df.loc[:, 'age']
print("\ndf.loc[:, 'name'] = df.loc[:, 'age']\n",df)
并且输出:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df[['name']] = df[['age']]
name age
0 25 25
1 27 27
2 23 23
df.loc[:, ['age']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']] = df.loc[:, ['age']]
name age
0 NaN 25.0
1 NaN 27.0
2 NaN 23.0
=======================
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
type(df.loc[:, 'age']): <class 'pandas.core.series.Series'>
type(df.loc[:, 'name']): <class 'pandas.core.series.Series'>
df.loc[:, 'name'] = df.loc[:, 'age']
name age
0 25 25
1 27 27
2 23 23
然而,这里有另一个奇怪的行为:将双括号分配给不同的变量,比如 df1
和 df2
,然后 df1 = df2
起作用了!
这是更多代码:
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df1 = df.loc[:, ['name']]
df2 = df.loc[:, ['age']]
print("\ndf1 = df.loc[:, ['name']]\n",df1)
print("\ndf2 = df.loc[:, ['age']]\n",df2)
df1=df2
print("\ndf1=df2\ndf1:\n",df1)
并且输出:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df1 = df.loc[:, ['name']]
name
0 Paul
1 John
2 Bill
df2 = df.loc[:, ['age']]
age
0 25
1 27
2 23
df1=df2
df1:
age
0 25
1 27
2 23
From the Docs Pandas Data Alignment
(emphasis mine):
pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
您可以在 Basics
header 下找到标有警告的摘录。
他们已经举例说明了。
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
In [11]: df[['A', 'B']]
Out[11]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
来自文档(强调我的):
This will not modify df because the column alignment is before value assignment.
明确避免自动对齐
Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).
当 LHS 和 RHS 是数据帧时,对齐就会起作用。为避免对齐,请尝试使用。
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
你手头有两个案子,
.loc
赋值pd.DataFrame
..loc
赋值pd.Series
在编辑中。
.loc
pd.DataFrame
中的赋值
在pd.DataFrame
中有2个轴index
和columns
。所以,当你这样做时
df.loc[:, ['name']] = df.loc[:, ['age']]
LHS 的列 A
与 RHS 列 B
不对齐,因此在赋值后得到所有 NaN
。
来自文档 DataAlignment
(强调我的)
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.
如果不是全部,您可以在大多数 pandas 操作中找到此行为。例如,加法、减法、乘法等。不匹配的索引和列用 NaN
.
来自数据对齐和算术的示例
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"]) df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"]) df + df2 A B C D 0 0.045691 -0.014138 1.380871 NaN 1 -0.955398 -1.501007 0.037181 NaN 2 -0.662690 1.534833 -0.859691 NaN 3 -2.452949 1.237274 -0.133712 NaN 4 1.414490 1.951676 -2.320422 NaN 5 -0.494922 -1.649727 -1.084601 NaN 6 -1.047551 -0.748572 -0.805479 NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN
回答你的
But why do column indexes need to match? I can see why one want row indexes to match, but why column indexes?
我们来看上面的例子,如果列没有对齐,你怎么添加两个DataFrame?在列和索引上对齐它们是有意义的。
.loc
pd.Series
中的赋值
pd.Series
只有 一个 轴,即 index
。这就是为什么当你这样做时它起作用的原因
df.loc[:, 'name'] = df.loc[:, 'age']
由于pd.Series
只有一个轴,pandas试图对齐index
并且成功了。当然,如果 index
不对齐,则会导致 NaN
值。
From Docs Series Alignment
(emphasis mine):
The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing
NaN
.