pandas 版本 0.16.0 更改数据帧索引后所有值变为 NaN

Question

我正在使用 ipython 笔记本并遵循 pandas 食谱示例版本 0.16.0。我在第 237 页时遇到了麻烦。我制作了这样的数据框

from pandas import *
data1=DataFrame({'AAA':[4,5,6,7],'BBB':[10,20,30,40],'CCC':[100,50,-30,-50]})

然后，我这样做了，试图更改索引：

df=DataFrame(data=data1,index=(['a','b','c','d']))

但我得到的是一个所有值为 NaN 的数据框！任何人都知道为什么以及如何解决它？我还尝试使用 set_index 函数，但它给了我错误。

非常感谢！

Answer 1

如果要更改索引，请使用 reindex 或直接分配给索引：

In [5]:

data1=pd.DataFrame({'AAA':[4,5,6,7],'BBB':[10,20,30,40],'CCC':[100,50,-30,-50]})
print(data1)
df=pd.DataFrame(data=data1)
df.index = ['a','b','c','d']
df
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
Out[5]:
   AAA  BBB  CCC
a    4   10  100
b    5   20   50
c    6   30  -30
d    7   40  -50

我不知道这是否是错误，但如果您执行以下操作，它就会起作用：

In [7]:

df=pd.DataFrame(data=data1.values,index=(['a','b','c','d']))
df
Out[7]:
   0   1    2
a  4  10  100
b  5  20   50
c  6  30  -30
d  7  40  -50

因此，如果您将数据分配给值而不是 df 本身，则 df 不会尝试与传入的索引对齐

编辑

在逐步执行此处的代码后，问题是它使用传递的索引重新索引 df，我们可以通过执行以下操作来重现此行为：

In [46]:

data1 = pd.DataFrame({'AAA':[4,5,6,7],'BBB':[10,20,30,40],'CCC':[100,50,-30,-50]})
data1.reindex_axis(list('abcd'))
Out[46]:
   AAA  BBB  CCC
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c  NaN  NaN  NaN
d  NaN  NaN  NaN

这是因为它进入 df 构造函数检测到它是 BlockManager 的实例并尝试构造一个 df:

单步执行代码，我看到它在 frame.py 中到达此处：

        if isinstance(data, BlockManager):
        mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
                             dtype=dtype, copy=copy)

然后在 generic.py 中结束：

119         def _init_mgr(self, mgr, axes=None, dtype=None, copy=False):
120             """ passed a manager and a axes dict """
121             for a, axe in axes.items():
122                 if axe is not None:
123                     mgr = mgr.reindex_axis(
124  ->                     axe, axis=self._get_block_manager_axis(a), copy=False)

关于此的 issue 已经发布

Update 这是预期的行为，如果您传递索引，那么它将使用该索引重新索引来自 @Jeff

的传入 df

This is the defined behavior, to reindex the provided input to the passed index and/or columns .

查看相关内容Issue

Answer 2

EdChum 对使用 reindex 的建议完全正确，但我认为这里发生的事情是，当您使用 DataFrame 作为数据参数的参数时，它使用整个 existing 创建 new DataFrame 时的 DataFrame。

如果你想完成你正在做的事情，你需要显式地向 DataFrame class 提供实际的数据（不是包裹在另一个数据中的数据）数据框）。您可以使用 data1.values 来执行此操作。您还必须显式地给 class 列名，所以结果都是这样的：

In [1]: pd.DataFrame(data=data1.values,columns=data1.columns,index=(['a','b','c','d']))

Out[1]: 
   AAA  BBB  CCC
a    4   10  100
b    5   20   50
c    6   30  -30
d    7   40  -50

Answer 3

also tried to use set_index function, and it gave me errors.

为什么会这样？ set_index 是为了使用一个或多个 existing 列来设置索引。所以 data1.set_index('a') 会产生 Key Error 因为 a 不是 data1 中的列，而 data1.set_index['AAA'] 会产生

     BBB  CCC
AAA          
4     10  100
5     20   50
6     30  -30
7     40  -50

其他两个答案回答了问题的其余部分。

pandas 版本 0.16.0 更改数据帧索引后所有值变为 NaN

pandas version 0.16.0 after changing dataframe index all values become NaN

python

ipython

pandas