将 fillna 与两个多索引数据帧一起使用会抛出 InvalidIndexError
Using fillna with two multi-index dataframes throws InvalidIndexError
我有两个这样的数据框:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
prop1 prop2
key1 key2
A 1 x m
B 6 y n
A 7 z b
5 u b
C 9 y b
8 n a
A 7 b s
prop1 prop2
key1 key2
A 1 NaN NaN
B 5 NaN NaN
C 9 NaN NaN
8 NaN NaN
A 7 NaN NaN
D 8 NaN NaN
7 NaN NaN
并且现在想使用 df1
使用
填充 df2
df2.fillna(df1)
然而,我得到
site-packages/pandas/core/generic.py in _where(self, cond, other,
inplace, axis, level, errors, try_cast) 8694
other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes)
8695 ):
-> 8696 raise InvalidIndexError 8697 8698 # slice me out of the other
InvalidIndexError:
我过去曾成功地使用过这种方法,但我不太明白为什么会失败。有什么想法可以让它发挥作用吗?
编辑
这是一个非常相似并且运行良好的示例:
filler1 = pd.DataFrame({
'key': list('AAABCCDD'),
'prop1': list('xyzuyasj'),
'prop2': list('mnbbbqwo')
})
tobefilled1 = pd.DataFrame({
'key': list('AAABBCACDF'),
'keep_me': ['stuff'] * 10,
'prop1': [np.nan] * 10,
'prop2': [np.nan] * 10,
})
filler1['g'] = filler1.groupby('key').cumcount()
tobefilled1['g'] = tobefilled1.groupby('key').cumcount()
filler1 = filler1.set_index(['key', 'g'])
tobefilled1 = tobefilled1.set_index(['key', 'g'])
print(tobefilled1.fillna(filler1))
prints
key g
A 0 stuff x m
1 stuff y n
2 stuff z b
B 0 stuff u b
1 stuff NaN NaN
C 0 stuff y b
A 3 stuff NaN NaN
C 1 stuff a q
D 0 stuff s w
F 0 stuff NaN NaN
这里有一些索引值不匹配的问题,对我来说 DataFrame.combine_first
:
df = df2.combine_first(df1)
print (df)
prop1 prop2
key1 key2
A 1 x m
5 u b
7 z b
7 b s
B 5 NaN NaN
6 y n
C 8 n a
9 y b
D 7 NaN NaN
8 NaN NaN
这里的问题是df1中定义的重复索引:
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
注意:Key1=A Key2=7出现两次,df1的索引不唯一
我们把第二个 A7 改成 A9
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675989'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
因此在 df1 中创建唯一索引,现在尝试 df.fillna:
df2.fillna(df1)
输出:
prop1 prop2
key1 key2
A 1 x m
B 5 NaN NaN
C 9 y b
8 n a
A 7 z b
D 8 NaN NaN
7 NaN NaN
我在尝试 reindex_like
方法时得到了提示,首先使用唯一索引:
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675989'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
print(df1.reindex_like(df2))
输出:
prop1 prop2
key1 key2
A 1 x m
B 5 NaN NaN
C 9 y b
8 n a
A 7 z b
D 8 NaN NaN
7 NaN NaN
现在,让我们恢复到 post 中的原始数据帧:
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
print(df1.reindex_like(df2))
输出值错误:
ValueError: cannot handle a non-unique multi-index!
另一种解决方法是通过使用 cumcount 添加另一个索引级别来创建唯一索引。
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
df1 = df1.set_index(df1.groupby(df1.index).cumcount(), append=True)
df2 = df2.set_index(df2.groupby(df2.index).cumcount(), append=True)
df2.fillna(df1)
输出:
prop1 prop2
key1 key2
A 1 0 x m
B 5 0 NaN NaN
C 9 0 y b
8 0 n a
A 7 0 z b
D 8 0 NaN NaN
7 0 NaN NaN
然后你可以删除索引级别 2:
df2.fillna(df1).reset_index(level=2, drop=True)
输出:
prop1 prop2
key1 key2
A 1 x m
B 5 NaN NaN
C 9 y b
8 n a
A 7 z b
D 8 NaN NaN
7 NaN NaN
但是,我认为 pandas 应该像 reindex_like
.
那样为 fillna
非唯一多索引提供更好的错误消息
我有两个这样的数据框:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
prop1 prop2
key1 key2
A 1 x m
B 6 y n
A 7 z b
5 u b
C 9 y b
8 n a
A 7 b s
prop1 prop2
key1 key2
A 1 NaN NaN
B 5 NaN NaN
C 9 NaN NaN
8 NaN NaN
A 7 NaN NaN
D 8 NaN NaN
7 NaN NaN
并且现在想使用 df1
使用
df2
df2.fillna(df1)
然而,我得到
site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast) 8694
other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes)
8695 ): -> 8696 raise InvalidIndexError 8697 8698 # slice me out of the otherInvalidIndexError:
我过去曾成功地使用过这种方法,但我不太明白为什么会失败。有什么想法可以让它发挥作用吗?
编辑
这是一个非常相似并且运行良好的示例:
filler1 = pd.DataFrame({
'key': list('AAABCCDD'),
'prop1': list('xyzuyasj'),
'prop2': list('mnbbbqwo')
})
tobefilled1 = pd.DataFrame({
'key': list('AAABBCACDF'),
'keep_me': ['stuff'] * 10,
'prop1': [np.nan] * 10,
'prop2': [np.nan] * 10,
})
filler1['g'] = filler1.groupby('key').cumcount()
tobefilled1['g'] = tobefilled1.groupby('key').cumcount()
filler1 = filler1.set_index(['key', 'g'])
tobefilled1 = tobefilled1.set_index(['key', 'g'])
print(tobefilled1.fillna(filler1))
prints
key g
A 0 stuff x m
1 stuff y n
2 stuff z b
B 0 stuff u b
1 stuff NaN NaN
C 0 stuff y b
A 3 stuff NaN NaN
C 1 stuff a q
D 0 stuff s w
F 0 stuff NaN NaN
这里有一些索引值不匹配的问题,对我来说 DataFrame.combine_first
:
df = df2.combine_first(df1)
print (df)
prop1 prop2
key1 key2
A 1 x m
5 u b
7 z b
7 b s
B 5 NaN NaN
6 y n
C 8 n a
9 y b
D 7 NaN NaN
8 NaN NaN
这里的问题是df1中定义的重复索引:
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
注意:Key1=A Key2=7出现两次,df1的索引不唯一
我们把第二个 A7 改成 A9
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675989'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
因此在 df1 中创建唯一索引,现在尝试 df.fillna:
df2.fillna(df1)
输出:
prop1 prop2
key1 key2
A 1 x m
B 5 NaN NaN
C 9 y b
8 n a
A 7 z b
D 8 NaN NaN
7 NaN NaN
我在尝试 reindex_like
方法时得到了提示,首先使用唯一索引:
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675989'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
print(df1.reindex_like(df2))
输出:
prop1 prop2
key1 key2
A 1 x m
B 5 NaN NaN
C 9 y b
8 n a
A 7 z b
D 8 NaN NaN
7 NaN NaN
现在,让我们恢复到 post 中的原始数据帧:
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
print(df1.reindex_like(df2))
输出值错误:
ValueError: cannot handle a non-unique multi-index!
另一种解决方法是通过使用 cumcount 添加另一个索引级别来创建唯一索引。
df1 = pd.DataFrame({
'key1': list('ABAACCA'),
'key2': list('1675987'),
'prop1': list('xyzuynb'),
'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])
df2 = pd.DataFrame({
'key1': list('ABCCADD'),
'key2': list('1598787'),
'prop1': [np.nan] * 7,
'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
df1 = df1.set_index(df1.groupby(df1.index).cumcount(), append=True)
df2 = df2.set_index(df2.groupby(df2.index).cumcount(), append=True)
df2.fillna(df1)
输出:
prop1 prop2
key1 key2
A 1 0 x m
B 5 0 NaN NaN
C 9 0 y b
8 0 n a
A 7 0 z b
D 8 0 NaN NaN
7 0 NaN NaN
然后你可以删除索引级别 2:
df2.fillna(df1).reset_index(level=2, drop=True)
输出:
prop1 prop2
key1 key2
A 1 x m
B 5 NaN NaN
C 9 y b
8 n a
A 7 z b
D 8 NaN NaN
7 NaN NaN
但是,我认为 pandas 应该像 reindex_like
.
fillna
非唯一多索引提供更好的错误消息