我如何 combine_first,将两个 DataFrame 的特定列索引为一个?
How do I combine_first, indexing specific column two DataFrames into one?
按c列映射后,
如果A列有值,则插入A列的值;如果不是,插入 B 列。
data1 data2
a b c a c d
a1 b1 c1 1a c1 1d
b2 c2 2a c2 2d
a3 c3 3a c3 3d
4a c4 4d
我想要的结果
result
a b c
a1 b1 c1
2a b2 c2
a3 c3
我尝试了以下,但我不满意。
->>> result = data1.merge(data2, on=['c'])
Prefixes _x and _y are created. combine_first is not applied.
->>> result = data1.combine_first(data2)
It is not mapped by column c.
如何获得好的结果?
我请求你的帮助。
谢谢
我不是 100% 清楚你如何索引你的数据帧(data1
和 data2
),但如果你在列 'c'
上索引它们应该可以。
我是这样创建你的数据的:
import pandas as pd
data1 = pd.DataFrame({'a': ['a1', None, 'a3'],
'b': ['b1', 'b2', None],
'c': ['c1', 'c2', 'c3']})
data2 = pd.DataFrame({'a': ['1a', '2a', '3a', '4a'],
'c': ['c1', 'c2', 'c3', 'c4'],
'd': ['1d', '2d', '3d', '4d']})
然后我将两者的索引设置为列 'c'
:
data1 = data1.set_index('c')
data2 = data2.set_index('c')
那我跟你一样用combine_first
:
data_combined = data1.combine_first(data_2)
我明白了:
a b d
c
c1 a1 b1 1d
c2 2a b2 2d
c3 a3 None 3d
c4 4a NaN 4d
不确定您为什么不想要索引为 'c4'
的行或列为 'd'
,但删除它们很容易:
data_combined = data_combined.drop('d', axis=1)
data_combined = data_combined.loc[data_combined.index != 'c4']
然后我会做一些 re-ordering 以获得您想要的结果:
data_combined = data_combined.reset_index()
data_combined = data_combined[['a', 'b', 'c']]
data_combined = data_combined.fillna('')
a b c
0 a1 b1 c1
1 2a b2 c2
2 a3 c3
你也可以这样试试:
# set indexes
data1 = data1.set_index('c')
data2 = data2.set_index('c')
# join data on indexes
datax = data1.join(data2.drop('d', axis=1), rsuffix='_rr').reset_index()
# fill missing value in column a
datax['a'] = datax['a'].fillna(datax['a_rr'])
# drop unwanted columns
datax.drop('a_rr', axis=1, inplace=True)
# fill missing values with blank spaces
datax.fillna('', inplace=True)
# output
a b c
0 a1 b1 c1
1 2a b2 c2
2 a3 c3
# data used
data1 = pd.DataFrame({'a':['a1','','a3'],
'b':['b1','b2',''],
'c':['c1','c2','c3']})
data2 = pd.DataFrame({'a':['1a','2a','3a','4a'],
'c':['c1','c2','c3','c4'],
'd':['1d','2d','3d','4d']})
使用@IdoS 设置:
import pandas as pd
data1 = pd.DataFrame({'a': ['a1', None, 'a3'],
'b': ['b1', 'b2', None],
'c': ['c1', 'c2', 'c3']})
data2 = pd.DataFrame({'a': ['1a', '2a', '3a', '4a'],
'c': ['c1', 'c2', 'c3', 'c4'],
'd': ['1d', '2d', '3d', '4d']})
您可以使用 set_index
、combine_first
和重建索引:
df_out = data1.set_index('c').combine_first(data2.set_index('c'))\
.reindex(data1.c)\
.reset_index()
df_out
输出:
c a b d
0 c1 a1 b1 1d
1 c2 2a b2 2d
2 c3 a3 None 3d
按c列映射后, 如果A列有值,则插入A列的值;如果不是,插入 B 列。
data1 data2
a b c a c d
a1 b1 c1 1a c1 1d
b2 c2 2a c2 2d
a3 c3 3a c3 3d
4a c4 4d
我想要的结果
result
a b c
a1 b1 c1
2a b2 c2
a3 c3
我尝试了以下,但我不满意。
->>> result = data1.merge(data2, on=['c'])
Prefixes _x and _y are created. combine_first is not applied.
->>> result = data1.combine_first(data2)
It is not mapped by column c.
如何获得好的结果? 我请求你的帮助。 谢谢
我不是 100% 清楚你如何索引你的数据帧(data1
和 data2
),但如果你在列 'c'
上索引它们应该可以。
我是这样创建你的数据的:
import pandas as pd
data1 = pd.DataFrame({'a': ['a1', None, 'a3'],
'b': ['b1', 'b2', None],
'c': ['c1', 'c2', 'c3']})
data2 = pd.DataFrame({'a': ['1a', '2a', '3a', '4a'],
'c': ['c1', 'c2', 'c3', 'c4'],
'd': ['1d', '2d', '3d', '4d']})
然后我将两者的索引设置为列 'c'
:
data1 = data1.set_index('c')
data2 = data2.set_index('c')
那我跟你一样用combine_first
:
data_combined = data1.combine_first(data_2)
我明白了:
a b d
c
c1 a1 b1 1d
c2 2a b2 2d
c3 a3 None 3d
c4 4a NaN 4d
不确定您为什么不想要索引为 'c4'
的行或列为 'd'
,但删除它们很容易:
data_combined = data_combined.drop('d', axis=1)
data_combined = data_combined.loc[data_combined.index != 'c4']
然后我会做一些 re-ordering 以获得您想要的结果:
data_combined = data_combined.reset_index()
data_combined = data_combined[['a', 'b', 'c']]
data_combined = data_combined.fillna('')
a b c
0 a1 b1 c1
1 2a b2 c2
2 a3 c3
你也可以这样试试:
# set indexes
data1 = data1.set_index('c')
data2 = data2.set_index('c')
# join data on indexes
datax = data1.join(data2.drop('d', axis=1), rsuffix='_rr').reset_index()
# fill missing value in column a
datax['a'] = datax['a'].fillna(datax['a_rr'])
# drop unwanted columns
datax.drop('a_rr', axis=1, inplace=True)
# fill missing values with blank spaces
datax.fillna('', inplace=True)
# output
a b c
0 a1 b1 c1
1 2a b2 c2
2 a3 c3
# data used
data1 = pd.DataFrame({'a':['a1','','a3'],
'b':['b1','b2',''],
'c':['c1','c2','c3']})
data2 = pd.DataFrame({'a':['1a','2a','3a','4a'],
'c':['c1','c2','c3','c4'],
'd':['1d','2d','3d','4d']})
使用@IdoS 设置:
import pandas as pd
data1 = pd.DataFrame({'a': ['a1', None, 'a3'],
'b': ['b1', 'b2', None],
'c': ['c1', 'c2', 'c3']})
data2 = pd.DataFrame({'a': ['1a', '2a', '3a', '4a'],
'c': ['c1', 'c2', 'c3', 'c4'],
'd': ['1d', '2d', '3d', '4d']})
您可以使用 set_index
、combine_first
和重建索引:
df_out = data1.set_index('c').combine_first(data2.set_index('c'))\
.reindex(data1.c)\
.reset_index()
df_out
输出:
c a b d
0 c1 a1 b1 1d
1 c2 2a b2 2d
2 c3 a3 None 3d