Concat python 基于唯一行的数据帧
Concat python dataframes based on unique rows
我的数据框如下所示:
df1
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
df2
user_id username firstname lastname
111 xyz xyz xyz
456 def def def
234 mnp mnp mnp
现在我想要一个像
这样的输出数据框
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
111 xyz xyz xyz
234 mnp mnp mnp
As user_id 456
在两个数据帧中都很常见。我已经在 user_id groupby(['user_id'])
上尝试过 groupby 。但看起来 groupby 后面需要跟一些 aggregation
,我不想在这里。
使用concat
+ drop_duplicates
:
df = pd.concat([df1, df2]).drop_duplicates('user_id').reset_index(drop=True)
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
groupby
和聚合 first
的解决方案较慢:
df = pd.concat([df1, df2]).groupby('user_id', as_index=False, sort=False).first()
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
编辑:
boolean indexing
and numpy.in1d
的另一个解决方案:
df = pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
一种带掩码的方法 -
def app1(df1,df2):
df20 = df2[~df2.user_id.isin(df1.user_id)]
return pd.concat([df1, df20],axis=0)
还有两种方法使用底层数组数据,np.in1d
,np.searchsorted
来获取匹配掩码,然后堆叠这两个数据并从堆叠的数组数据构建输出数据帧 -
def app2(df1,df2):
df20_arr = df2.values[~np.in1d(df1.user_id.values, df2.user_id.values)]
arr = np.vstack(( df1.values, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app3(df1,df2):
a = df1.values
b = df2.values
df20_arr = b[~np.in1d(a[:,0], b[:,0])]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app4(df1,df2):
a = df1.values
b = df2.values
b0 = b[:,0].astype(int)
as0 = np.sort(a[:,0].astype(int))
df20_arr = b[as0[np.searchsorted(as0,b0)] != b0]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
给定样本的时间 -
In [49]: %timeit app1(df1,df2)
...: %timeit app2(df1,df2)
...: %timeit app3(df1,df2)
...: %timeit app4(df1,df2)
...:
1000 loops, best of 3: 753 µs per loop
10000 loops, best of 3: 192 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 171 µs per loop
# @jezrael's edited solution
In [85]: %timeit pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
1000 loops, best of 3: 614 µs per loop
看看这些在更大的数据集上的表现会很有趣。
另一种方法是使用 np.in1d 检查是否重复 user_id。
pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
或者使用集合从 df1 和 df2 的合并记录中获取唯一行。这个好像快了好几倍
pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
时间:
%timeit pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
1000 loops, best of 3: 2.48 ms per loop
%timeit pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
1000 loops, best of 3: 632 µs per loop
也可以使用 append
+ drop_duplicates
.
df1.append(df2)
df1.drop_duplicates(inplace=True)
我的数据框如下所示:
df1
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
df2
user_id username firstname lastname
111 xyz xyz xyz
456 def def def
234 mnp mnp mnp
现在我想要一个像
这样的输出数据框 user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
111 xyz xyz xyz
234 mnp mnp mnp
As user_id 456
在两个数据帧中都很常见。我已经在 user_id groupby(['user_id'])
上尝试过 groupby 。但看起来 groupby 后面需要跟一些 aggregation
,我不想在这里。
使用concat
+ drop_duplicates
:
df = pd.concat([df1, df2]).drop_duplicates('user_id').reset_index(drop=True)
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
groupby
和聚合 first
的解决方案较慢:
df = pd.concat([df1, df2]).groupby('user_id', as_index=False, sort=False).first()
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
编辑:
boolean indexing
and numpy.in1d
的另一个解决方案:
df = pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
print (df)
user_id username firstname lastname
0 123 abc abc abc
1 456 def def def
2 789 ghi ghi ghi
3 111 xyz xyz xyz
4 234 mnp mnp mnp
一种带掩码的方法 -
def app1(df1,df2):
df20 = df2[~df2.user_id.isin(df1.user_id)]
return pd.concat([df1, df20],axis=0)
还有两种方法使用底层数组数据,np.in1d
,np.searchsorted
来获取匹配掩码,然后堆叠这两个数据并从堆叠的数组数据构建输出数据帧 -
def app2(df1,df2):
df20_arr = df2.values[~np.in1d(df1.user_id.values, df2.user_id.values)]
arr = np.vstack(( df1.values, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app3(df1,df2):
a = df1.values
b = df2.values
df20_arr = b[~np.in1d(a[:,0], b[:,0])]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app4(df1,df2):
a = df1.values
b = df2.values
b0 = b[:,0].astype(int)
as0 = np.sort(a[:,0].astype(int))
df20_arr = b[as0[np.searchsorted(as0,b0)] != b0]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
给定样本的时间 -
In [49]: %timeit app1(df1,df2)
...: %timeit app2(df1,df2)
...: %timeit app3(df1,df2)
...: %timeit app4(df1,df2)
...:
1000 loops, best of 3: 753 µs per loop
10000 loops, best of 3: 192 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 171 µs per loop
# @jezrael's edited solution
In [85]: %timeit pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
1000 loops, best of 3: 614 µs per loop
看看这些在更大的数据集上的表现会很有趣。
另一种方法是使用 np.in1d 检查是否重复 user_id。
pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
或者使用集合从 df1 和 df2 的合并记录中获取唯一行。这个好像快了好几倍
pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
时间:
%timeit pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
1000 loops, best of 3: 2.48 ms per loop
%timeit pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
1000 loops, best of 3: 632 µs per loop
也可以使用 append
+ drop_duplicates
.
df1.append(df2)
df1.drop_duplicates(inplace=True)