Pandas 查询多个数据帧

Question

有两个数据框。第一个有合同 ID 号和名称。第二个有合同编号和交易类型。第一个数据框是

contract id	first name	last name
1	John	Smith
2	Rob	Brown
3	Rob	Brown

第二个DataFrame是

contract id	transaction
1	cash
1	cash
1	cash
2	bank transfer
2	bank transfer
2	bank transfer
3	cash

我想统计有多少人只使用过一种交易类型。在示例中，有两个人。第一个只使用现金支付，第二个使用银行转账和现金。所以，答案是 1.

DataFrame 很大，将它们连接在一起是不可行的。还有哪些其他选择？

数据：

df1:

{'contract id': [1, 2, 3],
 'first name': ['John', 'Rob', 'Rob'],
 'last name': ['Smith', 'Brown', 'Brown']}

df2:

{'contract id': [1, 1, 1, 2, 2, 2, 3],
 'transaction': ['cash', 'cash', 'cash', 'bank transfer',
                 'bank transfer', 'bank transfer', 'cash']}

Answer 1

您可以在 df1 中为名称创建一个列；然后 map 命名为 df2 中的合约 ID。如果 df2 中有很多重复值，那么首先 drop_duplicates 可能是值得的。然后在“姓名”一栏使用value_counts + eq + sum 来统计有多少人进行了单一类型的交易：

mapping = df1.assign(name=df1['first name'] + ' ' + df1['last name']).set_index('contract id')['name']
df2 = df2.drop_duplicates().copy()
df2['name'] = df2['contract id'].map(mapping)
out = df2['name'].drop_duplicates().value_counts().eq(1).sum()

另一种选择是，groupby 名称并构建一个布尔掩码来过滤名称（但我怀疑这会比其他方法慢）。

df2['transaction'].groupby(df2['contract id'].map(mapping)).nunique().eq(1).sum()

输出：

Answer 2

合并+groupby的一个解决方案：

# merge of the 2 datasets based on the common column to get one table with all the information
# for the real dataset may have to be more precise in the type of merging (left, outer, ...)
data = df1.merge(df2) 
data['name'] = data['first name'] + data['last name']  # to get "unique" full names 

dfg = (data.groupby('name')['transaction']  # group the data by name and provide the column transaction
           .unique()                        # for which we take the list of unique values for each name
           .apply(lambda x: len(x))         # then we get the number of elements in the lists
)
res = dfg.index[dfg.values == 1].tolist()   # list of the names for which the value is 1

Pandas 查询多个数据帧

Pandas Query Multiple Dataframes

python

bigdata

dataframe

pandas