如何与通配符合并? - Pandas
How to merge with wildcard? - Pandas
我有两个要合并的数据框。右侧数据框的连接列可能包含 wildcard 值(例如:"ALL"),这些值应匹配连接列中的 every 值左侧数据框。
考虑以下最小示例:
entities = pandas.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
])
recommendations = pandas.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
])
recommendations
可以解释为一组建议
"Recognize all Entities, regardless of their type, disdain Entities which are Material, and worship Entities which are Ideal")。我现在想要一个包含所有实体及其推荐操作的数据框。因此,在此示例中,生成的数据框应该看起来
name recommendation type
0 Boson recognize Material
1 Boson disdain Material
2 Atman recognize Ideal
3 Atman worship Ideal
有没有什么方法可以做到这一点?
我知道如何通过创建一个包含 entities
和 recommendations
的笛卡尔积的数据框,然后根据条件将其削减。
我还可以想到一个解决方案,我得到一系列 entitities
中存在的所有 types
并使用通配符为 recommendations
中的每一行的每种类型创建一行类型。
但在我的实际问题中,我实际上有多个列,我想用通配符值加入这些列。所以一个聪明高效的pandaic方式会对我有很大帮助。
一个可能的解决方案是我从存在的所有其他元素中替换通配符,然后合并它们,即
数据 :
edf = pd.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
])
rdf = pd.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
])
预处理 :
mask = rdf['entity_type']=='ALL'
# Join all the elements from `edf['type']` with `;` since you might have `,`s in types and we need to use set to get rid of duplicates (Thank you @John )
all_ = ';'.join(set(edf['type'])) # all_ : Material,Ideal
# Replace all by newly obatined string
rdf['entity_type'] = np.where(mask,all_,rdf['entity_type'])
rdf
action entity_type
0 recognize Material;Ideal
1 disdain Material
2 worship Ideal
# Split and stack so we can make `entity_type` one dimensional
rdf = rdf.set_index('action')['entity_type'].str.split(';',expand=True)\
.stack().reset_index('action').rename(columns={0:'type'})
rdf
action type
0 recognize Material
1 recognize Ideal
0 disdain Material
0 worship Ideal
合并:
ndf = edf.merge(rdf,on='type').rename(columns={'action':'recommendation'})
ndf
name type recommendation
0 Boson Material recognize
1 Boson Material disdain
2 Atman Ideal recognize
3 Atman Ideal worship
样本 运行 在不同的数据帧上:
edf = pd.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
{ 'name' : 'Chaos', 'type' : 'Void, but emphasized' },
{ 'name' : 'Tohuwabohu', 'type' : 'Void' },
])
rdf = pd.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
{ 'action' : 'drink', 'entity_type' : 'ALL'}
])
然后:
mask = rdf['entity_type']=='ALL'
all_ = ';'.join(set(edf['type']))
rdf['entity_type'] = np.where(mask,all_,rdf['entity_type'])
rdf = rdf.set_index('action')['entity_type'].str.split(';',expand=True)\
.stack().reset_index('action').rename(columns={0:'type'})
ndf = edf.merge(rdf,on='type').rename(columns={'action':'recommendation'})
ndf
name type recommendation
0 Boson Material recognize
1 Boson Material disdain
2 Boson Material drink
3 Atman Ideal recognize
4 Atman Ideal worship
5 Atman Ideal drink
6 Chaos Void, but emphasized recognize
7 Chaos Void, but emphasized drink
8 Tohuwabohu Void recognize
9 Tohuwabohu Void drink
与笛卡尔积相比,此方法速度快且占用的内存更少。希望对您有所帮助:)
经过思考,我认为使用两个数据帧的笛卡尔积的方式可能并不像我之前想的那么糟糕。因此,对于以后阅读此主题的任何人,我只想展示如何做到这一点:
# get Cartesian Product of the two dfs
entities['join'] = recommendations['join'] = 0
results = entities.merge(recommendations, on='join')
# extract matching rows
results = results[(( results['type'] == results['entity_type']) | (results['entity_type'] == "ALL"))]
results = results[['name', 'type', 'action']]
有输入
entities = pd.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
{ 'name' : 'Chaos', 'type' : 'Void, but emphasized' },
{ 'name' : 'Tohuwabohu', 'type' : 'Void' },
])
recommendations = pd.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'disdain', 'entity_type' : 'Void'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
])
这导致 results
:
name type action
0 Boson Material recognize
1 Boson Material disdain
4 Atman Ideal recognize
7 Atman Ideal worship
8 Chaos Void, but emphasized recognize
12 Tohuwabohu Void recognize
14 Tohuwabohu Void disdain
我有两个要合并的数据框。右侧数据框的连接列可能包含 wildcard 值(例如:"ALL"),这些值应匹配连接列中的 every 值左侧数据框。
考虑以下最小示例:
entities = pandas.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
])
recommendations = pandas.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
])
recommendations
可以解释为一组建议
"Recognize all Entities, regardless of their type, disdain Entities which are Material, and worship Entities which are Ideal")。我现在想要一个包含所有实体及其推荐操作的数据框。因此,在此示例中,生成的数据框应该看起来
name recommendation type
0 Boson recognize Material
1 Boson disdain Material
2 Atman recognize Ideal
3 Atman worship Ideal
有没有什么方法可以做到这一点?
我知道如何通过创建一个包含 entities
和 recommendations
的笛卡尔积的数据框,然后根据条件将其削减。
我还可以想到一个解决方案,我得到一系列 entitities
中存在的所有 types
并使用通配符为 recommendations
中的每一行的每种类型创建一行类型。
但在我的实际问题中,我实际上有多个列,我想用通配符值加入这些列。所以一个聪明高效的pandaic方式会对我有很大帮助。
一个可能的解决方案是我从存在的所有其他元素中替换通配符,然后合并它们,即
数据 :
edf = pd.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
])
rdf = pd.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
])
预处理 :
mask = rdf['entity_type']=='ALL'
# Join all the elements from `edf['type']` with `;` since you might have `,`s in types and we need to use set to get rid of duplicates (Thank you @John )
all_ = ';'.join(set(edf['type'])) # all_ : Material,Ideal
# Replace all by newly obatined string
rdf['entity_type'] = np.where(mask,all_,rdf['entity_type'])
rdf
action entity_type
0 recognize Material;Ideal
1 disdain Material
2 worship Ideal
# Split and stack so we can make `entity_type` one dimensional
rdf = rdf.set_index('action')['entity_type'].str.split(';',expand=True)\
.stack().reset_index('action').rename(columns={0:'type'})
rdf
action type
0 recognize Material
1 recognize Ideal
0 disdain Material
0 worship Ideal
合并:
ndf = edf.merge(rdf,on='type').rename(columns={'action':'recommendation'})
ndf
name type recommendation
0 Boson Material recognize
1 Boson Material disdain
2 Atman Ideal recognize
3 Atman Ideal worship
样本 运行 在不同的数据帧上:
edf = pd.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
{ 'name' : 'Chaos', 'type' : 'Void, but emphasized' },
{ 'name' : 'Tohuwabohu', 'type' : 'Void' },
])
rdf = pd.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
{ 'action' : 'drink', 'entity_type' : 'ALL'}
])
然后:
mask = rdf['entity_type']=='ALL'
all_ = ';'.join(set(edf['type']))
rdf['entity_type'] = np.where(mask,all_,rdf['entity_type'])
rdf = rdf.set_index('action')['entity_type'].str.split(';',expand=True)\
.stack().reset_index('action').rename(columns={0:'type'})
ndf = edf.merge(rdf,on='type').rename(columns={'action':'recommendation'})
ndf
name type recommendation
0 Boson Material recognize
1 Boson Material disdain
2 Boson Material drink
3 Atman Ideal recognize
4 Atman Ideal worship
5 Atman Ideal drink
6 Chaos Void, but emphasized recognize
7 Chaos Void, but emphasized drink
8 Tohuwabohu Void recognize
9 Tohuwabohu Void drink
与笛卡尔积相比,此方法速度快且占用的内存更少。希望对您有所帮助:)
经过思考,我认为使用两个数据帧的笛卡尔积的方式可能并不像我之前想的那么糟糕。因此,对于以后阅读此主题的任何人,我只想展示如何做到这一点:
# get Cartesian Product of the two dfs
entities['join'] = recommendations['join'] = 0
results = entities.merge(recommendations, on='join')
# extract matching rows
results = results[(( results['type'] == results['entity_type']) | (results['entity_type'] == "ALL"))]
results = results[['name', 'type', 'action']]
有输入
entities = pd.DataFrame.from_dict([
{ 'name' : 'Boson', 'type' : 'Material' },
{ 'name' : 'Atman', 'type' : 'Ideal' },
{ 'name' : 'Chaos', 'type' : 'Void, but emphasized' },
{ 'name' : 'Tohuwabohu', 'type' : 'Void' },
])
recommendations = pd.DataFrame.from_dict([
{ 'action' : 'recognize', 'entity_type' : 'ALL'},
{ 'action' : 'disdain', 'entity_type' : 'Material'},
{ 'action' : 'disdain', 'entity_type' : 'Void'},
{ 'action' : 'worship', 'entity_type' : 'Ideal'},
])
这导致 results
:
name type action
0 Boson Material recognize
1 Boson Material disdain
4 Atman Ideal recognize
7 Atman Ideal worship
8 Chaos Void, but emphasized recognize
12 Tohuwabohu Void recognize
14 Tohuwabohu Void disdain