如何与通配符合并? - Pandas

How to merge with wildcard? - Pandas

我有两个要合并的数据框。右侧数据框的连接列可能包含 wildcard 值(例如:"ALL"),这些值应匹配连接列中的 every 值左侧数据框。

考虑以下最小示例:

entities = pandas.DataFrame.from_dict([
    { 'name' : 'Boson', 'type' : 'Material' },
    { 'name' : 'Atman', 'type' : 'Ideal' },
])

recommendations = pandas.DataFrame.from_dict([
    { 'action' : 'recognize', 'entity_type' : 'ALL'},
    { 'action' : 'disdain', 'entity_type' : 'Material'},
    { 'action' : 'worship', 'entity_type' : 'Ideal'},
])

recommendations 可以解释为一组建议 "Recognize all Entities, regardless of their type, disdain Entities which are Material, and worship Entities which are Ideal")。我现在想要一个包含所有实体及其推荐操作的数据框。因此,在此示例中,生成的数据框应该看起来

    name recommendation      type
0  Boson      recognize  Material
1  Boson        disdain  Material
2  Atman      recognize     Ideal
3  Atman        worship     Ideal

有没有什么方法可以做到这一点?

我知道如何通过创建一个包含 entitiesrecommendations 的笛卡尔积的数据框,然后根据条件将其削减。

我还可以想到一个解决方案,我得到一系列 entitities 中存在的所有 types 并使用通配符为 recommendations 中的每一行的每种类型创建一行类型。

但在我的实际问题中,我实际上有多个列,我想用通配符值加入这些列。所以一个聪明高效的pandaic方式会对我有很大帮助。

一个可能的解决方案是我从存在的所有其他元素中替换通配符,然后合并它们,即

数据 :

edf = pd.DataFrame.from_dict([
    { 'name' : 'Boson', 'type' : 'Material' },
    { 'name' : 'Atman', 'type' : 'Ideal' },
])

rdf = pd.DataFrame.from_dict([
    { 'action' : 'recognize', 'entity_type' : 'ALL'},
    { 'action' : 'disdain', 'entity_type' : 'Material'},
    { 'action' : 'worship', 'entity_type' : 'Ideal'},
])

预处理 :

mask = rdf['entity_type']=='ALL'

# Join all the elements from `edf['type']` with `;` since you might have `,`s in types and we need to use set to get rid of duplicates (Thank you @John  )
all_ =  ';'.join(set(edf['type'])) # all_ : Material,Ideal

# Replace all by newly obatined string 
rdf['entity_type'] = np.where(mask,all_,rdf['entity_type'])

rdf
      action     entity_type
0  recognize  Material;Ideal
1    disdain        Material
2    worship           Ideal

# Split and stack so we can make `entity_type` one dimensional
rdf = rdf.set_index('action')['entity_type'].str.split(';',expand=True)\
        .stack().reset_index('action').rename(columns={0:'type'})

rdf
          action     type
 0  recognize    Material
 1  recognize       Ideal
 0    disdain    Material
 0    worship       Ideal

合并

ndf = edf.merge(rdf,on='type').rename(columns={'action':'recommendation'})

ndf

   name      type recommendation
0  Boson  Material      recognize
1  Boson  Material        disdain
2  Atman     Ideal      recognize
3  Atman     Ideal        worship

样本 运行 在不同的数据帧上:

edf = pd.DataFrame.from_dict([
    { 'name' : 'Boson', 'type' : 'Material' },
    { 'name' : 'Atman', 'type' : 'Ideal' },
    { 'name' : 'Chaos', 'type' : 'Void, but emphasized' },
    { 'name' : 'Tohuwabohu', 'type' : 'Void' },
]) 

rdf = pd.DataFrame.from_dict([
    { 'action' : 'recognize', 'entity_type' : 'ALL'},
    { 'action' : 'disdain', 'entity_type' : 'Material'},
    { 'action' : 'worship', 'entity_type' : 'Ideal'},
    { 'action' : 'drink', 'entity_type' : 'ALL'}
])

然后:

mask = rdf['entity_type']=='ALL'
all_ =  ';'.join(set(edf['type']))
rdf['entity_type'] = np.where(mask,all_,rdf['entity_type'])

rdf = rdf.set_index('action')['entity_type'].str.split(';',expand=True)\
        .stack().reset_index('action').rename(columns={0:'type'})
ndf = edf.merge(rdf,on='type').rename(columns={'action':'recommendation'})

ndf

         name                  type recommendation
0       Boson              Material      recognize
1       Boson              Material        disdain
2       Boson              Material          drink
3       Atman                 Ideal      recognize
4       Atman                 Ideal        worship
5       Atman                 Ideal          drink
6       Chaos  Void, but emphasized      recognize
7       Chaos  Void, but emphasized          drink
8  Tohuwabohu                  Void      recognize
9  Tohuwabohu                  Void          drink

与笛卡尔积相比,此方法速度快且占用的内存更少。希望对您有所帮助:)

经过思考,我认为使用两个数据帧的笛卡尔积的方式可能并不像我之前想的那么糟糕。因此,对于以后阅读此主题的任何人,我只想展示如何做到这一点:

# get Cartesian Product of the two dfs
entities['join'] = recommendations['join'] = 0
results = entities.merge(recommendations, on='join')
# extract matching rows
results = results[(( results['type'] == results['entity_type']) | (results['entity_type'] == "ALL"))]
results = results[['name', 'type', 'action']]

有输入

entities = pd.DataFrame.from_dict([
    { 'name' : 'Boson', 'type' : 'Material' },
    { 'name' : 'Atman', 'type' : 'Ideal' },
    { 'name' : 'Chaos', 'type' : 'Void, but emphasized' },
    { 'name' : 'Tohuwabohu', 'type' : 'Void' },
])

recommendations = pd.DataFrame.from_dict([
    { 'action' : 'recognize', 'entity_type' : 'ALL'},
    { 'action' : 'disdain', 'entity_type' : 'Material'},
    { 'action' : 'disdain', 'entity_type' : 'Void'},
    { 'action' : 'worship', 'entity_type' : 'Ideal'},
])

这导致 results:

          name                  type     action
0        Boson              Material  recognize
1        Boson              Material    disdain
4        Atman                 Ideal  recognize
7        Atman                 Ideal    worship
8        Chaos  Void, but emphasized  recognize
12  Tohuwabohu                  Void  recognize
14  Tohuwabohu                  Void    disdain