Python Pandas: 如何最佳地将字典的每个值与所有其他值进行比较?

Python Pandas: How to optimally compare each value of a dictionary with all other values?

我有一本字典 'orgs_dict',我想将每个值与所有值进行比较,为此我将所有值放在一个集合中,然后进行比较,如果它们是同样,我将它添加到 'final_hosts' 字典中:

orgs_dict = {'Ridgway School': 'ridgway','Ridgway Uni': 'ridgway', 'Aktieselskapet': 'aktieselskapet','Aktieselskapet_1': 'aktieselskapet', 'Chinese Education Association Ex': 'chinese association ex', 'Gestora de Infraestructuras de Telecomunicaciones': 'gestora infraestructuras telecomunicaciones','Aktieselskapet_5': 'aktieselskapet'}

这是我的代码:

set_neworgs=set()
for key in orgs_dict.keys():
    set_neworgs.add(orgs_dict[key])

final_hosts = defaultdict(list)
for i in set_neworgs:
    for k,v in orgs_dict.items():
        if i == v:
            final_hosts[i].append(k) 

这很好用,但是当我的 'orgs_dict' 非常大时,需要 3 个小时才能完成。我想知道是否有人知道更可选的方法?

您可以使用键作为列 'new_orgs' 和值作为 'hosts' 构建一个 df,然后使用 value_counts() > 1 作为布尔过滤器,然后过滤存在的主机在本系列中使用 isin:

In [150]:

orgs_dict = {'Ridgway School': 'ridgway','Ridgway Uni': 'ridgway', 'Aktieselskapet': 'aktieselskapet','Aktieselskapet_1': 'aktieselskapet', 'Chinese Education Association Ex': 'chinese association ex', 'Gestora de Infraestructuras de Telecomunicaciones': 'gestora infraestructuras telecomunicaciones','Aktieselskapet_5': 'aktieselskapet'}
df =pd.DataFrame({'new_orgs':list(orgs_dict.keys()), 'hosts':list(orgs_dict.values())})
df
Out[150]:
                                         hosts  \
0                               aktieselskapet   
1                                      ridgway   
2                               aktieselskapet   
3                                      ridgway   
4                       chinese association ex   
5  gestora infraestructuras telecomunicaciones   
6                               aktieselskapet   

                                            new_orgs  
0                                   Aktieselskapet_1  
1                                     Ridgway School  
2                                   Aktieselskapet_5  
3                                        Ridgway Uni  
4                   Chinese Education Association Ex  
5  Gestora de Infraestructuras de Telecomunicaciones  
6                                     Aktieselskapet  

In [157]:

df[df['hosts'].isin((df['hosts'].value_counts()[df['hosts'].value_counts()> 1].index))]
Out[157]:
            hosts          new_orgs
0  aktieselskapet  Aktieselskapet_1
1         ridgway    Ridgway School
2  aktieselskapet  Aktieselskapet_5
3         ridgway       Ridgway Uni
6  aktieselskapet    Aktieselskapet

另一种方法是 groupby 'host` 然后只计算 'new_orgs' 的数量并使用它来过滤:

In [167]:

df['host_count'] = df.groupby('hosts')['new_orgs'].transform('count')
df[df['host_count'] > 1]
Out[167]:
            hosts          new_orgs  host_count
0  aktieselskapet  Aktieselskapet_1           3
1         ridgway    Ridgway School           2
2  aktieselskapet  Aktieselskapet_5           3
3         ridgway       Ridgway Uni           2
6  aktieselskapet    Aktieselskapet           3

计时

在这个小样本集上我得到

In [168]:

%%timeit
df['host_count'] = df.groupby('hosts')['new_orgs'].transform('count')
df[df['host_count'] > 1]
1000 loops, best of 3: 1.65 ms per loop

In [169]:

%timeit df[df['hosts'].isin((df['hosts'].value_counts()[df['hosts'].value_counts()> 1].index))]
1000 loops, best of 3: 1.49 ms per loop

差别不大,你现在的方法更快:

In [175]:

%%timeit
set_neworgs=set()
for key in orgs_dict.keys():
    set_neworgs.add(orgs_dict[key])
​
final_hosts = defaultdict(list)
for i in set_neworgs:
    for k,v in orgs_dict.items():
        if i == v:
            final_hosts[i].append(k) 
100000 loops, best of 3: 6.85 µs per loop

但是,它不能很好地扩展到您的实际数据集大小,而上面的 2 种方法可以

Python 2.7+: 值相同的键可以用这个字典理解找到:

{k: orgs_dict[k] for k in orgs_dict  if orgs_dict.values().count(orgs_dict[k])>1}

Python 3.x:将 orgs_dict.values() 包装在对 list 的调用中:

{k: orgs_dict[k] for k in orgs_dict  if list(orgs_dict.values()).count(orgs_dict[k])>1}

输出:

{'Aktieselskapet_1': 'aktieselskapet', 'Ridgway School': 'ridgway', 'Aktieselskapet': 'aktieselskapet', 'Ridgway Uni': 'ridgway', 'Aktieselskapet_5': 'aktieselskapet'}

另一种方法: 在 2.7+ 和 3.x 中使用 collections 模块中的 Counter

from collections import Counter
c = Counter(orgs_dict.values()) # count values
{k : orgs_dict[k] for k in orgs_dict.keys() if c[orgs_dict[k]]>1}