比 Counter 对象上的 double for 循环更快

Question

我想对一个Counter对象做一个双循环，它是两个不同的计数器相减的结果。我的柜台是这样的：

{'sun': 5,
 'abstract': 0.0,
 'action': 10,
 'ad': 0.0,
  ....}

我有一个像这样的数据框：

    0           1   
0   sun         sunlight        
2   river       water   
3   stair       staircase
4   morning     sunrise 
n   ......

我的目的是在数据框中只保留几个词，其中行的第一个词的频率为 0，第二个词的频率大于 0（或者相反，第一个词的频率大于零，第二个词的频率为 0，因此排除频率均为 0 或频率均大于零的夫妇）。

我试过了，但是太慢了（需要5个多小时才能完成）：

for i,j in counter_diff.items():       #extract i word and j counter number of a item
  for t,k in counter_diff.items():     #extract t word and k counter number of a item
    for s in range(len(df)):
      if ((df[0][s] == i and j==0) and (df[1][s] == t and k==0)):
        df = df.drop([s])
      elif ((df[0][s] == i and j>0) and (df[1][s] == t and k>0)):
        df = df.drop([s])
    df = df.reset_index(drop=True)

您有什么更好的建议吗？感谢您的宝贵时间！

Answer 1

IIUC，你可以试试：

d = {'sun': 5, 'abstract': 0.0, 'action': 10, 'ad': 0.0}
df = pd.DataFrame({0: ["sun", "river", "stair", "morning"], 
                   1: ["sunlight", "water", "staircase", "sunrise"]})

>>> df.loc[(df[0].map(d)>0)+(df[1].map(d)>0)==1]
     0         1
0  sun  sunlight

如果您在 df 中还有其他列并且想要检查是否只有一列的计数大于 0：

>>> df.loc[df.apply(lambda x: x.map(d)>0).sum(axis=1)==1]

Answer 2

一种方法是使用 applymap + numpy.logical_xor:

from collections import Counter
import pandas as pd
import numpy as np

# toy Counter object
counts = Counter({'sun': 5, 'abstract': 0, 'action': 10, 'ad': 0})

# toy DataFrame object
df = pd.DataFrame(data=[["sun", "sunlight"],
        ["river", "water"],
        ["stair", "staircase"],
        ["morning", "sunrise"]])

# map the counts element-wise over all the elements of the DataFrame
# and create boolean mask
indicators = df.applymap(lambda x: counts.get(x, 0)) > 0

# use a logical xor to find the combinations where the count is 0 and >0 (and the other way around)
mask = np.logical_xor(indicators[0], indicators[1])

# finally filter using a mask
res = df[mask]
print(res)

输出

     0         1
0  sun  sunlight

此方法的时间复杂度为 O(n)，其中 n 是 DataFrame 的大小（单元格数）。可以在 here.

中找到有关 xor（异或）的更多信息

比 Counter 对象上的 double for 循环更快

Faster than double for loop on Counter object

python

counter

for-loop

dataframe