如何使此功能更（时间）有效？

Question

我有一个包含句子的数据框系列。（有些有点长）

我还有 2 部字典，其中包含单词作为键，整数作为计数。

并非字符串中的所有单词都出现在两个词典中。有的只有一个，有的两个都没有。

数据帧的长度为 124011 个单位。函数每个字符串需要我大约 0.4。这太长了。

W只是字典的参考值(weights = {},weights[W] = {})

函数如下：

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    ratios = []

    for word in words:

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)

    return ratio

感谢

Answer 1

我认为您的时间效率低下可能是因为您使用的是 Counter 而不是 dict。一些暗示 dict class 有部分是用纯 c 写的，而 counter 是用 python.

写的

我建议将您的代码更改为使用 dict 并测试是否提供更快的时间

还有为什么这段代码重复了？:

words = string.split()

words_counts = Counter(words)

words = string.split()

words_counts = Counter(words)

ratios = []

Answer 2

让我们稍微清理一下：

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    words = string.split()

    words_counts = Counter(words)

这是多余的！将 4 个语句替换为 2:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)

下一个：

    ratios = []

    for word in words:    

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

我不知道您认为该代码的作用。我希望你不是在狡猾。但是 .keys returns 是一个可迭代对象，而 X in <iterable> 比 X in <dict> 慢得多。此外，注意： 如果最内层 (weights[W][word] != 0) 条件失败，则不要附加任何内容。这可能是一个错误，因为您尝试在另一个 else 条件下附加 0 。（我不知道你在做什么，所以我只是指出来。）这是 Python，不是 Perl 或 C 或 Java。所以 if <test>:

周围不需要括号

让我们开始吧：

    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

下一个：

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

您试图防止被零除。但是您可以使用列表的 truthiness 来检查这一点，并避免比较：

    if words:
        ratios = np.divide(ratios, float(len(words)))

其余的都可以，但是您不需要变量。

    ratio = np.sum(ratios)

    return ratio

应用这些模组后，您的函数如下所示：

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

稍微仔细看一下，我看到你在这样做：

word_counts = Counter(words)

for word in words:
    append(   word_counts[word] * ...)

根据我的说法，这意味着如果 "apple" 出现 6 次，您将在列表中追加 6*...，每个单词一次。因此，您的列表中将出现 6 次不同的 6*...。你确定那是你想要的吗？还是应该 for word in word_counts 来迭代不同的词？

另一个优化是从循环内部删除查找。您一直在查找 weights[W] 和 rel_weight[W]，即使 W 的值从未改变。让我们在循环外缓存这些值。另外，让我们缓存一个指向 ratios.append 方法的指针。

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    # Cache these values for speed in loop
    ratios_append = ratios.append
    weights_W = weights[W]
    rel_W = rel_weight[W]

    for word in words:
        if word in weights_W and word in rel_W:
            if weights_W[word] != 0:    
                ratios_append(words_counts[word] * rel_W[word] / weights_W[word])

        else:
            ratios_append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

试一试，看看效果如何。请看上面加粗的 note 和问题。可能有bug，可能有更多的加速方法。

Answer 3

如果你有那个函数执行的概要文件就好了，但这里有一些通用的想法：

您在每次迭代中都不必要地获取了一些元素。您可以在循环之前提取这些

例如

weights_W = weights[W]
rel_weights_W = rel_weights[W]

你不需要在听写上调用 .keys()。

这些是等价的：

word in weights_W.keys()
word in weights_W

尝试在不先查找的情况下获取值。这将为您节省一次查找。

例如代替：

if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
        if (weights[W][word]!=0):

你可以做到：

word_weight = weights_W.get(word)
if word_weight is not None:
    word_rel_weight = rel_weights_W.get(word)
    if word_rel_weight is not None:
        if word_weight != 0:  # lookup saved here

如何使此功能更（时间）有效？

how to make this function more (time) efficient?

python

performance

dictionary

coding-efficiency

pandas