如何使此功能更(时间)有效?
how to make this function more (time) efficient?
我有一个包含句子的数据框系列。 (有些有点长)
我还有 2 部字典,其中包含单词作为键,整数作为计数。
并非字符串中的所有单词都出现在两个词典中。有的只有一个,有的两个都没有。
数据帧的长度为 124011 个单位。函数每个字符串需要我大约 0.4。这太长了。
W只是字典的参考值(weights = {},weights[W] = {})
函数如下:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
感谢
我认为您的时间效率低下可能是因为您使用的是 Counter 而不是 dict。一些 暗示 dict class 有部分是用纯 c 写的,而 counter 是用 python.
写的
我建议将您的代码更改为使用 dict 并测试是否提供更快的时间
还有为什么这段代码重复了?:
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
ratios = []
让我们稍微清理一下:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
这是多余的!将 4 个语句替换为 2:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
下一个:
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
我不知道您认为该代码的作用。我希望你不是在狡猾。但是 .keys
returns 是一个可迭代对象,而 X in <iterable>
比 X in <dict>
慢得多。此外,注意: 如果最内层 (weights[W][word] != 0
) 条件失败,则不要附加任何内容。这可能是一个错误,因为您尝试在另一个 else 条件下附加 0 。 (我不知道你在做什么,所以我只是指出来。)这是 Python,不是 Perl 或 C 或 Java。所以 if <test>:
周围不需要括号
让我们开始吧:
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
下一个:
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
您试图防止被零除。但是您可以使用列表的 truthiness 来检查这一点,并避免比较:
if words:
ratios = np.divide(ratios, float(len(words)))
其余的都可以,但是您不需要变量。
ratio = np.sum(ratios)
return ratio
应用这些模组后,您的函数如下所示:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
稍微仔细看一下,我看到你在这样做:
word_counts = Counter(words)
for word in words:
append( word_counts[word] * ...)
根据我的说法,这意味着如果 "apple" 出现 6 次,您将在列表中追加 6*...,每个单词一次。因此,您的列表中将出现 6 次不同的 6*...。你确定那是你想要的吗?还是应该 for word in word_counts
来迭代不同的词?
另一个优化是从循环内部删除查找。您一直在查找 weights[W]
和 rel_weight[W]
,即使 W
的值从未改变。让我们在循环外缓存这些值。另外,让我们缓存一个指向 ratios.append
方法的指针。
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
# Cache these values for speed in loop
ratios_append = ratios.append
weights_W = weights[W]
rel_W = rel_weight[W]
for word in words:
if word in weights_W and word in rel_W:
if weights_W[word] != 0:
ratios_append(words_counts[word] * rel_W[word] / weights_W[word])
else:
ratios_append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
试一试,看看效果如何。请看上面加粗的 note 和问题。可能有bug,可能有更多的加速方法。
如果你有那个函数执行的概要文件就好了,但这里有一些通用的想法:
- 您在每次迭代中都不必要地获取了一些元素。您可以在循环之前提取这些
例如
weights_W = weights[W]
rel_weights_W = rel_weights[W]
- 你不需要在听写上调用
.keys()
。
这些是等价的:
word in weights_W.keys()
word in weights_W
- 尝试在不先查找的情况下获取值。这将为您节省一次查找。
例如代替:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
你可以做到:
word_weight = weights_W.get(word)
if word_weight is not None:
word_rel_weight = rel_weights_W.get(word)
if word_rel_weight is not None:
if word_weight != 0: # lookup saved here
我有一个包含句子的数据框系列。 (有些有点长)
我还有 2 部字典,其中包含单词作为键,整数作为计数。
并非字符串中的所有单词都出现在两个词典中。有的只有一个,有的两个都没有。
数据帧的长度为 124011 个单位。函数每个字符串需要我大约 0.4。这太长了。
W只是字典的参考值(weights = {},weights[W] = {})
函数如下:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
感谢
我认为您的时间效率低下可能是因为您使用的是 Counter 而不是 dict。一些
我建议将您的代码更改为使用 dict 并测试是否提供更快的时间
还有为什么这段代码重复了?:
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
ratios = []
让我们稍微清理一下:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
这是多余的!将 4 个语句替换为 2:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
下一个:
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
我不知道您认为该代码的作用。我希望你不是在狡猾。但是 .keys
returns 是一个可迭代对象,而 X in <iterable>
比 X in <dict>
慢得多。此外,注意: 如果最内层 (weights[W][word] != 0
) 条件失败,则不要附加任何内容。这可能是一个错误,因为您尝试在另一个 else 条件下附加 0 。 (我不知道你在做什么,所以我只是指出来。)这是 Python,不是 Perl 或 C 或 Java。所以 if <test>:
让我们开始吧:
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
下一个:
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
您试图防止被零除。但是您可以使用列表的 truthiness 来检查这一点,并避免比较:
if words:
ratios = np.divide(ratios, float(len(words)))
其余的都可以,但是您不需要变量。
ratio = np.sum(ratios)
return ratio
应用这些模组后,您的函数如下所示:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
稍微仔细看一下,我看到你在这样做:
word_counts = Counter(words)
for word in words:
append( word_counts[word] * ...)
根据我的说法,这意味着如果 "apple" 出现 6 次,您将在列表中追加 6*...,每个单词一次。因此,您的列表中将出现 6 次不同的 6*...。你确定那是你想要的吗?还是应该 for word in word_counts
来迭代不同的词?
另一个优化是从循环内部删除查找。您一直在查找 weights[W]
和 rel_weight[W]
,即使 W
的值从未改变。让我们在循环外缓存这些值。另外,让我们缓存一个指向 ratios.append
方法的指针。
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
# Cache these values for speed in loop
ratios_append = ratios.append
weights_W = weights[W]
rel_W = rel_weight[W]
for word in words:
if word in weights_W and word in rel_W:
if weights_W[word] != 0:
ratios_append(words_counts[word] * rel_W[word] / weights_W[word])
else:
ratios_append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
试一试,看看效果如何。请看上面加粗的 note 和问题。可能有bug,可能有更多的加速方法。
如果你有那个函数执行的概要文件就好了,但这里有一些通用的想法:
- 您在每次迭代中都不必要地获取了一些元素。您可以在循环之前提取这些
例如
weights_W = weights[W]
rel_weights_W = rel_weights[W]
- 你不需要在听写上调用
.keys()
。
这些是等价的:
word in weights_W.keys()
word in weights_W
- 尝试在不先查找的情况下获取值。这将为您节省一次查找。
例如代替:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
你可以做到:
word_weight = weights_W.get(word)
if word_weight is not None:
word_rel_weight = rel_weights_W.get(word)
if word_rel_weight is not None:
if word_weight != 0: # lookup saved here