如何有效地计算 python 列表列表中的同时出现
How to efficiently tally co-occurrences in python list of lists
我有一个相对较大的(~3GB,3+ 百万个条目)子列表列表,其中每个子列表包含一组标签。这是一个非常简单的例子:
tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]
unique_tags = ['dog', 'cat', 'fish']
co_occurences = {key:Counter() for key in unique_tags}
for tags in tag_corpus:
tallies = Counter(tags)
for key in tags:
co_occurences[key] = co_occurences[key] + tallies
这有点像魅力,但它在实际数据集上超级慢,它有非常大的子列表(总共约 30K 个唯一标签)。任何 python 专业人士都知道如何加快这件事吗?
这可能走得更快。你得量一下。
from collections import Counter
from collections import defaultdict
tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]
co_occurences = defaultdict(Counter)
for tags in tag_corpus:
for key in tags:
co_occurences[key].update(tags)
unique_tags = sorted(co_occurences)
print co_occurences
print unique_tags
我只是在胡闹,没想到最终会得到更高效的东西,但是有 100000 只猫、狗和鱼,这要快得多,计时为 0.07 秒,而不是 1.25。
我试图以一个更短的解决方案结束,但事实证明这种方式是最快的,即使它看起来非常简单:)
occurances = {}
for tags in tag_corpus:
for key in tags:
for key2 in tags:
try:
occurances[key][key2] += 1
except KeyError:
try:
occurances[key][key2] = 1
except KeyError:
occurances[key] = {key2: 1}
您可以尝试结合使用 defaultdict 来避免在开始时使用 Peters 答案中的逻辑进行初始化,运行时间会明显加快:
In [35]: %%timeit
co_occurences = defaultdict(Counter)
for tags in tag_corpus:
for key in tags:
co_occurences[key].update(tags)
....:
1 loop, best of 3: 513 ms per loop
In [36]: %%timeit
occurances = {k1: defaultdict(int) for k1 in unique_tags}
for tags in tag_corpus:
for key in tags:
for key2 in tags:
occurances[key][key2] += 1
....:
10 loops, best of 3: 65.7 ms per loop
In [37]: %%timeit
....: co_occurences = {key:Counter() for key in unique_tags}
....: for tags in tag_corpus:
....: tallies = Counter(tags)
....: for key in tags:
....: co_occurences[key] = co_occurences[key] + tallies
....:
1 loop, best of 3: 1.13 s per loop
In [38]: %%timeit
....: occurances = defaultdict(lambda: defaultdict(int))
....: for tags in tag_corpus:
....: for key in tags:
....: for key2 in tags:
....: occurances[key][key2] += 1
....:
10 loops, best of 3: 66.5 ms per loop
至少在 python2 中,Counter 字典实际上并不是获得计数的最快方法,defaultdict 但是即使使用 lambda 也很快。
即使滚动您自己的计数函数也会更快:
def count(x):
d = defaultdict(int)
for ele in x:
d[ele] += 1
return d
不如最快的快,但仍然不错:
In [42]: %%timeit
....: co_occurences = {key: defaultdict(int) for key in unique_tags}
....: for tags in tag_corpus:
....: tallies = count(tags)
....: for key in tags:
....: for k, v in tallies.items():
....: co_occurences[key][k] += v
....:
10 loops, best of 3: 164 ms per loop
如果您想要更快的速度,一点 cython 可能会大有帮助。
我有一个相对较大的(~3GB,3+ 百万个条目)子列表列表,其中每个子列表包含一组标签。这是一个非常简单的例子:
tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]
unique_tags = ['dog', 'cat', 'fish']
co_occurences = {key:Counter() for key in unique_tags}
for tags in tag_corpus:
tallies = Counter(tags)
for key in tags:
co_occurences[key] = co_occurences[key] + tallies
这有点像魅力,但它在实际数据集上超级慢,它有非常大的子列表(总共约 30K 个唯一标签)。任何 python 专业人士都知道如何加快这件事吗?
这可能走得更快。你得量一下。
from collections import Counter
from collections import defaultdict
tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]
co_occurences = defaultdict(Counter)
for tags in tag_corpus:
for key in tags:
co_occurences[key].update(tags)
unique_tags = sorted(co_occurences)
print co_occurences
print unique_tags
我只是在胡闹,没想到最终会得到更高效的东西,但是有 100000 只猫、狗和鱼,这要快得多,计时为 0.07 秒,而不是 1.25。
我试图以一个更短的解决方案结束,但事实证明这种方式是最快的,即使它看起来非常简单:)
occurances = {}
for tags in tag_corpus:
for key in tags:
for key2 in tags:
try:
occurances[key][key2] += 1
except KeyError:
try:
occurances[key][key2] = 1
except KeyError:
occurances[key] = {key2: 1}
您可以尝试结合使用 defaultdict 来避免在开始时使用 Peters 答案中的逻辑进行初始化,运行时间会明显加快:
In [35]: %%timeit
co_occurences = defaultdict(Counter)
for tags in tag_corpus:
for key in tags:
co_occurences[key].update(tags)
....:
1 loop, best of 3: 513 ms per loop
In [36]: %%timeit
occurances = {k1: defaultdict(int) for k1 in unique_tags}
for tags in tag_corpus:
for key in tags:
for key2 in tags:
occurances[key][key2] += 1
....:
10 loops, best of 3: 65.7 ms per loop
In [37]: %%timeit
....: co_occurences = {key:Counter() for key in unique_tags}
....: for tags in tag_corpus:
....: tallies = Counter(tags)
....: for key in tags:
....: co_occurences[key] = co_occurences[key] + tallies
....:
1 loop, best of 3: 1.13 s per loop
In [38]: %%timeit
....: occurances = defaultdict(lambda: defaultdict(int))
....: for tags in tag_corpus:
....: for key in tags:
....: for key2 in tags:
....: occurances[key][key2] += 1
....:
10 loops, best of 3: 66.5 ms per loop
至少在 python2 中,Counter 字典实际上并不是获得计数的最快方法,defaultdict 但是即使使用 lambda 也很快。
即使滚动您自己的计数函数也会更快:
def count(x):
d = defaultdict(int)
for ele in x:
d[ele] += 1
return d
不如最快的快,但仍然不错:
In [42]: %%timeit
....: co_occurences = {key: defaultdict(int) for key in unique_tags}
....: for tags in tag_corpus:
....: tallies = count(tags)
....: for key in tags:
....: for k, v in tallies.items():
....: co_occurences[key][k] += v
....:
10 loops, best of 3: 164 ms per loop
如果您想要更快的速度,一点 cython 可能会大有帮助。