python 优化 n-gram 的 count()

python optimization of count() for n-grams

我正在尝试使用 count() 函数对字符串列表中的项目进行计数,并将结果从大到小排序。尽管该函数在小型列表上表现相当不错,但它根本无法很好地扩展,从下面的小型实验中可以看出,只有 5 个周期将输入长度加倍(第 6 个周期等待时间太长)。有没有一种方法可以优化第一个列表理解或者 count() 的替代方法可以更好地扩展?

import nltk
from operator import itemgetter
import time

t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 6):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

# Runtime: 0.001s for 1x the size
# Runtime: 0.003s for 2x the size
# Runtime: 0.022s for 3x the size
# Runtime: 0.33s for 4x the size 
# Runtime: 8.065s for 5x the size

使用集合中的计数器并通过成员函数进行排序 "most_common()" 无论大小如何,我都得到几乎 0 秒:

import nltk
nltk.download('punkt')


from operator import itemgetter
from collections import Counter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 5):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Slow Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

    start = time.time()
    a = Counter(unigrams).most_common()
    #print(a)
    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Fast Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

运行时间慢:1 倍大小需要 0.003 秒

快速运行时间:1 倍大小只需 0.0 秒

运行时间慢:0.006 秒,大小为 2 倍

快速运行时间:0.0 秒,大小为 2 倍

运行时间慢:3 倍大小需要 0.157 秒

快速运行时间:3 倍大小只需 0.0 秒

运行时间慢:4 倍大小需要 1.891 秒

快速运行时间:0.001 秒大小的 4 倍