存储 NLTK FreqDict 的更快方法？

Question

我正在尝试加快我的应用程序，我发现下面的简单小函数 (compute_ave_freq) 实际上是最耗时的函数之一。罪魁祸首似乎是当它解开一个 NLTK FreqDist 时；这需要大量时间。

当然，即使是这么长的时间也不到重新计算 FreqDist 所需时间的一半。有没有更好的方法来保存 NLTK FreqDist 对象？我尝试将它序列化为 JSON，但这将它保存为一个简单的字典，丢失了很多我需要的 NLTK 功能。

代码如下：

def compute_ave_freq(word_forms):    
    fd = pickle.load(open("data/fd.txt", 'rb'))
    total_freq = 0
    for form in word_forms:
        freq = fd.freq(form)
        total_freq += freq
    try:
        ave_freq = total_freq/len(word_forms)
    except ZeroDivisionError:
        ave_freq = 0
    return ave_freq

这是 LineProfiler 输出：

Total time: 0.197121 s
File: /home/username/development/appname/filename.py
Function: compute_ave_freq at line 25
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
25                                           def compute_ave_freq(word_forms, debug=False):
26                                               # word_forms is a list of morphological variations of a word, such as
27                                               # ['كتبوا', 'كتبو', 'كتبنا', 'كتبت']
28                                           
29         1        78580  78580.0     79.1      fd = pickle.load(open("data/fd.txt", 'rb'))
30         1            3      3.0      0.0      total_freq = 0
31         5           10      2.0      0.0      for form in word_forms:
32         4        20676   5169.0     20.8          freq = fd.freq(form)
33         4            9      2.2      0.0          if debug==True:
34                                                       print(form, '\n', freq)
35         4            6      1.5      0.0          total_freq += freq
36         1            1      1.0      0.0      try:
37         1            3      3.0      0.0          ave_freq = total_freq/len(word_forms)
38                                               except ZeroDivisionError:
39                                                   ave_freq = 0
40         1            1      1.0      0.0      return ave_freq

谢谢！

Answer 1

按照评论中的建议，将 fd 变量移到函数之外应该可以解决问题：

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):    
    total_freq = 0
    for form in word_forms:
        freq = fd.freq(form)
        total_freq += freq
    try:
        ave_freq = total_freq/len(word_forms)
    except ZeroDivisionError:
        ave_freq = 0
    return ave_freq

但是由于您正在创建求和平均函数，这里有一个更简单的实现：

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):
    try:
        return sum([fd.freq(form) for form in word_forms]) / len(word_forms)
    except ZeroDivisionError:
        return 0

或：

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):
    l = len(word_forms)
    if  l > 0:
        return sum([fd.freq(form) for form in word_forms]) / l
    else:
        return 0

或更简单：

fd = pickle.load(open("data/fd.txt", 'rb'))

def compute_ave_freq(word_forms):
    l = len(word_forms)
    return sum([fd.freq(form) for form in word_forms]) / l if l > 0 else 0

或 lambda:

fd = pickle.load(open("data/fd.txt", 'rb'))
compute_ave_freq = lambda x: sum(fd.freq(i) for i in x)/len(x)
ave_freq = compute_ave_freq(word_forms) if len(word_forms) > 0 else 0

看看EAFP and LBYL

存储 NLTK FreqDict 的更快方法？

Faster way to store a NLTK FreqDict?

python

serialization

json

pickle

nltk