需要建议 - Python 代码性能改进

Suggestion required - Python code performance improvement

需要一些建议来提高我的代码的性能。

我有两个文件(Keyword.txt,description.txt)。 Keyword.txt 包含关键字列表(具体而言超过 11,000 个),descriptions.txt 包含非常大的文本描述(超过 9,000 个)。

我正在尝试从 keyword.txt 中一次读取一个关键字,并检查描述中是否存在该关键字。如果关键字存在,我将其写入新文件。所以这就像一个多对多关系 (11,000 * 9,000)。

示例关键字:

Xerox
VMWARE CLOUD

样本描述(很大):

Planning and implementing entire IT Infrastructure. Cyberoam firewall implementation and administration in head office and branch office. Report generation and analysis. Including band width conception, internet traffic and application performance. Windows 2003/2008 Server Domain controller implementation and managing. VERITAS Backup for Clients backup, Daily backup of applications and database. Verify the backed up database for data integrity. Send backup tapes to remote location for safe storage Installing and configuring various network devices; Routers, Modems, Access Points, Wireless ADSL+ modems / Routers Monitoring, managing & optimizing Network. Maintaining Network Infrastructure for various clients. Creating Users and maintaining the Linux Proxy servers for clients. Trouble shooting, diagnosing, isolating & resolving Windows / Network Problems. Configuring CCTV camera, Biometrics attendance machine, Access Control System Kaspersky Internet Security / ESET NOD32

下面是我写的代码:

import csv
import nltk
import re
wr = open(OUTPUTFILENAME,'w')
def match():
    c = 0
    ft = open('DESCRIPTION.TXT','r')
    ky2 = open('KEYWORD.TXT','r')
    reader = csv.reader(ft)
    keywords = []
    keyword_reader2 = csv.reader(ky2)
    for x in keyword_reader2: # Storing all the keywords to a list
        keywords.append(x[1].lower())

    string = ' '
    c = 0
    for row in reader:
        sentence = row[1].lower()
        id = row[0]
        for word in keywords:
            if re.search(r'\b{}\b'.format(re.escape(word.lower())),sentence):
                    string = string + id+'$'+word.lower()+'$'+sentence+ '\n'
                    c = c + 1
        if c > 5000:  # I am writing 5000 lines at a time.
            print("Batch printed")
            c = 0
            wr.write(string)
            string = ' '
    wr.write(string)
    ky2.close()
    ft.close()
    wr.close()

match()

现在完成此代码大约需要 120 分钟。我尝试了几种方法来提高速度。

  1. 一开始我是一次写一行,然后我把它改为一次 5000 行,因为它是一个小文件,我可以把所有东西都放在内存中。没有看到太大的改善。
  2. 我将所有内容推送到标准输出并使用来自控制台的管道将所有内容附加到文件。这甚至更慢。

我想知道是否有更好的方法,因为我可能在代码中做错了什么。

我的电脑规格:内存:15gb 处理器:i7 第 4 代

我猜您想加快搜索速度。在这种情况下,如果您不关心描述中关键字的频率,只关心它们是否存在,您可以尝试以下操作:

对于每个描述文件,将文本拆分成单独的词,并生成一组唯一的词。

然后,对于关键字列表中的每个关键字,检查集合是否包含关键字,如果为真则写入文件。

这应该会使您的迭代更快。它还应该可以帮助您跳过正则表达式,这也可能是您性能问题的一部分。

PS:我的方法假定您过滤掉标点符号。

如果您所有的搜索词短语都由完整的单词组成(begin/end 在单词边界上),那么并行索引到单词树中的效率会尽可能高。

类似

# keep lowercase characters and digits
# keep apostrophes for contractions (isn't, couldn't, etc)
# convert uppercase characters to lowercase
# replace all other printable symbols with spaces
TO_ALPHANUM_LOWER = str.maketrans(
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ'!#$%&()*+,-./:;<=>?@[]^_`{|}~ \t\n\r\x0b\x0c\"\",
    "abcdefghijklmnopqrstuvwxyz'                                     "
)

def clean(s):
    """
    Convert string `s` to canonical form for searching
    """
    return s.translate(TO_ALPHANUM_LOWER)

class WordTree:
    __slots__ = ["children", "terminal"]

    def __init__(self, phrases=None):
        self.children = {}   # {"word": WordTrie}
        self.terminal = ''   # if end of search phrase, full phrase is stored here
        # preload tree
        if phrases:
            for phrase in phrases:
                self.add_phrase(phrase)

    def add_phrase(self, phrase):
        tree  = self
        words = clean(phrase).split()
        for word in words:
            ch = tree.children
            if word in ch:
                tree = ch[word]
            else:
                tree = ch[word] = WordTree()
        tree.terminal = " ".join(words)

    def inc_search(self, word):
        """
        Search one level deeper into the tree

        Returns
          (None,    ''    )  if word not found
          (subtree, ''    )  if word found but not terminal
          (subtree, phrase)  if word found and completes a search phrase
        """
        ch = self.children
        if word in ch:
            wt = ch[word]
            return wt, wt.terminal
        else:
            return (None, '')

    def parallel_search(self, text):
        """
        Return all search phrases found in text
        """
        found  = []
        fd = found.append
        partials = []
        for word in clean(text).split():
            new_partials = []
            np = new_partials.append
            # new search from root
            wt, phrase = self.inc_search(word)
            if wt:     np(wt)
            if phrase: fd(phrase)
            # continue existing partial matches
            for partial in partials:
                wt, phrase = partial.inc_search(word)
                if wt:     np(wt)
                if phrase: fd(phrase)
            partials = new_partials
        return found

    def tree_repr(self, depth=0, indent="  ", terminal=" *"):
        for word,tree in self.children.items():
            yield indent * depth + word + (terminal if tree.terminal else '')
            yield from tree.tree_repr(depth + 1, indent, terminal)

    def __repr__(self):
        return "\n".join(self.tree_repr())

那么你的程序就变成了

import csv

SEARCH_PHRASES = "keywords.csv"
SEARCH_INTO    = "descriptions.csv"
RESULTS        = "results.txt"

# get search phrases, build WordTree
with open(SEARCH_PHRASES) as inf:
    wt = WordTree(*(phrase for _,phrase in csv.reader(inf)))

with open(SEARCH_INTO) as inf, open(RESULTS, "w") as outf:
    # bound methods (save some look-ups)
    find_phrases = wt.parallel_search
    fmt          = "{}${}${}\n".format
    write        = outf.write
    # sentences to search
    for id,sentence in csv.reader(inf):
        # search phrases found
        for found in find_phrases(sentence):
            # store each result
            write(fmt(id, found, sentence))

应该快一千倍。