如何写这个 romove_stopwords 更快 python？

Question

我有一个这样的函数remove_stopwords。如何让它运行更快？

temp.reverse()

def drop_stopwords(text):
    
    for x in temp:
        elif len(x.split()) > 1:
            text_list = text.split()  
            for y in range(len(text_list)-len(x.split())):
                if " ".join(text_list[y:y+len(x.split())]) == x:
                    del text_list[y:y+len(x.split())]
                    text = " ".join(text_list)
        
        else:
            text = " ".join(text for text in text.split() if text not in vietnamese)

    return text

解决我数据中文本的时间是 14 秒，如果我有一些像这次这样的技巧，时间将减少到 3 秒：


temp.reverse()

def drop_stopwords(text):
    
    for x in temp:
        if len(x.split()) >2:
            if x in text:
                text = text.replace(x,'')

        elif len(x.split()) > 1:
            text_list = text.split()  
            for y in range(len(text_list)-len(x.split())):
                if " ".join(text_list[y:y+len(x.split())]) == x:
                    del text_list[y:y+len(x.split())]
                    text = " ".join(text_list)
        
        else:
            text = " ".join(text for text in text.split() if text not in vietnamese)

    return text

但我认为在我的语言中有些地方可能有误。我如何在 Python 中重写此函数以使其更快（在 C 和 C++ 中，我可以使用上面的函数轻松解决它:(( )

Answer 1

您的函数一遍又一遍地做很多相同的事情，特别是重复 split 和 join 相同的 text。执行单个 split，对列表进行操作，然后在最后执行单个 join 可能会更快，并且肯定会导致更简单的代码。不幸的是，我没有您的任何示例数据来测试性能，但希望这能为您提供一些试验的东西：

temp = ["foo", "baz ola"]


def drop_stopwords(text):
    text_list = text.split()
    text_len = len(text_list)
    for word in temp:
        word_list = word.split()
        word_len = len(word_list)
        for i in range(text_len + 1 - word_len):
            if text_list[i:i+word_len] == word_list:
                text_list[i:i+word_len] = [None] * word_len
    return ' '.join(t for t in text_list if t)


print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
# the quick brown jumped over the dog

您也可以尝试在所有情况下迭代地执行 text.replace 并查看与更复杂的基于 split 的解决方案相比其性能如何：

temp = ["foo", "baz ola"]


def drop_stopwords(text):
    for word in temp:
        text = text.replace(word, '')
    return ' '.join(text.split())


print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
# the quick brown jumped over the dog

如何写这个 romove_stopwords 更快 python？

how to write this romove_stopwords faster python?

python

stop-words

pandas