如何在波斯语文本上创建可搜索树?

how to create a searchable tree on Persian text?

我想从停用词中清除波斯语文本。我已经在下面 link 中提供了停用词数据。在我看来,如果我有一个预先构建的停用词树,我可以节省很多时间。我想在这个预先构建的树中搜索文本的每个单词,如果该单词在树中,我将其从文本中删除,如果不在我保留它。

O(n * l) 到 O(n*log(l))。

This is my stop-words

如果您有比预建树搜索更好的建议,我将不胜感激与我分享。

这是关于轮胎树的回答:

读取数据:

#readindg stopword data
stopwords = pd.read_csv('STOPWORDS',header=None)

轮胎树:

#creating tire tree
class TrieNode: 

    # Trie node class 
    def __init__(self): 
        self.children = [None]*15000

        # isEndOfWord is True if node represent the end of the word 
        self.isEndOfWord = False

class Trie: 

    # Trie data structure class 
    def __init__(self): 
        self.root = self.getNode() 

    def getNode(self): 

        # Returns new trie node (initialized to NULLs) 
        return TrieNode() 

    def _charToIndex(self,ch): 

        # private helper function 
        # Converts key current character into index 
        # use only 'a' through 'z' and lower case 

        return ord(ch)-ord('!') 


    def insert(self,key): 

        # If not present, inserts key into trie 
        # If the key is prefix of trie node, 
        # just marks leaf node 
        pCrawl = self.root 
        length = len(key) 
        for level in range(length): 
            index = self._charToIndex(key[level]) 

            # if current character is not present 
            if not pCrawl.children[index]: 
                pCrawl.children[index] = self.getNode() 
            pCrawl = pCrawl.children[index] 

        # mark last node as leaf 
        pCrawl.isEndOfWord = True

    def search(self, key): 

        # Search key in the trie 
        # Returns true if key presents 
        # in trie, else false 
        pCrawl = self.root 
        length = len(key) 
        for level in range(length): 
            index = self._charToIndex(key[level]) 
            if not pCrawl.children[index]: 
                return False
            pCrawl = pCrawl.children[index] 

        return pCrawl != None and pCrawl.isEndOfWord 

使用示例:

# Input keys (use only 'a' through 'z' and lower case) 
keys = list(stopwords.loc[:,0])

output = ["Not present in trie", 
        "Present in trie"] 

# Trie object 
t = Trie() 

# Construct trie 
for key in keys: 
    t.insert(key) 


print("{} ---- {}".format("از",output[t.search("از")])) 

输出:

از ---- Present in trie