如何在波斯语文本上创建可搜索树?
how to create a searchable tree on Persian text?
我想从停用词中清除波斯语文本。我已经在下面 link 中提供了停用词数据。在我看来,如果我有一个预先构建的停用词树,我可以节省很多时间。我想在这个预先构建的树中搜索文本的每个单词,如果该单词在树中,我将其从文本中删除,如果不在我保留它。
O(n * l) 到 O(n*log(l))。
如果您有比预建树搜索更好的建议,我将不胜感激与我分享。
这是关于轮胎树的回答:
读取数据:
#readindg stopword data
stopwords = pd.read_csv('STOPWORDS',header=None)
轮胎树:
#creating tire tree
class TrieNode:
# Trie node class
def __init__(self):
self.children = [None]*15000
# isEndOfWord is True if node represent the end of the word
self.isEndOfWord = False
class Trie:
# Trie data structure class
def __init__(self):
self.root = self.getNode()
def getNode(self):
# Returns new trie node (initialized to NULLs)
return TrieNode()
def _charToIndex(self,ch):
# private helper function
# Converts key current character into index
# use only 'a' through 'z' and lower case
return ord(ch)-ord('!')
def insert(self,key):
# If not present, inserts key into trie
# If the key is prefix of trie node,
# just marks leaf node
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
# if current character is not present
if not pCrawl.children[index]:
pCrawl.children[index] = self.getNode()
pCrawl = pCrawl.children[index]
# mark last node as leaf
pCrawl.isEndOfWord = True
def search(self, key):
# Search key in the trie
# Returns true if key presents
# in trie, else false
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
if not pCrawl.children[index]:
return False
pCrawl = pCrawl.children[index]
return pCrawl != None and pCrawl.isEndOfWord
使用示例:
# Input keys (use only 'a' through 'z' and lower case)
keys = list(stopwords.loc[:,0])
output = ["Not present in trie",
"Present in trie"]
# Trie object
t = Trie()
# Construct trie
for key in keys:
t.insert(key)
print("{} ---- {}".format("از",output[t.search("از")]))
输出:
از ---- Present in trie
我想从停用词中清除波斯语文本。我已经在下面 link 中提供了停用词数据。在我看来,如果我有一个预先构建的停用词树,我可以节省很多时间。我想在这个预先构建的树中搜索文本的每个单词,如果该单词在树中,我将其从文本中删除,如果不在我保留它。
O(n * l) 到 O(n*log(l))。
如果您有比预建树搜索更好的建议,我将不胜感激与我分享。
这是关于轮胎树的回答:
读取数据:
#readindg stopword data
stopwords = pd.read_csv('STOPWORDS',header=None)
轮胎树:
#creating tire tree
class TrieNode:
# Trie node class
def __init__(self):
self.children = [None]*15000
# isEndOfWord is True if node represent the end of the word
self.isEndOfWord = False
class Trie:
# Trie data structure class
def __init__(self):
self.root = self.getNode()
def getNode(self):
# Returns new trie node (initialized to NULLs)
return TrieNode()
def _charToIndex(self,ch):
# private helper function
# Converts key current character into index
# use only 'a' through 'z' and lower case
return ord(ch)-ord('!')
def insert(self,key):
# If not present, inserts key into trie
# If the key is prefix of trie node,
# just marks leaf node
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
# if current character is not present
if not pCrawl.children[index]:
pCrawl.children[index] = self.getNode()
pCrawl = pCrawl.children[index]
# mark last node as leaf
pCrawl.isEndOfWord = True
def search(self, key):
# Search key in the trie
# Returns true if key presents
# in trie, else false
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
if not pCrawl.children[index]:
return False
pCrawl = pCrawl.children[index]
return pCrawl != None and pCrawl.isEndOfWord
使用示例:
# Input keys (use only 'a' through 'z' and lower case)
keys = list(stopwords.loc[:,0])
output = ["Not present in trie",
"Present in trie"]
# Trie object
t = Trie()
# Construct trie
for key in keys:
t.insert(key)
print("{} ---- {}".format("از",output[t.search("از")]))
输出:
از ---- Present in trie