spimi算法误解
spimi algorithm misunderstanding
我正在尝试在 C++ 中实现单通道内存索引器
但是在算法中,我认为有问题或者(很可能)我有误解
SPIMI-INVERT(token_stream)
output_file = NEWFILE()
dictionary = NEWHASH()
while (free memory available)
token ← next(token_stream)
if term(token) ∈ dictionary
then postings_list = ADDTODICTIONARY(dictionary, term(token))
else postings_list=GETPOSTINGSLIST(dictionary,term(token))
if full(postings_list)
then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))
ADDTOPOSTINGSLIST(postings_list, docID(token))
sorted_terms ← SORTTERMS(dictionary)
WRITEBLOCKTODISK(sorted_terms,dictionary,output_file)
return output_file
假设我进行了所有解析并将所有文档转换为 tokens 流,其中 tokens 是 term,doc_id
对
http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html 表示每个块都会调用 SPIMI-INVERT 函数。
好吧,那我们开始吧,
- 我们逐块读取流,所以现在我有一个块并且
通过 SPIMI-INVERT 函数将其作为参数发送
- 该函数对字典的标记进行一些处理
somehow ( maybe because the dictionary is too big) we don't have free
memory anymore when we are in the while loop.
- 算法打破循环,将当前字典写入
磁盘。
But from outside world (as a caller of the function) I have no idea if
the block that I send it as an argument processed totally or not. Don't you
think that there is something wrong here?
因为到目前为止没有答案,在与我的教授交谈后,我正在回答我的问题。
我必须说算法不是很清楚,因为我的教授也不确定。我正在按照我的解释来回答这个问题。
token stream is a file that includes tokens(term , doc-id pair)
while (there is token in token stream)
dict = new dictionary()
while(there is free memory available)
token = next_token()
dict.add_to_postingList(token)
write_dict_to_file(dict)
delete dict
//Assuming posting list is dynamically sized and dictionary knows if a term exist.
Here 我用 C++ 实现的,可能有用
我正在尝试在 C++ 中实现单通道内存索引器
但是在算法中,我认为有问题或者(很可能)我有误解
SPIMI-INVERT(token_stream)
output_file = NEWFILE()
dictionary = NEWHASH()
while (free memory available)
token ← next(token_stream)
if term(token) ∈ dictionary
then postings_list = ADDTODICTIONARY(dictionary, term(token))
else postings_list=GETPOSTINGSLIST(dictionary,term(token))
if full(postings_list)
then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))
ADDTOPOSTINGSLIST(postings_list, docID(token))
sorted_terms ← SORTTERMS(dictionary)
WRITEBLOCKTODISK(sorted_terms,dictionary,output_file)
return output_file
假设我进行了所有解析并将所有文档转换为 tokens 流,其中 tokens 是 term,doc_id
对
http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html 表示每个块都会调用 SPIMI-INVERT 函数。
好吧,那我们开始吧,
- 我们逐块读取流,所以现在我有一个块并且 通过 SPIMI-INVERT 函数将其作为参数发送
- 该函数对字典的标记进行一些处理
somehow ( maybe because the dictionary is too big) we don't have free memory anymore when we are in the while loop.
- 算法打破循环,将当前字典写入 磁盘。
But from outside world (as a caller of the function) I have no idea if the block that I send it as an argument processed totally or not. Don't you think that there is something wrong here?
因为到目前为止没有答案,在与我的教授交谈后,我正在回答我的问题。
我必须说算法不是很清楚,因为我的教授也不确定。我正在按照我的解释来回答这个问题。
token stream is a file that includes tokens(term , doc-id pair)
while (there is token in token stream)
dict = new dictionary()
while(there is free memory available)
token = next_token()
dict.add_to_postingList(token)
write_dict_to_file(dict)
delete dict
//Assuming posting list is dynamically sized and dictionary knows if a term exist.
Here 我用 C++ 实现的,可能有用