spimi算法误解

spimi algorithm misunderstanding

我正在尝试在 C++ 中实现单通道内存索引器

但是在算法中,我认为有问题或者(很可能)我有误解

SPIMI-INVERT(token_stream)
  output_file = NEWFILE() 
  dictionary = NEWHASH() 

  while (free memory available) 
     token ← next(token_stream)
     if term(token) ∈ dictionary
        then postings_list = ADDTODICTIONARY(dictionary, term(token)) 
        else postings_list=GETPOSTINGSLIST(dictionary,term(token))
     if full(postings_list)
        then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))

     ADDTOPOSTINGSLIST(postings_list, docID(token)) 
  sorted_terms ← SORTTERMS(dictionary)  
  WRITEBLOCKTODISK(sorted_terms,dictionary,output_file) 
  return output_file

假设我进行了所有解析并将所有文档转换为 tokens 流,其中 tokens term,doc_id

http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html 表示每个块都会调用 SPIMI-INVERT 函数。

好吧,那我们开始吧,

somehow ( maybe because the dictionary is too big) we don't have free memory anymore when we are in the while loop.

But from outside world (as a caller of the function) I have no idea if the block that I send it as an argument processed totally or not. Don't you think that there is something wrong here?

因为到目前为止没有答案,在与我的教授交谈后,我正在回答我的问题。

我必须说算法不是很清楚,因为我的教授也不确定。我正在按照我的解释来回答这个问题。

token stream is a file that includes tokens(term , doc-id pair)

while (there is token in token stream)
      dict = new dictionary()
      while(there is free memory available)
            token = next_token()
            dict.add_to_postingList(token)
      write_dict_to_file(dict)
      delete dict

//Assuming posting list is dynamically sized and dictionary knows if a term exist. 

Here 我用 C++ 实现的,可能有用