spimi算法误解

Question

我正在尝试在 C++ 中实现单通道内存索引器

但是在算法中，我认为有问题或者（很可能）我有误解

SPIMI-INVERT(token_stream)
  output_file = NEWFILE() 
  dictionary = NEWHASH() 

  while (free memory available) 
     token ← next(token_stream)
     if term(token) ∈ dictionary
        then postings_list = ADDTODICTIONARY(dictionary, term(token)) 
        else postings_list=GETPOSTINGSLIST(dictionary,term(token))
     if full(postings_list)
        then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))

     ADDTOPOSTINGSLIST(postings_list, docID(token)) 
  sorted_terms ← SORTTERMS(dictionary)  
  WRITEBLOCKTODISK(sorted_terms,dictionary,output_file) 
  return output_file

假设我进行了所有解析并将所有文档转换为 tokens 流，其中 tokens 是 term,doc_id对

http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html 表示每个块都会调用 SPIMI-INVERT 函数。

好吧，那我们开始吧，

我们逐块读取流，所以现在我有一个块并且通过 SPIMI-INVERT 函数将其作为参数发送
该函数对字典的标记进行一些处理

somehow ( maybe because the dictionary is too big) we don't have free memory anymore when we are in the while loop.

算法打破循环，将当前字典写入磁盘。

But from outside world (as a caller of the function) I have no idea if the block that I send it as an argument processed totally or not. Don't you think that there is something wrong here?

Answer 1

因为到目前为止没有答案，在与我的教授交谈后，我正在回答我的问题。

我必须说算法不是很清楚，因为我的教授也不确定。我正在按照我的解释来回答这个问题。

token stream is a file that includes tokens(term , doc-id pair)

while (there is token in token stream)
      dict = new dictionary()
      while(there is free memory available)
            token = next_token()
            dict.add_to_postingList(token)
      write_dict_to_file(dict)
      delete dict

//Assuming posting list is dynamically sized and dictionary knows if a term exist.

Here 我用 C++ 实现的，可能有用

spimi算法误解

spimi algorithm misunderstanding

c++

algorithm

indexing

dictionary

information-retrieval