在 C 中重建索引

Question

我有一个程序以这种格式读取一个巨大的行文本文件，我需要从这个文本文件构建一个数据结构。

microfinance 5 41 5 1650 2 1667 1 1811 1 1988 5 
subminiature 1 432 1

单词后的第一个数字是找到该单词的文档数。以下数字在文档 ID 号和文档中找到的单词出现次数之间交替出现。所以对于 microfinance，有 5 个文件，第一个是文件 41，出现 5 次，接下来是 doc 1650，出现 2 次，依此类推

我使用 strtok 获取每个元素并组织它们。我知道 strtok 工作正常。问题是将元素正确附加到我的数据结构。

DocumentNode *myDoc;
while (fgets(theLine, sizeof(theLine), newPointer) != NULL)
    {
        counter = 0;
        pch = strtok (theLine," ");
        while (pch != NULL)
        {
         if (0 == counter)
         {
            WordNode *toInsertPtr = (malloc(sizeof(struct WordNode)));
            word = (malloc(100));
            strncpy (word, pch, strlen(pch));
            toInsertPtr->word = word;
            toInsertPtr->next = NULL;

            currIndex = JenkinsHash(word, MAX_HASH_SLOT);
            if ((TheIndex->index[currIndex]) == NULL)
            {
                TheIndex->index[currIndex] = toInsertPtr;
            }
            else 
            {
                TheIndex->index[currIndex]->next = toInsertPtr;
            }   
         }

         if (1 == counter)
         {
            numOfDocs = atoi(pch);
         }

         if (counter % 2 == 0 && counter != 0 && pch != NULL)
         {
            myDoc= (malloc(sizeof(struct DocumentNode)));
            myDoc->next = NULL;
            int doc_id = atoi(pch);
            myDoc->documentID = doc_id;         
         }

         if (counter % 2 != 0 && counter != 1 && pch != NULL)
         {
            myDoc->occurences = atoi(pch);

            if (TheIndex->index[currIndex]->page == NULL)
            {
                TheIndex->index[currIndex]->page = myDoc;
            }
            else
            {
                TheIndex->index[currIndex]->page->next = myDoc;
            }
         }
          pch = strtok (NULL, " ");
          counter++;
        }
    }

我已经 GDBed 发现问题出在这里。第一个 if 语句检查索引处是否有 doc 节点总是捕获为 null（即使索引中的那个点明显有东西）并且它一次又一次地覆盖一个槽。为什么它总是认为它不是 NULL？

        if (TheIndex->index[currIndex]->page == NULL)
        {
            TheIndex->index[currIndex]->page = myDoc;
        }
        else
        {
            TheIndex->index[currIndex]->page->next = myDoc;
        }

数据结构如下：

typedef struct DocumentNode {
    struct DocumentNode *next;      // pointer to next member of the list.
    int documentID;                 //doc identifier (filename, ie. 1, 2, etc.)
    int occurences;                 //num. occurances. 
} DocumentNode;

typedef struct WordNode {                
    struct WordNode *next;           //pointer to the next word (for collisions)
    char *word;                      //the word itself.
    DocumentNode *page;              // pointer to the first element of the page list.
} WordNode;

typedef struct InvertedIndex {
    WordNode *index[MAX_HASH_SLOT];   
} InvertedIndex;

Answer 1

您的方法太复杂：循环试图维持状态，并且有很长的条件链来决定如何处理下一个令牌。

与其一次做 strtok 一个，不如做第一个算出单词，第二个算出数，然后将其余的成对做。它应该如下所示：

while (fgets(theLine, sizeof(theLine), newPointer) != NULL) {
    pch = strtok (theLine," ");
    char *word = malloc(strlen(pch)+1);
    strcpy(word, pch);
    ... // Add the word
    pch = strtok(NULL, " ");
    int pairCount = atoi(pch);
    for (int i = 0 ; i != pairCount ; i++) {
         pch = strtok(NULL, " ");
         int id = atoi(pch);
         pch = strtok(NULL, " ");
         int count = atoi(pch);
         ... // Add the document
    }
}

P.S。如果您很好地理解这种方法，您可能会喜欢 Edsger Dijkstra 的 this tale。

在 C 中重建索引

Reconstructing an Index in C

c

strtok

data-structures