将文本文件拆分为 C 中的单词

Splitting a text file into words in C

我有 2 种类型的文本,我想将它们拆分成单词。

第一种文本文件只是单词除以换行符。

Milk
Work
Chair
...

第二种文本文件是书中的文字,只有空格。 (没有逗号、问号等)

And then she tried to run 
but she was stunned by the view of 
...

你知道哪种方法最好吗?

我尝试了以下两种方法,但似乎我正在分段。

对于我使用的第一种文本:

while(fgets(line,sizeof(line),wordlist) != NULL)
{
    /* Checks Words |
    printf("%s",line);*/
    InsertWord(W,line);/*Function that inserts the word to a tree*/
}

对于第二种类型的文本,我使用:

while(fgets(line,sizeof(line),out) != NULL)
{
    bp = line ;
    while(1)
    {
        cp = strtok(bp," ");
        bp = NULL ;

        if(cp == NULL)
            break;

        /*printf("Word by Word : %s \n",cp);*/
        CheckWord(Words, cp);/*Function that checks if the word from the book is the same with one in a tree */
    }
}

如果这些有误,你能提出更好的建议或纠正我吗?

编辑:(关于 segm.fault)

InsertWord 是一个将单词插入树中的函数。 当我使用这段代码时:

for (i = 0 ; i <=2 ; i++)
{
    if (i==0)
        InsertWord(W,"A");
    if (i==1)
        InsertWord(W,"B");
    if (i==2)
        InsertWord(W,"c");
}*/

树很好地插入了单词并打印了它们,这意味着我的树工作正常并且它的功能(它们也是我们的老师给的)。 但是当我尝试做同样的事情时:

char this_word[15];
while (fscanf(wordlist, "%14s", this_word) == 1) 
{
    printf("Latest word that was read: '%s'\n", this_word);
    InsertWord(W,this_word);
}

我收到来自 tree.So 的错误,我猜这是某种分割。 有什么想法吗?

最简单的方法可能是逐个字符:

char word[50];
char *word_pos = word;

// Discard characters until the first word character
while ((ch = fgetch(out)) != EOF &&
        ch != '\n' &&
        ch != ' ');

do {
    if (ch == '\n' || ch == ' ') {
        *word_pos++ = '[=10=]';
        word_pos = word;
        CheckWord(Words, word);

        while ((ch = fgetch(out)) != EOF &&
                ch != '\n' &&
                ch != ' ');
    }

    *word_pos++ = ch;
} while ((ch = fgetch(out)) != EOF);

您受到 word 大小的限制,您需要将每个停止字符添加到 whileif 条件中。

您想从文件中读取,可能会想到 fgets()

您想通过分隔符(空格)拆分为多个标记,请记住 strtok()


所以,你可以这样做:

#include <stdio.h>
#include <string.h>

int main(void)
{
   FILE * pFile;
   char mystring [100];
   char* pch;

   pFile = fopen ("text_newlines.txt" , "r");
   if (pFile == NULL) perror ("Error opening file");
   else {
     while ( fgets (mystring , 100 , pFile) != NULL )
       printf ("%s", mystring);
     fclose (pFile);
   }

   pFile = fopen ("text_wspaces.txt" , "r");
   if (pFile == NULL) perror ("Error opening file");
   else {
     while ( fgets (mystring , 100 , pFile) != NULL ) {
       printf ("%s", mystring);
       pch = strtok (mystring," ");
       while (pch != NULL)
       {
         printf ("%s\n",pch);
         pch = strtok (NULL, " ");
       }
     }
     fclose (pFile);
   }

   return 0;
}

输出:

linux25:/home/users/grad1459>./a.out
Milk
Work
Chair
And then she tried to run 
And
then
she
tried
to
run


but she was stunned by the view of
but
she
was
stunned
by
the
view
of
//newline here as well

这是输入类型 fscanf 并且 %s 是为:

char this_word[15];
while (fscanf(tsin, "%14s", this_word) == 1) {
    printf("Latest word that was read: '%s'.\n", this_word);
    // Process the word...
}