当你需要打破空白时如何使用 fgets

How to use fgets when you need to break from blanks

我需要把一些句子分开。例如,txt 文件是这样的:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Donec commodo metus sit amet mauris facilisis, fringilla convallis erat dictum. 


Quisque scelerisque turpis hendrerit, sodales erat et, convallis nisl. 
Etiam ultrices vulputate purus, id tincidunt purus semper vel. 

有很多块(作为块,我的意思是连续两个句子)所以我无法手动将它们分开。我需要用它们之间的空格将它们分开。但是,fgets 逐行工作,所以它会给我;

Lorem ipsum dolor sit amet, consectetur adipiscing elit. 

Donec commodo metus sit amet mauris facilisis, fringilla convallis erat dictum. 

Quisque scelerisque turpis hendrerit, sodales erat et, convallis nisl. 

Etiam ultrices vulputate purus, id tincidunt purus semper vel. 

我该怎么办?我想,还没有起点。感谢您的帮助。

编辑:因为很多人不明白,我明白我不清楚。所以重点是,从上面的 txt 文件中,我需要用空格分隔这些句子并将这些句子添加到一个数组(在本例中为字符串数组)。

所以,当这个过程完成后,arrayofstrings[0] 必须给我们

Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Donec commodo metus sit amet mauris facilisis, fringilla convallis erat dictum. 

那么下一个索引应该与此类似。问题之一是我不能确定索引中的那些多句并不总是由两个句子构成的。我的意思是,对于索引 i,arrayofstrings[i] 可以是:

Ut mattis mi ac purus tempor bibendum. 
Praesent sed metus enim. 
Pellentesque at orci id mauris consectetur consequat. 

所以根据两行思路无法完成流程

您可以通过多种方式读取每个连续的文本块(包括嵌入的 '\n')字符,但最简单的方法之一是保留一个简单的标志来记录无论您是在阅读空格的段落之间,还是在阅读文本的块中。 (标志是一个简单的state-variable)

然后只需要读取每一行,如果它是块的一部分,则将块中的每一行附加到数组中的单个索引,或者如果它是空行,则前进到下一个索引,然后重置变量以准备读取下一个块。如果使用 fixed-size 数组,请不要忘记通过检查附加在索引处的每个新行是否适合来保护数组边界。粗略的大纲是:

  • (对于固定的字符串数组)声明一个行和列的数组,其中的列足以容纳每个文本块。
  • 首先将 read-state 变量设置为 0 (false),表示您在文本行之前或之间。
  • 当你的数组未满时,阅读每一行。
  • 如果该行只包含一个 '\n' 字符,
    • 检查你的标志以确定你是否在阅读这一行之前的文本,如果是这样你就完成了数组索引的填充,
      • 将索引推进到下一个,
    • 重置您的标志 0,并且
    • 将索引处使用的字符数重置为 0
  • 唯一的其他选择(else)部分是您读取包含作为块一部分的文本的行。在这里你会:
    • 计算当前存储在索引中的内容所需的总字节数,加上新行的长度(加上 1 用于 nul-terminating 字符)。
    • 如果符合索引的行,
      • 将当前行追加到索引
      • 更新索引中存储的总字符数
    • 否则 (else) 行不适合索引,处理错误
  • 将块标志设置为 1(真)

现在显然不是使用固定数组,您可以使用指针数组并根据需要为每个索引分配存储空间,或者您可以使用 pointer-to-pointer 并分配 pointers-as-needed 和每行存储。由你决定。

将大纲变成一个简短的例子,它使用 inblk 变量作为你的标志来确定你是在一个块阅读行中,还是在块之前或块之间,并在中使用 offset每个索引跟踪当前用于保护固定数组边界的字符数,你可以这样做:

(已更新以处理 @AndreasWenzel 指出的最终区块问题,并添加了一些类型 clean-ups)

#include <stdio.h>
#include <string.h>

#define NROWS  128    /* max number of rows (sentences) in array */
#define MAXCHR 256    /* max number of chars in read-buffer */

int main (int argc, char **argv) {
    
  char buf[MAXCHR] = "",              /* buffer to hold each line */
       array[NROWS][MAXCHR] = {""};   /* array of strings */
  int  inblk = 0,                     /* flag - in block reading text */
       ndx = 0;                       /* array index */
  size_t offset = 0;                  /* offset in index to write string */
  /* use filename provided as 1st argument (stdin by default) */
  FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

  if (!fp) {  /* validate file open for reading */
      perror ("file open failed");
      return 1;
  }
  
  /* while array not full, read line into buf */
  while (ndx < NROWS && fgets (buf, MAXCHR, fp)) {
    if (*buf == '\n') {               /* 1st char is \n ? */
      if (inblk) {                    /* if in block ? */
        ndx += 1;                     /* end of block, advance index */
      }
      inblk = 0;                      /* reset flag 0 (false) */
      offset = 0;                     /* reset offset */
    }
    else {  /* otherwise reading line in block */
      size_t buflen = strlen (buf),           /* length of string in buf */
             reqd = offset + buflen + 1;      /* get total required chars */
      if (reqd < MAXCHR) {                    /* line will fit in array */
        strcpy (array[ndx] + offset, buf);    /* append buf to index */
        offset += buflen;                     /* update offset to end */
      }
      else {  /* line won't fit in remaining space, handle error */
        fputs ("error: line exceeds storage for array.\n", stderr);
        return 1;
      }
      inblk = 1;                      /* set in block flag 1 (true) */
    }
  }
  if (inblk) {    /* close and write final block */
    ndx += 1;     /* end of block, advance index */
  }
  
  if (fp != stdin)   /* close file if not stdin */
      fclose (fp);

  for (int i = 0; i < ndx; i++) {     /* output reults */
    printf ("array[%2d]:\n%s\n", i, array[i]);
  }
}

示例输入文件

鉴于您描述的行数不一致 per-block 以及块之间空行数可能不一致,使用了以下内容:

$ cat dat/blks.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Donec commodo metus sit amet mauris facilisis, fringilla convallis erat dictum.


Quisque scelerisque turpis hendrerit, sodales erat et, convallis nisl.
Etiam ultrices vulputate purus, id tincidunt purus semper vel.

Ut mattis mi ac purus tempor bibendum.
Praesent sed metus enim.
Pellentesque at orci id mauris consectetur consequat.

例子Use/Output

提供要读取的文件名作为程序的第一个参数(或将 stdin 上的文件重定向到程序)将导致以下结果:

$ ./bin/combineblks dat/blks.txt
array[ 0]:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Donec commodo metus sit amet mauris facilisis, fringilla convallis erat dictum.

array[ 1]:
Quisque scelerisque turpis hendrerit, sodales erat et, convallis nisl.
Etiam ultrices vulputate purus, id tincidunt purus semper vel.

array[ 2]:
Ut mattis mi ac purus tempor bibendum.
Praesent sed metus enim.
Pellentesque at orci id mauris consectetur consequat.

其中每个数组索引包含文件中的完整文本块,包括嵌入和尾随 \n' 个字符。

我已经使用以下算法解决了这个问题:

我创建了一个数组char *strings[MAX_STRINGS],其中每个指针都被初始化为零以指示它是否指向一个有效的字符串。我使用 fgets 一次读取一行并将该行附加到当前字符串。我使用动态内存分配(即 malloc)来存储和增长实际字符串,但数组 strings 本身是 fixed-length.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_STRINGS 200
#define MAX_LINE_LENGTH 300

int main( void )
{
    char *strings[MAX_STRINGS] = { NULL };
    int num_strings = 0;

    char line[MAX_LINE_LENGTH];

    //read one line of input per loop iteration
    while ( fgets( line, sizeof line, stdin ) != NULL )
    {
        //make sure that line was not too long for input buffer
        if ( strchr( line, '\n' ) == NULL )
        {
            size_t len;

            //a missing newline character is not wrong
            //if end-of-file has been reached
            if ( !feof(stdin) )
            {
                fprintf( stderr, "Line too long for input buffer!\n" );
                exit( EXIT_FAILURE );
            }

            //newline character is missing at end-of-file, so add it
            len = strlen( line );
            if ( len + 1 == sizeof line )
            {
                fprintf( stderr, "No room for adding newline character!\n" );
                exit( EXIT_FAILURE );
            }
            line[len]   = '\n';
            line[len+1] = '[=10=]';
        }

        //determine whether line is empty
        if ( line[0] == '\n' )
        {
            //determine whether current string already has content
            if ( strings[num_strings] != NULL )
            {
                num_strings++;
            }

            //skip to next line
            continue;
        }

        //make sure that maximum number of strings has not been exceeded
        if ( num_strings == MAX_STRINGS )
        {
            fprintf( stderr, "Maximum number of strings exceeded!\n" );
            exit( EXIT_FAILURE );
        }

        //determine whether current string already exists
        if ( strings[num_strings] == NULL )
        {
            //allocate memory for new string
            strings[num_strings] = malloc( strlen(line) + 1 );
            if ( strings[num_strings] == NULL )
            {
                fprintf( stderr, "Memory allocation failure!\n" );
                exit( EXIT_FAILURE );
            }

            //copy string to allocated memory
            strcpy( strings[num_strings], line );
        }
        else
        {
            size_t len;

            //resize memory buffer for adding new string
            len = strlen( strings[num_strings] );
            len += strlen(line) + 1;
            strings[num_strings] = realloc( strings[num_strings], len );
            if ( strings[num_strings] == NULL )
            {
                fprintf( stderr, "Memory allocation failure!\n" );
                exit( EXIT_FAILURE );
            }

            //concatenate the current line with the existing string
            strcat( strings[num_strings], line );
        }
    }

    //mark last string as complete, if it exists
    if ( strings[num_strings] != NULL )
    {
        num_strings++;
    }

    //print results

    printf( "Found a total of %d strings.\n\n", num_strings );

    for ( int i = 0; i < num_strings; i++ )
    {
        printf( "strings[%d] has the following content:\n%s\n", i, strings[i] );

        //perform cleanup
        free( strings[i] );
    }
}

对于问题中发布的输入,此程序具有以下输出:

Found a total of 2 strings.

strings[0] has the following content:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Donec commodo metus sit amet mauris facilisis, fringilla convallis erat dictum. 

strings[1] has the following content:
Quisque scelerisque turpis hendrerit, sodales erat et, convallis nisl. 
Etiam ultrices vulputate purus, id tincidunt purus semper vel. 

不过上面的代码优化的不是很好,因为:

  1. 它不记住字符串的长度,而是一遍又一遍地重复使用strcatstrlen来确定字符串的长度。如果字符串变得非常大,这是非常低效的。

  2. 每次添加新字符串时都会调用realloc。这可能导致每次都将整个字符串复制到新的内存缓冲区,如果字符串变得非常大,这将是非常低效的。

另一个问题是

#define MAX_STRINGS 200

将根据此程序可以处理的字符串数创建 hard-limit。尽管必要时可以增加此数字,但自动执行此操作可能会更好。

因此,不要像这样定义 strings

char *strings[MAX_STRINGS] = { NULL };

这样定义可能会更好

char **strings;

并使用动态内存分配为字符串指针分配内存,并根据需要增加内存。

这是另一个更复杂的解决方案,但解决了这些问题:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LENGTH 300

int main( void )
{
    char **strings;
    size_t num_strings = 0; //number of valid strings
    size_t cap_strings = 100; //current capacity of strings

    char line[MAX_LINE_LENGTH];

    size_t current_string_length = 0;
    size_t current_string_capacity = 0;

    //allocate initial memory for array "strings"
    strings = malloc( cap_strings * sizeof *strings );
    if ( strings == NULL )
    {
        fprintf( stderr, "Memory allocation error!\n" );
        exit( EXIT_FAILURE );
    }

    //read one line of input per loop iteration
    while ( fgets( line, sizeof line, stdin ) != NULL )
    {
        size_t len;

        //make sure that line was not too long for input buffer
        len = strlen( line );
        if ( len == 0 || line[len-1] != '\n' )
        {
            //a missing newline character is not wrong
            //if end-of-file has been reached
            if ( !feof(stdin) )
            {
                fprintf( stderr, "Line too long for input buffer!\n" );
                exit( EXIT_FAILURE );
            }

            //newline character is missing at end-of-file, so add it
            if ( len + 1 == sizeof line )
            {
                fprintf( stderr, "No room for adding newline character!\n" );
                exit( EXIT_FAILURE );
            }
            line[len]   = '\n';
            line[len+1] = '[=15=]';
            len++;
        }

        //determine whether line is empty
        if ( line[0] == '\n' )
        {
            //determine whether current string already has content
            if ( current_string_length > 0 )
            {
                //shrink allocated memory to required size
                strings[num_strings] = realloc( strings[num_strings], current_string_length + 1  );
                if ( strings[num_strings] == NULL )
                {
                    exit( EXIT_FAILURE );
                }

                //mark string as complete
                num_strings++;
                current_string_length = 0;
            }

            //skip to next line
            continue;
        }

        //grow array "strings" if necessary
        if ( num_strings == cap_strings )
        {
            cap_strings *= 2;
            strings = realloc( strings, cap_strings * sizeof *strings );
            if ( strings == NULL )
            {
                fprintf( stderr, "Memory allocation error!\n" );
                exit( EXIT_FAILURE );
            }
        }

        //determine whether current string already exists
        if ( current_string_length == 0 )
        {
            //allocate memory for new string
            current_string_capacity = 200;
            if ( current_string_capacity < len + 1 )
                current_string_capacity = len + 1;
            strings[num_strings] = malloc( current_string_capacity );
            if ( strings[num_strings] == NULL )
            {
                fprintf( stderr, "Memory allocation failure!\n" );
                exit( EXIT_FAILURE );
            }

            //copy string to allocated memory
            strcpy( strings[num_strings], line );

            //update length of current string
            current_string_length = len;
        }
        else
        {
            //remember previous length of string
            size_t prev_string_length = current_string_length;

            //resize memory buffer, if necessary
            current_string_length += len;
            if ( current_string_capacity < len + 1 )
            {
                current_string_capacity *= 2;
                if ( current_string_capacity < len + 1 )
                    current_string_capacity = len + 1;
                strings[num_strings] = realloc( strings[num_strings], current_string_capacity );
                if ( strings[num_strings] == NULL )
                {
                    fprintf( stderr, "Memory allocation failure!\n" );
                    exit( EXIT_FAILURE );
                }
            }

            //add the current line to the existing string
            strcpy( strings[num_strings]+prev_string_length, line );
        }
    }

    //shrink last string and mark as complete, if it exists
    if ( current_string_length != 0 )
    {
        //shrink allocated memory to required size
        strings[num_strings] = realloc( strings[num_strings], current_string_length + 1  );
        if ( strings[num_strings] == NULL )
        {
            exit( EXIT_FAILURE );
        }
        num_strings++;
    }

    //print results

    printf( "Found a total of %zu strings.\n\n", num_strings );

    for ( size_t i = 0; i < num_strings; i++ )
    {
        printf( "strings[%zu] has the following content:\n%s\n", i, strings[i] );

        //perform cleanup
        free( strings[i] );
    }

    //more cleanup
    free( strings );
}

呜呜呜我也是!
另一种选择是简单地读取文件 en bloc,然后对两个或多个换行符的序列进行标记化。

#include <ctype.h>
#include <iso646.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>


char * file_to_string( const char * filename, const char * mode )  // NULL, "", "text", "r" --> text mode
{                                                                  // "b", "binary", "rb" --> binary mode
    struct stat st;
    size_t size  = stat( filename, &st ) ? 0 : st.st_size;
    char * s     = calloc( size + 1, 1 );
    FILE * f     = fopen( filename, (mode and ((*mode == 'b') or (mode[1] == 'b'))) ? "rb" : "r" );
    bool   ok    = size and s and f;
    if (ok) fread( s, 1, size, f );
    if (f) fclose( f );
    if (ok) return s;
    free( s );
    return NULL;
}


char * skip_whitespace( char ** s )
{
    while (isspace( **s )) ++(*s);
    return *s;
}


char * find_end_of_paragraph( char ** s )
{
    char  * p = strstr( *s, "\n\n" );           // a paragraph ends with two (or more) newlines
    if (!p) p = strchr( *s, '[=10=]' );             // or at the end of the string
    while ((*s < p) and isspace( p[-1] )) --p;  // (also trim trailing whitespace)
    return (*s = p);
}


size_t sp_loop( char * s, char ** ss )  // worker function for split_paragraphs()
{                                       // loops through the "paragraphs" in the argument string
    size_t count = 0;
    while (*skip_whitespace( &s ))
    {
        count += 1;
        if (ss) *ss++ = s;
        if (*find_end_of_paragraph( &s ) and ss) *s++ = '[=10=]';
    }
    return count;
}

char ** split_paragraphs( char * s )  // returns a NULL-terminated array referencing
{                                     // paragraphs in the modified argument string
    size_t count = sp_loop( s, NULL );
    char ** ss = calloc( sizeof(char *), count + 1 );  // (returns NULL on alloc failure)
    if (ss) sp_loop( s, ss );
    return ss;
}


int main( int argc, char ** argv )  // main takes FILENAME as argument
{
    if (argc != 2) return 0;
    
    char * s = file_to_string( argv[1], "text" );
    if (s)
    {
        char ** paragraphs = split_paragraphs( s );
        if (paragraphs)
        {
            if (!*paragraphs)
            {
                puts( "\n(empty file: no paragraphs)" );
            }
            else 
            {
                size_t n = 0;
                while (paragraphs[n]) printf( "\n'''%s'''\n", paragraphs[n++] );
                printf( "\n(%zu paragraph%s)\n", n, n == 1 ? "" : "s" );
            }
            free( paragraphs );
        }
        free( s );
    }
    return EXIT_SUCCESS;
}

这确实意味着您不能 修改 段落而不修改源字符串,但这可能就是您所需要的。

这是一个 two-pass 标记化。加上读取意味着总共三遍,加上分配。

段落标记器从标记化的字符串中去除前导和尾随的空格,并且适用于具有两个或多个换行符的前导和尾随序列的文件。