删除逗号之间的白色字符，但不删除逗号内的内容

Question

我是 C 的新手，正在学习 C90。我正在尝试将字符串解析为命令，但我很难尝试删除白色字符。

我的目标是像这样解析一个字符串：

NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1

进入这个：

NA ME,NAME,123 456,124,14134,134. 134,1

因此参数中的白色字符仍然存在，但其他白色字符已被删除。

我想过用strtok，但我还是想保留逗号，即使有多个连续的逗号。

到现在我用过：

void removeWhiteChars(char *s)
{
    int i = 0;
    int count = 0;
    int inNum = 0;
    while (s[i])
    {
        if (isdigit(s[i]))
        {
            inNum = 1;
        }
        if (s[i] == ',')
        {
            inNum = 0;
        }
        if (!isspace(s[i]) && !inNum)
            s[count++] = s[i];
        else if (inNum)
        {
            s[count++] = s[i];
        }

        ++i;
    }
    s[count] = '[=12=]'; /* adding NULL-terminate to the string */
}

但它只跳过数字，不删除数字后面的白色字符，直到逗号，这是完全错误的。

我将不胜感激任何形式的帮助，我已经在这个问题上停留了两天了。

Answer 1

每当遇到可能的可跳过白色时，您都需要进行前瞻space。下面的函数，每次看到 space 时，都会向前检查它是否以逗号结尾。同样，对于每个逗号，它会检查并删除所有后续 spaces.

// Remove elements str[index] to str[index+len] in place
void splice (char * str, int index, int len) {
  while (str[index+len]) {
    str[index] = str[index+len];
    index++;
  }
  str[index] = 0;
}

void removeWhiteChars (char * str) {
  int index=0, seq_len;

  while (str[index]) {
    if (str[index] == ' ') {
      seq_len = 0;

      while (str[index+seq_len] == ' ') seq_len++;

      if (str[index+seq_len] == ',') {
        splice(str, index, seq_len);
      }
    }
    if (str[index] == ',') {
      seq_len = 0;
      while (str[index+seq_len+1] == ' ') seq_len++;

      if (seq_len) {
        splice(str, index+1, seq_len);
      }
    }
    index++;
  }
}

Answer 2

以下有效，至少对于您的输入字符串。我绝对不会声称它的效率或优雅。我没有尝试就地修改 s ，而是写入一个新字符串。我遵循的算法是：

已将 startPos 初始化为 0。
循环 s 直到找到逗号。
从该位置备份，直到找到第一个 non-space 个字符。
memcpy 从 startPos 到那个位置到一个新字符串。
在新字符串的下一个位置添加一个逗号。
从逗号位置向前看，直到找到第一个 non-space 字符，将其设置为 startPos.
冲洗并重复
最后，在最后一个标记后附加 strcat

void removeWhiteChars(char *s)
{
    size_t i = 0;
    size_t len = strlen(s);
    char* newS = calloc(1, len);
    size_t newSIndex = 0;
    size_t startPos = 0;

    while (i<len)
    {
        // find the comma
        if (s[i] == ',')
        {            
            // find the first nonspace char before the comma
            ssize_t before = i-1;
            while (isspace(s[before]))
            {
                before--;
            }
            
            // copy from startPos to before into our new string
            size_t amountToCopy = (before-startPos)+1;
            memcpy(newS+newSIndex, s+startPos, amountToCopy);
            newSIndex += amountToCopy;
            newS[newSIndex++] = ',';

            // update startPos
            startPos = i+1;
            while (isspace(s[startPos]))
            {
                startPos++;
            }
            
            // update i
            i = startPos+1;
        }
        else
        {
            i++;
        }
    }

    // finally tack on the end
    strcat(newS, s+startPos);

    // You can return newS if you're allowed to change your function
    // signature, or strcpy it to s
    printf("%s\n", newS);    
}

我也只用你的输入字符串测试过，其他情况下可能会中断。

Demonstration

Answer 3

请试试这个：

void removeWhiteChars(char *s)
{
    int i = 0;
    int count = 0;
    int isSomething = 0;
    while (s[i])
    {
        if (s[i] == ',' && isSomething == 0)
            isSomething = 2;
        else if (s[i] == ',' && isSomething == 1)
            isSomething = 2;
        else if (s[i] == ',' && isSomething == 2)
        {
            s[count++] = ',';
            s[count++] = s[i];
            isSomething = 0;
        }
        else if (isspace(s[i]) && isSomething == 0)
            isSomething = 1;
        else if (isspace(s[i]) && isSomething == 1)
            isSomething = 1;
        else if (isspace(s[i]) && isSomething == 2)
            isSomething = 2;
        else if (isSomething == 1)
        {
            s[count++] = ' ';
            s[count++] = s[i];
            isSomething = 0;
        }
        else if (isSomething == 2)
        {
            s[count++] = ',';
            s[count++] = s[i];
            isSomething = 0;
        }
        else
            s[count++] = s[i];

        ++i;
    }
    s[count] = '[=10=]'; /* adding NULL-terminate to the string */
}

Answer 4

这是一种可能的算法。它不一定是此处显示的 well-optimized，但存在是为了演示算法的一种可能实现。它是有意部分抽象的。

以下是一个非常强大的 O(n) 时间算法，您可以将其用于 trim 白色 space（如果您概括和扩展它，还有其他一些东西）。

此实现尚未经过验证可以工作 as-is，但是。

你应该跟踪前一个字符和相关的 spaces，这样如果你看到 { ',', ' ' } 或 { CHAR_IN_ALPHABET, ' '}，你就开始了一个链，一个值代表当前的执行路径.当您看到任何其他字符时，如果检测到第一个序列，则链应该断开，如果检测到第二个序列，则反之亦然。我们将定义一个函数：

// const char *const in: indicates intent to read from in only
void trim_whitespace(const char *const in, char *out, uint64_t const out_length);

我们正在定义一个明确的算法，其中所有执行路径都是已知的，因此对于每个唯一的可能执行状态，您应该使用函数内定义的枚举分配一个从零开始线性增加的数值以提高可读性，然后切换语句（除非 goto 和 labels 更好地模拟算法的行为）：

void trim_whitespace(const char *const in, char *out, uint64_t const out_length) {
    // better to use ifdefs first or avoid altogether with auto const variable,
    // but you get the point here without all that boilerplate
    #define CHAR_NULL 0

    enum {
        DEFAULT = 0,
        WHITESPACE_CHAIN
    } execution_state = DEFAULT;
    
    // track if loop is executing; makes the logic more readable;
    // can also detect environment instability
    // volatile: don't want this to be optimized out of existence
    volatile bool executing = true;

    while(executing) {
        switch(execution_state) {
        case DEFAULT:
            ...
        case WHITESPACE_CHAIN:
            ...
        default:
            ...
        }
    }

    function_exit:
        return;

    // don't forget to undefine once finished so another function can use
    // the same macro name!
    #undef CHAR_NULL
}

可能的执行状态数等于2**ceil(log_2(n))，其中n是与当前算法操作相关的实际执行状态数。您应该明确命名它们并在 switch 语句中为它们创建案例。

在 DEFAULT 的情况下，我们只检查逗号和“合法”字符。如果前一个字符是逗号或合法字符，而当前字符是space，那么我们要将状态设置为WHITESPACE_CHAIN.

在 WHITESPACE_CHAIN 的情况下，我们根据我们开始的字符是逗号还是合法字符来测试当前链是否可以 trimmed。如果当前字符可以 trimmed，它会被简单地跳过，然后我们进入下一次迭代，直到我们根据我们正在寻找的内容找到另一个逗号或合法字符，然后将执行状态设置为 DEFAULT.如果我们确定这条链不是 trimmable，那么我们添加我们跳过的所有字符并将执行状态设置回 DEFAULT.

循环应该如下所示：

...
// black boxing subjectives for portability, maintenance, and readability
bool is_whitespace(char);
bool is_comma(char);
// true if the character is allowed in the current context
bool is_legal_char(char);
...

volatile bool executing = true;

// previous character (only updated at loop start, line #LL)
char previous = CHAR_NULL;
// current character (only updated at loop start, line #LL)
char current = CHAR_NULL;
// writes to out if true at end of current iteration; doesn't write otherwise
bool write = false;
// COMMA: the start was a comma/delimeter
// CHAR_IN_ALPHABET: the start was a character in the current context's input alphabet
enum { COMMA=0, CHAR_IN_ALPHABET } comma_or_char = COMMA;

// current character index (only updated at loop end, line #LL)
uint64_t i = 0, j = 0;

while(executing) {
    previous = current;
    current = in[i];

    if (!current) {
        executing = false;
        break;
    }

    switch(execution_state) {
        case DEFAULT:
            if (is_comma(previous) && is_whitespace(current)) {
                execution_state = WHITESPACE_CHAIN;
                write = false;
                comma_or_char = COMMA;
            } else if (is_whitespace(current) && is_legal_char(previous)) { // whitespace check first for short circuiting
                execution_state = WHITESPACE_CHAIN;
                write = false;
                comma_or_char = CHAR_IN_ALPHABET;
            }
            
            break;

        case WHITESPACE_CHAIN:
            switch(comma_or_char) {
                case COMMA:
                    if (is_whitespace(previous) && is_legal_char(current)) {
                        execution_state = DEFAULT;
                        write = true;
                    } else if (is_whitespace(previous) && is_comma(current)) {
                        execution_state = DEFAULT;
                        write = true;
                    } else {
                        // illegal condition: logic error, unstable environment, or SEU
                        executing = true;
                        out = NULL;
                        goto function_exit;
                    }

                    break;

                case CHAR_IN_ALPHABET:
                    if (is_whitespace(previous) && is_comma(current) {
                        execution_state = DEFAULT;
                        write = true;
                    } else if (is_whitespace(previous) && is_legal_char(current)) {
                        // abort: within valid input string/token
                        execution_state = DEFAULT;
                        write = true;
                        // make sure to write all the elements we skipped; 
                        // function should update the value of j when finished
                        write_skipped(in, out, &i, &j);
                    } else {
                        // illegal condition: logic error, unstable environment, or SEU
                        executing = true;
                        out = NULL;
                        goto function_exit;
                    }

                    break;

                default:
                    // impossible condition: unstable environment or SEU
                    executing = true;
                    out = NULL;
                    goto function_exit;
            }
            
            break;

        default:
            // impossible condition: unstable environment or SEU
            executing = true;
            out = NULL;
            goto function_exit;
    }

    if (write) {
        out[j] = current;
        ++j;
    }

    ++i;
}

if (executing) {
    // memory error: unstable environment or SEU
    out = NULL;
} else {
    // execution successful
    goto function_exit;
}

// end of function

也请使用白色这个词space来描述这些字符，因为这是他们通常所说的，而不是“白色字符”。

Answer 5

解决任何解析问题的一种简短而可靠的方法是使用 state-loop 这只不过是对原始字符串中所有字符的循环，其中您使用一个（或多个）标志变量来跟踪您需要跟踪的任何事物的状态。在你这里的情况下，你需要知道你是否正在阅读 post （在逗号之后）的状态。

这控制您如何处理下一个字符。您将使用一个简单的计数器变量来跟踪您已读取的空格数，并且当您遇到下一个字符时，如果您不是 post-comma，则将该空格数附加到您的新字符串中。如果您是 post-comma，则丢弃缓冲空间。（您可以使用遇到 ',' 本身作为不需要保存在变量中的标志）。

要删除 ',' 定界符周围的空格，您可以编写一个 rmdelimws() 函数，将要填充的新字符串和要复制的旧字符串作为参数，并执行类似于以下操作的操作：

void rmdelimws (char *newstr, const char *old)
{
  size_t spcount = 0;               /* space count */
  int postcomma = 0;                /* post comma flag */
  
  while (*old) {                    /* loop each char in old */
    if (isspace (*old)) {           /* if space? */
      spcount += 1;                 /* increment space count */
    }
    else if (*old == ',') {         /* if comma? */
      *newstr++ = ',';              /* write to new string */
      spcount = 0;                  /* reset space count */
      postcomma = 1;                /* set post comma flag true */
    }
    else {                          /* normal char? */
      if (!postcomma) {             /* if not 1st char after comma */
        while (spcount--) {         /* append spcount spaces to newstr */
          *newstr++ = ' ';
        }
      }
      spcount = postcomma = 0;      /* reset spcount and postcomma */
      *newstr++ = *old;             /* copy char from old to newstr */
    }
    old++;                          /* increment pointer */
  }
  *newstr = 0;                      /* nul-terminate newstr */
}

（注意：如果 newstr 没有初始化为全零， 肯定更新为 nul-terminate，如下所示）

如果你想保存行中的尾随空格（例如在你的示例中结尾 1 之后的空格），你可以在 nul-terminating 上面的字符串之前添加以下内容：

  if (!postcomma) {                 /* if tailing whitespace wanted */
    while (spcount--) {             /* append spcount spaces to newstr */
      *newstr++ = ' ';
    }
  }

将它放在一起是一个简短的例子：

#include <stdio.h>
#include <ctype.h>

void rmdelimws (char *newstr, const char *old)
{
  size_t spcount = 0;               /* space count */
  int postcomma = 0;                /* post comma flag */
  
  while (*old) {                    /* loop each char in old */
    if (isspace (*old)) {           /* if space? */
      spcount += 1;                 /* increment space count */
    }
    else if (*old == ',') {         /* if comma? */
      *newstr++ = ',';              /* write to new string */
      spcount = 0;                  /* reset space count */
      postcomma = 1;                /* set post comma flag true */
    }
    else {                          /* normal char? */
      if (!postcomma) {             /* if not 1st char after comma */
        while (spcount--) {         /* append spcount spaces to newstr */
          *newstr++ = ' ';
        }
      }
      spcount = postcomma = 0;      /* reset spcount and postcomma */
      *newstr++ = *old;             /* copy char from old to newstr */
    }
    old++;                          /* increment pointer */
  }
  *newstr = 0;                      /* nul-terminate newstr */
}


int main (void) {
  
  char str[] = "NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   ",
       newstr[sizeof str] = "";
  
  rmdelimws (newstr, str);
  
  printf ("\"%s\"\n\"%s\"\n", str, newstr);
}

例子Use/Output

$ ./bin/rmdelimws
"NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   "
"NA ME,NAME,123 456,124,14134,134. 134,1"

Answer 6

您可以使用状态机在 O(n) 中就地修改它。在此示例中，我使用 re2c 到 set-up 并为我保留状态。

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static void lex(char *cursor) {
    char *out = cursor, *open = cursor, *close = 0;
start:
    /*!re2c /* Use "re2c parse.re.c -o parse.c" to get C output file. */
    re2c:define:YYCTYPE = "char";
    re2c:define:YYCURSOR = "cursor";
    re2c:yyfill:enable = 0;
    /* Whitespace. */
    [ \f\n\r\t\v]+ { if(!close) open = cursor; goto start; }
    /* Words. */
    [^, \f\n\r\t\v\x00]+ { close = cursor; goto start; }
    /* Comma: write [open, close) and reset. */
    "," {
        if(close)
            memmove(out, open, close - open), out += close - open, close = 0;
        *(out++) = ',';
        open = cursor;
        goto start;
    }
    /* End of string: write any [open, close). */
    "\x00" {
        if(close)
            memmove(out, open, close - open), out += close - open;
        *(out++) = '[=10=]';
        return;
    }
    */
}

int main(void) {
    char command[]
        = "NA ME, NAME   , 123 456, 124   , 14134, 134. 134   ,   1   ";
    printf("<%s>\n", command);
    lex(command);
    printf("<%s>\n", command);
    return EXIT_SUCCESS;
}

这是通过偷懒来实现的；也就是说，改变字符串的写法，直到我们可以确定它是完整的，无论是在逗号处还是在字符串的末尾处。很简单，属于一个regular language，没有lookahead。它在没有逗号的单词之间保留 whitespace 。它还会覆盖字符串，因此不会使用额外的 space；我们可以这样做，因为编辑只涉及删除。

删除逗号之间的白色字符，但不删除逗号内的内容

Remove white chars between commas, but not between what inside the commas

c

c89

ansi-c