为什么 malloc() 分配的字节数比预期的多 2 个字节？

Question

我正在编写一个 C 编译器。 Flex 识别我的字符串标记并将其发送到函数以将其存储在包含有关它的信息的结构{}中，但首先需要删除字符串的转义字符，即 ''。这是我的代码：

char* removeEscapeChars(char* svalue)
{
    char* processedString; //will be the string with escape characters removed
    int svalLen = strlen(svalue);
    printf("svalLen (size of string passed in): %d\n", svalLen);
    printf("svalue (string passed in): %s\n", svalue);
    int foundEscapedChars = 0;
    for (int i = 0; i < svalLen;) 
    {
        if (svalue[i] == '\') {
            //Found escaped character
            if (svalue[i+1] == 'n') {
                //Found newline character
                svalue[i] = int('\n');
            }
            else if (svalue[i+1] == '0') {
                //Found null character
                svalue[i] = int('[=10=]');
            }
            else {
                //Any other character
                svalue[i] = svalue[i+1];
            }
            i++;
            foundEscapedChars++;
            for (int j = i; j < svalLen + 1; j++) {
                svalue[j] = svalue[j+1];
            }
        }
        else {
            i++;
        }
    }
    int newSize = svalLen - foundEscapedChars;
    processedString = (char*) malloc(newSize * sizeof(char));
    memcpy(processedString, svalue, newSize * sizeof(char));
    printf("newSize: %d\n", newSize);
    printf("processedString: %s\n", processedString);
    printf("processedString Size: %d\n", strlen(processedString));
    
    free(svalue);
    return processedString;
}

它在 99% 的时间都有效，但是当它在这个特定字符串（或类似的 40 个字符）“-//W3C//DTD XHTML 1.0 Transitional//EN”上进行测试时，出现 malloc()为 2 个字节太大的字符串分配内存。输出如下。请注意，我在调用 malloc() 时使用了 int newSize，它说它的值为 40，然后是 strlen() returns 42。sizeof(char) 也是 == 1。主要问题是它在字符串末尾插入垃圾字符。给出了什么？

"-//W3C//DTD XHTML 1.0 Transitional//EN"
svalLen (size of string passed in): 40
svalue (string passed in) "-//W3C//DTD XHTML 1.0 Transitional//EN"
newSize: 40
processedString: "-//W3C//DTD XHTML 1.0 Transitional//EN"Z
processedString Size: 42
Line 47 Token: STRINGCONST Value: "-//W3C//DTD XHTML 1.0 Transitional//EN"Z Len: 40 Input: "-//W3C//DTD XHTML 1.0 Transitional//EN"

Answer 1

代码至少有这个问题：试图打印一个不是 string 的“字符串”，因为它缺少终止 null 字符 和 space 来存储它。

这会导致未定义的行为。此 UB 可能会显示为打印额外的字符。

// processedString = (char*) malloc(newSize * sizeof(char));
// memcpy(processedString, svalue, newSize * sizeof(char));
processedString = malloc(newSize + 1);
memcpy(processedString, svalue, newSize);
processedString[new_Size] = 0;

可能还有其他问题。

Answer 2

这是对您的代码的修改，它采用不同的、更传统的方法来处理字符串。首先从计算转义字符的函数开始，因为这在下一步中很有用：

int escapeCount(char* str) {
    int c = 0;

    // Can just increment and work through the string using the given pointer
    while (*str) {
        // Backslash something here
        if (*str == '\') {
            ++str;
            ++c;
        }

        if (*str) {
          // Handle unmatched \ at end of string
          ++str;
        }
    }

    return c;
}

现在您可以使用该信息分配正确的缓冲区大小：

char* removeEscapeChars(char* str)
{
    // IMPORTANT: Allocate strlen() + 1 for the NUL byte not counted
    char* result = malloc(strlen(str) - escapeCount(str) + 1);
    char* r = result;

    do {
        if (*str == '\') {
            ++str;

            switch (*str) {
                case 'n':
                    *r = '\n';
                    break;
                case 'r':
                    *r = '\r';
                    break;
                case 't':
                    *r = '\t';
                    break;
                default:
                    *r = *str;
                    break;
            }
        }
        else {
            *r = *str;
        }

        if (*str) {
          ++str;
        }

        ++r;
    } while(*str);

    return result;
}

为什么 malloc() 分配的字节数比预期的多 2 个字节？

Why is malloc() allocating 2 more bytes than its supposed to?

c

parsing

compiler-construction

flex-lexer