C 程序中的边缘案例，用于查找和替换文本文件中的单词

Question

我是 C 的新手，如果能帮助我修复程序中的错误，我将不胜感激。

我发现了一个边缘案例，但我不太确定如何解决它。

目前，该函数将查找并替换给定文本文件中的单词和单词中的单词。例如，将 'water' 更改为 'snow' 会将字符串 'waterfall' 更改为 'snowfall'。这是预期的结果。

然而，当我输入'waterfalls'来改变单词'waterfall'时，程序似乎陷入了死循环。我不太清楚为什么，但如果有人能指出正确的方向，我将不胜感激。

这是我的代码：

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>

#define BUFFER_SIZE 20

void replaceWord(char *str, const char *oldWord, const char *newWord)
{
    char *position, buffer[BUFFER_SIZE];
    int index, oldWordLength;
    oldWordLength = (long)strlen(oldWord);
    while ((position = strstr(str, oldWord)) != NULL)
    {
        strcpy(buffer, str);
        index = position - str;
        str[index] = '[=10=]';
        strcat(str, newWord);
        strcat(str, buffer + index + oldWordLength);
    }
}

int main()
{
    char msg[100] = "This is some text with the word snowfall to replace";
    puts(msg);
    replaceWord(msg, "snowfall", "snowfalls");
    puts(msg);
    return 0;
}

Answer 1

好的。首先，您的缓冲区大小严重不足。这个：

char buffer[BUFFER_SIZE];

是最终成为原始消息的 full-string 副本的目标。但在 main 中，原始消息：

char msg[100] = "This is some text with the word snowfall to replace";

是 51 个字符宽（不包括终止符）。那是行不通的，运行通过调试地址清理程序（或理想情况下的常规调试器）将显示这一点。 :

==1==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd578054d4 at pc 0x7f7457e4c846 bp 0x7ffd57805460 sp 0x7ffd57804c10
WRITE of size 52 at 0x7ffd578054d4 thread T0
    #0 0x7f7457e4c845  (/opt/compiler-explorer/gcc-11.2.0/lib64/libasan.so.6+0x55845)
    #1 0x4012a7 in replaceWord /app/example.c:15
    #2 0x401592 in main /app/example.c:27
    #3 0x7f7457c2c0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    #4 0x40112d in _start (/app/output.s+0x40112d)

Address 0x7ffd578054d4 is located in stack of thread T0 at offset 52 in frame
    #0 0x4011f5 in replaceWord /app/example.c:9

所以这显然不是什么好事。通过增加缓冲区大小来解决这个问题将“有效”，但要真正做到这一点 source/target 缓冲区（在你的函数中它们是同一个）应该有它的完整可写宽度（包括 space 作为终结符）作为参数提供（你绝对应该这样做。

其次，执行此操作的代码：

while ((position = strstr(str, oldWord)) != NULL)

总是从输入字符串的开头开始搜索 oldWord。这是错误的（好吧，恰好一次；第一次通过；之后，它是错误的）。考虑一下：

i love c++

假设我查看或 i 并且我想将其替换为 is。我会在这里找到它：

i love c++
^

替换我正在构建的新字符串后如下所示：

is love c++

那么您知道从哪里开始下一次搜索？您从原始字符串开始的位置开始，加上替换字符串值的长度。原来在pos 0，替换的长度是2，所以我们从pos 2开始下一次搜索。

is love c++
  ^

请注意，当你做所有事情时，这会变得更加复杂 in-place（例如，没有中间缓冲区），但这似乎不是你现在的目标，而且可能会关闭你的雷达。因此，一个 not-very-efficient，但实用的方法是：

从字符串的开头开始 (src = str)
搜索旧词，从 src 开始。
如果找到，复制原始字符串 up-to，但不包括旧词到缓冲区。
将替换字符串附加到缓冲区。
追加原始字符串的剩余部分将旧词传递到缓冲区。
将缓冲区复制回源字符串。
重新定位 src 为替换词的长度过去从 (3)
循环回到（2），直到不再找到旧词。

如我所说；效率不高，但很容易理解。在代码中看起来像这样。请注意缓冲区大小显着增加，并用于在 main 中声明临时缓冲区和数组。 这还是不好，但这是你带来的，所以我坚持下去。我敦促您考虑使用动态内存管理来执行此算法，或者 size-restrictions 作为附加参数传递：

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>

#define BUFFER_SIZE 100

void replaceWord(char *str, const char *oldWord, const char *newWord)
{
    char buffer[ BUFFER_SIZE ];
    char *src = str;
    char *oldpos = NULL;

    size_t lenOldWord = strlen(oldWord);
    size_t lenNewWord = strlen(newWord);

    while ((oldpos = strstr(src, oldWord)) != NULL)
    {
        // 1. copy everything up to the old word.
        // 2. append the new word
        // 3. copy the remainder of source string *past* the old word
        // 4. copy back to the original string.
        memcpy(buffer, str, (size_t)(oldpos - str));
        memcpy(buffer + (oldpos - str), newWord, lenNewWord);
        strcpy(buffer + (oldpos - str) + lenNewWord, oldpos + lenOldWord);
        strcpy(str, buffer);

        // the new starting point will be the previous discovry
        //  location plus the length of the new word.
        src = oldpos + lenNewWord;
    }
}

int main()
{
    char msg[BUFFER_SIZE] = "This is some text with the word snowfall to replace";
    puts(msg);
    replaceWord(msg, "snowfall", "snowfalls");
    puts(msg);
    return 0;
}

输出

This is some text with the word snowfall to replace
This is some text with the word snowfalls to replace

我强烈建议您在调试器中运行并逐步观察它是如何工作的。它将帮助您了解您在哪里遗漏了 start-search-here 逻辑。我更强烈地建议你，作为一个练习，解决这个算法的明显漏洞。想想可以轻松调用未定义行为的方法（提示：简短、常见的旧词，非常非常长的替换词），以及解决这些问题应该做的事情。

C 程序中的边缘案例，用于查找和替换文本文件中的单词

Edge case in C program to find and replace words from a text file

c

string

pointers

replace

file