为什么这两种统计单词的方法有很大的不同？

Question

我编写了一个程序，允许用户在任何文本文件中查找单词或单词集合的实例数。用户可以在命令行中输入如下内容：

$ ./wordCount Mars TripToMars.txt

搜索 Trip To Mars 一书中单词 "Mars" 的出现次数，或

$ ./wordCount -f collectionOfSearchWords.txt TripToMars.txt

搜索 collectionOfSearchWords.txt.

中各个行中多个单词的实例数

为了确保程序正确，我使用了 grep 命令：

$ grep -o 'Mars' TripToMars.txt | wc -w

和

$ grep -o -w 'Mars' TripToMars.txt | wc -w

第一个命令查找单词在任意位置出现的次数，其中包括 "Marsa"、"Marseen"、"Marses" 等术语，而第二个命令仅查找"Mars" 作为独立单词的实例，其中包括尾随标点符号，例如 "Mars."、"Mars!"、"Mars?" 等

两个grep命令return49作为书中"Mars"的实例数

当我在下面的 while 循环中使用代码时（为简单起见，我只包括相关代码），程序 returns 49。太棒了！

FILE *textToSearch;
char *readMode = "r";

int count;
char nextWord[100];
char d;

textToSearch = fopen(argVector[argCount-1], readMode);
if (textToSearch == NULL) {
    fprintf(stderr, "Cannot open %s to be searched\n", argVector[argCount-1]);
    return 1;
} else {
    while (fscanf(textToSearch, "%*[^a-zA-Z]"), fscanf(textToSearch, "%80[a-zA-Z]", nextWord) > 0) { 

        // increment the counter if the word is a match
        if (strcmp(nextWord, argVector[word]) == 0) {
            count++;
        }
    }
}

但是当我用这个 while 循环替换前一个时，程序 returns 17。

while(1) {
    d = fscanf(textToSearch, "%s", nextWord);
    if (d == EOF) break;

    // increment the counter if the word is a match
    if (strcmp(nextWord, argVector[word]) == 0) {
        count++;
    }
}

那么，

两者之间的最大区别是什么？

while (fscanf(textToSearch, "%*[^a-zA-Z]"), fscanf(textToSearch, "%80[a-zA-Z]", nextWord) > 0) {}

和

while(1) {
    d = fscanf(textToSearch, "%s", nextWord);
    if (d == EOF) break;
}

?

编辑:

我添加了这段代码：

if (strcmp(nextWordDict, nextWord) == 0 ||
     strcmp(nextWordDict, strcat(nextWord, ".")) == 0 ||
     strcmp(nextWordDict, strcat(nextWord, "?")) == 0 ||
     strcmp(nextWordDict, strcat(nextWord, "!")) == 0 ||
     strcmp(nextWordDict, strcat(nextWord, ",")) == 0) {
        count++;
 }

为 Mars 生成 17 的代码尝试解释尾随标点符号的情况，但没有任何变化。还是 17.

EDIT2:

正如 John Bollinger 在下面正确指出的那样，这段代码什么都不做，因为缓冲到 nextWord 中的字符串已经有了尾随标点符号，而代码只会添加更多标点符号。这是我的错误想法。

Answer 1

你说命令是不正确的...

$ grep -o -w 'Mars' TripToMars.txt | wc -w

... "finds only instances of 'Mars' as a standalone word"，或者至少该声明在上下文中具有误导性。该命令查找不属于较大单词一部分的 "Mars" 实例，其中 "word" 定义为连续的字母、数字、and/or 下划线字符串。特别是，它将匹配 "Mars" 后跟标点符号的位置，这与您似乎声称的内容冲突。

但是你的两种扫描方式有什么区别呢？嗯，这个...

while (fscanf(textToSearch, "%*[^a-zA-Z]"),
        fscanf(textToSearch, "%80[a-zA-Z]", nextWord) > 0) { /* ... */ }

... 扫描零个或多个不是拉丁字母的字符，忽略是否匹配以及是否出现输入错误，然后扫描最多 80 个拉丁字母的连续序列，将该序列记录在 nextWord缓冲区。

另一方面，这个...

while(1) {
    d = fscanf(textToSearch, "%s", nextWord);
    if (d == EOF) break;
}

... 忽略前导空格，然后将下一个连续的非空格字符串扫描到 nextWord.

两者在处理既不是拉丁字母也不是空格的字符方面有很大不同：前者忽略它们，而后者将它们包含在 nextWord 中。当您随后将 nextWord 与字符串 "Mars" 进行比较时，后者错过了

Going to Mars.

和

The name "Mars"

和

Is there water on Mars?

因为比较的是相邻的标点符号。您的文本很可能有许多类似的结构，并且您的 grep 命令不会以其他方式展示。

为什么这两种统计单词的方法有很大的不同？

Why do these two methods of counting words differ significantly?

c

regex

text

scanf

text-parsing