如何使用 grep/egrep 在文件中查找重复的单词？

Question

我需要在 unix 中使用 egrep（或 grep -e）查找文件中重复的单词 (bash)

我试过了：

egrep "(\<[a-zA-Z]+\>) " file.txt

和

egrep "(\b[a-zA-Z]+\b) " file.txt

但出于某种原因，这些人认为事情是重复的，但实际上不是！例如，它认为字符串 "word words" 满足条件，尽管单词边界条件 \> 或 \b.

Answer 1

</code> 匹配第一个捕获匹配的任何字符串。这与匹配与第一次捕获匹配的相同模式不同。因此，即使 <code>\b 位于捕获括号内，在单词边界上匹配的第一个捕获这一事实也不再相关。

如果你想让第二个实例也在一个单词边界上，你需要这样说：

egrep "(\b[a-zA-Z]+) \b" file.txt

这与：

没有区别

egrep "\b([a-zA-Z]+) \b" file.txt

模式中的 space 强制使用单词边界，因此我删除了多余的 \b 。如果你想更明确，你可以把它们放在：

egrep "\<([a-zA-Z]+)\> \<\>" file.txt

Answer 2

这是预期的行为。看看 man grep 怎么说：

The Backslash Character and Special Expressions

The symbols \< and > respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].

然后在另一个地方我们看到 "word" 是什么：

Matching Control

Word-constituent characters are letters, digits, and the underscore.

这就是将产生的结果：

$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) " a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) " a
hello and and bye
words words
this are words words
"words words"

Answer 3

egrep "(\<[a-zA-Z]+>) \<\>" file.txt

解决了问题。

基本上，你必须告诉 \1 它也需要保持在单词边界内

Answer 4

我用

pcregrep -M '(\b[a-zA-Z]+)\s+\b' *

检查我的文档是否存在此类错误。如果重复的单词之间有换行符，这也适用。

解释：

-M, --multiline 运行在多行模式下（如果在重复的单词之间换行很重要。
[a-zA-Z]+: 匹配词
\b：字界，见tutorial
(\b[a-zA-Z]+) 分组
\s+ 匹配至少一个（但必要时可以更多）空白字符。这包括换行符。
: 匹配第一组中的任何内容

如何使用 grep/egrep 在文件中查找重复的单词？

How can I find repeated words in a file using grep/egrep?

regex

unix

bash

grep

word-boundary