如何使用 shell 脚本从两个句子中删除重复的单词？

Question

我有两个句子包含重复的单词，例如文件my_text.txt中的输入数据：

The Unix and Linux operating system.
The Unix and Linux system was to create an environment that promoted efficient program.

我使用了这个脚本：

while read p
do
echo "$p"|sort -u | uniq
done < my_text.txt

但输出与输入文件的内容相同：

The Unix and Linux operating system. The Unix and Linux system was to create an environment that promoted efficient program

如何从两个句子中删除重复的单词？

Answer 1

要输出保留单词出现顺序的已处理行，您可以使用 awk 来解析和删除重复项。此脚本支持多句，考虑到单词后跟常见的标点符号（.,;）：

文件remove_duplicates.awk:

#!/usr/bin/awk -f

{
    # Store occurences of each word in current line, keyed by the word itself
    for (i=1; i<=NF; i++) {
        sub(/[.,;]/, "", $i)
        seen_words[$i]++
    }
    # Store line, keyed by line number
    lines[$NR]=[=10=]
}
END {
    # Process stored lines
    for (i=1; i<=NR; i++) {
        split(lines[$i], word, " ")
        output_line=""
        for (j=1; j<=length(word); j++){
            sub(/[.,;]/, "", word[j])
            if (seen_words[word[j]] <= 1) {
                output_line = output_line " " word[j]
            }
        }
        print output_line
    }
}

用法：

./remove_duplicates.awk < input_text

输出：

operating
was to create an environment that promoted efficient program

Answer 2

您的代码将删除重复的行； sort 和 uniq 都对行而不是单词进行操作。（即便如此，循环也是多余的；如果你想这样做，你的代码应该简化为 sort -u my_text.txt。）

通常的解决方法是将输入拆分为每行一个单词；现实世界的文本有些复杂，但第一个基本的 Unix 101 实现看起来像

tr ' ' '\n' <my_text.txt | sort -u

当然，这会以与原始顺序不同的顺序为您提供单词，并保存每个单词的第一次出现。如果你想丢弃任何出现不止一次的词，也许试试

tr ' ' '\n' <my_text.txt | sort | uniq -c | awk ' == 1 { print  }'

（如果您的 tr 无法将 \n 识别为换行符，也许可以尝试 '2'。）

这是一个非常简单的两遍 Awk 脚本，希望它更有用。它在第一次遍历文件时将所有单词收集到内存中，然后在第二次遍历时删除出现多次的所有单词。

awk 'NR==FNR { for (i=1; i<=NF; ++i) ++a[$i]; next }
{ for (i=1; i<=NF; ++i) if (a[$i] > 1) $i="" } 1' my_test.txt my_test.txt

这会在删除单词的地方留下空白；使用最终 sub().

修复应该很容易

一个更有用的程序会拆分所有标点符号，并将单词缩减为小写（这样 Word、word、Word! 和 word? 就不会t算作分开）。

Answer 3

使用 awk（GNU awk）：

 awk '{ 
        for (i=1;i<=NF;i++) { # Loop on each word on each line
          gsub(/[[:punct:]]/,"",$i); # Srip out any punctuation
          cnt++; Set a word count variable
          if (!map[$i]) { If there is not an entry for the word in an array, set it with the word as the index and the cnt variable as the value
            map[$i]=cnt 
          } 
         } 
      } 
  END { 
        PROCINFO["sorted_in"]="@val_num_asc"; # Set the order of the array to value number ascending
        for (i in map) { 
           printf "%s ",i # Print each word with a space
        } 
       }' filename

一个班轮：

 awk '{ for (i=1;i<=NF;i++) { gsub(/[[:punct:]]/,"",$i);cnt++;if (!map[$i]) { map[$i]=cnt } } } END { PROCINFO["sorted_in"]="@val_num_asc";for (i in map) { printf "%s ",i } }' filename

注意 - 这将去掉任何标点符号（单词后的句号）

Answer 4

可以使用此命令从两个句子中删除重复的单词：

tr ' ' '\n' <my_text.txt | sort | uniq | xargs

如何使用 shell 脚本从两个句子中删除重复的单词？

How to Remove duplication of words from both sentences using shell script?

unix

bash

shell

uniq