如何使用 shell 脚本从句子中删除停用词？

Question

我要从文件中的句子中删除停用词吗？

停用词我的意思是：
[I, a, an, as, at, the, by, in, for, of, on, that]

我在文件 my_text.txt 中有这些句子：

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

那我想去掉上面句子中的停用词

我用过这个脚本：

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
cat $p  | sed -e 's/\<$i\>//g' 
done < my_text.txt

但输出是：

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

预期的输出应该是：

One primary goals design Unix system was to create an environment promoted efficient program

注意：我要删除停用词而不是重复词？

Answer 1

像这样，假设$p是一个存在的文件：

 sed -i -e "s/\<$i\>//g" "$p"

您必须使用双引号而不是单引号来展开变量。

-i 开关替换行。

在shell中了解如何正确引用，这非常重要:

"Double quote" every literal that contains spaces/metacharacters and every expansion: "$var", "$(command "$var")", "${array[@]}", "a & b". Use 'single quotes' for code or literal $'s: 'Costs US', ssh host 'echo "$HOSTNAME"'. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words

终于

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
    sed -i -e "s/\<$i\>\s*//g" Input_File 
done

奖金

尝试不使用 \s* 以了解我添加此正则表达式的原因

Answer 2

awk 中的一个。这是一个有效的道具，但需要适当的标点符号处理，然后是一些（幸运的是你的数据有 none）：

$ awk '
NF==FNR {                         # process stop words
    split([=10=],a,/,/)               # comma separated without space
    for(i in a)                   # they go to b hash
        b[a[i]]
    next
}
{                                 # reading the text
    for(i=1;i<=NF;i++)            # iterating them words
        if(!($i in b))            # if current word notfound in stop words
            printf "%s%s",$i,OFS  # output it (leftover space in the end, sorry)
        print ""                  # newline in the 
}' words text

输出：

One primary goals design Unix system was to create environment promoted efficient program

为什么要使用 awk？ Shell 是一个管理文件和启动程序的工具。除了在其他地方处理得更好。

Answer 3

我也很喜欢在文本处理中使用awk。假设输入数据是 mytext.txt 文件，script 是包含下面代码的文件，简单地运行就是 awk -f mytext.txt script.

此外，通过更改 stopwords 变量，这应该可以更轻松地在需要时更改停用词。请记住，mytext.txt 和 stopwords 都只能包含 space 个分隔词。

BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}

{
equals = 0
for (w in wordarray)
  if ([=10=] == wordarray[w])
    equals = 1
if (equals == 0) print [=10=]
}

Answer 4

可以使用这个脚本:

while read p 
do 
  echo $p | sed -e 's/\<I\>//g' | sed -e 's/\<an\>//g' | sed -e 's/\<a\>// g'|sed -e 's/\<as\>//g'|sed -e 's/\<at\>//g'|sed -e 's/\<the\>//g' | sed -e 's/\<by\>//g' | sed -e 's/\<in\>//g' | sed -e 's/\<for\>//g' | sed -e 's/\<of\>//g' | sed -e 's/\<on\>//g' > my_text.txt
  
  cat my_text.txt

done < my_text.txt

那么输出一定是这样的：

One primary goals design Unix system was to create an environment promoted efficient program

如何使用 shell 脚本从句子中删除停用词？

How can I remove the stop words from sentence using shell script?

bash

shell

sed

tr

终于

奖金