如何使用 shell 脚本从句子中删除停用词?

How can I remove the stop words from sentence using shell script?

我要从文件中的句子中删除停用词吗?

停用词我的意思是:
[I, a, an, as, at, the, by, in, for, of, on, that]

我在文件 my_text.txt 中有这些句子:

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

那我想去掉上面句子中的停用词

我用过这个脚本:

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
cat $p  | sed -e 's/\<$i\>//g' 
done < my_text.txt

但输出是:

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

预期的输出应该是:

One primary goals design Unix system was to create an environment promoted efficient program

注意:我要删除停用词而不是重复词?

像这样,假设$p是一个存在的文件:

 sed -i -e "s/\<$i\>//g" "$p"

您必须使用双引号而不是单引号来展开变量。

-i 开关替换

在shell中了解如何正确引用,这非常重要:

"Double quote" every literal that contains spaces/metacharacters and every expansion: "$var", "$(command "$var")", "${array[@]}", "a & b". Use 'single quotes' for code or literal $'s: 'Costs US', ssh host 'echo "$HOSTNAME"'. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words

终于

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
    sed -i -e "s/\<$i\>\s*//g" Input_File 
done

奖金

尝试不使用 \s* 以了解我添加此正则表达式的原因

awk 中的一个。这是一个有效的道具,但需要适当的标点符号处理,然后是一些(幸运的是你的数据有 none):

$ awk '
NF==FNR {                         # process stop words
    split([=10=],a,/,/)               # comma separated without space
    for(i in a)                   # they go to b hash
        b[a[i]]
    next
}
{                                 # reading the text
    for(i=1;i<=NF;i++)            # iterating them words
        if(!($i in b))            # if current word notfound in stop words
            printf "%s%s",$i,OFS  # output it (leftover space in the end, sorry)
        print ""                  # newline in the 
}' words text

输出:

One primary goals design Unix system was to create environment promoted efficient program 

为什么要使用 awk? Shell 是一个管理文件和启动程序的工具。除了在其他地方处理得更好。

我也很喜欢在文本处理中使用awk。假设输入数据是 mytext.txt 文件,script 是包含下面代码的文件,简单地 运行 就是 awk -f mytext.txt script.

此外,通过更改 stopwords 变量,这应该可以更轻松地在需要时更改停用词。请记住,mytext.txtstopwords 都只能包含 space 个分隔词。

BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}

{
equals = 0
for (w in wordarray)
  if ([=10=] == wordarray[w])
    equals = 1
if (equals == 0) print [=10=]
}

可以使用这个脚本:

while read p 
do 
  echo $p | sed -e 's/\<I\>//g' | sed -e 's/\<an\>//g' | sed -e 's/\<a\>// g'|sed -e 's/\<as\>//g'|sed -e 's/\<at\>//g'|sed -e 's/\<the\>//g' | sed -e 's/\<by\>//g' | sed -e 's/\<in\>//g' | sed -e 's/\<for\>//g' | sed -e 's/\<of\>//g' | sed -e 's/\<on\>//g' > my_text.txt
  
  cat my_text.txt

done < my_text.txt

那么输出一定是这样的:

One primary goals design Unix system was to create an environment promoted efficient program