如何使用 shell 脚本从句子中删除停用词?
How can I remove the stop words from sentence using shell script?
我要从文件中的句子中删除停用词吗?
停用词我的意思是:
[I, a, an, as, at, the, by, in, for, of, on, that]
我在文件 my_text.txt
中有这些句子:
One of the primary goals in the design of the Unix system was to
create an environment that promoted efficient program
那我想去掉上面句子中的停用词
我用过这个脚本:
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
cat $p | sed -e 's/\<$i\>//g'
done < my_text.txt
但输出是:
One of the primary goals in the design of the Unix system was to
create an environment that promoted efficient program
预期的输出应该是:
One primary goals design Unix system was to
create an environment promoted efficient program
注意:我要删除停用词而不是重复词?
像这样,假设$p
是一个存在的文件:
sed -i -e "s/\<$i\>//g" "$p"
您必须使用双引号而不是单引号来展开变量。
-i
开关替换 行 。
在shell中了解如何正确引用,这非常重要:
"Double quote" every literal that contains spaces/metacharacters and every expansion: "$var"
, "$(command "$var")"
, "${array[@]}"
, "a & b"
. Use 'single quotes'
for code or literal $'s: 'Costs US'
, ssh host 'echo "$HOSTNAME"'
. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words
终于
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
sed -i -e "s/\<$i\>\s*//g" Input_File
done
奖金
尝试不使用 \s*
以了解我添加此正则表达式的原因
awk 中的一个。这是一个有效的道具,但需要适当的标点符号处理,然后是一些(幸运的是你的数据有 none):
$ awk '
NF==FNR { # process stop words
split([=10=],a,/,/) # comma separated without space
for(i in a) # they go to b hash
b[a[i]]
next
}
{ # reading the text
for(i=1;i<=NF;i++) # iterating them words
if(!($i in b)) # if current word notfound in stop words
printf "%s%s",$i,OFS # output it (leftover space in the end, sorry)
print "" # newline in the
}' words text
输出:
One primary goals design Unix system was to create environment promoted efficient program
为什么要使用 awk? Shell 是一个管理文件和启动程序的工具。除了在其他地方处理得更好。
我也很喜欢在文本处理中使用awk。假设输入数据是 mytext.txt
文件,script
是包含下面代码的文件,简单地 运行 就是 awk -f mytext.txt script
.
此外,通过更改 stopwords
变量,这应该可以更轻松地在需要时更改停用词。请记住,mytext.txt
和 stopwords
都只能包含 space 个分隔词。
BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}
{
equals = 0
for (w in wordarray)
if ([=10=] == wordarray[w])
equals = 1
if (equals == 0) print [=10=]
}
可以使用这个脚本:
while read p
do
echo $p | sed -e 's/\<I\>//g' | sed -e 's/\<an\>//g' | sed -e 's/\<a\>// g'|sed -e 's/\<as\>//g'|sed -e 's/\<at\>//g'|sed -e 's/\<the\>//g' | sed -e 's/\<by\>//g' | sed -e 's/\<in\>//g' | sed -e 's/\<for\>//g' | sed -e 's/\<of\>//g' | sed -e 's/\<on\>//g' > my_text.txt
cat my_text.txt
done < my_text.txt
那么输出一定是这样的:
One primary goals design Unix system was to create an environment promoted efficient
program
我要从文件中的句子中删除停用词吗?
停用词我的意思是:
[I, a, an, as, at, the, by, in, for, of, on, that]
我在文件 my_text.txt
中有这些句子:
One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program
那我想去掉上面句子中的停用词
我用过这个脚本:
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
cat $p | sed -e 's/\<$i\>//g'
done < my_text.txt
但输出是:
One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program
预期的输出应该是:
One primary goals design Unix system was to create an environment promoted efficient program
注意:我要删除停用词而不是重复词?
像这样,假设$p
是一个存在的文件:
sed -i -e "s/\<$i\>//g" "$p"
您必须使用双引号而不是单引号来展开变量。
-i
开关替换 行 。
在shell中了解如何正确引用,这非常重要:
"Double quote" every literal that contains spaces/metacharacters and every expansion:
"$var"
,"$(command "$var")"
,"${array[@]}"
,"a & b"
. Use'single quotes'
for code or literal$'s: 'Costs US'
,ssh host 'echo "$HOSTNAME"'
. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words
终于
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
sed -i -e "s/\<$i\>\s*//g" Input_File
done
奖金
尝试不使用 \s*
以了解我添加此正则表达式的原因
awk 中的一个。这是一个有效的道具,但需要适当的标点符号处理,然后是一些(幸运的是你的数据有 none):
$ awk '
NF==FNR { # process stop words
split([=10=],a,/,/) # comma separated without space
for(i in a) # they go to b hash
b[a[i]]
next
}
{ # reading the text
for(i=1;i<=NF;i++) # iterating them words
if(!($i in b)) # if current word notfound in stop words
printf "%s%s",$i,OFS # output it (leftover space in the end, sorry)
print "" # newline in the
}' words text
输出:
One primary goals design Unix system was to create environment promoted efficient program
为什么要使用 awk? Shell 是一个管理文件和启动程序的工具。除了在其他地方处理得更好。
我也很喜欢在文本处理中使用awk。假设输入数据是 mytext.txt
文件,script
是包含下面代码的文件,简单地 运行 就是 awk -f mytext.txt script
.
此外,通过更改 stopwords
变量,这应该可以更轻松地在需要时更改停用词。请记住,mytext.txt
和 stopwords
都只能包含 space 个分隔词。
BEGIN {
stopwords = "I a an as at the by in for of on that"
split(stopwords, wordarray)
ORS = " "
RS = " "
}
{
equals = 0
for (w in wordarray)
if ([=10=] == wordarray[w])
equals = 1
if (equals == 0) print [=10=]
}
可以使用这个脚本:
while read p
do
echo $p | sed -e 's/\<I\>//g' | sed -e 's/\<an\>//g' | sed -e 's/\<a\>// g'|sed -e 's/\<as\>//g'|sed -e 's/\<at\>//g'|sed -e 's/\<the\>//g' | sed -e 's/\<by\>//g' | sed -e 's/\<in\>//g' | sed -e 's/\<for\>//g' | sed -e 's/\<of\>//g' | sed -e 's/\<on\>//g' > my_text.txt
cat my_text.txt
done < my_text.txt
那么输出一定是这样的:
One primary goals design Unix system was to create an environment promoted efficient program