用于删除文件中停用词的 Unix 脚本或解析器
Unix script or parser to delete stop words in a file
我正在寻找一个解析器或脚本来从文件中删除停用词。
这是示例文件:
entities_0_confidence|entities_0_name|entities_0_entity|entities_1_confidence|relation_relation|
-1.1956528741743269|ellen brown|Ellen_Brown|-3.9166730593775214|WOULD ATTORNEY FROM|||||||||||||||||||||
-2.3889038197374015|rick santorum|Rick_Santorum||CRITICIZED|||||||||||||||||||||
-1.5485422793287602|thomas jefferson|Thomas_Jefferson|-1.7299349891097682||IS LETTER TO|||||||||||||||||||||
-1.229126527004769|lewis powell|Lewis_Powell_%28conspirator%29|-3.024385187632112|IS JUSTICE OF|||||||||||||||||||||
-2.2268355006701155|michael bloomberg|Michael_Bloomberg|-2.1242762129476493|WON MAYOR OF À|||||||||||||||||||||
这是停止词列表:
IS, OF ,WITH ,WON,WOULD,X,©,® FOR BEST ACTRESS PRESENTING,À,È,ÉS,ŞI,АND,И
我只想删除每一行的单词而不是整行。我当前的脚本也在从其他词中删除这些词。
例如:
- 我在文件中的行 - "TOLD to stop using this line"
- 停用词 - "To"
- 输出 - "LD sp using this line"
我的 file/dataset 包含 70k 个条目。
代码将替换 beginning/end/in-between 字段变量中传递的列号中的停用词。
fields="col_num=1“ #pass the column you want to remove stop words from
while word i;
do
str=“word=$i";
cat file | 'BEGIN{'$str';'$fields'} {gsub("^'$word'[ ]|[ ]'$word'$|^'$word'$",X,$col_num); gsub("[ ]'$word'[ ]", " ",$col_num); gsub(/^ /,X,$col_num); gsub(/ $/,X,$col_num); print}' > file".temp";
mv file".temp" file;
done < stop_words.txt
希望对您有所帮助!!
我正在寻找一个解析器或脚本来从文件中删除停用词。
这是示例文件:
entities_0_confidence|entities_0_name|entities_0_entity|entities_1_confidence|relation_relation|
-1.1956528741743269|ellen brown|Ellen_Brown|-3.9166730593775214|WOULD ATTORNEY FROM|||||||||||||||||||||
-2.3889038197374015|rick santorum|Rick_Santorum||CRITICIZED|||||||||||||||||||||
-1.5485422793287602|thomas jefferson|Thomas_Jefferson|-1.7299349891097682||IS LETTER TO|||||||||||||||||||||
-1.229126527004769|lewis powell|Lewis_Powell_%28conspirator%29|-3.024385187632112|IS JUSTICE OF|||||||||||||||||||||
-2.2268355006701155|michael bloomberg|Michael_Bloomberg|-2.1242762129476493|WON MAYOR OF À|||||||||||||||||||||
这是停止词列表:
IS, OF ,WITH ,WON,WOULD,X,©,® FOR BEST ACTRESS PRESENTING,À,È,ÉS,ŞI,АND,И
我只想删除每一行的单词而不是整行。我当前的脚本也在从其他词中删除这些词。
例如:
- 我在文件中的行 - "TOLD to stop using this line"
- 停用词 - "To"
- 输出 - "LD sp using this line"
我的 file/dataset 包含 70k 个条目。
代码将替换 beginning/end/in-between 字段变量中传递的列号中的停用词。
fields="col_num=1“ #pass the column you want to remove stop words from
while word i;
do
str=“word=$i";
cat file | 'BEGIN{'$str';'$fields'} {gsub("^'$word'[ ]|[ ]'$word'$|^'$word'$",X,$col_num); gsub("[ ]'$word'[ ]", " ",$col_num); gsub(/^ /,X,$col_num); gsub(/ $/,X,$col_num); print}' > file".temp";
mv file".temp" file;
done < stop_words.txt
希望对您有所帮助!!