如何用linux工具彻底擦除重复的线条?
How to completely erase the duplicated lines by linux tools?
这个问题不等于 How to print only the unique lines in BASH? 因为那个问题建议删除重复行的所有副本,而这个问题只是关于删除它们的重复项,即..更改 1, 2, 3, 3
进入 1, 2, 3
而不仅仅是 1, 2
.
这个问题真的很难写,因为我看不出有什么意义。但这个例子显然是直截了当的。如果我有这样的文件:
1
2
2
3
4
解析文件删除重复行后,变成这样:
1
3
4
我知道 python 或其中的一部分,这是我为执行它而编写的 python 脚本。创建一个名为 clean_duplicates.py
和 运行 的文件:
import sys
#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():
lines = sys.stdin.readlines()
# print( lines )
clean_duplicates( lines )
#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):
lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )
# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:
sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()
# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:
sys.stdout.write( lines[ 0 ] )
# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):
currentLine = lines[ index ]
nextLine = lines[ index + 1 ]
if currentLine == lastLine:
continue
lastLine = lines[ index ]
if currentLine == nextLine:
continue
sys.stdout.write( currentLine )
# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:
sys.stdout.write( lines[ linesCount - 1 ] )
if __name__ == "__main__":
main()
不过,在搜索重复行时,删除似乎更易于使用 grep、sort、sed、uniq 等工具:
- How to remove duplicate lines inside a text file?
- removing line from list using sort, grep LINUX
- Find duplicate lines in a file and count how many time each line was duplicated?
- Remove duplicate entries in a Bash script
- How to delete duplicate lines in a file without sorting it in Unix?
- How to delete duplicate lines in a file...AWK, SED, UNIQ not working on my file
您可以使用 uniq
和 -u
/--unique
选项。根据 uniq
man page:
-u
/ --unique
Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.
例如:
cat /tmp/uniques.txt | uniq -u
或者,如 UUOC: Useless use of cat 中所述,更好的方法是:
uniq -u /tmp/uniques.txt
这两个命令都会 return 我的值:
1
3
4
其中 /tmp/uniques.txt 包含问题中提到的数字,即
1
2
2
3
4
注意:uniq
需要对文件内容进行排序。如doc所述:
By default, uniq
prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.
如果文件没有排序,你需要先sort
内容
然后在排序后的内容上使用 uniq
:
sort /tmp/uniques.txt | uniq -u
无需排序,输出顺序与输入顺序相同:
$ awk 'NR==FNR{c[[=10=]]++;next} c[[=10=]]==1' file file
1
3
4
Europe Finland Office Supplies Online H 5/21/2015 193508565 7/3/2015 2339 651.21 524.96 1523180.19 1227881.44 295298.75
Europe Greece Household Online L 9/11/2015 895509612 9/26/2015 49 668.27 502.54 32745.23 24624.46 8120.77
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
如果你有这种线条可以使用这个命令。
[isuru@192 ~]$ sort duplines.txt | sed 's/\ /\-/g' | uniq | sed 's/\-/\ /g'
但在使用特殊字符时请记住。如果您的行中有破折号,请确保使用不同的符号。在这里,我在反斜杠和正斜杠之间保留了一个 space。
Before applied the code
After applied the code
请使用带有 -u
参数的 sort
命令来列出任何命令输出的唯一值。
cat file_name |sort -u
1
2
3
4
这个问题不等于 How to print only the unique lines in BASH? 因为那个问题建议删除重复行的所有副本,而这个问题只是关于删除它们的重复项,即..更改 1, 2, 3, 3
进入 1, 2, 3
而不仅仅是 1, 2
.
这个问题真的很难写,因为我看不出有什么意义。但这个例子显然是直截了当的。如果我有这样的文件:
1
2
2
3
4
解析文件删除重复行后,变成这样:
1
3
4
我知道 python 或其中的一部分,这是我为执行它而编写的 python 脚本。创建一个名为 clean_duplicates.py
和 运行 的文件:
import sys
#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():
lines = sys.stdin.readlines()
# print( lines )
clean_duplicates( lines )
#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):
lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )
# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:
sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()
# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:
sys.stdout.write( lines[ 0 ] )
# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):
currentLine = lines[ index ]
nextLine = lines[ index + 1 ]
if currentLine == lastLine:
continue
lastLine = lines[ index ]
if currentLine == nextLine:
continue
sys.stdout.write( currentLine )
# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:
sys.stdout.write( lines[ linesCount - 1 ] )
if __name__ == "__main__":
main()
不过,在搜索重复行时,删除似乎更易于使用 grep、sort、sed、uniq 等工具:
- How to remove duplicate lines inside a text file?
- removing line from list using sort, grep LINUX
- Find duplicate lines in a file and count how many time each line was duplicated?
- Remove duplicate entries in a Bash script
- How to delete duplicate lines in a file without sorting it in Unix?
- How to delete duplicate lines in a file...AWK, SED, UNIQ not working on my file
您可以使用 uniq
和 -u
/--unique
选项。根据 uniq
man page:
-u
/--unique
Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.
例如:
cat /tmp/uniques.txt | uniq -u
或者,如 UUOC: Useless use of cat 中所述,更好的方法是:
uniq -u /tmp/uniques.txt
这两个命令都会 return 我的值:
1
3
4
其中 /tmp/uniques.txt 包含问题中提到的数字,即
1
2
2
3
4
注意:uniq
需要对文件内容进行排序。如doc所述:
By default,
uniq
prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.
如果文件没有排序,你需要先sort
内容
然后在排序后的内容上使用 uniq
:
sort /tmp/uniques.txt | uniq -u
无需排序,输出顺序与输入顺序相同:
$ awk 'NR==FNR{c[[=10=]]++;next} c[[=10=]]==1' file file
1
3
4
Europe Finland Office Supplies Online H 5/21/2015 193508565 7/3/2015 2339 651.21 524.96 1523180.19 1227881.44 295298.75
Europe Greece Household Online L 9/11/2015 895509612 9/26/2015 49 668.27 502.54 32745.23 24624.46 8120.77
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
如果你有这种线条可以使用这个命令。
[isuru@192 ~]$ sort duplines.txt | sed 's/\ /\-/g' | uniq | sed 's/\-/\ /g'
但在使用特殊字符时请记住。如果您的行中有破折号,请确保使用不同的符号。在这里,我在反斜杠和正斜杠之间保留了一个 space。
Before applied the code
After applied the code
请使用带有 -u
参数的 sort
命令来列出任何命令输出的唯一值。
cat file_name |sort -u
1
2
3
4