如何删除文件中的重复条目
How to remove repetitive entries in files
我有一个文件 (input.txt ),其中包含以下行:
1_306500682 2_315577060 3_315161284 22_315577259 22_315576763
2_315578866 2_315579020 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338 2_315578919
1_306500655 2_315579567 3_315161256 3_315161708
据此,我只想保留每行中第一个条目在 _ 之前具有重复值。对于上面的例子,output.txt 应该包含:
1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256
请帮忙..
命令行中的 Perl,
perl -lane 'my %s;print join " ", grep /^(\d+)_/ && !$s{}++, @F' file
输出
1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
seen = set()
nums = line.split()
for num in nums:
header = num.split("_")[0]
if header not in seen:
outfile.write(num)
outfile.write(" ")
seen.add(header)
outfile.write('\n')
您可以使用单独的 set
来跟踪到目前为止遇到的单词前缀,并将每行中不重复的前缀收集到 list
中。以这种方式处理每一行后,可以轻松构建仅包含找到的非重复条目的替换文本行。注意:这只是 inspectorG4dget 的当前答案的一个稍微更有效的版本。
with open('input.txt', 'rt') as infile, \
open('non_repetitive_input.txt', 'wt') as outfile:
for line in infile:
values, prefixes = [], set()
for word, prefix in ((entry, entry.partition('_')[0])
for entry in line.split()):
if prefix not in prefixes:
values.append(word)
prefixes.add(prefix)
outfile.write(' '.join(values) + '\n')
输出文件的内容:
1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256
我有一个文件 (input.txt ),其中包含以下行:
1_306500682 2_315577060 3_315161284 22_315577259 22_315576763
2_315578866 2_315579020 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338 2_315578919
1_306500655 2_315579567 3_315161256 3_315161708
据此,我只想保留每行中第一个条目在 _ 之前具有重复值。对于上面的例子,output.txt 应该包含:
1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256
请帮忙..
命令行中的 Perl,
perl -lane 'my %s;print join " ", grep /^(\d+)_/ && !$s{}++, @F' file
输出
1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
seen = set()
nums = line.split()
for num in nums:
header = num.split("_")[0]
if header not in seen:
outfile.write(num)
outfile.write(" ")
seen.add(header)
outfile.write('\n')
您可以使用单独的 set
来跟踪到目前为止遇到的单词前缀,并将每行中不重复的前缀收集到 list
中。以这种方式处理每一行后,可以轻松构建仅包含找到的非重复条目的替换文本行。注意:这只是 inspectorG4dget 的当前答案的一个稍微更有效的版本。
with open('input.txt', 'rt') as infile, \
open('non_repetitive_input.txt', 'wt') as outfile:
for line in infile:
values, prefixes = [], set()
for word, prefix in ((entry, entry.partition('_')[0])
for entry in line.split()):
if prefix not in prefixes:
values.append(word)
prefixes.add(prefix)
outfile.write(' '.join(values) + '\n')
输出文件的内容:
1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256