仅基于 2 列获取唯一行

Question

我有一些大文件（50k 行）格式如下

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    35720   36081   NR_026820_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028325_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028327_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028325_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028327_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

第二和第三列是开始和结束位置。我想删除所有具有相同开始和结束位置的行（其余行无关紧要）并仅保留它第一次出现的时间。例如，我想在示例数据中保留第 14 行和第 15 行，因为即使开始位置相同，但结束位置不同。 15 和 16 具有相同的开始和结束，所以我想删除 16.I 我正在尝试在 Python 中执行此操作，但我真的不知道如何处理两列的唯一性要求。

关于 best/easiest 方法的任何想法？

Answer 1

您可以使用 pandas 来加载您的文件，然后根据需要删除基于 2 列的行（一个简单的示例）：

import pandas as pd

raw_data = {'firstcolumn': ['chr1', 'chr1', 'chr1'],
            'start_position': [35276, 35720, 35720],
            'end_position': [35481, 36081, 36081],
            'A': [4, 4, 31],
            'B': [25, 25, 57]}
df = pd.DataFrame(raw_data, columns = ['firstcolumn', 'start_position','end_position', 'A', 'B'])

df.drop_duplicates(['start_position','end_position']) #drop duplicate rows based on these 2 columns

Answer 2

你可以使用类似的东西：

with open('out_lines.dat', 'w') as out_file:
    with open('in_lines.dat', 'r') as in_file:
        prev_start_end = []
        for line in in_file:
            line_data = line.split()
            if line_data and len(line_data) == 15:
                start = line_data[1]
                end = line_data[2]
                if (start, end) not in prev_start_end:
                    out_file.write(line)
                    prev_start_end.append((start, end))

输入数据的输出为：

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

Answer 3

正如有人建议的那样，将您的 "keys"（开始-结束字段）存储在一个集合中，如果密钥之前已经看到，则跳过打印该行：

with open('datafile.tsv','r') as f:
    for line in f:
        fields=line.split('\t')
        key=tuple(fields[1:3])
        if key in s: continue
        s.add(key)
        print(line)

您可以将 python 的输出重定向到您的新文件。

Answer 4

考虑使用优秀的 pandas 库来加载和处理此类数据：

data_string = """
chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    35720   36081   NR_026820_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028325_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028327_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028325_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028327_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
"""

# this looks suspicously csv-like

import pandas
import StringIO

buf = StringIO.StringIO(data_string)

# this will create a DataFrame object with header: 0, 1, 2, ...
# if you have the file path, you can use that instead of the StringIO buffer
df = pandas.read_csv(buf, delim_whitespace=True, header=None)

>>> print df

      0       1       2                                   3   4  5         6   \
0   chr1   35276   35481     NR_026820_exon_1_0_chr1_35277_r   0  -  0.526829   
1   chr1   35720   36081     NR_026818_exon_2_0_chr1_35721_r   0  -  0.398892   
2   chr1   35720   36081     NR_026820_exon_2_0_chr1_35721_r   0  -  0.398892   
3   chr1   69090   70008  NM_001005484_exon_0_0_chr1_69091_f   0  +  0.571895   
4   chr1  134772  139696    NR_039983_exon_0_0_chr1_134773_r   0  -  0.366775   
5   chr1  139789  139847    NR_039983_exon_1_0_chr1_139790_r   0  -  0.551724   
6   chr1  140074  140566    NR_039983_exon_2_0_chr1_140075_r   0  -  0.475610   
7   chr1  323891  324060    NR_028322_exon_0_0_chr1_323892_f   0  +  0.426035   
8   chr1  323891  324060    NR_028325_exon_0_0_chr1_323892_f   0  +  0.426035   
9   chr1  323891  324060    NR_028327_exon_0_0_chr1_323892_f   0  +  0.426035   
10  chr1  324287  324345    NR_028322_exon_1_0_chr1_324288_f   0  +  0.551724   
11  chr1  324287  324345    NR_028325_exon_1_0_chr1_324288_f   0  +  0.551724   
12  chr1  324287  324345    NR_028327_exon_1_0_chr1_324288_f   0  +  0.551724   
13  chr1  324438  326938    NR_028327_exon_2_0_chr1_324439_f   0  +  0.375200   
14  chr1  324438  328581    NR_028322_exon_2_0_chr1_324439_f   0  +  0.378228   
15  chr1  324438  328581    NR_028325_exon_2_0_chr1_324439_f   0  +  0.378228  

# ... more data skipped...

现在，超级简单：

# drop duplicates for non-unique sets of values in columns 1, 2 (start, end)
no_dups = df.drop_duplicates([1, 2])

>>> print no_dups
          0       1       2                                   3   4  5         6   \
0   chr1   35276   35481     NR_026820_exon_1_0_chr1_35277_r   0  -  0.526829   
1   chr1   35720   36081     NR_026818_exon_2_0_chr1_35721_r   0  -  0.398892   
3   chr1   69090   70008  NM_001005484_exon_0_0_chr1_69091_f   0  +  0.571895   
4   chr1  134772  139696    NR_039983_exon_0_0_chr1_134773_r   0  -  0.366775   
5   chr1  139789  139847    NR_039983_exon_1_0_chr1_139790_r   0  -  0.551724   
6   chr1  140074  140566    NR_039983_exon_2_0_chr1_140075_r   0  -  0.475610   
7   chr1  323891  324060    NR_028322_exon_0_0_chr1_323892_f   0  +  0.426035   
10  chr1  324287  324345    NR_028322_exon_1_0_chr1_324288_f   0  +  0.551724   
13  chr1  324438  326938    NR_028327_exon_2_0_chr1_324439_f   0  +  0.375200   
14  chr1  324438  328581    NR_028322_exon_2_0_chr1_324439_f   0  +  0.378228

Answer 5

def unique_positions(filename):
    with open(filename) as lines:
        seen_positions = set()
        for line in lines:
            position = tuple(line.split()[1:3])
            if position not in seen_positions:
                seen_positions.add(position)
                yield line

for line in unique_positions('data.csv):
    print line

Answer 6

如果您只想删除重复项并写入文件，您可以使用 groupby，按两列分组并调用 next to 仅获取多个匹配项的第一行或唯一行，无论哪种情况，它在内存中存储的也很少：

from itertools import groupby

with open("in.csv") as f, open("out.csv", "w") as out:
    for _, v in groupby(f, key=lambda x: x.split()[1:3]):
        out.write(next(v))

输出：

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

如果您想更改原始文件，请使用 shutil.move:

的临时文件

from itertools import groupby
from tempfile import NamedTemporaryFile
from shutil import move
with open("in.csv") as f, NamedTemporaryFile(dir=".",delete=False) as out:
    for _, v in groupby(f, key=lambda x: x.split()[1:3]):
        out.write(next(v))
move(out.name,"foo.csv")

仅基于 2 列获取唯一行

Get unique lines based ONLY on 2 Columns

python

csv

bioinformatics