比较文件的连续列和 return 不匹配元素的数量

Question

我有一个如下所示的文本文件：

# sampleID  HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513  HGDP00513   

M rs4124251       0       0            A            G          0          A

M rs6650104       0       A            C            T          0          0

M rs12184279      0       0            G            A          T          0

我想比较连续的列和return匹配元素的数量。我想在 Python 中执行此操作。早些时候，我使用 Bash 和 AWK（shell 脚本）完成了它，但它非常慢，因为我有大量数据要处理。我相信 Python 会是一个更快的解决方案。但是，我是 Python 的新手，我已经有了这样的东西：

for line in open("phased.txt"):
    columns = line.split("\t")

    for i in range(len(columns)-1):
        a = columns[i+3]
        b = columns[i+4]
        for j in range(len(a)):
            if a[j] != b[j]:
                print j

这显然不起作用。由于我是 Python 的新手，我真的不知道要进行哪些更改才能使其正常工作。（这是完全错误的代码，我想我可以使用 difflib 等。但是，我以前从未在 Python 中熟练地编码，所以，持怀疑态度继续）

我想比较文件中每一列（从第三列开始）中不匹配元素的数量 return 与文件中的所有其他列。我总共有 828 列。因此我需要 828*828 个输出。（你可以想象一个 n*n 矩阵，其中第 (i,j) 个元素是它们之间不匹配元素的数量。如果是上面的代码片段，我想要的输出是：

3 4: 1

3 5: 3

3 6: 3

......

4 6: 3
..etc

如有任何帮助，我们将不胜感激。谢谢

Answer 1

我强烈建议您为此使用 pandas 而不是编写自己的代码：

import numpy as np
import pandas as pd
df = pd.read_csv("phased.txt")
match_counts = {(i,j): 
                   np.sum(df[df.columns[i]] != df[df.columns[j]]) 
                           for i in range(3,len(df.columns))
                           for j in range(3,len(df.columns))}

match_counts
{(6, 4): 3,
 (4, 7): 2,
 (4, 4): 0,
 (4, 3): 3,
 (6, 6): 0,
 (4, 5): 3,
 (5, 4): 3,
 (3, 5): 3,
 (7, 7): 0,
 (7, 5): 3,
 (3, 7): 2,
 (6, 5): 3,
 (5, 5): 0,
 (7, 4): 2,
 (5, 3): 3,
 (6, 7): 2,
 (4, 6): 3,
 (7, 6): 2,
 (5, 7): 3,
 (6, 3): 2,
 (5, 6): 3,
 (3, 6): 2,
 (3, 3): 0,
 (7, 3): 2,
 (3, 4): 3}

Answer 2

解决此问题的纯本机 python 库方法 - 让我们知道它与 bash 828 x 828 的比较应该是在公园散步。

元素列数：

出于简单和说明的目的，我特意在翻转序列的步骤中写了这篇文章 - 您可以通过更改逻辑或使用 class 对象、函数装饰器等来改进它...

代码Python 2.7:

shiftcol = 2  # shift columns as first two are to be ignored
with open('phased.txt') as f:
    data = [x.strip().split('\t')[shiftcol:] for x in f.readlines()][1:]

# Step 1: Flipping the data first
flip = []
for idx, rows in enumerate(data):
    for i in range(len(rows)):
        if len(flip) <= i:
            flip.append([])
        flip[i].append(rows[i])

# Step 2: counts store in temp dictionary
for idx, v in enumerate(flip):
    for e in v:
        tmp = {}
        for i, z in enumerate(flip):
            if i != idx and e != '0':
                # Dictionary to store results
                if i+1 not in tmp:  # note has_key will be deprecated
                    tmp[i+1] = {'match': 0, 'notma': 0}
                tmp[i+1]['match'] += z.count(e)
                tmp[i+1]['notma'] += len([x for x in z if x != e])

        # results compensate for column shift..
        for key, count in tmp.iteritems():
            print idx+shiftcol+1, key+shiftcol, ': ', count

示例输出

>>> 3 4 :  {'match': 0, 'notma': 3}
>>> 3 5 :  {'match': 0, 'notma': 3}
>>> 3 6 :  {'match': 2, 'notma': 1}
>>> 3 7 :  {'match': 2, 'notma': 1}
>>> 3 3 :  {'match': 1, 'notma': 2}
>>> 3 4 :  {'match': 1, 'notma': 2}
>>> 3 5 :  {'match': 1, 'notma': 2}

比较文件的连续列和 return 不匹配元素的数量

Compare consecutive columns of a file and return the number of non-matching elements

python

file-handling

元素列数：

代码Python 2.7:

示例输出