如何将基因组 SNP 转换为二进制格式?

How to convert genomic SNPs to a binary format?

我正在努力弄清楚如何转换如下所示的 (SNP) 文件:

pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,C,C,C,C,C,C,C,C,C,C,C,G,G
79115,C,C,C,C,C,C,C,A,C,C,T,C,C,C
79116,A,A,A,A,A,A,A,A,A,A,A,A,A,A
79124,C,C,T,T,C,C,C,C,C,C,C,C,C,C
79128,G,G,G,G,A,G,G,G,G,C,G,G,G,G

像这样的二进制格式:

pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,1,1
79115,0,0,0,0,0,0,0,1,0,0,1,0,0,0
79116,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79124,0,0,1,1,0,0,0,0,0,0,0,0,0,0
79128,0,0,0,0,1,0,0,0,0,1,0,0,0,0

有人知道 R 中的包或命令行工具、awk 代码可用于此吗?

以下解决方案的一个问题是都这样做:

1097023,A,A,G,G,G,G,G,G,G,G,G,G,A,A
1097027,C,C,C,C,C,C,C,C,C,C,C,C,C,C
4363243,C,A,A,A,A,A,A,A,A,A,A,A,A,A 
4363270,T,T,T,T,T,T,T,T,T,T,T,T,T,T 
4363275,A,G,G,G,G,G,G,G,G,G,G,G,G,G 
1097023,0,0,1,1,1,1,1,1,1,1,1,1,0,0
1097027,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363243,0,1,1,1,1,1,1,1,1,1,1,1,1,1 
4363270,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
4363275,0,1,1,1,1,1,1,1,1,1,1,1,1,1 

虽然应该是:

1097023,1,1,0,0,0,0,0,0,0,0,0,0,1,1
1097027,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4363243,1,0,0,0,0,0,0,0,0,0,0,0,0,0 
4363270,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
4363275,1,0,0,0,0,0,0,0,0,0,0,0,0,0

所以我认为解决方案应该包括如果大多数样本 'i.e. > 7 samples' 相同,则变为 0,不符合多数的变为 1。

这是一个简单的awk(标准Linuxawk/gawk)脚本:

script.awk

BEGIN {FS = ","} # set field seperator to ","
NR>1{     # every line except the first line
  zeroIndc=; # identify the zero indicator and save the variable
  for (i = 2; i <= NF; i++) { # for each DNA letter
    if ($i == zeroIndc) { # if DNA letter match zero indicator
      $i = 0; # set DNA letter to 0
    } else { # if DNA letter not match zero indicator
      $i = 1; # set DNA letter to 1
    }
  }
  print; # print new line at end 
}

运行 脚本 script.awk

awk -f scirpt.awk input.txt

运行 单行 awk 脚本

awk 'BEGIN{FS=","}NR>1{z=;for(i=2;i<=NF;i++)$i=($i==z)?0:1;print;}' input.txt

使用 base R,您可以将每个 sample 列与第一个样本列进行比较,并在值不匹配的地方将 return 转为 1。

cols <- grep('sample', names(df))
df[cols] <- +(df$sample1 != df[cols])
df

#    pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
#1 79107       0       0       0       0       0       0       0       0       0
#2 79115       0       0       0       0       0       0       0       1       0
#3 79116       0       0       0       0       0       0       0       0       0
#4 79124       0       0       1       1       0       0       0       0       0
#5 79128       0       0       0       0       1       0       0       0       0

#  sample10 sample11 sample12 sample13 sample14
#1        0        0        0        1        1
#2        0        1        0        0        0
#3        0        0        0        0        0
#4        0        0        0        0        0
#5        1        0        0        0        0

虽然上面的方法在大型数据集上会更有效,但这里有一个 dplyr 库的替代方法。

library(dplyr)
df <- df %>% mutate(across(contains('sample'), ~+(sample1 != .)))

数据

df <- structure(list(pos = c(79107L, 79115L, 79116L, 79124L, 79128L
), sample1 = c("C", "C", "A", "C", "G"), sample2 = c("C", "C", 
"A", "C", "G"), sample3 = c("C", "C", "A", "T", "G"), sample4 = c("C", 
"C", "A", "T", "G"), sample5 = c("C", "C", "A", "C", "A"), sample6 = c("C", 
"C", "A", "C", "G"), sample7 = c("C", "C", "A", "C", "G"), sample8 = c("C", 
"A", "A", "C", "G"), sample9 = c("C", "C", "A", "C", "G"), sample10 = c("C", 
"C", "A", "C", "C"), sample11 = c("C", "T", "A", "C", "G"), sample12 = c("C", 
"C", "A", "C", "G"), sample13 = c("G", "C", "A", "C", "G"), sample14 = c("G", 
"C", "A", "C", "G")), class = "data.frame", row.names = c(NA, -5L))
$ awk -F, 'NR>1{gsub(,0); gsub(/[ACTG]/,1)} 1' file
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,1,1
79115,0,0,0,0,0,0,0,1,0,0,1,0,0,0
79116,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79124,0,0,1,1,0,0,0,0,0,0,0,0,0,0
79128,0,0,0,0,1,0,0,0,0,1,0,0,0,0

假设您想将“参考”等位基因编码为 0,其中“参考”是最常见的,这里是使用 Python 3 的解决方案,其 pandas数据处理库(使用pip3 install pandas安装)和Counter specialized dictionary.

从包含示例数据的文件 snps_letters.csv 开始:

$ cat snps_letters.csv
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,C,C,C,C,C,C,C,C,C,C,C,G,G
79115,C,C,C,C,C,C,C,A,C,C,T,C,C,C
79116,A,A,A,A,A,A,A,A,A,A,A,A,A,A
79124,C,C,T,T,C,C,C,C,C,C,C,C,C,C
79128,G,G,G,G,A,G,G,G,G,C,G,G,G,G

我们将打开它,转换它并保存在脚本中binarize.py:

#!/usr/bin/env python3

from collections import Counter
import pandas as pd

# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv("snps_letters.csv", index_col="pos")

def binarize(column):
    # Extracting the most common element as ref
    # (using list and tuple unpacking,
    # because the most_common method returns
    # a list of (element, count) pairs):
    [(ref, _)] = Counter(column).most_common(1)
    # We can now define an auxiliary function
    # to recode individual elements in the column:
    def to_bin(elem):
        # int(True) -> 1, int(False) -> 0
        return 1 - int(elem == ref)
    # Transform the column by applying the recoding function to its elements
    # (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
    return column.apply(to_bin)

# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize, axis=0)
# Write the result to a file:
binarized.to_csv("snps_binary.csv")

正在执行 shell 中的脚本:

$ ./binarize.py 

检查保存的结果:

$ cat snps_binary.csv 
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,0,0,0,0,0,0,0,0,0,0,0,0,0,0
79115,0,0,0,0,0,0,0,1,0,0,1,0,1,1
79116,1,1,1,1,1,1,1,1,1,1,1,1,1,1
79124,0,0,1,1,0,0,0,0,0,0,0,0,1,1
79128,1,1,1,1,1,1,1,1,1,0,1,1,0,0

要获得更多 re-usable 脚本,您可以将输入和输出文件路径作为参数传递给命令行,它们可以在 python 代码中作为 sys.argv[1]sys.argv[2](前提是你先 import sys)。


在 command-line 上获取输入和输出文件并将参考核苷酸编码为 1 而不是 0 的版本:

#!/usr/bin/env python3

import sys
from collections import Counter
import pandas as pd

# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv(sys.argv[1], index_col="pos")

def binarize(column):
    # Extracting the most common element as ref
    # (using list and tuple unpacking,
    # because the most_common method returns
    # a list of (element, count) pairs):
    [(ref, _)] = Counter(column).most_common(1)
    # We can now define an auxiliary function
    # to recode individual elements in the column:
    def to_bin(elem):
        # int(True) -> 1, int(False) -> 0
        return int(elem == ref)
    # Transform the column by applying the recoding function to its elements
    # (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
    return column.apply(to_bin)

# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize, axis=0)
# Write the result to a file:
binarized.to_csv(sys.argv[2])

它将按如下方式使用:

./binarize.py snps_letters.csv snps_binary.csv

识别参考等位基因,在这种情况下,参考基于最常见的等位基因,使用 ,我们有 5 个 SNP REF 等位基因:

REF <- apply(df[, -1], 1, function(i) names(which.max(table(i)))) 
# [1] "C" "C" "A" "C" "G"

现在,逐列比较每个样本与 REF,然后通过将逻辑 TRUE/FALSE 乘以 1 转换为整数。并将其分配回原始 数据框:

df[, -1] <- (df[, -1] != REF) * 1
df
#     pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14
# 1 79107       0       0       0       0       0       0       0       0       0        0        0        0        1        1
# 2 79115       0       0       0       0       0       0       0       1       0        0        1        0        0        0
# 3 79116       0       0       0       0       0       0       0       0       0        0        0        0        0        0
# 4 79124       0       0       1       1       0       0       0       0       0        0        0        0        0        0
# 5 79128       0       0       0       0       1       0       0       0       0        1        0        0        0        0