将一个文件的两列与同一列的另一个文件进行比较，并获取 matches_large dataset_14GB

Question

我有 650,000 行的文件 1，其中有两个列，"Chr" 和 "Pos"。我想将此文件与 dbsnp (file2) 数据转储进行比较，并与 dbSNP 转储中存在的 Chr 和 Pos col 进行匹配。一旦匹配，将获取相应的 rsid。我尝试使用 Python Panda's，但我的进程被杀死了。当它尝试 50000 行时它起作用了。

如何从 dbSNP (file2) 获取整个数据集（file1 = 650k 行）的 rsid

#Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
import pandas as pd
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
df3 = pd.merge(df1, df2, on='Chr''Pos', how='inner')
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

Answer 1

根据 and reading through the Pandas 0.24.2 merge 文档，以下是我的处理方式 -

# Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs

# import pandas
import pandas as pd

# read in data files
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')

# merge on matched columns 
df3 = df1.merge(df2, on=['Chr', 'Pos'], how='inner')

# export merged df to file
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

df.merge()中的on参数接受单个标签或多个标签作为列表。由于您想在多个列上进行匹配，因此提供列名列表就可以了。

此外，您的进程是如何被杀死的？发布您的错误消息会更有帮助。

将一个文件的两列与同一列的另一个文件进行比较，并获取 matches_large dataset_14GB

Compare two cols of one file with another file of same cols and fetch the matches_large dataset_14GB

bioinformatics

genome

pandas