将一个文件的两列与同一列的另一个文件进行比较,并获取 matches_large dataset_14GB

Compare two cols of one file with another file of same cols and fetch the matches_large dataset_14GB

我有 650,000 行的文件 1,其中有两个列,"Chr" 和 "Pos"。我想将此文件与 dbsnp (file2) 数据转储进行比较,并与 dbSNP 转储中存在的 Chr 和 Pos col 进行匹配。一旦匹配,将获取相应的 rsid。我尝试使用 Python Panda's,但我的进程被杀死了。当它尝试 50000 行时它起作用了。

如何从 dbSNP (file2) 获取整个数据集(file1 = 650k 行)的 rsid

#Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
import pandas as pd
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
df3 = pd.merge(df1, df2, on='Chr''Pos', how='inner')
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

根据 and reading through the Pandas 0.24.2 merge 文档,以下是我的处理方式 -

# Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs

# import pandas
import pandas as pd

# read in data files
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')

# merge on matched columns 
df3 = df1.merge(df2, on=['Chr', 'Pos'], how='inner')

# export merged df to file
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

df.merge()中的on参数接受单个标签或多个标签作为列表。由于您想在多个列上进行匹配,因此提供列名列表就可以了。

此外,您的进程是如何被杀死的?发布您的错误消息会更有帮助。