如何用 "N/A" 跳过 python 中的第一行和第一列来替换“00”

Question

我正在处理 200 万列和 522 行的 GWAS 数据。在这里，我需要将数据上的“00”替换为“N/A”。由于我有一个巨大的文件，我正在使用 open_reader 方法。

示例数据：

ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,00
200,00,TG,00,GT
300,AA,00,CG,AA
400,GG,CC,AA,TA

期望的输出：

ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,N/A
200,N/A,TG,N/A,GT
300,AA,N/A,CG,AA
400,GG,CC,AA,TA

我写的代码：

import re

input_file = "test.csv"
output_file = "testresult.csv"

# print("Processing data from", input_file)
with open(input_file) as f:
    lineno = 0
    for line in f:
        lineno = lineno + 1
        if (lineno == 1):
            #need to skip first line
            # print("Skipping line 1 which is a header")
            print(line.rstrip())
        else:
            # print("Processing line {}".format(lineno))
            line = re.sub(r',00', ',N/A', line.rstrip())
            print(line)
    # print("Processed {} lines".format(lineno))

我错过了什么？

Answer 1

您可以使用 pandas 轻松完成此操作：

import pandas as pd
df = pd.read('test.csv', dtype = str)
df = df.replace('00', 'N/A')
df.to_csv('test-result.csv', index = False)

对于非常大的 CSV 文件，您可以这样做：

header = True
for chunk in pd.read_csv('test.csv', chunksize = your-chunk-size, type = str):
    chunk = chunk.replace('00', 'N/A')
    chunk.to_csv('test-result.csv', index = False, header = header, mode = 'a')
    header = False

Answer 2

when I use print(line), its showing fine

然后只需使用 print 的 file 关键字参数，如下所示

import re

input_file = "test.csv"
output_file = "testresult.csv"

# print("Processing data from", input_file)
with open(input_file) as f, open(output_file, "w") as g:
    lineno = 0
    for line in f:
        lineno = lineno + 1
        if (lineno == 1):
            #need to skip first line
            # print("Skipping line 1 which is a header")
            print(line.rstrip(),file=g)
        else:
            # print("Processing line {}".format(lineno))
            line = re.sub(r',00', ',N/A', line.rstrip())
            print(line,file=g)
    # print("Processed {} lines".format(lineno))

请注意，虽然打开输入文件名就足够了，因为默认模式是 read-text，但输出文件需要指定写入模式 (w)。

如何用 "N/A" 跳过 python 中的第一行和第一列来替换“00”

How to replace "00" with "N/A" skipping first row and first column in python

python

large-data

pandas