如何用 "N/A" 跳过 python 中的第一行和第一列来替换“00”
How to replace "00" with "N/A" skipping first row and first column in python
我正在处理 200 万列和 522 行的 GWAS 数据。在这里,我需要将数据上的“00”替换为“N/A”。由于我有一个巨大的文件,我正在使用 open_reader 方法。
示例数据:
ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,00
200,00,TG,00,GT
300,AA,00,CG,AA
400,GG,CC,AA,TA
期望的输出:
ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,N/A
200,N/A,TG,N/A,GT
300,AA,N/A,CG,AA
400,GG,CC,AA,TA
我写的代码:
import re
input_file = "test.csv"
output_file = "testresult.csv"
# print("Processing data from", input_file)
with open(input_file) as f:
lineno = 0
for line in f:
lineno = lineno + 1
if (lineno == 1):
#need to skip first line
# print("Skipping line 1 which is a header")
print(line.rstrip())
else:
# print("Processing line {}".format(lineno))
line = re.sub(r',00', ',N/A', line.rstrip())
print(line)
# print("Processed {} lines".format(lineno))
我错过了什么?
您可以使用 pandas
轻松完成此操作:
import pandas as pd
df = pd.read('test.csv', dtype = str)
df = df.replace('00', 'N/A')
df.to_csv('test-result.csv', index = False)
对于非常大的 CSV 文件,您可以这样做:
header = True
for chunk in pd.read_csv('test.csv', chunksize = your-chunk-size, type = str):
chunk = chunk.replace('00', 'N/A')
chunk.to_csv('test-result.csv', index = False, header = header, mode = 'a')
header = False
when I use print(line)
, its showing fine
然后只需使用 print
的 file
关键字参数,如下所示
import re
input_file = "test.csv"
output_file = "testresult.csv"
# print("Processing data from", input_file)
with open(input_file) as f, open(output_file, "w") as g:
lineno = 0
for line in f:
lineno = lineno + 1
if (lineno == 1):
#need to skip first line
# print("Skipping line 1 which is a header")
print(line.rstrip(),file=g)
else:
# print("Processing line {}".format(lineno))
line = re.sub(r',00', ',N/A', line.rstrip())
print(line,file=g)
# print("Processed {} lines".format(lineno))
请注意,虽然打开输入文件名就足够了,因为默认模式是 read-text,但输出文件需要指定写入模式 (w
)。
我正在处理 200 万列和 522 行的 GWAS 数据。在这里,我需要将数据上的“00”替换为“N/A”。由于我有一个巨大的文件,我正在使用 open_reader 方法。
示例数据:
ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,00
200,00,TG,00,GT
300,AA,00,CG,AA
400,GG,CC,AA,TA
期望的输出:
ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,N/A
200,N/A,TG,N/A,GT
300,AA,N/A,CG,AA
400,GG,CC,AA,TA
我写的代码:
import re
input_file = "test.csv"
output_file = "testresult.csv"
# print("Processing data from", input_file)
with open(input_file) as f:
lineno = 0
for line in f:
lineno = lineno + 1
if (lineno == 1):
#need to skip first line
# print("Skipping line 1 which is a header")
print(line.rstrip())
else:
# print("Processing line {}".format(lineno))
line = re.sub(r',00', ',N/A', line.rstrip())
print(line)
# print("Processed {} lines".format(lineno))
我错过了什么?
您可以使用 pandas
轻松完成此操作:
import pandas as pd
df = pd.read('test.csv', dtype = str)
df = df.replace('00', 'N/A')
df.to_csv('test-result.csv', index = False)
对于非常大的 CSV 文件,您可以这样做:
header = True
for chunk in pd.read_csv('test.csv', chunksize = your-chunk-size, type = str):
chunk = chunk.replace('00', 'N/A')
chunk.to_csv('test-result.csv', index = False, header = header, mode = 'a')
header = False
when I use
print(line)
, its showing fine
然后只需使用 print
的 file
关键字参数,如下所示
import re
input_file = "test.csv"
output_file = "testresult.csv"
# print("Processing data from", input_file)
with open(input_file) as f, open(output_file, "w") as g:
lineno = 0
for line in f:
lineno = lineno + 1
if (lineno == 1):
#need to skip first line
# print("Skipping line 1 which is a header")
print(line.rstrip(),file=g)
else:
# print("Processing line {}".format(lineno))
line = re.sub(r',00', ',N/A', line.rstrip())
print(line,file=g)
# print("Processed {} lines".format(lineno))
请注意,虽然打开输入文件名就足够了,因为默认模式是 read-text,但输出文件需要指定写入模式 (w
)。