如何读取和处理文件的一部分并将其余部分写入另一个文件？

Question

我需要用 Panda 读取这个 csv file，对其执行一些处理并将其余 10% 的数据写入另一个 sheet。

鉴于此解决方案 ()，我想在取出 10% 行后对 store_data 的其余部分进行处理，但是，elif 条件打印与原始行相同的行文件，如何解决我的条件以跳过 10% 的行？

store_data = pd.read_csv("heart_disease.csv")

with open("out1.csv","w") as outfile:
    outcsv = csv.writer(outfile)
    for i, row in store_data.iterrows():
        if not i % 10: #write 10% of the file to another file
            outcsv.writerow(row)
        elif i % 10:  #I need to do some process on the rest of the file
            store_data = store_data.applymap(str)

Answer 1

简单地将您的数据框分成两部分，将 10% 保存到一个文件中 (dataframe.to_csv(..)) 并将您的计算应用于第二个 df 中的 90% 会更容易和更清晰。

您通过计算一个新列来执行此操作，该列告诉您某行是否经过测试，并将您的数据框沿着这个新列值分成两部分：

数据文件创建：

fn = "heart_disease.csv"
with open(fn,"w") as f:
    # doubled the data provided
    f.write("""Age,AL,SEX,DIAB,SMOK,CHOL,LAD,RCA,LM
65,0,M,n,y,220,80,75,20\n45,0.2,F,n,n,300,90,35,35\n66,-1,F,y,y,200,90,80,20
70,0.2,F,n,y,220,40,85,15\n80,1.1,M,y,y,200,90,90,25\n55,0,M,y,y,240,95,45,25
90,-1,M,n,y,350,35,75,20\n88,1,F,y,y,200,40,85,20\n50,1.1,M,n,n,220,55,30,30
95,-1,M,n,y,230,75,85,15\n30,1.1,F,n,y,235,75,20,30
65,0,M,n,y,220,80,75,20\n45,0.2,F,n,n,300,90,35,35\n66,-1,F,y,y,200,90,80,20
70,0.2,F,n,y,220,40,85,15\n80,1.1,M,y,y,200,90,90,25\n55,0,M,y,y,240,95,45,25
90,-1,M,n,y,350,35,75,20\n88,1,F,y,y,200,40,85,20\n50,1.1,M,n,n,220,55,30,30
95,-1,M,n,y,230,75,85,15\n30,1.1,F,n,y,235,75,20,30
""")

程序:[=35=]

import pandas as pd fn = "heart_disease.csv" store_data = pd.read_csv(fn) print(store_data) import random import numpy as np percentage = 0.1 store_data["test"] = np.random.rand(len(store_data)) test_data = store_data[store_data.test <= percentage] other_data = store_data[store_data.test > percentage] print(test_data) print(other_data)

输出：

# original data Age AL SEX DIAB SMOK CHOL LAD RCA LM 0 65 0.0 M n y 220 80 75 20 1 45 0.2 F n n 300 90 35 35 2 66 -1.0 F y y 200 90 80 20 3 70 0.2 F n y 220 40 85 15 4 80 1.1 M y y 200 90 90 25 5 55 0.0 M y y 240 95 45 25 6 90 -1.0 M n y 350 35 75 20 7 88 1.0 F y y 200 40 85 20 8 50 1.1 M n n 220 55 30 30 9 95 -1.0 M n y 230 75 85 15 10 30 1.1 F n y 235 75 20 30 11 65 0.0 M n y 220 80 75 20 12 45 0.2 F n n 300 90 35 35 13 66 -1.0 F y y 200 90 80 20 14 70 0.2 F n y 220 40 85 15 15 80 1.1 M y y 200 90 90 25 16 55 0.0 M y y 240 95 45 25 17 90 -1.0 M n y 350 35 75 20 18 88 1.0 F y y 200 40 85 20 19 50 1.1 M n n 220 55 30 30 20 95 -1.0 M n y 230 75 85 15 21 30 1.1 F n y 235 75 20 30 # data with test <= 0.1 Age AL SEX DIAB SMOK CHOL LAD RCA LM test 3 70 0.2 F n y 220 40 85 15 0.093135 10 30 1.1 F n y 235 75 20 30 0.021302 # data with test > 0.1 Age AL SEX DIAB SMOK CHOL LAD RCA LM test 0 65 0.0 M n y 220 80 75 20 0.449546 1 45 0.2 F n n 300 90 35 35 0.953321 2 66 -1.0 F y y 200 90 80 20 0.928233 4 80 1.1 M y y 200 90 90 25 0.672880 5 55 0.0 M y y 240 95 45 25 0.136537 6 90 -1.0 M n y 350 35 75 20 0.439261 7 88 1.0 F y y 200 40 85 20 0.935340 8 50 1.1 M n n 220 55 30 30 0.737416 9 95 -1.0 M n y 230 75 85 15 0.461699 11 65 0.0 M n y 220 80 75 20 0.548624 12 45 0.2 F n n 300 90 35 35 0.679861 13 66 -1.0 F y y 200 90 80 20 0.195141 14 70 0.2 F n y 220 40 85 15 0.997854 15 80 1.1 M y y 200 90 90 25 0.871436 16 55 0.0 M y y 240 95 45 25 0.907141 17 90 -1.0 M n y 350 35 75 20 0.295690 18 88 1.0 F y y 200 40 85 20 0.970249 19 50 1.1 M n n 220 55 30 30 0.566218 20 95 -1.0 M n y 230 75 85 15 0.545188 21 30 1.1 F n y 235 75 20 30 0.217490

它是随机的，您可能会得到恰好 10% 的数据 - 或者您可以获得 fewer/more 而不是 10% - 您的数据越大，您就越接近 10%。

您可以使用 "derived" 数据帧将数据存储到测试和使用 df.to_csv 的其他数据中。

对于纯 pandas 解决方案，How do I create test and train samples from one dataframe with pandas? 与您的解决方案重复，但您似乎在单独处理 csv，因此不确定它是否适用。

Answer 2

这是一个纯粹的 Pandas 解决方案：

import pandas as pd
df = pd.read_csv("heart_disease.csv")
#select only 10% of the rows, subtract 1 because index starts with zero
df_slice = df.loc[:round(len(df) * 10 /100) - 1, :]
#write the sliced df to csv
df_slice.to_csv("sliced.csv", index=None)
#to work with the rest of the data, just drop the rows at index where the df_slice rows exist
l = df_slice.index.tolist()
df.drop(df.index[l], inplace=True) #90% of data
#now the df has the rest 90% and you can do whatever you want with it

如何读取和处理文件的一部分并将其余部分写入另一个文件？

How can I read and process part of file and write the rest to another file?

python

test-data

pandas