加快 Python 庞大数据集的文件处理速度

Question

我有一个存储为 17GB csv 文件的大型数据集 (fileData)，其中每个 customer_id 包含可变数量的记录（最多约 30,000 条） .我正在尝试搜索特定客户（在 fileSelection 中列出 - 总共 90000 个中大约有 1500 个）并将每个客户的记录复制到单独的 csv 文件中（文件输出).

我对Python很陌生，但使用它是因为vba和matlab（我更熟悉）无法处理文件大小。（我正在使用 Aptana studio 编写代码，但是运行 python 直接从 cmd 行来提高速度。运行 64 位 Windows 7.）

我写的代码提取了一些客户，但是有两个问题： 1）无法在大型数据集中找到大多数客户。（我相信它们都在数据集中，但不能完全确定。） 2）它非常慢。任何加速代码的方法都将受到赞赏，包括可以更好地利用 16 核 PC 的代码。

代码如下：

 `def main():

    # Initialisation : 

    #  - identify columns in slection file
    #
    fS = open (fileSelection,"r")
    if fS.mode == "r":
        header = fS.readline()
        selheaderlist = header.split(",")
        custkey =   selheaderlist.index('CUSTOMER_KEY')

    #
    # Identify columns in dataset file
    fileData = path2+file_data
    fD = open (fileData,"r")
    if fD.mode == "r":
        header = fD.readline()
        dataheaderlist = header.split(",")
        custID =   dataheaderlist.index('CUSTOMER_ID')
    fD.close()

    # For each customer in the selection file
    customercount=1
    for sr in fS:
        # Find customer key and locate it in customer ID field in dataset  
        selrecord = sr.split(",")
        requiredcustomer = selrecord[custkey]

        #Look for required customer in dataset
        found = 0
        fD = open (fileData,"r")
        if fD.mode == "r":
            while found == 0:
                dr = fD.readline()
                if not dr: break
                datrecord = dr.split(",")
                if datrecord[custID] == requiredcustomer:
                    found = 1

                    # Open outputfile
                    fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"
                    fO=open(fileOutput,"w+")
                    fO.write(str(header))

                    #copy all records for required customer number
                    while datrecord[custID] == requiredcustomer:
                        fO.write(str(dr))
                        dr = fD.readline()
                        datrecord = dr.split(",")
                    #Close Output file          
                    fO.close()           

            if found == 1:
                print ("Customer Count "+str(customercount)+ "  Customer ID"+str(requiredcustomer)+" copied.  ")
                customercount = customercount+1
            else:
                print("Customer ID"+str(requiredcustomer)+" not found in dataset")
                fL.write (str(requiredcustomer)+","+"NOT FOUND")
            fD.close()

    fS.close()

    `

花了几天时间提取了数百个客户，但未能找到更多客户。

Sample Output

感谢@Paul Cornelius。这样效率更高。我采用了您的方法，还使用了@Bernardo 建议的 csv 处理：

# Import Modules

import csv


def main():

    # Initialisation : 


    fileSelection = path1+file_selection
    fileData = path2+file_data


    # Step through selection file and create dictionary with required ID's as keys, and empty objects
    with open(fileSelection,'rb') as csvfile:
        selected_IDs = csv.reader(csvfile)
        ID_dict = {}
        for row in selected_IDs:
            ID_dict.update({row[1]:[]})

      # step through data file: for selected customer ID's, append records to dictionary objects
    with open(fileData,'rb') as csvfile:
        dataset = csv.reader(csvfile)
        for row in dataset:
            if row[0] in ID_dict:
                    ID_dict[row[0]].extend([row[1]+','+row[4]])

        # write all dictionary objects to csv files
    for row in ID_dict.keys():
        fileOutput = path3+file_out_root+row+'.csv'
        with open(fileOutput,'wb') as csvfile:
            output = csv.writer(csvfile, delimiter='\n')
            output.writerows([ID_dict[row]])

Answer 1

改用 csv reader。 Python 有一个很好的库来处理 CSV 文件，因此您不需要进行拆分。

查看文档：https://docs.python.org/2/library/csv.html

>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print ', '.join(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

它应该表现得更好。

Answer 2

如果您的机器可以处理内存中 csv 的大小，请尝试使用 pandas。

如果您正在寻找非核心计算 - 请查看 dask（它们提供类似的 API）

在 pandas 中，如果您运行遇到内存问题，您只能从 csv 文件中读取特定列。

无论如何 - pandas 和 dask 都使用 C 绑定，显着比纯 python.

快

在 pandas 中，您的代码类似于：

import pandas as pd

input_csv = pd.read_csv('path_to_csv')
records_for_interesting customers = input_csv[input_csv.fileSelection.isin([list_of_ids])]
records_for_interesting customers.to_csv('output_path')

Answer 3

任务过于复杂，无法简单回答。但是你的方法效率很低，因为你有太多的嵌套循环。尝试让 ONE 遍历客户列表，并为每个构建一个 "customer" 对象，其中包含您以后需要使用的任何信息。你把这些放在字典里；键是不同的 requiredcustomer 变量，值是客户对象。如果我是你，我会先让这部分工作，然后再处理大文件。

现在，您单步执行大量客户数据文件，每次遇到其 datarecord[custID] 字段在字典中的记录时，您都会在输出文件中追加一行。您可以使用相对高效的 in 运算符来测试字典中的成员资格。

不需要嵌套循环。

您提供的代码不能运行因为您写入某个名为 fL 的对象而没有打开它。另外，正如 Tim Pietzcker 指出的那样，您并没有关闭文件，因为您实际上并没有调用关闭函数。

加快 Python 庞大数据集的文件处理速度

Speeding up Python file handling for a huge dataset

python

csv

performance

large-files

python-multithreading