将 CSV 文件处理成 numpy 数组的有效方法

Question

CSV 文件可能不干净（元素数量不一致的行），需要忽略不干净的行。处理过程中需要进行字符串操作。

示例输入：

20150701 20:00:15.173,0.5019,0.91665

期望的输出：float32（伪日期，一天中的秒数，f3，f4）

0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)

CSV文件也很大，内存中的numpy数组预计需要5-10GB，CSV文件超过30GB。

寻找一种有效的方法来处理 CSV 文件并以 numpy 数组结束。

当前解决方案：使用 csv 模块，逐行处理并使用 list() 作为缓冲区，稍后使用 asarray() 将其转换为 numpy 数组。问题是，在转动过程中内存消耗加倍，复制过程增加了执行开销。

Numpy 的 genfromtxt 和 loadtxt 似乎无法按需要处理数据。

Answer 1

如果事先知道数据中有多少行，就可以省去中间环节list，直接写入数组。

import numpy as np

no_rows = 5
no_columns = 4

a = np.zeros((no_rows, no_columns), dtype = np.float)

with open('myfile') as f:
    for i, line in enumerate(f):
        a[i,:] = cool_function_that_returns_formatted_data(line)

Answer 2

我认为i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix方法返回DataFrame。

Answer 3

您是否考虑过使用 pandas read_csv（引擎='C'）

我发现它是处理 csv 的最好和最简单的解决方案之一。我使用 4GB 文件，它对我有用。

import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))

将 CSV 文件处理成 numpy 数组的有效方法

Efficient way to process CSV file into a numpy array

python

csv

arrays

numpy

genfromtxt