pandas 数据帧的高效读写

Question

我有一个 pandas 数据帧，我想将其拆分成几个较小的部分，每个部分 100k 行，然后保存到磁盘上，以便我可以读入数据并逐一处理。我尝试过使用 dill 和 hdf 存储，因为 csv 和原始文本似乎要花费很多时间。

我正在对具有约 50 万行和五列混合数据的数据子集进行尝试。两个包含字符串，一个整数，一个浮点数，最后一个包含来自 sklearn.feature_extraction.text.CountVectorizer 的二元组计数，存储为 scipy.sparse.csr.csr_matrix 稀疏矩阵。

这是我遇到问题的最后一列。转储和加载数据没有问题，但是当我尝试实际访问数据时，它是一个 pandas.Series 对象。其次，该系列中的每一行都是一个包含整个数据集的元组。

# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400. 
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400. 

vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts

df_split  =  pandas.DataFrame(np.column_stack([df['string1'][0:100000],
                                               df['string2'][0:100000],
                                               df['float'][0:100000],
                                               df['integer'][0:100000],
                                               df['counts'][0:100000]]),
                                               columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))

df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400)    # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>

我是否犯了任何明显的错误，或者是否有更好的方法以这种格式存储此数据，这种方法不是很耗时？它必须可扩展到我包含 1 亿行的完整数据。

Answer 1

df['counts'] = counts

这将生成一个 Pandas 系列（列），其中元素的数量等于 len(df)，其中 每个元素 是一个稀疏矩阵，它由 vectorizer.fit_transform(df['string_data'])

返回

您可以尝试进行以下操作：

df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)

注意：请注意，这会将您的稀疏矩阵分解为密集（非稀疏）DataFrame，因此它将使用更多内存你可以得到 MemoryError

结论： 这就是为什么我建议您分别存储原始 DF 和 count 稀疏矩阵

pandas 数据帧的高效读写

Efficient read and write of pandas dataframe

python

sparse-matrix

dataframe

pandas

countvectorizer