有没有更有效的方法将大文件中的行附加到 numpy 数组? - 内存错误
Is there a more efficient way to append lines from a large file to a numpy array? - MemoryError
我正在尝试使用此 lda 程序包来处理具有 39568 行和 27519 列且仅包含 counting/natural 个数字的术语文档矩阵 csv 文件。
问题:我在读取文件并将其存储到 numpy 数组的方法中遇到 MemoryError。
目标:从 TDM csv 文件中获取数字并将其转换为 numpy 数组,以便我可以将 numpy 数组用作 lda 的输入。
with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])
我也尝试过使用 numpy append、vstack 和 concatenate,但我仍然遇到 MemoryError。
有没有办法避免内存错误?
编辑:
我试过使用 dtype int32 和 int 它给了我:
WindowsError: [Error 8] Not enough storage is available to process this command
我也试过使用 dtype float64 它给了我:
OverflowError: cannot fit 'long' into an index-sized integer
使用这些代码:
fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]
和
with open("Results/TDM.csv", 'r') as tdm_file:
vocabulary = [value for value in tdm_file.readline().strip().split(',')]
fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
for idx, line in enumerate(tdm_file):
fp[idx] = np.array(line.strip().split(','))
其他可能重要的信息
- Win10 64位
- 8GB 内存(可用 7.9)|在报告 MemoryError
之前从大约 3GB(使用了大约 2GB)达到 5.5GB 的峰值
- Python 2.7.10 [MSC v.1500 32 位(英特尔)]
- 使用 PyCharm 社区版 5.0.3
由于您的字数几乎全为零,因此将它们存储在 scipy.sparse
矩阵中会更有效率。例如:
from scipy import sparse
import textmining
import lda
# a small example matrix
tdm = textmining.TermDocumentMatrix()
tdm.add_doc("here's a bunch of words in a sentence")
tdm.add_doc("here's some more words")
tdm.add_doc("and another sentence")
tdm.add_doc("have some more words")
# tdm.sparse is a list of dicts, where each dict contains {word:count} for a single
# document
ndocs = len(tdm.sparse)
nwords = len(tdm.doc_count)
words = tdm.doc_count.keys()
# initialize output sparse matrix
X = sparse.lil_matrix((ndocs, nwords),dtype=int)
# iterate over documents, fill in rows of X
for ii, doc in enumerate(tdm.sparse):
for word, count in doc.iteritems():
jj = words.index(word)
X[ii, jj] = count
X
现在是一个 (ndocs, nwords) scipy.sparse.lil_matrix
,words
是一个对应于 X
:
列的列表
print(words)
# ['a', 'and', 'another', 'sentence', 'have', 'of', 'some', 'here', 's', 'words', 'in', 'more', 'bunch']
print(X.todense())
# [[2 0 0 1 0 1 0 1 1 1 1 0 1]
# [0 0 0 0 0 0 1 1 1 1 0 1 0]
# [0 1 1 1 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 1 0 1 0 0 1 0 1 0]]
您可以先将 X
直接传递给 lda.LDA.fit
, although it will probably be faster to convert it to a scipy.sparse.csr_matrix
:
X = X.tocsr()
model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
model.fit(X)
# INFO:lda:n_documents: 4
# INFO:lda:vocab_size: 13
# INFO:lda:n_words: 21
# INFO:lda:n_topics: 2
# INFO:lda:n_iter: 100
# INFO:lda:<0> log likelihood: -126
# INFO:lda:<10> log likelihood: -102
# INFO:lda:<20> log likelihood: -99
# INFO:lda:<30> log likelihood: -97
# INFO:lda:<40> log likelihood: -100
# INFO:lda:<50> log likelihood: -100
# INFO:lda:<60> log likelihood: -104
# INFO:lda:<70> log likelihood: -108
# INFO:lda:<80> log likelihood: -98
# INFO:lda:<90> log likelihood: -98
# INFO:lda:<99> log likelihood: -99
我正在尝试使用此 lda 程序包来处理具有 39568 行和 27519 列且仅包含 counting/natural 个数字的术语文档矩阵 csv 文件。
问题:我在读取文件并将其存储到 numpy 数组的方法中遇到 MemoryError。
目标:从 TDM csv 文件中获取数字并将其转换为 numpy 数组,以便我可以将 numpy 数组用作 lda 的输入。
with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])
我也尝试过使用 numpy append、vstack 和 concatenate,但我仍然遇到 MemoryError。
有没有办法避免内存错误?
编辑:
我试过使用 dtype int32 和 int 它给了我:
WindowsError: [Error 8] Not enough storage is available to process this command
我也试过使用 dtype float64 它给了我:
OverflowError: cannot fit 'long' into an index-sized integer
使用这些代码:
fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]
和
with open("Results/TDM.csv", 'r') as tdm_file:
vocabulary = [value for value in tdm_file.readline().strip().split(',')]
fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
for idx, line in enumerate(tdm_file):
fp[idx] = np.array(line.strip().split(','))
其他可能重要的信息
- Win10 64位
- 8GB 内存(可用 7.9)|在报告 MemoryError 之前从大约 3GB(使用了大约 2GB)达到 5.5GB 的峰值
- Python 2.7.10 [MSC v.1500 32 位(英特尔)]
- 使用 PyCharm 社区版 5.0.3
由于您的字数几乎全为零,因此将它们存储在 scipy.sparse
矩阵中会更有效率。例如:
from scipy import sparse
import textmining
import lda
# a small example matrix
tdm = textmining.TermDocumentMatrix()
tdm.add_doc("here's a bunch of words in a sentence")
tdm.add_doc("here's some more words")
tdm.add_doc("and another sentence")
tdm.add_doc("have some more words")
# tdm.sparse is a list of dicts, where each dict contains {word:count} for a single
# document
ndocs = len(tdm.sparse)
nwords = len(tdm.doc_count)
words = tdm.doc_count.keys()
# initialize output sparse matrix
X = sparse.lil_matrix((ndocs, nwords),dtype=int)
# iterate over documents, fill in rows of X
for ii, doc in enumerate(tdm.sparse):
for word, count in doc.iteritems():
jj = words.index(word)
X[ii, jj] = count
X
现在是一个 (ndocs, nwords) scipy.sparse.lil_matrix
,words
是一个对应于 X
:
print(words)
# ['a', 'and', 'another', 'sentence', 'have', 'of', 'some', 'here', 's', 'words', 'in', 'more', 'bunch']
print(X.todense())
# [[2 0 0 1 0 1 0 1 1 1 1 0 1]
# [0 0 0 0 0 0 1 1 1 1 0 1 0]
# [0 1 1 1 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 1 0 1 0 0 1 0 1 0]]
您可以先将 X
直接传递给 lda.LDA.fit
, although it will probably be faster to convert it to a scipy.sparse.csr_matrix
:
X = X.tocsr()
model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
model.fit(X)
# INFO:lda:n_documents: 4
# INFO:lda:vocab_size: 13
# INFO:lda:n_words: 21
# INFO:lda:n_topics: 2
# INFO:lda:n_iter: 100
# INFO:lda:<0> log likelihood: -126
# INFO:lda:<10> log likelihood: -102
# INFO:lda:<20> log likelihood: -99
# INFO:lda:<30> log likelihood: -97
# INFO:lda:<40> log likelihood: -100
# INFO:lda:<50> log likelihood: -100
# INFO:lda:<60> log likelihood: -104
# INFO:lda:<70> log likelihood: -108
# INFO:lda:<80> log likelihood: -98
# INFO:lda:<90> log likelihood: -98
# INFO:lda:<99> log likelihood: -99