稀疏矩阵上 hstack 的类型错误

Question

我有两个 csr 稀疏矩阵。一个包含来自 sklearn.feature_extraction.text.TfidfVectorizer 的转换，另一个包含从 numpy 数组转换而来的。我正在尝试对两者执行 scipy.sparse.hstack 以增加我的特征矩阵，但我总是收到错误消息：

TypeError: 'coo_matrix' object is not subscriptable

代码如下：

vectorizer = TfidfVectorizer(analyzer="char", lowercase=True, ngram_range=(1, 2), strip_accents="unicode")
ngram_features = vectorizer.fit_transform(df["strings"].values.astype(str))

list_other_features = ["entropy", "string_length"]
other_features = csr_matrix(df[list_other_features].values)

joined_features = scipy.sparse.hstack((ngram_features, other_features))

两个特征矩阵都是 scipy.sparse.csr_matrix 对象，我也尝试过不转换 other_features，将其保留为 numpy.array，但它会导致相同的错误。

Python 软件包版本：

numpy == 1.13.3
pandas == 0.22.0
scipy == 1.1.0

我不明白为什么它在这种情况下谈论 coo_matrix 对象，尤其是当我将两个矩阵都转换为 csr_matrix 时。查看 scipy 代码，我知道如果输入矩阵是 csr_matrix 对象，它不会进行任何转换。

Answer 1

在 scipy.sparse.hstack, it calls bmat 的源代码中，如果未建立快速路径情况，可能将矩阵转换为 coo_matrix。

诊断

Looking at the scipy code I understand it will not do any conversion if the input matrices are csr_matrix objects.

在bat的source code中，其实除了两个矩阵是csr_matrix才不会变成coo_matrix对象之外，还有更多的条件。看源码需要满足以下2个条件之一

# check for fast path cases
if (N == 1 and format in (None, 'csr') and all(isinstance(b, csr_matrix)
                                               for b in blocks.flat)):
    ...
elif (M == 1 and format in (None, 'csc')
      and all(isinstance(b, csc_matrix) for b in blocks.flat)):
    ...

在line 573A = coo_matrix(blocks[i,j])之前被调用。

建议

要解决此问题，我建议您再检查一次，看看您是否满足 csr_matrix 或 csc_matrix（上面列出的两个条件）的快速路径情况。请查看 bat 的完整源代码以获得更好的理解。不满足条件会转发给你变换矩阵成coo_matrix.

Answer 2

不太清楚这个错误是发生在hstack之后还是在你使用结果的时候。

如果它在 hstack 中，您需要提供回溯，以便我们了解发生了什么。

hstack，使用bmat，通常会收集所有输入的coo属性，并将它们组合成一个新的coo矩阵。因此，无论输入如何（特殊情况除外），结果都将为 coo。但是 hstack 也接受一个 fmt 参数。

或者您可以添加一个 .tocsr()。如果矩阵已经是 csr.

，则无需额外费用

稀疏矩阵上 hstack 的类型错误

TypeError from hstack on sparse matrices

python

numpy

scipy

sparse-matrix