sklearn SVD fit_transform 函数的输入数据类型

Question

我已经处理了 CSV 文件中的文档数据，我在 pandas DataFrame 中读取了它：

+----------+------+------------+
| document | term | count      |
+----------+------+------------+
| 1        | 126  | 1          |
| 1        | 80   | 1          |
| 1        | 1221 | 2          |
| 2        | 2332 | 1          |

所以它由 document_id、术语和术语频率组成。

我没有原始文档，只有这个处理过的数据，我想用 sklearn 应用 SVD，但我不知道如何为 SVD fit_transform() 准备这个 DataFrame，它需要：

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Answer 1

您可以将此 CSV 转换为 libsvm 格式：

<label> <index1>:<value1> <index2>:<value2> ...
.
.
.

因此，您的示例数据将如下所示：

0 80:1 126:1 1221:2
0 2332:1

然后使用sklearn.datasets.load_svmlight_file

读取这个文件

from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file('your_libsvm_format_file.libsvm')

然后，

from sklearn.decomposition import SVD
svd = SVD()
X_transformed = svd.fit_transform(X)

sklearn SVD fit_transform 函数的输入数据类型

Input data type for sklearn SVD fit_transform function

python

nlp

svd

dimensionality-reduction

scikit-learn