如何为 KMeans 向量化 json 数据？

Question

我有一些用户要回答的问题和选择。它们的格式如下：

question_id, text, choices

对于每个用户，我将每个用户回答的问题和选择作为 json 存储在 mongodb 中：

{user_id: "",  "question_answers" : [{"question_id": "choice_id", ..}] }

现在我正在尝试使用 K-Means 聚类和流式处理来根据用户的问题选择找到最相似的用户，但我需要将我的用户数据转换为一些矢量数字，如 Spark 文档中的示例 here.

kmeans 数据样本和我想要的输出：

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

我已经尝试使用 scikit-learn 的 DictVectorizer，但它似乎工作不正常。

我为每个 question_choice 组合创建了一个密钥，如下所示：

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)

然后我尝试将每个用户的 question/choice 对转换为：

v.transform({'question_1_choice_2': 1, ...})

我得到这样的结果：

[[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]

这是正确的方法吗？因为我每次都需要创建一个包含我所有选择和答案的字典。有没有办法在 Spark 中做到这一点？

提前致谢。抱歉，我是数据科学的新手。

Answer 1

不要对分类数据使用 K-Means。让我引用 How to understand the drawbacks of K-means by KevinKim:

k-means assume the variance of the distribution of each attribute (variable) is spherical;

all variables have the same variance;

the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.

对于编码的分类数据，几乎肯定会违反前两个假设。

有关进一步讨论，请参阅 K-means clustering is not a free lunch by David Robinson。

I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions

对于相似性搜索，使用 MinHashLSH 和近似连接：

https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance

您必须 StringIndex 和 OneHotEncode 所有变量，如以下答案所示：

How to handle categorical features with spark-ml?

另见 by henrikstroem。

如何为 KMeans 向量化 json 数据？

How to vectorize json data for KMeans?

k-means

scikit-learn

apache-spark

pyspark