在将文本向量参数提供给 sklearn 之前如何将其与其他参数组合?
How to combine text vector parameter with other parameters before feeding it to sklearn?
我试图在聚类之前组合两种类型的参数。
我的参数是文本 - 表示为稀疏矩阵,
另一个数组代表我的数据点的其他特征。
我尝试将 2 种类型的参数组合成 1 个数组并将其作为输入传递给算法:
db = DBSCAN(eps=1, min_samples=3, metric=get_distance).fit(array(combined_list))
我还构建了一个我将要使用的自定义距离度量。
def get_distance(vec1,vec2):
text_distance = cosine_similarity(vec1[0] ,vec2[0])
other_distance = vec1[1]-vec2[1]
return (text_distance+other_distance)/2
但是我在尝试传递我的输入数组时遇到错误。
组合数组构造如下:
combined_list = []
for i in range(len(hashes_list)):
combined_list.append((hashes_list[i],text_list[i]))
combined_list = array(combined_list)
完整错误回溯:
db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(array(combined_list))
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/cluster/dbscan_.py", line 319, in fit
X = check_array(X, accept_sparse='csr')
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 527, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
这是将文本向量与其他参数相结合的正确方法吗?
我对你的方法有几点建议。
- DBSCAN 的输入必须使用二维数组而不是元组。因此,您必须展平输入数据。
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape
(n_samples, n_samples)
get_distance()
必须 return 单个值而不是数组。因此,我建议您对非文本特征使用一些度量。我已经给出了欧氏距离的例子。
示例:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> text_list = vectorizer.fit_transform(corpus)
import numpy as np
hashes_list = np.array([[12,12,12],
[12,13,11],
[12,1,16],
[4,8,11]])
from scipy.sparse import hstack
combined_list = hstack((hashes_list,text_list))
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN
n1 = len(vectorizer.get_feature_names())
def get_distance(vec1,vec2):
text_distance = cosine_similarity([vec1[:n1]], [vec2[:n1]])
other_distance = euclidean_distances([vec1[n1:]], [vec2[n1:]])
return (text_distance+other_distance)/2
db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(combined_list.toarray())
我试图在聚类之前组合两种类型的参数。
我的参数是文本 - 表示为稀疏矩阵, 另一个数组代表我的数据点的其他特征。
我尝试将 2 种类型的参数组合成 1 个数组并将其作为输入传递给算法:
db = DBSCAN(eps=1, min_samples=3, metric=get_distance).fit(array(combined_list))
我还构建了一个我将要使用的自定义距离度量。
def get_distance(vec1,vec2):
text_distance = cosine_similarity(vec1[0] ,vec2[0])
other_distance = vec1[1]-vec2[1]
return (text_distance+other_distance)/2
但是我在尝试传递我的输入数组时遇到错误。 组合数组构造如下:
combined_list = []
for i in range(len(hashes_list)):
combined_list.append((hashes_list[i],text_list[i]))
combined_list = array(combined_list)
完整错误回溯:
db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(array(combined_list))
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/cluster/dbscan_.py", line 319, in fit
X = check_array(X, accept_sparse='csr')
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 527, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
这是将文本向量与其他参数相结合的正确方法吗?
我对你的方法有几点建议。
- DBSCAN 的输入必须使用二维数组而不是元组。因此,您必须展平输入数据。
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
get_distance()
必须 return 单个值而不是数组。因此,我建议您对非文本特征使用一些度量。我已经给出了欧氏距离的例子。
示例:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> text_list = vectorizer.fit_transform(corpus)
import numpy as np
hashes_list = np.array([[12,12,12],
[12,13,11],
[12,1,16],
[4,8,11]])
from scipy.sparse import hstack
combined_list = hstack((hashes_list,text_list))
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN
n1 = len(vectorizer.get_feature_names())
def get_distance(vec1,vec2):
text_distance = cosine_similarity([vec1[:n1]], [vec2[:n1]])
other_distance = euclidean_distances([vec1[n1:]], [vec2[n1:]])
return (text_distance+other_distance)/2
db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(combined_list.toarray())