基于 DBSCAN 的聚类字符串
Cluster string based on DBSCAN
总结:
在基于 'contents'
列的多列 csv 文件中寻找 python 代码的 DBSCAN 实现
Input:
input csv file rows sample
Rank, Domain, Contents
1, abc.com, hello random text out
2, xyz.com, hello random somethingelse
3, not.com, a b c d
4, plus.com, a b asdsadsa asdsadasdsadsa
5, minus.com, man win
Where,
Column 1 => Rank = digit
Column 2 => Domain = domain name ex. abc.com
Column 3 => Contents = list of words (string, this is
extracted clean up words from html page)
Output :
The output of the cluster be based on similar list of contents
Cluster 1: abc.com, xyz.com
Cluster 2: not.com, plus.com
Cluster 3: minus.com
....
Please note: In output, I am not looking for words that are in same cluster. Instead, I am looking for a 'domain name', column which is clustered based on similar contents of column 3, 'contents'
我研究了以下资源,但它们基于 kmeans,与我正在寻找的 DBSCAN 集群输出无关。请注意,在这种情况下,提供簇数将不适用,因为我们不想根据输入限制簇数。
1)
2)
3) http://brandonrose.org/clustering
4) https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
所以,
input <= csv file with 'Rank', 'Domain', 'Contents'
output <= cluster with domain name [NOT contents]
A python implementation in DBSCAN clustering would be an ideal.
谢谢!
您首先需要 select 数据集的 "Contents" 列。您可以在该步骤中使用 Python 的 csv
模块。
然后您必须将文本转换为可以训练 DBSCAN 的向量。您提供的第二个 link 包含完成该步骤所需的一切。
然后你必须在向量上训练 DBSCAN。例如,您可以使用 DBSCAN in scikit-learn 的实现。
获得与向量关联的标签(即 csv 文件的行)后,您可以按簇对行数进行分组并检索域。
总结: 在基于 'contents'
列的多列 csv 文件中寻找 python 代码的 DBSCAN 实现Input:
input csv file rows sample
Rank, Domain, Contents
1, abc.com, hello random text out
2, xyz.com, hello random somethingelse
3, not.com, a b c d
4, plus.com, a b asdsadsa asdsadasdsadsa
5, minus.com, man win
Where,
Column 1 => Rank = digit
Column 2 => Domain = domain name ex. abc.com
Column 3 => Contents = list of words (string, this is
extracted clean up words from html page)
Output :
The output of the cluster be based on similar list of contents
Cluster 1: abc.com, xyz.com
Cluster 2: not.com, plus.com
Cluster 3: minus.com
....
Please note: In output, I am not looking for words that are in same cluster. Instead, I am looking for a 'domain name', column which is clustered based on similar contents of column 3, 'contents'
我研究了以下资源,但它们基于 kmeans,与我正在寻找的 DBSCAN 集群输出无关。请注意,在这种情况下,提供簇数将不适用,因为我们不想根据输入限制簇数。
1)
2)
3) http://brandonrose.org/clustering
4) https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
所以,
input <= csv file with 'Rank', 'Domain', 'Contents'
output <= cluster with domain name [NOT contents]
A python implementation in DBSCAN clustering would be an ideal.
谢谢!
您首先需要 select 数据集的 "Contents" 列。您可以在该步骤中使用 Python 的 csv
模块。
然后您必须将文本转换为可以训练 DBSCAN 的向量。您提供的第二个 link 包含完成该步骤所需的一切。
然后你必须在向量上训练 DBSCAN。例如,您可以使用 DBSCAN in scikit-learn 的实现。
获得与向量关联的标签(即 csv 文件的行)后,您可以按簇对行数进行分组并检索域。