在 pyspark 中计算 name/surname 的 tf-idf
Calculating tf-idf for name/surname in pyspark
我有以下 RDD(示例):
names_rdd.take(3)
[u'Daryll Dickenson', u'Dat Naijaboi', u'Duc Dung Lam']
我正在尝试计算 tf_idf:
from pyspark.mllib.feature import HashingTF,IDF
hashingTF = HashingTF()
tf_names = hashingTF.transform(names_rdd)
tf_names.cache()
idf_names =IDF().fit(tf_names)
tfidf_names = idf_names.transform(tf_names)
我不明白为什么 tf_names.take(3)
会给出这些结果:
[SparseVector(1048576, {60275: 1.0, 134386: 1.0, 145380: 1.0, 274465: 1.0, 441832: 1.0, 579064: 1.0, 590058: 1.0, 664173: 2.0, 812399: 2.0, 845381: 2.0, 886510: 1.0, 897504: 1.0, 1045730: 1.0}),
SparseVector(1048576, {208501: 1.0, 274465: 1.0, 441832: 2.0, 515947: 1.0, 537935: 1.0, 845381: 1.0, 886510: 1.0, 897504: 3.0, 971619: 1.0}),
SparseVector(1048576, {274465: 2.0, 282612: 2.0, 293606: 1.0, 389709: 1.0, 738284: 1.0, 812399: 1.0, 845381: 2.0, 897504: 1.0, 1045730: 1.0})]
不应该是每行有 2 个值,例如:
[SparseVector(1048576, {60275: 1.0, 134386: 1.0}),
SparseVector(1048576, {208501: 1.0, 274465: 1.0}),
SparseVector(1048576, {274365: 2.0, 282612: 2.0})]
?
我做错的是我让每一行都把单词分开并列成一个列表。像这样:
def split_name(name):
list_name = name.split(' ')
list_name = [word.strip() for word in list_name]
return list_name
names = names_rdd.map(lambda name:split_name(name))
hashingTF = HashingTF()
tf_names = hashingTF.transform(names_rdd)
.
.
.
我有以下 RDD(示例):
names_rdd.take(3)
[u'Daryll Dickenson', u'Dat Naijaboi', u'Duc Dung Lam']
我正在尝试计算 tf_idf:
from pyspark.mllib.feature import HashingTF,IDF
hashingTF = HashingTF()
tf_names = hashingTF.transform(names_rdd)
tf_names.cache()
idf_names =IDF().fit(tf_names)
tfidf_names = idf_names.transform(tf_names)
我不明白为什么 tf_names.take(3)
会给出这些结果:
[SparseVector(1048576, {60275: 1.0, 134386: 1.0, 145380: 1.0, 274465: 1.0, 441832: 1.0, 579064: 1.0, 590058: 1.0, 664173: 2.0, 812399: 2.0, 845381: 2.0, 886510: 1.0, 897504: 1.0, 1045730: 1.0}),
SparseVector(1048576, {208501: 1.0, 274465: 1.0, 441832: 2.0, 515947: 1.0, 537935: 1.0, 845381: 1.0, 886510: 1.0, 897504: 3.0, 971619: 1.0}),
SparseVector(1048576, {274465: 2.0, 282612: 2.0, 293606: 1.0, 389709: 1.0, 738284: 1.0, 812399: 1.0, 845381: 2.0, 897504: 1.0, 1045730: 1.0})]
不应该是每行有 2 个值,例如:
[SparseVector(1048576, {60275: 1.0, 134386: 1.0}),
SparseVector(1048576, {208501: 1.0, 274465: 1.0}),
SparseVector(1048576, {274365: 2.0, 282612: 2.0})]
?
我做错的是我让每一行都把单词分开并列成一个列表。像这样:
def split_name(name):
list_name = name.split(' ')
list_name = [word.strip() for word in list_name]
return list_name
names = names_rdd.map(lambda name:split_name(name))
hashingTF = HashingTF()
tf_names = hashingTF.transform(names_rdd)
.
.
.