TF/IDF 使用 MySQL 进行测量
TF/IDF Measurement with using MySQL
我的 mysql weightallofwordsintopic table.
中有以下数据
Topic Name Word WordCount
20160401-20160405 ahlak 954
20160401-20160405 çocuk 825
20160401-20160405 kadın 764
20160401-20160405 tecavüz 710
20160401-20160405 güzel 701
20160401-20160405 hayat 670
20160401-20160405 bakan 661
20160401-20160405 zaman 585
20160401-20160405 adam 494
20160401-20160405 çalış 453
20160406-20160407 kandil 4927
20160406-20160407 mübarek 2906
20160406-20160407 hayır 2342
20160406-20160407 çocuk 1893
20160406-20160407 güzel 1835
20160406-20160407 regaip 1574
20160406-20160407 allah 1536
20160406-20160407 tecavüz 1457
20160406-20160407 kadın 1442
20160406-20160407 hayat 1436
20160408-20160409 güzel 2385
20160408-20160409 hayat 2187
20160408-20160409 hayır 1972
20160408-20160409 zaman 1902
20160408-20160409 cuma 1589
20160408-20160409 allah 1550
20160408-20160409 gece 1233
20160408-20160409 adam 1198
20160408-20160409 saat 1153
20160408-20160409 dünya 1130
20160410-20160411 stat 1993
20160410-20160411 güzel 1854
20160410-20160411 hayat 1579
20160410-20160411 şampiyon 1464
20160410-20160411 taraftar 1426
20160410-20160411 zaman 1380
20160410-20160411 adam 1336
20160410-20160411 çalış 1297
20160410-20160411 saat 1283
20160410-20160411 başkan 1112
我想测量每个主题中每个词的 tf/idf 频率。假设一个主题与一个文档同名,所以我需要单独测量所有单词的 tf/idf 频率。我需要 mysql 查询 this.WordCount 是该主题中该词的出现次数。我的 table 太大了,我只是写了一个示例来解释我的问题。我需要一个查询来完成这项工作。非常感谢。
我是根据这个wiki做的:。
干得好:
1) t1 得到每个主题的单词总和
2) t2 获取 idf。这是主题数与包含该词的主题数之比的 log10
3) 因为你做了字数统计,所以除以 sum_per_topic 得到 tf
select w.Topic_Name,
w.Word,
w.WordCount/t1.topic_sum as tf,
t2.idf,
(w.WordCount/t1.topic_sum)*(t2.idf) as tf_idf
from weightallofwordsintopic w
join (
select Topic_Name, sum(WordCount) as topic_sum
from weightallofwordsintopic
group by Topic_Name
) t1
on w.Topic_Name=t1.Topic_Name
join (
select w.Word, log10(t_cnts.cnts/count(*)) as idf
from weightallofwordsintopic w,
(select count(distinct Topic_Name) as cnts from weightallofwordsintopic) t_cnts
group by w.Word
) t2
on w.Word=t2.Word
order by tf_idf desc,
w.Word
我的 mysql weightallofwordsintopic table.
中有以下数据Topic Name Word WordCount
20160401-20160405 ahlak 954
20160401-20160405 çocuk 825
20160401-20160405 kadın 764
20160401-20160405 tecavüz 710
20160401-20160405 güzel 701
20160401-20160405 hayat 670
20160401-20160405 bakan 661
20160401-20160405 zaman 585
20160401-20160405 adam 494
20160401-20160405 çalış 453
20160406-20160407 kandil 4927
20160406-20160407 mübarek 2906
20160406-20160407 hayır 2342
20160406-20160407 çocuk 1893
20160406-20160407 güzel 1835
20160406-20160407 regaip 1574
20160406-20160407 allah 1536
20160406-20160407 tecavüz 1457
20160406-20160407 kadın 1442
20160406-20160407 hayat 1436
20160408-20160409 güzel 2385
20160408-20160409 hayat 2187
20160408-20160409 hayır 1972
20160408-20160409 zaman 1902
20160408-20160409 cuma 1589
20160408-20160409 allah 1550
20160408-20160409 gece 1233
20160408-20160409 adam 1198
20160408-20160409 saat 1153
20160408-20160409 dünya 1130
20160410-20160411 stat 1993
20160410-20160411 güzel 1854
20160410-20160411 hayat 1579
20160410-20160411 şampiyon 1464
20160410-20160411 taraftar 1426
20160410-20160411 zaman 1380
20160410-20160411 adam 1336
20160410-20160411 çalış 1297
20160410-20160411 saat 1283
20160410-20160411 başkan 1112
我想测量每个主题中每个词的 tf/idf 频率。假设一个主题与一个文档同名,所以我需要单独测量所有单词的 tf/idf 频率。我需要 mysql 查询 this.WordCount 是该主题中该词的出现次数。我的 table 太大了,我只是写了一个示例来解释我的问题。我需要一个查询来完成这项工作。非常感谢。
我是根据这个wiki做的:。 干得好: 1) t1 得到每个主题的单词总和 2) t2 获取 idf。这是主题数与包含该词的主题数之比的 log10 3) 因为你做了字数统计,所以除以 sum_per_topic 得到 tf
select w.Topic_Name,
w.Word,
w.WordCount/t1.topic_sum as tf,
t2.idf,
(w.WordCount/t1.topic_sum)*(t2.idf) as tf_idf
from weightallofwordsintopic w
join (
select Topic_Name, sum(WordCount) as topic_sum
from weightallofwordsintopic
group by Topic_Name
) t1
on w.Topic_Name=t1.Topic_Name
join (
select w.Word, log10(t_cnts.cnts/count(*)) as idf
from weightallofwordsintopic w,
(select count(distinct Topic_Name) as cnts from weightallofwordsintopic) t_cnts
group by w.Word
) t2
on w.Word=t2.Word
order by tf_idf desc,
w.Word