Spark MLlib LDA:生成总是非常相似的 LDA 主题背后的可能原因?
Spark MLlib LDA: the possible reasons behind generating always very similar LDA topics?
我正在申请 MLlib LDA example on various corpora downloaded from enter link description here
我过滤掉了停用词,也排除了非常频繁的术语和非常罕见的术语。
问题是我总是有非常相似的主题。
这是我获得的主题示例,通过 运行 来自 Wikipedia (eng_wikipedia_2010_300K-sentences) 的 300K 英语句子语料库的算法,知道我在使用时有类似的行为其他语料库:
TOPIC 0
dai 0.0020492776129338083
call 0.0019627409470977355
citi 0.0019496273507300062
three 0.0019172201890256511
gener 0.0018325842193426059
plai 0.0018287121439402873
peopl 0.001786839660855886
well 0.0017792000702589461
system 0.0017410979899730565
area 0.001721711978388363
power 0.0016906026954800833
forc 0.0016646631729486227
number 0.0016343386030518979
1 0.0016238591786476033
team 0.0016112030952801443
second 0.0015692071709961662
develop 0.0015670177558504078
group 0.0015378927495689552
unit 0.001535180513974118
nation 0.001520548489788889
TOPIC 1
dai 0.002027230927747474
call 0.0019861147606781222
citi 0.0019793753441068825
three 0.0019315799215582723
gener 0.0018482143436741026
plai 0.0018088629290540156
peopl 0.0017929339168126625
well 0.0017549252518608278
system 0.0016936542725510587
power 0.0016792684719108006
area 0.0016604962232717288
forc 0.0016575624332970456
1 0.0016344588453542676
number 0.0016147026427518426
team 0.0015914797457267642
develop 0.001580085843019015
unit 0.0015659585445574969
nation 0.0015412334667742672
second 0.0015292625574896467
group 0.0015111594105132022
TOPIC 2
dai 0.002028407701986021
call 0.001987655848237808
citi 0.0019737160296217846
three 0.0019183385421321895
plai 0.0018470661666555599
gener 0.0018431319454591765
peopl 0.0017947273975068192
well 0.00174922095206974
area 0.0017256327188664123
system 0.0016995971624202812
forc 0.001690002995539528
power 0.0016779250581379353
1 0.0016214669556130525
team 0.0016134935452659781
number 0.00161273946842774
develop 0.0015712560226793318
unit 0.0015385515465297065
second 0.001537016434433013
nation 0.001529578699246495
group 0.0015259003261706866
TOPIC 3
dai 0.0020271063080981745
call 0.001973996689805456
citi 0.0019709486233839084
three 0.0019445106630149387
gener 0.0018677792917783514
plai 0.0018485914586526906
peopl 0.0018082458859327093
well 0.0017955363877379456
area 0.0017455386898734308
system 0.0017118889300776724
power 0.0017085249825238942
forc 0.0016416026632813164
1 0.001625823945554925
team 0.0015984923365964885
number 0.001584888932954503
develop 0.0015753517064182336
unit 0.0015587234313666533
second 0.0015545107852806973
nation 0.001551230039407881
form 0.0015004750009120491
TOPIC 4
dai 0.0020367505428973216
citi 0.0019778590305849857
call 0.0019772546555550576
three 0.001909390366412786
peopl 0.001822249318126459
gener 0.0018136257455996375
plai 0.0018128359158538045
well 0.0017692106359278286
system 0.0017220797688845334
area 0.0017158874212548339
power 0.0016752592665713634
forc 0.0016481228833262157
1 0.0016364343814157618
develop 0.0016172188646470641
team 0.0016018835612051036
number 0.0015991873726231036
group 0.0015593423279207062
second 0.0015532604092917898
unit 0.001549525336335323
2 0.0015220460130066676
TOPIC 5
dai 0.0020635883517150367
call 0.0019664003159491844
citi 0.001961190935833301
three 0.001945998746077669
plai 0.0018498883070569758
peopl 0.0018146602342867515
gener 0.0018135991027718233
well 0.0017837359414291816
area 0.0017440315427199456
system 0.0016954828503859868
power 0.001684533695977363
forc 0.001669704443002364
number 0.00161528564937031
1 0.001615272821378791
team 0.0016121988960501902
unit 0.0015895009183487473
develop 0.001577936587739003
group 0.0015555325586313624
nation 0.0015404874848355308
second 0.0015394146696500102
TOPIC 6
dai 0.0020136284206896792
call 0.001992567179072041
citi 0.0019601308797825385
three 0.0019185595159400765
plai 0.0018409472012516875
gener 0.001829303983728153
peopl 0.0017780620849170163
well 0.001771180582253062
system 0.0017377818879564248
area 0.0016871361621009276
power 0.0016862650658960986
forc 0.00167141172198367
1 0.001629498191900329
number 0.0015977527836457993
develop 0.0015960475085336815
team 0.001571055963470908
unit 0.0015559866004530513
group 0.0015445653607137958
second 0.0015346412996486915
2 0.001533194322154979
TOPIC 7
dai 0.0020097600649219504
citi 0.001996121452902739
call 0.001976365831615543
three 0.0019444233325152307
gener 0.0018347697960641011
plai 0.0018294437097569366
peopl 0.001809068711352435
well 0.0017851474017785431
system 0.0017266117477556496
power 0.001696861186965475
area 0.0016963032173278431
forc 0.0016424242914518095
team 0.0016341651077031543
number 0.0016257268377783236
1 0.0016221579346215153
develop 0.0015930555191603342
unit 0.0015895942206181324
group 0.0015703868353222673
second 0.001515454552733173
2 0.0015143190174102155
TOPIC 8
dai 0.002044683052793855
call 0.001992448963405555
citi 0.00195425798896221
three 0.0018970773269210957
plai 0.001853887836159108
gener 0.0018252502592182695
peopl 0.0018160312050590462
well 0.0017935933754513543
system 0.0017479534729456555
area 0.0017288815955179666
power 0.0017029539375086361
forc 0.0016706673237865313
1 0.0016681586343593317
number 0.0016501255143390717
team 0.0015894156993455188
develop 0.0015724268907364824
unit 0.0015371351757786232
second 0.0015247527824288484
nation 0.0015235190916716697
group 0.0015194534324480095
TOPIC 9
dai 0.0020620160901430877
citi 0.001987856719658478
call 0.001973103036828604
three 0.001924295805136688
peopl 0.0018232321289066767
plai 0.0018172215529843724
gener 0.0018125979152302458
well 0.0018056742813131674
system 0.001725860669839185
area 0.0017232894719674296
power 0.001697643253119442
1 0.001640662972775316
forc 0.0016394197000681693
number 0.0015927389128238725
unit 0.0015785177165666606
team 0.0015751611459412492
develop 0.0015670613914512046
nation 0.0015287394547847542
2 0.0015262474392790497
group 0.0015196717933709822
TOPIC 10
dai 0.0020203137546454856
citi 0.001985814822156114
call 0.001974265937728284
three 0.001934180185122672
gener 0.0018803136198652043
plai 0.0018164056544889878
peopl 0.0018083393449413536
well 0.0017804569091358126
power 0.0017051544274740097
area 0.0016959804754901494
system 0.0016918620528211653
1 0.0016435864049172597
forc 0.0016413861291761263
number 0.001638383798987439
develop 0.0016053710214565596
team 0.0015754232749060797
unit 0.001543834810440448
group 0.0015352472722856185
nation 0.0015350540825884074
2 0.001500158078774582
为什么要删除常用词?把它们留在里面。当给定大量特征时,LDA 并不总是能很好地工作。许多已发布的结果将 LDA 限制为前 20k 个最常见的英语单词(无停用词)。我猜这就是你现在的很多问题。
可能还有其他问题,你运行算法收敛了吗? 10个题目是不是太少得不到合理的题目?你提供的信息很少。
转到原始的在线 LDA 论文,首先尝试复制他们的结果以确认您正在正确使用该库,然后在掌握了它之后调整到新的语料库。
我正在申请 MLlib LDA example on various corpora downloaded from enter link description here 我过滤掉了停用词,也排除了非常频繁的术语和非常罕见的术语。 问题是我总是有非常相似的主题。
这是我获得的主题示例,通过 运行 来自 Wikipedia (eng_wikipedia_2010_300K-sentences) 的 300K 英语句子语料库的算法,知道我在使用时有类似的行为其他语料库:
TOPIC 0
dai 0.0020492776129338083
call 0.0019627409470977355
citi 0.0019496273507300062
three 0.0019172201890256511
gener 0.0018325842193426059
plai 0.0018287121439402873
peopl 0.001786839660855886
well 0.0017792000702589461
system 0.0017410979899730565
area 0.001721711978388363
power 0.0016906026954800833
forc 0.0016646631729486227
number 0.0016343386030518979
1 0.0016238591786476033
team 0.0016112030952801443
second 0.0015692071709961662
develop 0.0015670177558504078
group 0.0015378927495689552
unit 0.001535180513974118
nation 0.001520548489788889
TOPIC 1
dai 0.002027230927747474
call 0.0019861147606781222
citi 0.0019793753441068825
three 0.0019315799215582723
gener 0.0018482143436741026
plai 0.0018088629290540156
peopl 0.0017929339168126625
well 0.0017549252518608278
system 0.0016936542725510587
power 0.0016792684719108006
area 0.0016604962232717288
forc 0.0016575624332970456
1 0.0016344588453542676
number 0.0016147026427518426
team 0.0015914797457267642
develop 0.001580085843019015
unit 0.0015659585445574969
nation 0.0015412334667742672
second 0.0015292625574896467
group 0.0015111594105132022
TOPIC 2
dai 0.002028407701986021
call 0.001987655848237808
citi 0.0019737160296217846
three 0.0019183385421321895
plai 0.0018470661666555599
gener 0.0018431319454591765
peopl 0.0017947273975068192
well 0.00174922095206974
area 0.0017256327188664123
system 0.0016995971624202812
forc 0.001690002995539528
power 0.0016779250581379353
1 0.0016214669556130525
team 0.0016134935452659781
number 0.00161273946842774
develop 0.0015712560226793318
unit 0.0015385515465297065
second 0.001537016434433013
nation 0.001529578699246495
group 0.0015259003261706866
TOPIC 3
dai 0.0020271063080981745
call 0.001973996689805456
citi 0.0019709486233839084
three 0.0019445106630149387
gener 0.0018677792917783514
plai 0.0018485914586526906
peopl 0.0018082458859327093
well 0.0017955363877379456
area 0.0017455386898734308
system 0.0017118889300776724
power 0.0017085249825238942
forc 0.0016416026632813164
1 0.001625823945554925
team 0.0015984923365964885
number 0.001584888932954503
develop 0.0015753517064182336
unit 0.0015587234313666533
second 0.0015545107852806973
nation 0.001551230039407881
form 0.0015004750009120491
TOPIC 4
dai 0.0020367505428973216
citi 0.0019778590305849857
call 0.0019772546555550576
three 0.001909390366412786
peopl 0.001822249318126459
gener 0.0018136257455996375
plai 0.0018128359158538045
well 0.0017692106359278286
system 0.0017220797688845334
area 0.0017158874212548339
power 0.0016752592665713634
forc 0.0016481228833262157
1 0.0016364343814157618
develop 0.0016172188646470641
team 0.0016018835612051036
number 0.0015991873726231036
group 0.0015593423279207062
second 0.0015532604092917898
unit 0.001549525336335323
2 0.0015220460130066676
TOPIC 5
dai 0.0020635883517150367
call 0.0019664003159491844
citi 0.001961190935833301
three 0.001945998746077669
plai 0.0018498883070569758
peopl 0.0018146602342867515
gener 0.0018135991027718233
well 0.0017837359414291816
area 0.0017440315427199456
system 0.0016954828503859868
power 0.001684533695977363
forc 0.001669704443002364
number 0.00161528564937031
1 0.001615272821378791
team 0.0016121988960501902
unit 0.0015895009183487473
develop 0.001577936587739003
group 0.0015555325586313624
nation 0.0015404874848355308
second 0.0015394146696500102
TOPIC 6
dai 0.0020136284206896792
call 0.001992567179072041
citi 0.0019601308797825385
three 0.0019185595159400765
plai 0.0018409472012516875
gener 0.001829303983728153
peopl 0.0017780620849170163
well 0.001771180582253062
system 0.0017377818879564248
area 0.0016871361621009276
power 0.0016862650658960986
forc 0.00167141172198367
1 0.001629498191900329
number 0.0015977527836457993
develop 0.0015960475085336815
team 0.001571055963470908
unit 0.0015559866004530513
group 0.0015445653607137958
second 0.0015346412996486915
2 0.001533194322154979
TOPIC 7
dai 0.0020097600649219504
citi 0.001996121452902739
call 0.001976365831615543
three 0.0019444233325152307
gener 0.0018347697960641011
plai 0.0018294437097569366
peopl 0.001809068711352435
well 0.0017851474017785431
system 0.0017266117477556496
power 0.001696861186965475
area 0.0016963032173278431
forc 0.0016424242914518095
team 0.0016341651077031543
number 0.0016257268377783236
1 0.0016221579346215153
develop 0.0015930555191603342
unit 0.0015895942206181324
group 0.0015703868353222673
second 0.001515454552733173
2 0.0015143190174102155
TOPIC 8
dai 0.002044683052793855
call 0.001992448963405555
citi 0.00195425798896221
three 0.0018970773269210957
plai 0.001853887836159108
gener 0.0018252502592182695
peopl 0.0018160312050590462
well 0.0017935933754513543
system 0.0017479534729456555
area 0.0017288815955179666
power 0.0017029539375086361
forc 0.0016706673237865313
1 0.0016681586343593317
number 0.0016501255143390717
team 0.0015894156993455188
develop 0.0015724268907364824
unit 0.0015371351757786232
second 0.0015247527824288484
nation 0.0015235190916716697
group 0.0015194534324480095
TOPIC 9
dai 0.0020620160901430877
citi 0.001987856719658478
call 0.001973103036828604
three 0.001924295805136688
peopl 0.0018232321289066767
plai 0.0018172215529843724
gener 0.0018125979152302458
well 0.0018056742813131674
system 0.001725860669839185
area 0.0017232894719674296
power 0.001697643253119442
1 0.001640662972775316
forc 0.0016394197000681693
number 0.0015927389128238725
unit 0.0015785177165666606
team 0.0015751611459412492
develop 0.0015670613914512046
nation 0.0015287394547847542
2 0.0015262474392790497
group 0.0015196717933709822
TOPIC 10
dai 0.0020203137546454856
citi 0.001985814822156114
call 0.001974265937728284
three 0.001934180185122672
gener 0.0018803136198652043
plai 0.0018164056544889878
peopl 0.0018083393449413536
well 0.0017804569091358126
power 0.0017051544274740097
area 0.0016959804754901494
system 0.0016918620528211653
1 0.0016435864049172597
forc 0.0016413861291761263
number 0.001638383798987439
develop 0.0016053710214565596
team 0.0015754232749060797
unit 0.001543834810440448
group 0.0015352472722856185
nation 0.0015350540825884074
2 0.001500158078774582
为什么要删除常用词?把它们留在里面。当给定大量特征时,LDA 并不总是能很好地工作。许多已发布的结果将 LDA 限制为前 20k 个最常见的英语单词(无停用词)。我猜这就是你现在的很多问题。
可能还有其他问题,你运行算法收敛了吗? 10个题目是不是太少得不到合理的题目?你提供的信息很少。
转到原始的在线 LDA 论文,首先尝试复制他们的结果以确认您正在正确使用该库,然后在掌握了它之后调整到新的语料库。