将 gensim.interfaces.TransformedCorpus 转换为可读结果

Transforming a gensim.interfaces.TransformedCorpus to a readable result

我正在使用 Mallet LDA 和 gensims 实现的包装器。

现在想获取几个没见过的文档的Topic分布,存储在嵌套列表中,然后打印出来。

这是我的代码:

other_texts = [
        ['wlan', 'usb', 'router'],
        ['auto', 'auto', 'auto'],
        ['human', 'system', 'computer']
 ]

corpus1 = [id2word.doc2bow(text) for text in other_texts]

to_pro = []
for t in corpus1:
    unseen_doc = corpus1
    vector = lda[unseen_doc] # get topic probability distribution for a document
    to_pro.append(vector)

如果我尝试打印列表 vector,它会产生以下结果:

[<gensim.interfaces.TransformedCorpus object at 0x0000024CC1DFC940>, <gensim.interfaces.TransformedCorpus object at 0x0000024CC1DFC320>, <gensim.interfaces.TransformedCorpus object at 0x0000024CC1DFC6A0>]

我试过这段代码正确打印出来,但是主题分布的概率是错误的:

topic_dist = []
for line in to_pro:
    topic_dist += lda.get_document_topics(line)
td=[]
for topic in topic_dist:
    td.append(topic)

我得到了这个结果:

[[(0, 0.05458162849743133), (1, 0.05510823556400538), (2, 0.05603786367505091), (3, 0.05472256432318962), (4, 0.05471966342417517), (5, 0.05454446883678316), (6, 0.060267211268385176), (7, 0.05590590303517797), (8, 0.054558298009463865), (9, 0.0570497751708577), (10, 0.05586054626708894), (11, 0.05611284070096096), (12, 0.05483861615903838), (13, 0.054548627713420714), (14, 0.0548708631793431), (15, 0.055097199555668705), (16, 0.05572779508710042), (17, 0.05544789953285848)], [(0, 0.05457482739088479), (1, 0.05509130205455064), (2, 0.05599364448566309), (3, 0.05479472333893934), (4, 0.05489998490024729), (5, 0.054542940465732534), (6, 0.06014649090195501), (7, 0.0558787316629024), (8, 0.05455634249554292), (9, 0.05651159582517287), (10, 0.0558343047708517), (11, 0.05605027364084813), (12, 0.05483134591787102), (13, 0.054546952683828316), (14, 0.05488058477867337), (15, 0.0550725066190555), (16, 0.055951974201133244), (17, 0.055841473866147906)], [(0, 0.05457665942453363), (1, 0.055255130626316235), (2, 0.05616834056392741), (3, 0.05472749675259328), (4, 0.0547199851837743), (5, 0.054544546873748226), (6, 0.06037007389117332), (7, 0.05593838115178327), (8, 0.05456190582329174), (9, 0.056409168851414615), (10, 0.0559404965748031), (11, 0.05614914322415512), (12, 0.054842094317369555), (13, 0.054550171326841215), (14, 0.054870520851845996), (15, 0.05511732934346291), (16, 0.05579100118297473), (17, 0.05546755403599123)], [(0, 0.054581620307290336), (1, 0.05510823907508528), (2, 0.056037876384335425), (3, 0.05472256410518629), (4, 0.05471967034475046), (5, 0.05454446871605657), (6, 0.06026693118061518), (7, 0.05590622478877356), (8, 0.054558295773128575), (9, 0.05704995161755483), (10, 0.05586057502348091), (11, 0.056112803329985396), (12, 0.05483861481767718), (13, 0.05454862663175604), (14, 0.054870865577993026), (15, 0.055097113943380405), (16, 0.05572773919917307), (17, 0.055447819183777246)], [(0, 0.05457482815837349), (1, 0.05509132071436994), (2, 0.05599364089981504), (3, 0.05479471920764724), (4, 0.05489999995707833), (5, 0.05454293700828862), (6, 0.06014645177706313), (7, 0.05587868116251209), (8, 0.05455634846240247), (9, 0.056511585085478364), (10, 0.055834295810939794), (11, 0.056050296895854265), (12, 0.054831353686471636), (13, 0.05454695325610574), (14, 0.05488059866846103), (15, 0.055072528844072065), (16, 0.05595218064057245), (17, 0.05584127976449436)], [(0, 0.054576657976703774), (1, 0.05525504608539575), (2, 0.05616829811928526), (3, 0.05472749878845379), (4, 0.05471997497183866), (5, 0.054544547686709126), (6, 0.06037016659013718), (7, 0.05593821008515276), (8, 0.05456190840675052), (9, 0.05640917964821885), (10, 0.05594054039873076), (11, 0.05614912143569156), (12, 0.0548420823035294), (13, 0.054550172872614225), (14, 0.054870521717331436), (15, 0.055117319561282), (16, 0.05579110737872705), (17, 0.055467645973447846)], [(0, 0.054581639915369816), (1, 0.055108252268374285), (2, 0.056037916094392765), (3, 0.05472256597071497), (4, 0.05471966744573819), (5, 0.0545444687939403), (6, 0.06026693966026536), (7, 0.055906213964449725), (8, 0.05455829555351338), (9, 0.05704968653857304), (10, 0.0558606261827436), (11, 0.05611290790292455), (12, 0.05483860593828801), (13, 0.05454862649308445), (14, 0.05487085805236639), (15, 0.05509715099521129), (16, 0.05572773695595529), (17, 0.05544784127409454)], [(0, 0.05457482754746605), (1, 0.05509132328696252), (2, 0.055993666140583764), (3, 0.05479472184721206), (4, 0.05489996963702654), (5, 0.05454294168997213), (6, 0.060146365105445465), (7, 0.05587886571230439), (8, 0.05455633757025994), (9, 0.056511632004648656), (10, 0.055834239764847755), (11, 0.05605028881626678), (12, 0.054831347261978546), (13, 0.05454695137813789), (14, 0.05488060185684171), (15, 0.05507250450434276), (16, 0.055951827151308337), (17, 0.05584158872439472)], [(0, 0.05457665857245025), (1, 0.05525503335748317), (2, 0.05616811411295409), (3, 0.054727501563580076), (4, 0.054719978109952404), (5, 0.05454454660618627), (6, 0.060370135879343034), (7, 0.05593823717454384), (8, 0.05456190762146366), (9, 0.056409205316000424), (10, 0.05594060935464846), (11, 0.056149148701409454), (12, 0.05484207733245972), (13, 0.054550172010398135), (14, 0.05487051175914863), (15, 0.05511731933953272), (16, 0.055791267296383236), (17, 0.055467575892062346)]]

然而,打印列表中的一个元素会产生正确的结果:

to_pro = []
for t in corpus1:
    unseen_doc = corpus1
    vector = lda[unseen_doc[1]] # specifying document at index 1
    to_pro.append(vector)
[[(0, 0.052410901467505704), (1, 0.052410901467505704), (2, 0.052410901467505704), (3, 0.052410901467505704), (4, 0.052410901467505704), (5, 0.052410901467505704), (6, 0.052410901467505704), (7, 0.052410901467505704), (8, 0.052410901467505704), (9, 0.052410901467505704), (10, 0.052410901467505704), (11, 0.052410901467505704), (12, 0.052410901467505704), (13, 0.052410901467505704), (14, 0.10901467505240292), (15, 0.052410901467505704), (16, 0.052410901467505704), (17, 0.052410901467505704)], [(0, 0.052410901467505704), (1, 0.052410901467505704), (2, 0.052410901467505704), (3, 0.052410901467505704), (4, 0.052410901467505704), (5, 0.052410901467505704), (6, 0.052410901467505704), (7, 0.052410901467505704), (8, 0.052410901467505704), (9, 0.052410901467505704), (10, 0.052410901467505704), (11, 0.052410901467505704), (12, 0.052410901467505704), (13, 0.052410901467505704), (14, 0.10901467505240292), (15, 0.052410901467505704), (16, 0.052410901467505704), (17, 0.052410901467505704)], [(0, 0.052410901467505704), (1, 0.052410901467505704), (2, 0.052410901467505704), (3, 0.052410901467505704), (4, 0.052410901467505704), (5, 0.052410901467505704), (6, 0.052410901467505704), (7, 0.052410901467505704), (8, 0.052410901467505704), (9, 0.052410901467505704), (10, 0.052410901467505704), (11, 0.052410901467505704), (12, 0.052410901467505704), (13, 0.052410901467505704), (14, 0.10901467505240289), (15, 0.052410901467505704), (16, 0.052410901467505704), (17, 0.052410901467505704)]]

另一个问题是,对于一个文档,同一个分布被打印了3次。

我也查看了这个答案:,但没有帮助。

我在这里做错了什么?

我犯了一个简单的错误:

计算主题概率的部分必须移出循环:

to_pro = []
unseen_doc = corpus1
vector = lda[unseen_doc]
for t in vector:
    print(t)

感谢您的修复 -- 您漏掉了一行。

to_pro = []
unseen_doc = corpus1
vector = lda[unseen_doc]
for t in vector:
    print(t)
    to_pro.append(t)

要导出为 csv:

results = pd.DataFrame(to_pro,columns=['Topic 1','Topic 2',
                                       'Topic 3','Topic 4',
                                       'Topic 5','Topic n'])

result.to_csv('test_results.csv', index=True, header=True)