保存 Graphlab LDA 模型会将主题变成乱码?

Saving Graphlab LDA model turns topics into gibberish?

好吧,这很古怪。我认为这个问题可能是由最近的 graphlab 更新引入的,因为我以前从未见过这个问题,但我不确定)。不管怎样,看看这个:

import graphlab as gl

corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)

+-------+---------------+------------------+
| topic |      word     |      score       |
+-------+---------------+------------------+
|   0   |     Music     | 0.0195325651638  |
|   0   |      Love     | 0.0120906781994  |
|   0   |  Photography  | 0.00936914065591 |
|   1   |     Recipe    | 0.0205673829742  |
|   1   |      Food     | 0.0202932111556  |
|   1   |     Sugar     | 0.0162560126511  |
|   2   |    Business   | 0.0223993672813  |
|   2   |    Science    | 0.0164027313084  |
|   2   |   Education   | 0.0139221301443  |
|   3   |    Science    | 0.0134658216431  |
|   3   |   Video_game  | 0.0113924173881  |
|   3   |      NASA     | 0.0112188654905  |
|   4   | United_States | 0.0127908290673  |
|   4   |   Automobile  | 0.00888669047383 |
|   4   |   Australia   | 0.00854809547772 |
|   5   |    Disease    | 0.00704245203928 |
|   5   |     Earth     | 0.00693360028027 |
|   5   |    Species    | 0.00648700544757 |
|   6   |    Religion   | 0.0142311765509  |
|   6   |      God      | 0.0139990904439  |
|   6   |     Human     | 0.00765681454222 |
|   7   |     Google    | 0.0198547267697  |
|   7   |    Internet   | 0.0191105480317  |
|   7   |    Computer   | 0.0179914269911  |
|   8   |      Art      | 0.0378733245262  |
|   8   |     Design    | 0.0223646138082  |
|   8   |     Artist    | 0.0142755732766  |
|   9   |      Film     | 0.0205971724156  |
|   9   |     Earth     | 0.0125386246077  |
|   9   |   Television  | 0.0102082224947  |
+-------+---------------+------------------+

好吧,即使对我的语料库一无所知,这些主题至少还是可以理解的,因为每个主题的热门术语或多或少是相关的。

但现在如果简单地保存并重新加载模型,主题就会完全改变(据我所知是废话):

lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)

+-------+-----------------------+-------------------+
| topic |          word         |       score       |
+-------+-----------------------+-------------------+
|   0   |      Cleanliness      |  0.00468171463384 |
|   0   |      Chicken_soup     |  0.00326753275774 |
|   0   | The_Language_Instinct |  0.00314506174959 |
|   1   |      Equalization     |  0.0015724652078  |
|   1   |    Financial_crisis   |  0.00132675410371 |
|   1   |    Tulsa,_Oklahoma    |  0.00118899041288 |
|   2   |        Batoidea       |  0.00142300468887 |
|   2   |       Abbottabad      |  0.0013474225953  |
|   2   |   Migration_humaine   |  0.00124284781396 |
|   3   |     Gewürztraminer    |  0.00147470845039 |
|   3   |         Indore        |  0.00107223358321 |
|   3   |     White_wedding     |  0.00104791136102 |
|   4   |        Bregenz        |  0.00130871351963 |
|   4   |       Carl_Jung       | 0.000879345016186 |
|   4   |           ภ           | 0.000855001542873 |
|   5   |        18e_eeuw       | 0.000950866105797 |
|   5   |      Vesuvianite      | 0.000832367570269 |
|   5   |      Gary_Kirsten     | 0.000806410748201 |
|   6   |  Sunday_Bloody_Sunday | 0.000828552346797 |
|   6   |  Linear_cryptanalysis | 0.000681188343324 |
|   6   |     Clothing_sizes    |  0.00066708652481 |
|   7   |          Mile         | 0.000759081990574 |
|   7   |  Pinwheel_calculator  | 0.000721971708181 |
|   7   |       Third_Age       | 0.000623010955132 |
|   8   |   Tennessee_Williams  | 0.000597449568381 |
|   8   |         Levite        | 0.000551338743949 |
|   8   |   Time_Out_(company)  | 0.000536667117994 |
|   9   |     David_Deutsch     | 0.000543813843275 |
|   9   | Honing_(metalworking) |  0.00044496051774 |
|   9   |   Clearing_(finance)  | 0.000431699705779 |
+-------+-----------------------+-------------------+

知道这里可能发生什么吗? save 应该只是 pickle 模型,所以我看不出奇怪的地方发生了,但是主题分布以某种不明显的方式完全改变了。我已经在两台不同的机器上验证了这一点(Linux 和 Mac)。有类似的奇怪结果。

编辑

将 Graphlab 从 1.7.1 降级到 1.6.1 似乎可以解决此问题,但这不是真正的解决方案。我在 1.7.1 发行说明中没有看到任何明显的内容来解释发生了什么,如果可能的话,我希望它能在 1.7.1 中工作...

这是 Graphlab create 1.7.1 中的错误。它现在已在 Graphlab Create 1.8 中修复。