将每个文档的相似度矩阵动态分配给数组以导出到 JSON

Question

我是 Python 的新手，所以我确定这很简单，我没有做，但我无法弄清楚。我为语料库中的每个文档创建了相似度矩阵，我想将它们分配回带有文档名称键的字典，以跟踪每个文档之间的相似度。

但是，它一直将最后一个矩阵分配给每个键，而不是将相应的矩阵分配给键。

import pandas as pd
import numpy as np
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import os

path = "stories/"
token_dict = {}
stemmer = PorterStemmer()

def tokenize(text):
   tokens = nltk.word_tokenize(text)
   stems = stem_tokens(tokens, stemmer)
   return stems

def stem_tokens(tokens, stemmer):
    stemmed_words = []
    for token in tokens:
        stemmed_words.append(stemmer.stem(token))
    return stemmed_words


for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = subdir + os.path.sep + file
        with open(file_path, "r", encoding = "utf-8") as file:
            story = file
            text = story.read()
            lowers = text.lower()
            map = str.maketrans('', '', string.punctuation)
            no_punctuation = lowers.translate(map)
            token_dict[file.name.split("\", 1)[1]] = no_punctuation

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())

termarray = tfs.toarray()
nparray = np.array(termarray)
rows, cols = nparray.shape

 similarity = []
 for document in docdict:
    for row in range(0, rows-1):
       similarity = cosine_similarity(tfs[row:row+1], tfs)
       docdict[document] = similarity

一切都按预期工作，直到分配回来。

这会生成一个字典：

{'98ststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,    0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,  0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory2.txt': array([[ 0.10586559,  0.04742287,  0.02478352,     0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]])

每个文档都分配了倒数第二个文档。虽然这只是一个简单的差，但真正的问题更多的是它们都分配了相同的矩阵。

我得到的一个文档的矩阵如下：

array([[ 1.        ,  0.07015725,  0.01593837,  0.05618977,  0.03892873,
         0.02434279,  0.06029888,  0.02261425,  0.03531677,  0.02975444,
         0.01835854,  0.02145624,  0.00985163,  0.03645598,  0.0497407 ,
         0.04482995,  0.06677013,  0.03153055,  0.10919878,  0.12029462,
         0.07255828,  0.05499581,  0.06330188,  0.04719668,  0.08909685,
         0.04484428,  0.06725359,  0.04453039,  0.02381673,  0.02639529,
         0.01012012,  0.0218679 ,  0.09989828,  0.10586559,  0.01535069]])

这是每个文档与第一个文档的相应相似度。我想要的是一本看起来像这样的字典：

{
    story1:
          {
              story1: 1.,
              story2: 0.07015725,
              story3: 0.01593837,
              story4: 0.05618977... 
          }
    story2:
          {
              story1: ...
          }
 }

..等等

示例数据集如下所示：

story1 = """Four other streets were renamed in Cork at the turn of the last   century to celebrate this event: Wolfe Tone St. (Previously Fair Lane), John Philpot Curran St. (Philpot’s Lane), Emmet (Nelson’s) Place and Sheare’s (Nile) St."""
story2 = """Oliver Plunkett Street was originally named George's Street after George I, the then reigning King of Great Britain and Ireland. In 1920, during the Burning of Cork, large parts of the street were destroyed by British troops."""
story3 = """Alfred Street is a connecting Street between Kent Train Station and MacCurtain Street. Present Cork city centre signage uses letters inspired by the book of Kells. This has been an inspiration for many typefaces in the past, including the Petrie's 'B' typface, and Monotype's 'Column Cille', which was widely used for school textbooks."""

运行通过脚本，这会产生一个相似度矩阵如下：

[[ 1.          0.05814422  0.06032458]]
[[ 0.05814422  1.          0.21323354]]
[[ 0.06032458  0.21323354  1.        ]]

其中每个都是一个 1*n 矩阵，对应于每个文档的相似性。我想把它变成一个字典，让我可以看到每个文档与其他文档的具体相似性，如下所示：

{
    story1: {
                story1: 1.,
                story2: 0.05814422,
                story3: 0.06032458
            },
    story2: {
                story1: 0.05814422,
                story2: 1.,
                story3: 0.21323354
            },
    story3: {
                story1: 0.06032458,
                story2: 0.21323354,
                story3: 1.
            }
}

我确定这是一个基本问题，但我对 Python 的数据结构缺乏了解，如有任何帮助，我们将不胜感激！

Answer 1

假设您有以下相似矩阵：

sim = cosine_similarity(tfs)

In [261]: sim
Out[261]:
array([[ 1.        ,  0.09933054,  0.08911641],
       [ 0.09933054,  1.        ,  0.27252107],
       [ 0.08911641,  0.27252107,  1.        ]])

注意：我们不需要循环来计算相似度矩阵

使用 Pandas module 我们可以执行以下操作：

In [262]: df = pd.DataFrame(sim,
                            columns=list(token_dict.keys()),
                            index=list(token_dict.keys()))

数据帧：

In [263]: df
Out[263]:
          story1    story2    story3
story1  1.000000  0.099331  0.089116
story2  0.099331  1.000000  0.272521
story3  0.089116  0.272521  1.000000

现在我们可以轻松地将DataFrame转换为dict

In [264]: df.to_dict()
Out[264]:
{'story1': {'story1': 1.0000000000000009,
  'story2': 0.099330538266243495,
  'story3': 0.089116410701360893},
 'story2': {'story1': 0.099330538266243495,
  'story2': 0.99999999999999911,
  'story3': 0.27252107037687257},
 'story3': {'story1': 0.089116410701360893,
  'story2': 0.27252107037687257,
  'story3': 1.0}}

或直接到JSON:

df.to_json('/path/to/file.json')

将每个文档的相似度矩阵动态分配给数组以导出到 JSON

Dynamically assign similarity matrices per document to array for export to JSON

python

dictionary

sparse-matrix

cosine-similarity

scikit-learn