将每个文档的相似度矩阵动态分配给数组以导出到 JSON

Dynamically assign similarity matrices per document to array for export to JSON

我是 Python 的新手,所以我确定这很简单,我没有做,但我无法弄清楚。我为语料库中的每个文档创建了相似度矩阵,我想将它们分配回带有文档名称键的字典,以跟踪每个文档之间的相似度。


import pandas as pd
import numpy as np
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import os

path = "stories/"
token_dict = {}
stemmer = PorterStemmer()

def tokenize(text):
   tokens = nltk.word_tokenize(text)
   stems = stem_tokens(tokens, stemmer)
   return stems

def stem_tokens(tokens, stemmer):
    stemmed_words = []
    for token in tokens:
    return stemmed_words

for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = subdir + os.path.sep + file
        with open(file_path, "r", encoding = "utf-8") as file:
            story = file
            text = story.read()
            lowers = text.lower()
            map = str.maketrans('', '', string.punctuation)
            no_punctuation = lowers.translate(map)
            token_dict[file.name.split("\", 1)[1]] = no_punctuation

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())

termarray = tfs.toarray()
nparray = np.array(termarray)
rows, cols = nparray.shape

 similarity = []
 for document in docdict:
    for row in range(0, rows-1):
       similarity = cosine_similarity(tfs[row:row+1], tfs)
       docdict[document] = similarity



{'98ststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,    0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,  0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory2.txt': array([[ 0.10586559,  0.04742287,  0.02478352,     0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]])



array([[ 1.        ,  0.07015725,  0.01593837,  0.05618977,  0.03892873,
         0.02434279,  0.06029888,  0.02261425,  0.03531677,  0.02975444,
         0.01835854,  0.02145624,  0.00985163,  0.03645598,  0.0497407 ,
         0.04482995,  0.06677013,  0.03153055,  0.10919878,  0.12029462,
         0.07255828,  0.05499581,  0.06330188,  0.04719668,  0.08909685,
         0.04484428,  0.06725359,  0.04453039,  0.02381673,  0.02639529,
         0.01012012,  0.0218679 ,  0.09989828,  0.10586559,  0.01535069]])


              story1: 1.,
              story2: 0.07015725,
              story3: 0.01593837,
              story4: 0.05618977... 
              story1: ...



story1 = """Four other streets were renamed in Cork at the turn of the last   century to celebrate this event: Wolfe Tone St. (Previously Fair Lane), John Philpot Curran St. (Philpot’s Lane), Emmet (Nelson’s) Place and Sheare’s (Nile) St."""
story2 = """Oliver Plunkett Street was originally named George's Street after George I, the then reigning King of Great Britain and Ireland. In 1920, during the Burning of Cork, large parts of the street were destroyed by British troops."""
story3 = """Alfred Street is a connecting Street between Kent Train Station and MacCurtain Street. Present Cork city centre signage uses letters inspired by the book of Kells. This has been an inspiration for many typefaces in the past, including the Petrie's 'B' typface, and Monotype's 'Column Cille', which was widely used for school textbooks."""


[[ 1.          0.05814422  0.06032458]]
[[ 0.05814422  1.          0.21323354]]
[[ 0.06032458  0.21323354  1.        ]]

其中每个都是一个 1*n 矩阵,对应于每个文档的相似性。我想把它变成一个字典,让我可以看到每个文档与其他文档的具体相似性,如下所示:

    story1: {
                story1: 1.,
                story2: 0.05814422,
                story3: 0.06032458
    story2: {
                story1: 0.05814422,
                story2: 1.,
                story3: 0.21323354
    story3: {
                story1: 0.06032458,
                story2: 0.21323354,
                story3: 1.

我确定这是一个基本问题,但我对 Python 的数据结构缺乏了解,如有任何帮助,我们将不胜感激!


sim = cosine_similarity(tfs)

In [261]: sim
array([[ 1.        ,  0.09933054,  0.08911641],
       [ 0.09933054,  1.        ,  0.27252107],
       [ 0.08911641,  0.27252107,  1.        ]])


使用 Pandas module 我们可以执行以下操作:

In [262]: df = pd.DataFrame(sim,


In [263]: df
          story1    story2    story3
story1  1.000000  0.099331  0.089116
story2  0.099331  1.000000  0.272521
story3  0.089116  0.272521  1.000000


In [264]: df.to_dict()
{'story1': {'story1': 1.0000000000000009,
  'story2': 0.099330538266243495,
  'story3': 0.089116410701360893},
 'story2': {'story1': 0.099330538266243495,
  'story2': 0.99999999999999911,
  'story3': 0.27252107037687257},
 'story3': {'story1': 0.089116410701360893,
  'story2': 0.27252107037687257,
  'story3': 1.0}}

