Python 计算两个文件目录的余弦相似度

Python compute cosine similarity on two directories of files

我有两个文件目录。一个包含人工转录的文件,另一个包含 IBM Watson 转录的文件。两个目录的文件数量相同,并且都是从相同的电话录音中转录的。

我正在使用匹配文件之间的 SpaCy .similarity 计算余弦相似度,并将结果与​​比较的文件名一起打印或存储。除了 for 循环之外,我还尝试使用函数进行迭代,但找不到在两个目录之间进行迭代、将两个文件与匹配索引进行比较并打印结果的方法。

这是我当前的代码:

# iterate through files in both directories
for human_file, api_file in os.listdir(human_directory), os.listdir(api_directory):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(human_file).read())
    api_model = nlp_small(open(api_file).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

我已经让它可以通过仅遍历一个目录并通过打印文件名检查它是否具有预期的输出,但是当同时使用两个目录时它不起作用。我也尝试过这样的事情:

# define directories
human_directory = os.listdir("./00_data/Human Transcripts")
api_directory = os.listdir("./00_data/Watson Scripts")

# function for cosine similarity of files in two directories using small model
def nlp_small(human_directory, api_directory):
    for i in (0, (len(human_directory) - 1)):
        print(human_directory[i], api_directory[i])

nlp_small(human_directory, api_directory)

哪个returns:

human_10.txt watson_10.csv
human_9.txt watson_9.csv

但这只是其中的两个文件,而不是所有 17 个文件。

任何关于遍历两个目录上的匹配索引的指示都将不胜感激。

编辑: 感谢@kevinjiang,这是工作代码块:

# set the directories containing transcripts
human_directory = os.path.join(os.getcwd(), "00_data\Human Transcripts")
api_directory = os.path.join(os.getcwd(), "00_data\Watson Scripts")

# iterate through files in both directories
for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Human Transcripts", human_file)).read())
    api_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Watson Scripts", api_file)).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

这里是大部分输出(需要在其中一个文件中修复一个 UTF-16 字符来停止循环):

nlp_small = spacy.load('en_core_web_sm')
Similarity using small model: human_10.txt watson_10.csv 0.9274665883462793
Similarity using small model: human_11.txt watson_11.csv 0.9348740684005554
Similarity using small model: human_12.txt watson_12.csv 0.9362025469343344
Similarity using small model: human_13.txt watson_13.csv 0.9557355330988958
Similarity using small model: human_14.txt watson_14.csv 0.9088701120190216
Similarity using small model: human_15.txt watson_15.csv 0.9479464053189846
Similarity using small model: human_16.txt watson_16.csv 0.9599724037676819
Similarity using small model: human_17.txt watson_17.csv 0.9367605599306302
Similarity using small model: human_18.txt watson_18.csv 0.8760760037870665
Similarity using small model: human_2.txt watson_2.csv 0.9184563762823503
Similarity using small model: human_3.txt watson_3.csv 0.9287452822270265
Similarity using small model: human_4.txt watson_4.csv 0.9415664367046419
Similarity using small model: human_5.txt watson_5.csv 0.9158895909429551
Similarity using small model: human_6.txt watson_6.csv 0.935313240861153

修复字符编码错误后,我将把它包装在一个函数中,这样我就可以在两个目录上调用大型或小型模型,用于我必须测试的其余 API。

两个小错误阻止您循环。对于第二个示例,在 for 循环中,您仅循环索引 0 和索引 (len(human_directory) - 1))。相反,你应该做 for i in range(len(human_directory)): 这应该允许你遍历两者。

首先,我认为您可能会得到某种 too many values to unpack error。要同时循环两个可迭代对象,请使用 zip(),因此它应该看起来像

for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):