当行不同时输出 2 个文本之间的差异

Question

我对 Python 比较陌生，所以提前道歉，有时听起来有点混乱。在提出更多问题之前，我会尝试使用 google 并尽可能多地尝试您的提示。这是我的情况：我正在使用 R 和文体测量法来找出文本的（可能）作者身份。我想做的是看看第二版小说的文体是否有差异，在其中一位（假定的）合著者去世并因此无法做出贡献之后。为了研究我需要

文字版 1
正文第 2 版

并为python输出

出现在文本 1 而没有出现在文本 2 中的词
出现在文本 2 而没有出现在文本 1 中的词

而且我希望每次出现这些词时都出现，所以不仅仅是 'the' 一次，而是每次程序遇到与第一版不同的时候（是的，我知道我要求非常抱歉）

我已经尝试通过

来解决这个问题

file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
    for j in list2:
        if  i==j:
            file3.write(i)

但这当然行不通，因为文本是两个巨大的文本球，而不是可以比较的单独行，而且第一个文本的行数比第二个文本多得多。有没有办法从行到 'words' 或一般的文本来克服这个问题？我可以把整本小说放在一个字符串中吗？我想不会。我也尝试过使用 difflib，但几周前我才开始编码，我发现它非常复杂。例如，我使用fraxel的脚本作为基础：

from difflib import Differ

s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")

def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
    l1 = s1.split(' ')
    l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif 
                                                       if not i[:1] in '-?'])

print appendBoldChanges

但我无法让它工作。

所以我的问题是有什么方法可以输出像这样的行中不相似的文本之间的差异吗？这听起来很可行，但我大大低估了我发现 Python 哈哈的难度。感谢阅读，感谢您的帮助！

编辑：发布我当前的代码，以防它可以帮助正在谷歌搜索答案的学习者：

file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}

import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]

for word1 in text1:
    if word1 not in wordlist1:
        wordlist1[word1] = 1
    else:
        wordlist1[word1] += 1

for k,v in sorted(wordlist1.items()):
    #print "%s %s" % (k, v)
    col1 = ("%s %s" % (k, v))
    print col1

file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}

import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]

for word2 in text2:
    if word2 not in wordlist2:
        wordlist2[word2] = 1
    else:
        wordlist2[word2] += 1

for k,v in sorted(wordlist2.items()):
    #print "%s %s" % (k, v)
    col2 = ("%s %s" % (k, v))
    print col2

我希望仍然编辑和输出的是这样的：使用字典的键和值系统（应用于 col1 和 col2）：{apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}？

Answer 1

如果这不是您要找的，请告诉我，但您似乎想要遍历文件的行，这在 python 中可以很容易地完成。下面是一个示例，其中我省略了每行末尾的换行符，并将这些行添加到列表中：

f = open("filename.txt", 'r')
lines = []
for line in f:
    lines.append(f[:-1])

希望对您有所帮助！

Answer 2

我不完全确定您是在尝试比较单词出现时的差异还是行出现时的差异，但是您可以这样做的一种方法是使用字典。如果您想查看哪些行发生了变化，您可以通过执行以下操作来拆分句点上的行：

text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')

这将在句点上拆分您拥有的字符串（其中包含我假设的整个文本），并将return所有句子的数组（或列表）。

然后你可以用 dict = {} 创建一个字典，遍历之前创建的数组中的每个句子，使它成为字典中具有相应值的键（可以是任何东西，因为大多数句子可能不会发生不止一次）。在对第一个版本执行此操作后，您可以查看第二个版本并检查哪些句子相同。下面是一些可以让您入门的代码（假设版本 1 包含第一个版本中的所有句子）：

for sentence in version1:
    dict[sentence] = 1                     #put a counter for e

然后您可以遍历第二个版本并检查是否在第一个版本中找到相同的句子，例如：

  for sentence in version2:
       if sentence in dict:            #if the sentence is in the dictionary
            pass
            #or do whatever you want here
       else:                           #if the sentence isn't
            print(sentence)

再次不确定这是否是您要查找的内容，但希望它对您有所帮助

Answer 3

你要输出：

文本 1 中出现但文本 2 中未出现的词
单词出现在文本 2 但不在文本 1

有意思。设置差异就是您所需要的。

import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()

words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)

set_s1 = set(words_s1)
set_s2 = set(words_s2)

words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1

words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)

with open("s1_output","w") as s1_output:
    s1_output.write(words_in_s1)

with open("s2_output","w") as s2_output:
    s2_output.write(words_in_s2)

当行不同时输出 2 个文本之间的差异

output differences between 2 texts when lines are dissimilar

python

difflib