在两个文件中找到独特的句子
find unique sentences in two files
我有两个文件,我正在尝试在两个文件之间打印独特的句子。为此,我在 python 中使用 difflib。
text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.'
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.'
import difflib
differ = difflib.Differ()
diff = differ.compare(text,text1)
print '\n'.join(diff)
它没有给我想要的输出。它给我这样。
P
h
y
s
i
c
s
i
s
o
n
e
o
f
t
h
e
我想要的输出只是两个文件之间的唯一句子。
text = Perhaps the oldest through its inclusion of astronomy. Over the
last two millennia.
text1 = Quantum chemistry is a branch of chemistry.
而且 difflib.Differ 似乎是逐行而不是逐句进行。请有任何建议。我该怎么做?
首先,确实,Differ().compare() 比较的是行,而不是句子。
其次,它实际上比较序列,例如字符串列表。但是,您传递的是两个字符串,而不是两个字符串列表。由于字符串也是一个(字符的)序列,因此您的情况下的 Differ().compare() 比较各个字符。
如果要按句子比较文件,必须准备两个句子列表。您可以使用 nltk.sent_tokenize(text) 将字符串拆分为句子。
diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1))
print('\n'.join(diff))
# Physics is one of the oldest academic disciplines.
#- Perhaps the oldest through its inclusion of astronomy.
#- Over the last two millennia.
# Physics was a part of natural philosophy along with chemistry.
#+ Quantum chemistry is a branch of chemistry.
正如 DZinoviev 上面所述,您将字符串传递到需要列表的函数中。您不需要使用 NLTK,相反,您可以通过拆分句点将字符串转换为句子列表。
import difflib
text1 ="""Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry."""
text2 ="""Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry."""
list1 = list(text1.split("."))
list2 = list(text2.split("."))
differ = difflib.Differ()
diff = differ.compare(list1,list2)
print "\n".join(diff)
我有两个文件,我正在尝试在两个文件之间打印独特的句子。为此,我在 python 中使用 difflib。
text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.'
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.'
import difflib
differ = difflib.Differ()
diff = differ.compare(text,text1)
print '\n'.join(diff)
它没有给我想要的输出。它给我这样。
P
h
y
s
i
c
s
i
s
o
n
e
o
f
t
h
e
我想要的输出只是两个文件之间的唯一句子。
text = Perhaps the oldest through its inclusion of astronomy. Over the last two millennia.
text1 = Quantum chemistry is a branch of chemistry.
而且 difflib.Differ 似乎是逐行而不是逐句进行。请有任何建议。我该怎么做?
首先,确实,Differ().compare() 比较的是行,而不是句子。
其次,它实际上比较序列,例如字符串列表。但是,您传递的是两个字符串,而不是两个字符串列表。由于字符串也是一个(字符的)序列,因此您的情况下的 Differ().compare() 比较各个字符。
如果要按句子比较文件,必须准备两个句子列表。您可以使用 nltk.sent_tokenize(text) 将字符串拆分为句子。
diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1))
print('\n'.join(diff))
# Physics is one of the oldest academic disciplines.
#- Perhaps the oldest through its inclusion of astronomy.
#- Over the last two millennia.
# Physics was a part of natural philosophy along with chemistry.
#+ Quantum chemistry is a branch of chemistry.
正如 DZinoviev 上面所述,您将字符串传递到需要列表的函数中。您不需要使用 NLTK,相反,您可以通过拆分句点将字符串转换为句子列表。
import difflib
text1 ="""Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry."""
text2 ="""Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry."""
list1 = list(text1.split("."))
list2 = list(text2.split("."))
differ = difflib.Differ()
diff = differ.compare(list1,list2)
print "\n".join(diff)