文本中句子之间的语义相似性
Semantic Similarity between Sentences in a Text
我已经使用 here 中的 material 和之前的论坛页面为程序编写了一些代码,该程序将自动计算整个文本中连续句子之间的语义相似度。在这里;
第一部分的代码是从第一个 link 复制粘贴的,然后我在下面的 245 行之后放了这些东西。我删除了第 245 行之后的所有多余部分。
with open ("File_Name", "r") as sentence_file:
while x and y:
x = sentence_file.readline()
y = sentence_file.readline()
similarity(x, y, true)
#boolean set to false or true
x = y
y = sentence_file.readline()
我的文本文件格式如下;
Red alcoholic drink. Fresh orange juice. An English dictionary. The
Yellow Wallpaper.
最后我想把所有相邻的相似句子对显示出来,像这样;
["Red alcoholic drink.", "Fresh orange juice.", 0.611],
["Fresh orange juice.", "An English dictionary.", 0.0]
["An English dictionary.", "The Yellow Wallpaper.", 0.5]
if norm(vec_1) > 0 and if norm(vec_2) > 0:
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
elif norm(vec_1) < 0 and if norm(vec_2) < 0:
???Move On???
这应该有效。评论中有几点需要注意。基本上,您可以循环遍历文件中的行并随时存储结果。一次处理两行的一种方法是设置一个 "infinite loop" 并检查我们读取的最后一行以查看我们是否已经结束(readline()
将 return None
在文件末尾)。
# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
try:
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
except:
print("Error when parsing lines:\n{}\n{}\n".format(x, y))
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
编辑:关于您从 similarity()
得到的问题,如果您想简单地忽略导致这些错误的线对(没有深入查看源代码,我真的不知道是什么继续),您可以在对 similarity()
的调用周围添加一个 try, catch
。
我已经使用 here 中的 material 和之前的论坛页面为程序编写了一些代码,该程序将自动计算整个文本中连续句子之间的语义相似度。在这里;
第一部分的代码是从第一个 link 复制粘贴的,然后我在下面的 245 行之后放了这些东西。我删除了第 245 行之后的所有多余部分。
with open ("File_Name", "r") as sentence_file:
while x and y:
x = sentence_file.readline()
y = sentence_file.readline()
similarity(x, y, true)
#boolean set to false or true
x = y
y = sentence_file.readline()
我的文本文件格式如下;
Red alcoholic drink. Fresh orange juice. An English dictionary. The Yellow Wallpaper.
最后我想把所有相邻的相似句子对显示出来,像这样;
["Red alcoholic drink.", "Fresh orange juice.", 0.611],
["Fresh orange juice.", "An English dictionary.", 0.0]
["An English dictionary.", "The Yellow Wallpaper.", 0.5]
if norm(vec_1) > 0 and if norm(vec_2) > 0:
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
elif norm(vec_1) < 0 and if norm(vec_2) < 0:
???Move On???
这应该有效。评论中有几点需要注意。基本上,您可以循环遍历文件中的行并随时存储结果。一次处理两行的一种方法是设置一个 "infinite loop" 并检查我们读取的最后一行以查看我们是否已经结束(readline()
将 return None
在文件末尾)。
# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
try:
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
except:
print("Error when parsing lines:\n{}\n{}\n".format(x, y))
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
编辑:关于您从 similarity()
得到的问题,如果您想简单地忽略导致这些错误的线对(没有深入查看源代码,我真的不知道是什么继续),您可以在对 similarity()
的调用周围添加一个 try, catch
。