检查提交的作业之间的相似性百分比的最佳算法是什么？

Question

我计划为最后一年构建一个类似于相似性检查器的项目。在项目中，我计划检查提交的作业之间的相似度百分比，即离线。

例如：

当第一个学生提交作业时，不会与任何其他作业进行检查。
当第二个学生提交作业时，它会与第一个作业进行核对。
当第三个学生提交作业时，会与第一和第二个提交的作业进行核对。
类似地，如果有 35 个学生，那么第 36 个提交的作业将与其余 35 个提交的作业进行检查。

现在，问题来了，如何比较两个赋值。在这种情况下，比较是文档中文本之间的相似性。我想要类似这样的结果：

我只想显示相似句子的百分比以及它们是什么？

我做了什么：

我研究了不同的算法，如 td-idf、余弦相似度算法，但我无法正确地插入算法的结果。

所以，我想知道在这种情况下哪种算法最好，我想知道这是如何完成的。是否有任何对我有帮助的网站、博客的参考资料？

Answer 1

这取决于你使用的算法return的比较结果。

例如，以下函数比较文档内容列表和 returns 映射到它们之间的公共单词序列列表的文档对字典。它不会区分彼此包含的单词序列，因为重叠的较长和较短的单词序列可能出现的次数不同。

import re
from itertools import combinations

def wordList(document): return re.findall("(\w+|\d+)",document.lower())

def compareDocs(documents, minSize=2, maxSize=25):
    result  = dict() # { (documentIndex,documentIndex) : [CommonExpressions] }
    def tallyDuplicates(expressionDocs):
        for expression,docIndexes in expressionDocs.items():
            for docIndex,otherDoc in combinations(docIndexes,2):
                result.setdefault((docIndex,otherDoc),[]).append(expression)

    documentWords    = [ wordList(document) for document in documents ]
    wordCounts       = [ len(words) for words in documentWords ]
    expressionRanges = dict()
    for docIndex,words in enumerate(documentWords):
        for wordIndex,word in enumerate(words):
            expressionRanges.setdefault(word,[]).append((docIndex,wordIndex))

    size = 1    
    while size == 1 or expressionDocs and size <= maxSize:        
        nextExpressions   = dict()
        expressionDocs    = dict()
        for expression,starts in expressionRanges.items():
            for docIndex,startIndex in starts:
                endIndex = startIndex+size
                if endIndex >= wordCounts[docIndex]: continue
                extended = " ".join([expression,documentWords[docIndex][endIndex]])
                expressionDocs.setdefault(extended,set()).add(docIndex)
                nextExpressions.setdefault(extended,[]).append( (docIndex,startIndex) )
        expressionDocs   = { expression:docIndexes for expression,docIndexes in expressionDocs.items() if len(docIndexes) > 1 }
        expressionRanges = { expression:ranges for expression,ranges in nextExpressions.items() if expression in expressionDocs }  
        if size >= minSize: tallyDuplicates(expressionDocs)
        size += 1

    return result

根据这些比较结果，您需要分析每对文档的内容，统计出共同表达所涵盖的词（词序列）。鉴于一个表达式包含多个单词，每个表达式将占相似度中的多个单词：words-in matching-expressions / words-in-document。

[编辑] 我将结果分析放在它自己的函数中并添加了一个 html 输出以突出显示文档文本中的表达式：

def analyzeComparison(doc1,doc2,commonExpr):
    words1  = wordList(doc1)
    words2  = wordList(doc2)
    normalizedDoc1 = " ".join(words1)
    normalizedDoc2 = " ".join(words2)
    expressions.sort(key=lambda s:len(s),reverse=True)
    matches = []
    for expression in expressions:
        count1 = len(re.findall(expression,normalizedDoc1))
        count2 = len(re.findall(expression,normalizedDoc2))
        commonCount = min(count1,count2)
        if commonCount == 0: continue
        expressionId = "<#"+str(len(matches))+"#>"
        normalizedDoc1 = re.sub(expression,expressionId,normalizedDoc1,commonCount)
        normalizedDoc2 = re.sub(expression,expressionId,normalizedDoc2,commonCount)
        matches.append((expression,commonCount))
    commonWords = sum( count*len(expr.split(" ")) for expr,count in matches)
    percent1 = 100*commonWords/len(words1)
    percent2 = 100*commonWords/len(words2)
    for index,match in enumerate(matches):
        expressionId = "<#"+str(index)+"#>"
        expressionHighlight = "<span style='background-color:yellow'>"+match[0]+"</span>"
        normalizedDoc1 = re.sub(expressionId,expressionHighlight,normalizedDoc1)
        normalizedDoc2 = re.sub(expressionId,expressionHighlight,normalizedDoc2)
    return (percent1,percent2,matches,normalizedDoc1,normalizedDoc2)

例如：如果您有以下 3 个文档（您通常会从文件中读取它们）：

doc1 = """
Plagiarism, one of the main scourges of the academic life, is quite an easy concept, but, nonetheless, harmful. In short, to plagiarize means to steal someone else’s idea or part of work and use it as your own. But why exactly it is considered to be so bad and immoral? And it is really considered immoral and a serious offence. In case it is discovered, it may lead to very unpleasant consequences; the higher the position of the offender is, the more unpleasant they are.
copy and paste
There are two major kinds of harm plagiarism causes. First, it is something as simple as stealing and lying – you just steal someone else’s work and trick somebody into believing it was you who had written it, which is as immoral as any other kind of theft is. It means that somebody had actually spent time and effort in order to create something, while you did nothing but ripping it off and submitting it.
copy and paste function
Second, it is a crime you commit against yourself. If you study at an educational institution, there are certain tasks copy and paste you are given in order to ensure that you learn something. When you resort to plagiarism, you undo all these efforts for, instead of actually doing something and understanding it in process, you use someone else’s work and the certain amount of experience that you were supposed to get just misses you.
"""
doc2 = """
Plagiarism has always been a problem in schools. However, with the invention of the internet,copy and paste  it has made plagiarism even more of a challenge. Plagiarism.org, “estimates that nearly 30 percent of all students may be plagiarizing on all their written assignments and that the use of the Internet has made plagiarism much worse.” [1] The act of plagiarism can be defined as, “To steal and pass off (the ideas or words of another) as one’s own, to use (another’s production) without crediting the source, to commit literary theft, to present as new and original as idea or product derived from an existing source”2. Plagiarism has become such a concern for colleges that almost all the sites on this topic are sponsored by schools. The three main topics with plagiarism are the copy and paste function, “paper mills” and the ways that can be used to prevent students from doing this. 
it is quite an easy concept
The first major concern with the internet would be the copy and paste function. Wittenberg copy and paste function lists that “Widespread availability of the internet and increased access to full text databases has made cut and paste plagiarism very easy”.3 While the function is actually very nice to have, people are using it the wrong way. Instead of just using it to copy quotes from websites, than pasting it to their word document and giving it the proper credit, people are passing it off as their own. This is where the problem occurs.
"""

doc3 = """
Plagiarism has always been a problem in schools. However, it is something as simple as stealing and lying
it is a crime you. some other text
"""

您将首先在文档内容列表上调用 compareDocs()，对于每对文档（return由该函数编辑），您将使用 analyzeComparison() 来获取百分比、计数和亮点：

documents   = [doc1,doc2,doc3]
comparisons = compareDocs( documents )
for documentPair,expressions in comparisons.items():
    docIndex1,docIndex2 = documentPair
    doc1 = documents[docIndex1]
    doc2 = documents[docIndex2]        
    pct1,pct2,matches,doc1,doc2 = analyzeComparison(doc1,doc2,expressions)

    # print result on console ...
    print(int(pct1//1)," % of document #",docIndex1," is same as document #", docIndex2)
    print(int(pct2//1)," % of document #",docIndex2," is same as document #", docIndex1)
    print("Common expressions are:")
    for expression,count in matches:
        print( "    ",expression,"(",count,"times )")
    print("")

    # output comparison result to an HTML file...
    htmlPage = "<html><body><table border='1'>"
    htmlPage += "<tr><th>#" + str(docIndex1) + ": Source " + str(int(pct1//1)) + "% duplicate</th>"
    htmlPage += "<th>#" + str(docIndex2) + ": Target  " + str(int(pct2//1)) + "% duplicate</th></tr>"
    htmlPage += "<tr><td width='50%' valign='top'>" + doc1 + "</td><td valign='top'>" + doc2 + "</td></tr>"
    htmlPage +="</table></body></html>"        
    fileName = str(docIndex1)+"-"+str(docIndex2)+".html"
    with open(fileName,"w") as f: f.write(htmlPage)

这将打印以下信息并创建一堆 HTML 看起来与您想要的结果相似的文件：

3.0  % of document # 1  is same as document # 2
34.0  % of document # 2  is same as document # 1
Common expressions are:
     plagiarism has always been a problem in schools however ( 1 times )

6.0  % of document # 0  is same as document # 1
5.0  % of document # 1  is same as document # 0
Common expressions are:
     is quite an easy concept ( 1 times )
     copy and paste function ( 1 times )
     copy and paste ( 2 times )

5.0  % of document # 0  is same as document # 2
53.0  % of document # 2  is same as document # 0
Common expressions are:
     it is something as simple as stealing and lying ( 1 times )
     it is a crime you ( 1 times )

总而言之，整个过程如下：

1) 运行用于识别每对文档共有的表达式（单词序列）的比较函数。

给定文档文本列表，compareDocs 函数在一次调用中执行此操作。
如果您使用不同的比较算法，它可能被设计为仅执行两个文档之间的比较，或者在分类器的情况下，它可能只是 return word/term 个频率的列表一份文件。
根据算法的输入和输出，您或多或少需要将逻辑包装在自己的代码中以获得所需的结果
在此阶段您应该查找的是不同文档对之间的常用表达式（单词序列）列表。
如果您正在使用仅提取词频的算法（例如 td-idf），您将面临一个非常复杂的问题来交叉匹配文档对之间的词频。

例如，对于给定文档，分类器可能 return 频率："cut"=25 次，"and"=97 次，"paste"=31 次。这不会让您知道表达式 "cut and paste" 实际上存在于文档中或它会出现多少次。文档可能在谈论牙膏，但从来没有按顺序排列这 3 个词。仅根据词频比较文档会发现同一主题的论文之间存在很高的相关性，但这并不意味着存在抄袭。

此外，即使您的分类器设法 return 所有两个词或更多词的表达，每个文档也会产生接近 w*2^n 的表达，其中 w 是文档中的词数，n是以单词数表示的表达式的最大长度（您必须决定的最大值）。这很容易达到每个文档数百万个，然后您需要将它们与其他文档中的数百万个进行匹配。如果你有 Google 的资源，这可能不是问题，但对我们其他人来说却是问题。

2) 要衡量文档之间的相似度百分比，您需要找到两边的共同表达，并衡量每个文档的单词有多少被共同表达所覆盖。

识别表达式的位置是一个简单的文本搜索过程
但是，您需要避免多次计算任何给定的单词，因为您的百分比的分母是文档中的单词数（并且您不想高估或超过 100%）
这可以通过首先处理较长的表达式并将它们从文本中删除（或屏蔽它们）来实现，这样它们的单词就不会被后续（较短的）表达式再次计算
analyzeComparison() 函数通过用占位符替换它们来掩盖在文本中找到的表达式，稍后将使用该占位符重新注入带有突出显示标记的文本 (HTML)。

3) 在自己的程序中利用文档对比分析。这取决于您希望如何呈现信息以及是否需要存储结果（由您决定）。例如，您可以决定相似度的阈值，并且只输出可疑的文档对。该阈值可以基于百分比、常用表达式的数量、常用表达式的最大或平均长度等

[EDIT2] compareDocs 的工作原理...

该函数创建一个表达式字典，将它们映射到每个文档中第一个单词的位置。这存储在 expressionRanges 变量中。
- 示例：{ "copy and paste":[ (0,57), (1,7),(1,32) ] .... }
- 这意味着在文档 #0 中的位置 57（单词 "copy" 的位置）和文档 #1 中的位置 7 和 32 中找到了 3 词表达式 "copy and paste"。
表达式字典 (expressionRanges) 从单词表达式开始，然后使用它来获得 2 词表达式，然后是 3 词表达式，依此类推。
在移动到下一个表达式大小之前，通过删除仅在一个文档中找到的所有表达式来清理表达式字典。
- 尺寸 1 ==> { "copy":[ (0,57),(0,72),(1,7),(1,32),(1,92) ] .. .}
- 清理...
- 大小 2 ==> { "copy and": [ (0,57),(1,7),(1,32),(1,92) ] ... }
- 清理...
- 尺寸 3 ==> { "copy and paste": [ (0,57),(1,7),(1,32) ] ... }
此清理是通过创建一个单独的字典 (expressionDocs) 来实现的，该字典将表达式映射到一组包含该表达式的文档索引。从两个词典中删除在其集合中仅包含一个文档的表达式。
expressionDocs 词典也用于生成函数的输出。出现在多个文档中的表达式被映射到文档对（2 的组合）以形成一个字典，其中包含：{（文档对）：[表达式列表]}（这是函数的结果）
tallyDuplicates 子函数通过将表达式添加到列表中 2 的每个组合来执行从 {Expression:[list of document indexes]} 到 {(document pair):[list of expressions]} 的转置文档索引。

expressionRanges 的连续细化大大减少了要执行的单词匹配的数量。每次传递仅向每个表达式添加一个单词，并在进入下一个表达式大小之前立即进行清理。 expressionRanges 字典开始时的条目数与文档中不同的单词一样多，但会迅速缩小到更小的大小（除非文档实际上完全相同）。

这种方法的一个缺点是，具有大量非常长的匹配表达式的文档会导致字典增长而不是缩小，而 while 循环将运行长得多。最坏的情况是两份相同的文件。这可以通过引入最大表达式大小以使循环更早停止来避免。例如，如果您将最大大小设置为 25，则该函数将仅报告 25 个词的常用表达式和 5 个词的常用表达式，而不是 30 个词的表达式（如果有的话）。这可能是一个可以接受的折衷方案，可以避免几乎相同的文档需要很长的处理时间。就相似度的百分比而言，差异将是最小的。（即，如果有一个 26 个词的常用表达式，并且最大为 25 个，则一个常用词可能会被忽略，但一个 27 个词的表达式将作为 25 个词和 2 个词的匹配出现）

检查提交的作业之间的相似性百分比的最佳算法是什么？

What's the best algorithm to check the similarity percentage among the submitted assignments?

algorithm

project

cosine-similarity

sentence-similarity