如何在网页中搜索 word/phrase 的出现？

Question

我的最终目标是创建一个给定文本文件的原始抄袭检查器。为此，我计划首先按句子拆分数据，在 Google 上搜索每个句子，最后搜索 Google 返回的前几个 URL 中的每一个 Google =21=]。最后一步是我遇到的问题。

当运行在 for 循环中通过每个 URL 时，我首先使用 urllib.open() 读取 URL 的内容，但我没有确定之后要做什么。代码附在下面，我尝试过的一些解决方案被注释掉了。我已经导入了 googlesearch、urllib.request 和 re 库。

def plagCheck():

    global inpFile

    with open(inpFile) as data:
        sentences = data.read().split(".")

    for sentence in sentences:
        for url in search(sentence, tld='com', lang='en', num=5, start=0, stop=5, pause=2.0):
            content = urlopen(url).read()

            # if sentence in content:
            #     print("yes")
            # else:
            #     print("no")

            # matches = findall(sentence, content)
            # if len(matches) == 0:
            #     print("no")
            # else:
            #     print("yes")

Answer 1

如果我对你的代码的理解正确，你现在有两个 Python 个句子列表。看起来您已经使用句点拆分了它们。这将为其他类型的标点符号 (?, !) 创建相当大的运行-on 句子。

我会考虑使用相似性检查器库。 Diflibb has a simliar class 然后决定要标记的百分比，即是否有 40% 相同。这减少了您必须手动检查的内容量。

增加标点符号的数量。这可能看起来像这样：

with open(inpFile) as data:
        # Replace all !, ? with .
        sentences = data.read().replace("!", ".").replace("?", ".").split(".")

然后我会把你对这个文件的结果写回一个新的输出文件，就像这样

# loop each sentence and run it through google
# Compare those two sentences with the sequence matcher linked above (Difflib) 
# Add them to a dictionary with the percent, url, and sentence in question
# Sample result
results = {"sentence_num": 0, "percent": 0.8, "url": "the google url found on", "original_sentence": "Red green fox over the wall"
}
outputStr = "<html>"
# loop the results and format the dictionary in a way that you can read. Ideally an HTML table with columns representing the keys above
outputStr += "<table>" # etc
with open(outputFile) as results:
   results.write(outputStr)

您甚至可以根据百分比突出显示 table 行即

80%及以上为红色 61-79% 橙色 40-60% 黄色 39%及以下为绿色

如何在网页中搜索 word/phrase 的出现？

How to search for occurrence of a word/phrase within webpage?

python

urllib

plagiarism-detection