使用维基百科 1.4.0 进行维基百科抓取：如何跳过不良结果？

Question

我正在使用 wikipedia python 2.7，scrape 文章，使用来自非常大数据集的词。

代码如下：

for node_id in top_k:
    human_string = label_lines[node_id]
    score = predictions[0][node_id]
    print('%s (score = %.5f)' % (human_string, score))       


    # Wiki = wikipedia.page(human_string)
    # print (Wiki.content)

    lista.append(human_string)

for i in xrange(5):
    wiki = wikipedia.page(lista[i])
    print (wiki.content)
    a = wiki.content
    #appendowanie = '%s (score = %.5f)' % (human_string, score)
    # appendowanie = str(human_string)
    appendFile = open('/home/inception/wikipedia.txt', 'a')
    appendFile.write('\n\n'+str(i))
    appendFile.write(a.encode("utf-8"))
    appendFile.close()

我想从列表中取出 5 项，在维基百科中搜索，然后将整篇文章抓取到 wikipedia.txt 文件中。有时维基百科搜索会给我一个错误，由于列表中的未知单词： 示例错误

Traceback (most recent call last):   File "label_image.py", line 68, in <module>
    wiki = wikipedia.page(lista[i])   File "/usr/local/lib/python2.7/dist-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)   File "/usr/local/lib/python2.7/dist-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)   File "/usr/local/lib/python2.7/dist-packages/wikipedia/wikipedia.py", line 345, in __load
    raise PageError(self.title) wikipedia.exceptions.PageError: Page id "gracile crown blackbird" does not match any pages. Try another id!

gracile crown blackbird

我想更改脚本以忽略维基百科抓取器无法加载的词 有没有办法用一个脚本找出所有错误的单词？

Answer 1

像这样使用try-except:

try:
    <get the article>
except wikipedia.exceptions.PageError as e:
    if "does not match any pages" in str(e):
        <ignore the error>
    else:
        # Some other error jumped out, so do not ignore it:
        raise

现在，这不是 100% 确定，因为页面的名称理论上可以是 "does not match any pages"。

因此，您确实需要输入在变量 e 中捕获的异常，并且只看到消息或是否有错误编号或其他内容。

因为我认为 PageError() 可以引发超过未找到的页面。

我不知道 PageError() 异常是如何产生的，但也许：

e.msg

或

e.message

应该给你真实的东西而不是检查 str(e)

使用维基百科 1.4.0 进行维基百科抓取：如何跳过不良结果？

wikipedia scraping with wikipedia 1.4.0: How to skip bad results?

list

wikipedia-api

web-scraping

python-2.7