如何使用 Python 3 和 Beautiful Soup 获取维基百科文章的文本？

Question

我在 Python 3:

制作了这个脚本

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
    html = BeautifulSoup(response, 'html.parser')
    title = html.select("#firstHeading")[0].text

如您所见，我可以从文章中获取标题，但我不知道如何将文本从 "Mathematics (from Greek μά..." 获取到内容 table...

Answer 1

select <p> 标签。有 52 个元素。不确定你是否想要整个东西，但你可以遍历这些标签来尽可能地存储它。我只是选择打印它们中的每一个以显示输出。

import bs4
import requests


response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
    html = bs4.BeautifulSoup(response.text, 'html.parser')

    title = html.select("#firstHeading")[0].text
    paragraphs = html.select("p")
    for para in paragraphs:
        print (para.text)

    # just grab the text up to contents as stated in question
    intro = '\n'.join([ para.text for para in paragraphs[0:5]])
    print (intro)

Answer 2

使用库wikipedia

import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)

Answer 3

有一种更简单的方法可以从维基百科获取信息 - 维基百科 API.

有 this Python wrapper，它允许您仅用零 HTML-parsing:

在几行内完成

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

page = wiki_wiki.page('Mathematics')
print(page.summary)

打印：

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning") includes the study of such topics as quantity, structure, space, and change...(omitted intentionally)

而且，一般来说，如果有直接 API 可用，请尽量避免屏幕抓取。

Answer 4

您可以使用如下所示的 lxml 库获得所需的输出。

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Mathematics"

res = requests.get(url)
source = fromstring(res.content)
paragraph = '\n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)

使用BeautifulSoup:

from bs4 import BeautifulSoup
import requests

res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
    if item.text.startswith("The history"):break
    print(item.text)

Answer 5

您似乎想要的是没有周围导航元素的 (HTML) 页面内容。正如我在 this earlier answer from 2013 中所描述的，有（至少）两种获取方式：

可能最简单的方法是在 URL 中包含参数 action=render，如 https://en.wikipedia.org/wiki/Mathematics?action=render。这将只为您提供内容 HTML，没有其他内容。
或者，您也可以通过MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics获取页面内容。

使用 API 的优点是它还可以为您提供 a lot of other information 关于您可能会觉得有用的页面。例如，如果您想要一个通常显示在页面侧边栏中的跨语言链接列表，或者通常显示在内容区域下方的类别，您可以像这样从 API 中获取这些内容：

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

（同样请求获取页面内容，使用prop=langlinks|categories|text。）

有几个 Python libraries for using the MediaWiki API 可以自动化使用它的一些细节，尽管它们支持的功能集可能会有所不同。也就是说，完全可以直接从您的代码中使用 API 而无需中间的库。

Answer 6

要获得正确的函数使用方法，您可以获取维基百科提供的 JSON API :

from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads


def getJSON(page):
    params = urlencode({
        'format': 'json',
        'action': 'parse',
        'prop': 'text',
        'redirects' : 'true',
        'page': page})
    API = "https://en.wikipedia.org/w/api.php"
    response = urlopen(API + "?" + params)
    return response.read().decode('utf-8')


def getRawPage(page):
    parsed = loads(getJSON(page))
    try:
        title = parsed['parse']['title']
        content = parsed['parse']['text']['*']
        return title, content
    except KeyError:
        # The page doesn't exist
        return None, None

title, content = getRawPage("Mathematics")

然后你可以用任何你想提取你需要的库来解析它:)

Answer 7

我使用这个：通过'idx'我可以确定我想阅读的段落。

from from bs4 import BeautifulSoup
import requests

res = requests.get("https://de.wikipedia.org/wiki/Pferde")
soup = BeautifulSoup(res.text, 'html.parser')
for idx, item in enumerate(soup.find_all("p")):
    if idx == 1:
        break
print(item.text)

如何使用 Python 3 和 Beautiful Soup 获取维基百科文章的文本？

How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

html

python

wikipedia

beautifulsoup

web-scraping