当 table 单元格为混合格式时抓取维基百科信息框

Scraping Wikipedia infobox when table cells are in mixed formats

我正在尝试抓取维基百科信息框并获取一些关键字的信息。例如:https://en.wikipedia.org/wiki/A%26W_Root_Beer

假设我正在寻找制造商的值。我想要他们在列表中,我只想要他们的文字。所以在这种情况下,所需的输出将是 ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']。 无论我尝试什么,我都无法成功生成此列表。这是我的一段代码:

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")

    # take th.text and td.text

我想要一种可以在各种情况下工作的方法:当中间有换行符时,当某些值是 links 时,当某些值是段落时,等等。在在所有情况下,我只想要我们在屏幕上看到的文本,而不是 link,不是段落,只是纯文本。我也不希望输出是 Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada),因为稍后我希望能够解析结果并对每个实体做一些事情。

我正在浏览许多维基百科页面,但我找不到适合大部分页面的方法。你能帮我 工作代码 吗?我不擅长抓取

好的,这是我的尝试(json 库只是为了漂亮地打印字典):

import json
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/ABC_Studios"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})

list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")
        if th is not None:
            innerText = ''
            for elem in td.recursiveChildGenerator():
                if isinstance(elem, str):
                    innerText += elem.strip()
                elif elem.name == 'br':
                    innerText += '\n'
            info[th.text] = innerText

print(json.dumps(info, indent=1))

代码用 \n 替换了 <br/> 标签,得到:

{
 "Trading name": "ABC Studios",
 "Type": "Subsidiary\nLimited liability company",
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
 "Website": "abcstudios.go.com"
}

如果你想 return 列表而不是带有 \ns

的字符串,你可以调整它
    innerTextList = innerText.split("\n")
    if len(innerTextList) < 2:
        info[th.text] = innerTextList[0]
    else:
        info[th.text] = innerTextList

给出:

{
 "Trading name": "ABC Studios",
 "Type": [
  "Subsidiary",
  "Limited liability company"
 ],
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": [
  "ABC Entertainment Group",
  "(Disney\u2013ABC Television Group)"
 ],
 "Website": "abcstudios.go.com"
}

此代码无效

soup = BeautifulSoup(requests.get(url), "lxml")

BeautifulSoup需要requests内容,追加.text.content.

要获得预期的制造结果,您需要 select td[class="brand"] 中的 a 元素,然后使用 .next_sibling.string

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']