当 table 单元格为混合格式时抓取维基百科信息框
Scraping Wikipedia infobox when table cells are in mixed formats
我正在尝试抓取维基百科信息框并获取一些关键字的信息。例如:https://en.wikipedia.org/wiki/A%26W_Root_Beer
假设我正在寻找制造商的值。我想要他们在列表中,我只想要他们的文字。所以在这种情况下,所需的输出将是 ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']
。
无论我尝试什么,我都无法成功生成此列表。这是我的一段代码:
url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
# take th.text and td.text
我想要一种可以在各种情况下工作的方法:当中间有换行符时,当某些值是 links 时,当某些值是段落时,等等。在在所有情况下,我只想要我们在屏幕上看到的文本,而不是 link,不是段落,只是纯文本。我也不希望输出是 Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada)
,因为稍后我希望能够解析结果并对每个实体做一些事情。
我正在浏览许多维基百科页面,但我找不到适合大部分页面的方法。你能帮我 工作代码 吗?我不擅长抓取
好的,这是我的尝试(json 库只是为了漂亮地打印字典):
import json
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/ABC_Studios"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
if th is not None:
innerText = ''
for elem in td.recursiveChildGenerator():
if isinstance(elem, str):
innerText += elem.strip()
elif elem.name == 'br':
innerText += '\n'
info[th.text] = innerText
print(json.dumps(info, indent=1))
代码用 \n
替换了 <br/>
标签,得到:
{
"Trading name": "ABC Studios",
"Type": "Subsidiary\nLimited liability company",
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
"Website": "abcstudios.go.com"
}
如果你想 return 列表而不是带有 \n
s
的字符串,你可以调整它
innerTextList = innerText.split("\n")
if len(innerTextList) < 2:
info[th.text] = innerTextList[0]
else:
info[th.text] = innerTextList
给出:
{
"Trading name": "ABC Studios",
"Type": [
"Subsidiary",
"Limited liability company"
],
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": [
"ABC Entertainment Group",
"(Disney\u2013ABC Television Group)"
],
"Website": "abcstudios.go.com"
}
此代码无效
soup = BeautifulSoup(requests.get(url), "lxml")
BeautifulSoup需要requests
内容,追加.text
或.content
.
要获得预期的制造结果,您需要 select td[class="brand"]
中的 a
元素,然后使用 .next_sibling.string
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']
我正在尝试抓取维基百科信息框并获取一些关键字的信息。例如:https://en.wikipedia.org/wiki/A%26W_Root_Beer
假设我正在寻找制造商的值。我想要他们在列表中,我只想要他们的文字。所以在这种情况下,所需的输出将是 ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']
。
无论我尝试什么,我都无法成功生成此列表。这是我的一段代码:
url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
# take th.text and td.text
我想要一种可以在各种情况下工作的方法:当中间有换行符时,当某些值是 links 时,当某些值是段落时,等等。在在所有情况下,我只想要我们在屏幕上看到的文本,而不是 link,不是段落,只是纯文本。我也不希望输出是 Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada)
,因为稍后我希望能够解析结果并对每个实体做一些事情。
我正在浏览许多维基百科页面,但我找不到适合大部分页面的方法。你能帮我 工作代码 吗?我不擅长抓取
好的,这是我的尝试(json 库只是为了漂亮地打印字典):
import json
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/ABC_Studios"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
if th is not None:
innerText = ''
for elem in td.recursiveChildGenerator():
if isinstance(elem, str):
innerText += elem.strip()
elif elem.name == 'br':
innerText += '\n'
info[th.text] = innerText
print(json.dumps(info, indent=1))
代码用 \n
替换了 <br/>
标签,得到:
{
"Trading name": "ABC Studios",
"Type": "Subsidiary\nLimited liability company",
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
"Website": "abcstudios.go.com"
}
如果你想 return 列表而不是带有 \n
s
innerTextList = innerText.split("\n")
if len(innerTextList) < 2:
info[th.text] = innerTextList[0]
else:
info[th.text] = innerTextList
给出:
{
"Trading name": "ABC Studios",
"Type": [
"Subsidiary",
"Limited liability company"
],
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": [
"ABC Entertainment Group",
"(Disney\u2013ABC Television Group)"
],
"Website": "abcstudios.go.com"
}
此代码无效
soup = BeautifulSoup(requests.get(url), "lxml")
BeautifulSoup需要requests
内容,追加.text
或.content
.
要获得预期的制造结果,您需要 select td[class="brand"]
中的 a
元素,然后使用 .next_sibling.string
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']