使用 BeautifulSoup 和列表从维基百科的信息框中提取特定文本的最佳方法是什么?
What's the best way to extract specific text from Wikipedia's Infobox using BeautifulSoup and lists?
我正在使用 BeautifulSoup 从维基百科的信息框(收入)中提取特定文本。如果收入文本位于 'a' 标记内,我的代码可以正常工作。不幸的是,并非所有页面的收入都列在 'a' 标签中。例如,有些在 'span' 标签之后有他们的收入文本。我想知道获取公司列表的收入文本的最佳/最安全方法是什么。找到另一个标签代替 'a' 效果最好吗?或者是其他东西?感谢您的帮助。
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
rev = re.compile('^Revenue')
thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0]
tdRev = thRev.find_next('td')
revenue = tdRev.find_all('a')
for f in revenue:
print c + " " + f.text
break
你可以试试:
from bs4 import BeautifulSoup
import urllib
import re
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
for tr in soup.findAll('tr'):
trText = tr.text
if re.search(r"^\bRevenue\b$", trText):
match = re.search(r"\w+$(?:\s+)?[\d\.]+.{1}\w+", trText)
revenue = match.group()
print c+"\n"+revenue+"\n"
输出:
Lockheed_Martin
US$ 46.132 billion
Phillips_66
US$ 161.21 billion
ConocoPhillips
US.52 billion
Sysco
US.41 Billion
Baker_Hughes
US$ 22.364 billion
注:
您可能想改用 Wikipedia API,即:
https://en.wikipedia.org/w/api.php?action=query&titles=Baker_Hughes&prop=revisions&rvprop=content&format=json
我正在使用 BeautifulSoup 从维基百科的信息框(收入)中提取特定文本。如果收入文本位于 'a' 标记内,我的代码可以正常工作。不幸的是,并非所有页面的收入都列在 'a' 标签中。例如,有些在 'span' 标签之后有他们的收入文本。我想知道获取公司列表的收入文本的最佳/最安全方法是什么。找到另一个标签代替 'a' 效果最好吗?或者是其他东西?感谢您的帮助。
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
rev = re.compile('^Revenue')
thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0]
tdRev = thRev.find_next('td')
revenue = tdRev.find_all('a')
for f in revenue:
print c + " " + f.text
break
你可以试试:
from bs4 import BeautifulSoup
import urllib
import re
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes']
for c in company:
r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read()
soup = BeautifulSoup(r, "lxml")
for tr in soup.findAll('tr'):
trText = tr.text
if re.search(r"^\bRevenue\b$", trText):
match = re.search(r"\w+$(?:\s+)?[\d\.]+.{1}\w+", trText)
revenue = match.group()
print c+"\n"+revenue+"\n"
输出:
Lockheed_Martin
US$ 46.132 billion
Phillips_66
US$ 161.21 billion
ConocoPhillips
US.52 billion
Sysco
US.41 Billion
Baker_Hughes
US$ 22.364 billion
注: 您可能想改用 Wikipedia API,即:
https://en.wikipedia.org/w/api.php?action=query&titles=Baker_Hughes&prop=revisions&rvprop=content&format=json