从 Beautiful soup 中的图表中提取文本
Extracting text from chart in Beautiful soup
beautifulsoup 相对较新,我正在尝试从此网页中提取数据:http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg==#
我想获取标题 "Program Completers"、"Employed Second Quarter" 等下的数字。html 代码的相关部分是:
<ul class="listbox">
<li class="li1">
<p style="cursor:help" class="listtop" title="WIA Adult
completers are those individuals who have exited a WIA Adult program from
which the individual received a core staff-assisted service (such as job
search or placement assistance) or an intensive service (such as
counseling, career planning, or job training). Those individuals who
participated in WIA through self-service, like OhioMeansJobs.com, or other
less intensive programs are not included in the dashboard.">Program
Completers</p>
<p id="programcompleters1">18</p></li>
我想要字符串 "Program Completers" 和“18”。我已经尝试实施这些解决方案 here, here, and here 但运气不佳。我的代码的一个版本是:
from bs4 import BeautifulSoup
import urllib2
url="http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg=="
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(url, headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
for tag in soup.find_all('ul'):
print tag.text, tag.next_sibling
此 returns 文本,但来自网页其他部分的文本也被标记为 'ul'。我未能成功从图表区域内获取任何文本。如何检索我想要的文本?
感谢您的帮助!
您需要的元素在 iframe 中。尝试从位于 http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=
的页面本身中提取
所以,这应该有效
url="http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8="
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
chartcontainers = soup.findAll('div', {"class": "chartcontain"})
for container in chartcontainers:
print(container)
#then do whatever
如前所述,您要查找的数据位于 iframe 中,请按@chosen_codex 此处所述访问它:
然后您可以通过以下方式访问您感兴趣的字段:
data = {}
for tag in soup.find_all('p'):
if tag.get('id'):
data[tag.get('id')] = tag.text
print(data)
>> print(data.get('programcompleters1'))
18
beautifulsoup 相对较新,我正在尝试从此网页中提取数据:http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg==#
我想获取标题 "Program Completers"、"Employed Second Quarter" 等下的数字。html 代码的相关部分是:
<ul class="listbox">
<li class="li1">
<p style="cursor:help" class="listtop" title="WIA Adult
completers are those individuals who have exited a WIA Adult program from
which the individual received a core staff-assisted service (such as job
search or placement assistance) or an intensive service (such as
counseling, career planning, or job training). Those individuals who
participated in WIA through self-service, like OhioMeansJobs.com, or other
less intensive programs are not included in the dashboard.">Program
Completers</p>
<p id="programcompleters1">18</p></li>
我想要字符串 "Program Completers" 和“18”。我已经尝试实施这些解决方案 here, here, and here 但运气不佳。我的代码的一个版本是:
from bs4 import BeautifulSoup
import urllib2
url="http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg=="
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(url, headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
for tag in soup.find_all('ul'):
print tag.text, tag.next_sibling
此 returns 文本,但来自网页其他部分的文本也被标记为 'ul'。我未能成功从图表区域内获取任何文本。如何检索我想要的文本?
感谢您的帮助!
您需要的元素在 iframe 中。尝试从位于 http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=
的页面本身中提取所以,这应该有效
url="http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8="
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
chartcontainers = soup.findAll('div', {"class": "chartcontain"})
for container in chartcontainers:
print(container)
#then do whatever
如前所述,您要查找的数据位于 iframe 中,请按@chosen_codex 此处所述访问它:
然后您可以通过以下方式访问您感兴趣的字段:
data = {}
for tag in soup.find_all('p'):
if tag.get('id'):
data[tag.get('id')] = tag.text
print(data)
>> print(data.get('programcompleters1'))
18