使用 Python 从网站中提取 Web 元素

Question

我想从该网站的表格和段落文本中提取各种元素。

https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655

这是我使用的代码：

import lxml
from lxml import html
from lxml import etree
import urllib2
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read()
x = etree.HTML(source)
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)")
growth

从网站中提取我想要的元素而不必每次都更改代码中的 XPath 的最佳方法是什么？他们每个月都会在同一个网站上发布新数据，但 XPath 有时似乎会发生一些变化。

Answer 1

BeautifulSoup 救援：

from bs4 import BeautifulSoup
import urllib2

r = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655')
soup = BeautifulSoup(r)
soup.find('div', {'id': 'home_feature_container'}, 'h4')

此代码正在按照描述的方式满足规范。如果您使用 soup.find().contents，它会创建元素中包含的每个项目的列表。

至于考虑页面上的更改，这真的取决于。如果变化很大，您将不得不更改 soup.find()。否则，您可以编写足够通用的代码，使其始终适用。（就像如果名为 home_feature_container 的 div 始终是特色，您永远不必更改它。）

Answer 2

如果您想要的项目的位置经常变化，请尝试按名称检索它们。例如，这里是如何从 "New Orders" 行的 table 中提取元素。

import requests #better than urllib
from lxml import html, etree

url = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
page = requests.get(url)
tree = html.fromstring(page.content)

neworders = tree.xpath('//strong[text()="New Orders"]/../../following-sibling::td/p/text()')

print(neworders)

或者如果你想要整个 html table :

data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..')

for elements in data:
    print(etree.tostring(elements, pretty_print=True))

另一个使用BeautifulSoup

的例子

from bs4  import BeautifulSoup
import requests

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"

content = requests.get(url).content

soup = BeautifulSoup(content, "lxml")

table = soup.find_all('table')[1]

table_body = table.find('tbody')

data= []
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

print(data)

使用 Python 从网站中提取 Web 元素

Web elements extraction from websites using Python

python

xpath

lxml

urllib2

xml.etree