如何使用 lxml 进行网页抓取?
How to use lxml for web scraping?
我想编写一个 python 脚本,在堆栈溢出时获取我当前的声誉 --https://whosebug.com/users/14483205/raunanza?tab=profile
这是我写的代码。
from lxml import html
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)
现在,如何获取我的声望。 (我什至无法理解如何使用 xpath
谷歌搜索后。)
使用 lxml
和 beautifulsoup
的简单解决方案:
from lxml import html
from bs4 import BeautifulSoup
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile').text
tree = BeautifulSoup(page, 'lxml')
name = tree.find("div", {'class': 'grid--cell fw-bold'}).text
title = tree.find("div", {'class': 'grid--cell fs-title fc-dark'}).text
print("Whosebug reputation of {}is: {}".format(name, title))
# output: Whosebug reputation of Raunanza is: 3
如果您不介意使用 BeautifulSoup
,您可以直接从包含您的声誉的标签中提取文本。当然要先检查页面结构
from bs4 import BeautifulSoup
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile')
soup = BeautifulSoup(page.content, features= 'lxml')
for tag in soup.find_all('strong', {'class': 'ml6 fc-medium'}):
print(tag.text)
#this will output as 3
您需要对代码进行一些修改才能获取 xpath。下面是代码:
from lxml import HTML
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)
title = tree.xpath('//*[@id="avatar-card"]/div[2]/div/div[1]/text()')
print(title) #prints 3
您可以在 chrome 控制台(检查选项)中轻松获取元素的 xpath。
要了解更多关于 xpath 的信息,您可以参考:https://www.w3schools.com/xml/xpath_examples.asp
我想编写一个 python 脚本,在堆栈溢出时获取我当前的声誉 --https://whosebug.com/users/14483205/raunanza?tab=profile
这是我写的代码。
from lxml import html
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)
现在,如何获取我的声望。 (我什至无法理解如何使用 xpath
谷歌搜索后。)
使用 lxml
和 beautifulsoup
的简单解决方案:
from lxml import html
from bs4 import BeautifulSoup
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile').text
tree = BeautifulSoup(page, 'lxml')
name = tree.find("div", {'class': 'grid--cell fw-bold'}).text
title = tree.find("div", {'class': 'grid--cell fs-title fc-dark'}).text
print("Whosebug reputation of {}is: {}".format(name, title))
# output: Whosebug reputation of Raunanza is: 3
如果您不介意使用 BeautifulSoup
,您可以直接从包含您的声誉的标签中提取文本。当然要先检查页面结构
from bs4 import BeautifulSoup
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile')
soup = BeautifulSoup(page.content, features= 'lxml')
for tag in soup.find_all('strong', {'class': 'ml6 fc-medium'}):
print(tag.text)
#this will output as 3
您需要对代码进行一些修改才能获取 xpath。下面是代码:
from lxml import HTML
import requests
page = requests.get('https://whosebug.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)
title = tree.xpath('//*[@id="avatar-card"]/div[2]/div/div[1]/text()')
print(title) #prints 3
您可以在 chrome 控制台(检查选项)中轻松获取元素的 xpath。
要了解更多关于 xpath 的信息,您可以参考:https://www.w3schools.com/xml/xpath_examples.asp