使用Python和BeautifulSoup访问网页标签的title属性
Use Python and BeautifulSoup to access title attribute of tags in web page
我是 Python 的新手,我想从特定的 url 中检索所有标题,但我无法这样做。代码编译没有任何错误,但我仍然没有得到输出。
import requests
import sys
from bs4 import BeautifulSoup
def test_function(num):
url = "https://www.zomato.com/chennai/restaurants?buffet=1&page=" +
str(num)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link)
test_function(1)
要获取页面标题,您只需使用:
soup.title.string
但是,您似乎并不需要页面标题,而是需要包含标题的任何标签的属性。如果你想获得每个标签的标题属性(如果它存在的话)那么你可以这样做:
for tag in soup.findAll():
try:
print(tag['title'])
except KeyError:
pass
这将打印页面中标签的所有标题。我们查看所有标签,尝试打印它的标题值,如果有 none 我们将得到一个 KeyError,然后我们对错误不做任何处理!
还有一个问题是没有通过请求传递 user-agent。如果您不这样做,该站点将给出 500 错误。我在下面添加了代码来执行此操作。
用你的代码就是
import requests
import sys
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}
def test_function(num):
url = "https://www.zomato.com/chennai/restaurants?buffet=1&page=" +
str(num)
source_code = requests.get(url, headers=HEADERS)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for tag in soup.findAll():
try:
print(tag['title'])
except KeyError:
pass
test_function(1)
您需要添加 header 以获得响应 200,然后执行相同的操作。
def test_function(num):
url = "https://www.zomato.com/chennai/restaurants"
params = {'buffet': 1, 'page': num}
header = {'Accept-Encoding': 'gzip, deflate, sdch',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
r = requests.get(url, params=params, headers=header)
plain_text = r.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link.text)
test_function(1)
Restaurants in Chennai serving Buffet - Zomato
我是 Python 的新手,我想从特定的 url 中检索所有标题,但我无法这样做。代码编译没有任何错误,但我仍然没有得到输出。
import requests
import sys
from bs4 import BeautifulSoup
def test_function(num):
url = "https://www.zomato.com/chennai/restaurants?buffet=1&page=" +
str(num)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link)
test_function(1)
要获取页面标题,您只需使用:
soup.title.string
但是,您似乎并不需要页面标题,而是需要包含标题的任何标签的属性。如果你想获得每个标签的标题属性(如果它存在的话)那么你可以这样做:
for tag in soup.findAll():
try:
print(tag['title'])
except KeyError:
pass
这将打印页面中标签的所有标题。我们查看所有标签,尝试打印它的标题值,如果有 none 我们将得到一个 KeyError,然后我们对错误不做任何处理!
还有一个问题是没有通过请求传递 user-agent。如果您不这样做,该站点将给出 500 错误。我在下面添加了代码来执行此操作。
用你的代码就是
import requests
import sys
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}
def test_function(num):
url = "https://www.zomato.com/chennai/restaurants?buffet=1&page=" +
str(num)
source_code = requests.get(url, headers=HEADERS)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for tag in soup.findAll():
try:
print(tag['title'])
except KeyError:
pass
test_function(1)
您需要添加 header 以获得响应 200,然后执行相同的操作。
def test_function(num):
url = "https://www.zomato.com/chennai/restaurants"
params = {'buffet': 1, 'page': num}
header = {'Accept-Encoding': 'gzip, deflate, sdch',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
r = requests.get(url, params=params, headers=header)
plain_text = r.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link.text)
test_function(1)
Restaurants in Chennai serving Buffet - Zomato