Python BeautifulSoup 网络抓取 Tripadvisor 查看评论
Python BeautifulSoup web-scraping Tripadvisor view a review
所以我不熟悉网络抓取并尝试查看特定酒店的评论列表。
我最初试图通过选择特定 class 来查看特定评论,但我没有得到任何输出,即使我尝试检查请求的状态代码,我也没有得到任何输出。我相信我的代码需要很长时间才能 运行.
网络抓取是否需要时间 运行 或者我的代码有问题?
import requests
from bs4 import BeautifulSoup
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
url = "https://www.tripadvisor.ca/Hotel_Review-g154913-d1587398-Reviews-Le_Germain_Hotel_Calgary-Calgary_Alberta.html"
req = requests.get(url, headers)
print (req.status_code)
soup = BeautifulSoup(req.content, 'html.parser')
review = soup.find_all(class_="XllAv H4 _a").get_text()
print(review)
更改了几个 headers
keys
和一些 requests
参数
我在 .get_text()
上出错,所以替换为其他
import requests
from bs4 import BeautifulSoup
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'accept': '*/*',
'accept-encoding': 'gzip, deflate',
'accept-language': 'en,mr;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
url = "https://www.tripadvisor.ca/Hotel_Review-g154913-d1587398-Reviews-Le_Germain_Hotel_Calgary-Calgary_Alberta.html"
req = requests.get(url,headers=headers,timeout=5,verify=False)
print (req.status_code)
soup = BeautifulSoup(req.content, 'html.parser')
#review = soup.find_all(class_="XllAv H4 _a").get_text()
#print(review)
for x in soup.body.find_all(class_="XllAv H4 _a"):
print(x.text)
所以我不熟悉网络抓取并尝试查看特定酒店的评论列表。 我最初试图通过选择特定 class 来查看特定评论,但我没有得到任何输出,即使我尝试检查请求的状态代码,我也没有得到任何输出。我相信我的代码需要很长时间才能 运行.
网络抓取是否需要时间 运行 或者我的代码有问题?
import requests
from bs4 import BeautifulSoup
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
url = "https://www.tripadvisor.ca/Hotel_Review-g154913-d1587398-Reviews-Le_Germain_Hotel_Calgary-Calgary_Alberta.html"
req = requests.get(url, headers)
print (req.status_code)
soup = BeautifulSoup(req.content, 'html.parser')
review = soup.find_all(class_="XllAv H4 _a").get_text()
print(review)
更改了几个 headers
keys
和一些 requests
参数
我在 .get_text()
上出错,所以替换为其他
import requests
from bs4 import BeautifulSoup
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'accept': '*/*',
'accept-encoding': 'gzip, deflate',
'accept-language': 'en,mr;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
url = "https://www.tripadvisor.ca/Hotel_Review-g154913-d1587398-Reviews-Le_Germain_Hotel_Calgary-Calgary_Alberta.html"
req = requests.get(url,headers=headers,timeout=5,verify=False)
print (req.status_code)
soup = BeautifulSoup(req.content, 'html.parser')
#review = soup.find_all(class_="XllAv H4 _a").get_text()
#print(review)
for x in soup.body.find_all(class_="XllAv H4 _a"):
print(x.text)