想知道如何在 tripadvisor 上爬行
Want to know how to crawling at tripadvisor
我正在尝试获取新加坡所有 url 餐馆的链接,但我的代码无法正常工作
data = requests.get("https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html").text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'property_title'}):
print('https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href'))
print(link.string)
一直在代码里加载再加载soup = BeautifulSoup(data, "html.parser")
我不知道为什么会发生这种情况,即使这对其他网站也很有效。
这是猫途鹰屏蔽抓取还是代码错误?
It keeps on loading and loading again
要获得响应,请添加 user-agent
header
:
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
但是数据是动态加载的,requests
不支持动态加载页面。但是,在网站上以 JSON 格式提供,(不清楚您要抓取的内容)。要获取所有数据,您可以使用 json
/re
模块:
import json
...
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
json_data = re.search(r"window\.__WEB_CONTEXT__=({.*});", data, flags=re.MULTILINE).group(1)
print(
# Prints all the data, you can use `json.loads` instead to access the data instead
json.dumps(json_data, indent=4)
)
获取所有链接:
import re
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
print("https://www.tripadvisor.com.sg/" + link)
输出(截断):
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1145149-Reviews-Grand_Shanghai_Restaurant-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1193730-Reviews-Entre_Nous_creperie-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1173583-Reviews-The_Courtyard-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d4611806-Reviews-NOX_Dine_in_the_Dark-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d13152787-Reviews-Positano_Risto-Singapore.html
我正在尝试获取新加坡所有 url 餐馆的链接,但我的代码无法正常工作
data = requests.get("https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html").text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'property_title'}):
print('https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href'))
print(link.string)
一直在代码里加载再加载soup = BeautifulSoup(data, "html.parser")
我不知道为什么会发生这种情况,即使这对其他网站也很有效。
这是猫途鹰屏蔽抓取还是代码错误?
It keeps on loading and loading again
要获得响应,请添加 user-agent
header
:
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
但是数据是动态加载的,requests
不支持动态加载页面。但是,在网站上以 JSON 格式提供,(不清楚您要抓取的内容)。要获取所有数据,您可以使用 json
/re
模块:
import json
...
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
json_data = re.search(r"window\.__WEB_CONTEXT__=({.*});", data, flags=re.MULTILINE).group(1)
print(
# Prints all the data, you can use `json.loads` instead to access the data instead
json.dumps(json_data, indent=4)
)
获取所有链接:
import re
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
print("https://www.tripadvisor.com.sg/" + link)
输出(截断):
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1145149-Reviews-Grand_Shanghai_Restaurant-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1193730-Reviews-Entre_Nous_creperie-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1173583-Reviews-The_Courtyard-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d4611806-Reviews-NOX_Dine_in_the_Dark-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d13152787-Reviews-Positano_Risto-Singapore.html