从 Zillow 抓取数据的最佳方式是什么?
Whats the best way to scrape data from Zillow?
我试图从 Zillow 收集数据,但没有成功。
示例:
url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
我想从洛杉矶的所有房屋中提取地址、价格、价格、位置等信息。
我已经尝试 HTML 使用像 BeautifulSoup 这样的包进行抓取。我也尝试过使用 json。我几乎可以肯定 Zillow 的 API 不会有帮助。据我了解,API 最适合收集有关特定 属性 的信息。
我已经能够从其他站点抓取信息,但 Zillow 似乎使用动态 ID(每次刷新都会更改),这使得访问该信息变得更加困难。
更新:
尝试使用以下代码但仍未产生任何结果
import requests
from bs4 import BeautifulSoup
url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
page = requests.get(url)
data = page.content
soup = BeautifulSoup(data, 'html.parser')
for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}):
try:
#There is sponsored links in the list. You might need to take care
#of that
#Better check for null values which we are not doing in here
print(li.find('span', {'class': 'zsg-photo-card-price'}).text)
print(li.find('span', {'class': 'zsg-photo-card-info'}).text)
print(li.find('span', {'class': 'zsg-photo-card-address'}).text)
print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text)
except :
print('An error occured')
可能是因为你没有通过headers。
如果您在开发人员工具中查看 Chrome 的网络选项卡,这些是浏览器传递的 headers:
:authority:www.zillow.com
:method:GET
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
但是,如果您尝试发送所有这些,它将失败,因为 requests
不允许您发送 headers 以冒号 ':' 开头。
我尝试单独跳过这四个,并在此脚本中使用其他五个。有效。所以试试这个:
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
r = s.get(url, headers=req_headers)
之后,您可以使用BeautifulSoup
提取您需要的信息:
soup = BeautifulSoup(r.content, 'lxml')
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text
address = soup.find('span', {'itemprop': 'address'}).text
这是从该页面提取的数据示例:
+--------------+-----------------------------------------------------------+
| 5,000 | 121 S Hope St APT 435 Los Angeles CA 90012 |
| 0,000 | 4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423 |
| ,495,000 | 13446 Valley Vista Blvd Sherman Oaks CA 91423 |
| ,199,000 | 6241 Crescent Park W UNIT 410 Los Angeles CA 90094 |
| 1,472+ | Chase St. And Woodley Ave # HGS0YX North Hills CA 91343 |
| 9,000 | 8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293 |
| 5,000 | 6427 Klump Ave North Hollywood CA 91606 |
+--------------+-----------------------------------------------------------+
您可以尝试一些付费工具,例如
https://www.scraping-bot.io/how-to-scrape-real-estate-listings-on-zillow/
通过站点地图查找您需要的内容
https://www.zillow.com/sitemap/catalog/sitemap.xml
从站点地图中的 url 抓取数据
我试图从 Zillow 收集数据,但没有成功。
示例:
url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
我想从洛杉矶的所有房屋中提取地址、价格、价格、位置等信息。
我已经尝试 HTML 使用像 BeautifulSoup 这样的包进行抓取。我也尝试过使用 json。我几乎可以肯定 Zillow 的 API 不会有帮助。据我了解,API 最适合收集有关特定 属性 的信息。
我已经能够从其他站点抓取信息,但 Zillow 似乎使用动态 ID(每次刷新都会更改),这使得访问该信息变得更加困难。
更新: 尝试使用以下代码但仍未产生任何结果
import requests
from bs4 import BeautifulSoup
url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
page = requests.get(url)
data = page.content
soup = BeautifulSoup(data, 'html.parser')
for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}):
try:
#There is sponsored links in the list. You might need to take care
#of that
#Better check for null values which we are not doing in here
print(li.find('span', {'class': 'zsg-photo-card-price'}).text)
print(li.find('span', {'class': 'zsg-photo-card-info'}).text)
print(li.find('span', {'class': 'zsg-photo-card-address'}).text)
print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text)
except :
print('An error occured')
可能是因为你没有通过headers。
如果您在开发人员工具中查看 Chrome 的网络选项卡,这些是浏览器传递的 headers:
:authority:www.zillow.com
:method:GET
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
但是,如果您尝试发送所有这些,它将失败,因为 requests
不允许您发送 headers 以冒号 ':' 开头。
我尝试单独跳过这四个,并在此脚本中使用其他五个。有效。所以试试这个:
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
r = s.get(url, headers=req_headers)
之后,您可以使用BeautifulSoup
提取您需要的信息:
soup = BeautifulSoup(r.content, 'lxml')
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text
address = soup.find('span', {'itemprop': 'address'}).text
这是从该页面提取的数据示例:
+--------------+-----------------------------------------------------------+
| 5,000 | 121 S Hope St APT 435 Los Angeles CA 90012 |
| 0,000 | 4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423 |
| ,495,000 | 13446 Valley Vista Blvd Sherman Oaks CA 91423 |
| ,199,000 | 6241 Crescent Park W UNIT 410 Los Angeles CA 90094 |
| 1,472+ | Chase St. And Woodley Ave # HGS0YX North Hills CA 91343 |
| 9,000 | 8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293 |
| 5,000 | 6427 Klump Ave North Hollywood CA 91606 |
+--------------+-----------------------------------------------------------+
您可以尝试一些付费工具,例如 https://www.scraping-bot.io/how-to-scrape-real-estate-listings-on-zillow/
通过站点地图查找您需要的内容 https://www.zillow.com/sitemap/catalog/sitemap.xml
从站点地图中的 url 抓取数据