BeautifulSoup 没有从爱彼迎搜索页面返回完整的 html 脚本
BeautifulSoup not returning full html script from airbnb search page
我正在尝试使用 BeautifulSoup 和 Selenium 从 Airbnb 抓取数据。我想从 this 搜索页面收集每个列表。
这是我目前拥有的:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def scrape_page(page_url):
driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(service = Service(driver_path))
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)
#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
# 'url':items.select_one('[itemprop="url"]')['content']}
# for i in items]
test = scrape_page(page_url)
test
搜索页面中的 scrape_page( ) returns 脚本似乎是 HTML,但不包含完整内容。它不包括我需要的信息,也就是HTML:
的这一部分
Image of HTML Script
我做了一些研究,发现 WebDriverWait 可能有帮助,但我收到 TimeoutException 错误。
TimeoutException Error
最终目标是获取每个列表的名称和 URL。
结果列表中的前 3 项应与此类似:
[{'name': '✿Kyoto✿/Near Station & Bus/Temple/Twin Room(^^♪✿✿',
'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
{'name': 'Stay in Kyoto central island',
'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
{'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]
如果我没有在这个问题中包含足够的信息,我先道歉,因为这是我第一次在这里发帖。
如果有任何帮助,我将不胜感激。谢谢。
我不经常使用 selenium,但推荐 requests
库。
试试这个
from requests import get
from bs4 import BeautifulSoup
headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}
res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
url_list = soup.find_all("meta", attrs={"itemprop":"url"})
在我的例子中,它返回了 20 个结果,这是一个页面上可以显示的结果。
如果您希望返回更多结果,则需要抓取更多页面。
Firefox用户代理的使用非常重要。它提供了一个旧的抓取案例用法,当使用这个代理时很多网页不会被阻止。
Select 您正在等待的元素在本例中更具体 css selector
:
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
也尽量避免使用 beautifulsoup
的 selenium 语法,并在 bs3
语法中使用 css selectors
:
listings = page_soup.select('[itemprop="itemListElement"]')
例子
...
def scrape_page(page_url):
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.select('[itemprop="itemListElement"]')
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
items = extract_listing(page_url)
#process items to get all information you need, just an example
[{'name':i.select_one('[itemprop="name"]')['content'],
'url':i.select_one('[itemprop="url"]')['content']}
for i in items]
输出
[{'name': '✿Kyoto✿/Nähe Bahnhof & Bus/Tempel/Einzelzimmer(^^♪',
'url': 'www.airbnb.de/rooms/50293998?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
{'name': '100 Jahre altes Machiya-Gästehaus (1Pax)',
'url': 'www.airbnb.de/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-08-22&check_out=2022-08-29&previous_page_section_name=1000'},
{'name': '27, Deluxe Designer Zweibett- / Dreibettzimmer in Shijo (1-3 Personen / Nichtraucher)',
'url': 'www.airbnb.de/rooms/41413491?adults=1&children=0&infants=0&check_in=2023-05-16&check_out=2023-05-23&previous_page_section_name=1000'},
{'name': 'Aufenthalt auf der zentralen Insel Kyoto',
'url': 'www.airbnb.de/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-24&check_out=2022-07-01&previous_page_section_name=1000'},
{'name': 'Sweet 202 Privatzimmer ☘️',
'url': 'www.airbnb.de/rooms/30217767?adults=1&children=0&infants=0&check_in=2022-07-18&check_out=2022-07-25&previous_page_section_name=1000'},
{'name': 'Kyoto Sanjo Ohashi Superior Zweibettzimmer Studio Nichtraucher Superior Zweibettzimmer',
'url': 'www.airbnb.de/rooms/45207535?adults=1&children=0&infants=0&check_in=2022-09-27&check_out=2022-10-04&previous_page_section_name=1000'},
{'name': 'Toller Blick auf den Fluss, schönes traditionelles Haus',
'url': 'www.airbnb.de/rooms/25762078?adults=1&children=0&infants=0&check_in=2022-12-07&check_out=2022-12-14&previous_page_section_name=1000'},
{'name': 'Doppelzimmer - Waschmaschine in allen Zimmern ☆ Guest House 10-Minuten zu Fuß von Kyoto Station -',
'url': 'www.airbnb.de/rooms/51433076?adults=1&children=0&infants=0&check_in=2022-06-13&check_out=2022-06-20&previous_page_section_name=1000'},
{'name': 'In der Nähe des Bahnhofs Kyoto Gemütliches Zimmer in einem traditionellen Haus',
'url': 'www.airbnb.de/rooms/25600163?adults=1&children=0&infants=0&check_in=2022-09-12&check_out=2022-09-19&previous_page_section_name=1000'},
{'name': 'Gemütliche und ruhige zweistöckige japanische Wohnung',
'url': 'www.airbnb.de/rooms/38743436?adults=1&children=0&infants=0&check_in=2023-03-11&check_out=2023-03-18&previous_page_section_name=1000'},
{'name': '51★Günstigste★5 Minuten zu Fuß Shin-Osaka Sta.★Max 1 Gäste',
'url': 'www.airbnb.de/rooms/14539052?adults=1&children=0&infants=0&check_in=2022-07-03&check_out=2022-07-10&previous_page_section_name=1000'},
{'name': '和楽庵【Doppel】100 Jahre altes Machiya Gästehaus (2pax)',
'url': 'www.airbnb.de/rooms/22867502?adults=1&children=0&infants=0&check_in=2022-08-26&check_out=2022-09-02&previous_page_section_name=1000'},
{'name': 'Expo Hostel Nishi #1 /1000yen Fahrrad für deinen Aufenthalt',
'url': 'www.airbnb.de/rooms/8295322?adults=1&children=0&infants=0&check_in=2022-08-27&check_out=2022-09-03&previous_page_section_name=1000'},
{'name': '★Lovely RiverSide House in★der Nähe von Einkaufsviertel★3 Betten',
'url': 'www.airbnb.de/rooms/40117962?adults=1&children=0&infants=0&check_in=2022-07-07&check_out=2022-07-14&previous_page_section_name=1000'},
{'name': 'ZIMMER - Bereich Central Kyoto Gion',
'url': 'www.airbnb.de/rooms/15215980?adults=1&children=0&infants=0&check_in=2022-06-14&check_out=2022-06-21&previous_page_section_name=1000'},
{'name': 'Raum, um das Kyoto zu genießen.',
'url': 'www.airbnb.de/rooms/9263813?adults=1&children=0&infants=0&check_in=2022-09-08&check_out=2022-09-15&previous_page_section_name=1000'},
{'name': 'Stilvolles modernes Kyo-Machiya 500 金閣寺 m vom Trockner entfernt',
'url': 'www.airbnb.de/rooms/20041502?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'},
{'name': 'Hotel Sou Kyoto Gion Queen Studio',
'url': 'www.airbnb.de/rooms/40236377?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
{'name': 'Workation GroLiving in KYOTO',
'url': 'www.airbnb.de/rooms/612511811801466646?adults=1&children=0&infants=0&check_in=2022-07-19&check_out=2022-07-26&previous_page_section_name=1000'},
{'name': '【home quarantin ok】shibainuatiniya/Kyoto Sta/Toji',
'url': 'www.airbnb.de/rooms/34028813?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'}]
我正在尝试使用 BeautifulSoup 和 Selenium 从 Airbnb 抓取数据。我想从 this 搜索页面收集每个列表。
这是我目前拥有的:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def scrape_page(page_url):
driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(service = Service(driver_path))
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)
#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
# 'url':items.select_one('[itemprop="url"]')['content']}
# for i in items]
test = scrape_page(page_url)
test
搜索页面中的 scrape_page( ) returns 脚本似乎是 HTML,但不包含完整内容。它不包括我需要的信息,也就是HTML:
的这一部分Image of HTML Script
我做了一些研究,发现 WebDriverWait 可能有帮助,但我收到 TimeoutException 错误。
TimeoutException Error
最终目标是获取每个列表的名称和 URL。 结果列表中的前 3 项应与此类似:
[{'name': '✿Kyoto✿/Near Station & Bus/Temple/Twin Room(^^♪✿✿',
'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
{'name': 'Stay in Kyoto central island',
'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
{'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]
如果我没有在这个问题中包含足够的信息,我先道歉,因为这是我第一次在这里发帖。 如果有任何帮助,我将不胜感激。谢谢。
我不经常使用 selenium,但推荐 requests
库。
试试这个
from requests import get
from bs4 import BeautifulSoup
headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}
res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
url_list = soup.find_all("meta", attrs={"itemprop":"url"})
在我的例子中,它返回了 20 个结果,这是一个页面上可以显示的结果。 如果您希望返回更多结果,则需要抓取更多页面。
Firefox用户代理的使用非常重要。它提供了一个旧的抓取案例用法,当使用这个代理时很多网页不会被阻止。
Select 您正在等待的元素在本例中更具体 css selector
:
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
也尽量避免使用 beautifulsoup
的 selenium 语法,并在 bs3
语法中使用 css selectors
:
listings = page_soup.select('[itemprop="itemListElement"]')
例子
...
def scrape_page(page_url):
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.select('[itemprop="itemListElement"]')
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
items = extract_listing(page_url)
#process items to get all information you need, just an example
[{'name':i.select_one('[itemprop="name"]')['content'],
'url':i.select_one('[itemprop="url"]')['content']}
for i in items]
输出
[{'name': '✿Kyoto✿/Nähe Bahnhof & Bus/Tempel/Einzelzimmer(^^♪',
'url': 'www.airbnb.de/rooms/50293998?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
{'name': '100 Jahre altes Machiya-Gästehaus (1Pax)',
'url': 'www.airbnb.de/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-08-22&check_out=2022-08-29&previous_page_section_name=1000'},
{'name': '27, Deluxe Designer Zweibett- / Dreibettzimmer in Shijo (1-3 Personen / Nichtraucher)',
'url': 'www.airbnb.de/rooms/41413491?adults=1&children=0&infants=0&check_in=2023-05-16&check_out=2023-05-23&previous_page_section_name=1000'},
{'name': 'Aufenthalt auf der zentralen Insel Kyoto',
'url': 'www.airbnb.de/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-24&check_out=2022-07-01&previous_page_section_name=1000'},
{'name': 'Sweet 202 Privatzimmer ☘️',
'url': 'www.airbnb.de/rooms/30217767?adults=1&children=0&infants=0&check_in=2022-07-18&check_out=2022-07-25&previous_page_section_name=1000'},
{'name': 'Kyoto Sanjo Ohashi Superior Zweibettzimmer Studio Nichtraucher Superior Zweibettzimmer',
'url': 'www.airbnb.de/rooms/45207535?adults=1&children=0&infants=0&check_in=2022-09-27&check_out=2022-10-04&previous_page_section_name=1000'},
{'name': 'Toller Blick auf den Fluss, schönes traditionelles Haus',
'url': 'www.airbnb.de/rooms/25762078?adults=1&children=0&infants=0&check_in=2022-12-07&check_out=2022-12-14&previous_page_section_name=1000'},
{'name': 'Doppelzimmer - Waschmaschine in allen Zimmern ☆ Guest House 10-Minuten zu Fuß von Kyoto Station -',
'url': 'www.airbnb.de/rooms/51433076?adults=1&children=0&infants=0&check_in=2022-06-13&check_out=2022-06-20&previous_page_section_name=1000'},
{'name': 'In der Nähe des Bahnhofs Kyoto Gemütliches Zimmer in einem traditionellen Haus',
'url': 'www.airbnb.de/rooms/25600163?adults=1&children=0&infants=0&check_in=2022-09-12&check_out=2022-09-19&previous_page_section_name=1000'},
{'name': 'Gemütliche und ruhige zweistöckige japanische Wohnung',
'url': 'www.airbnb.de/rooms/38743436?adults=1&children=0&infants=0&check_in=2023-03-11&check_out=2023-03-18&previous_page_section_name=1000'},
{'name': '51★Günstigste★5 Minuten zu Fuß Shin-Osaka Sta.★Max 1 Gäste',
'url': 'www.airbnb.de/rooms/14539052?adults=1&children=0&infants=0&check_in=2022-07-03&check_out=2022-07-10&previous_page_section_name=1000'},
{'name': '和楽庵【Doppel】100 Jahre altes Machiya Gästehaus (2pax)',
'url': 'www.airbnb.de/rooms/22867502?adults=1&children=0&infants=0&check_in=2022-08-26&check_out=2022-09-02&previous_page_section_name=1000'},
{'name': 'Expo Hostel Nishi #1 /1000yen Fahrrad für deinen Aufenthalt',
'url': 'www.airbnb.de/rooms/8295322?adults=1&children=0&infants=0&check_in=2022-08-27&check_out=2022-09-03&previous_page_section_name=1000'},
{'name': '★Lovely RiverSide House in★der Nähe von Einkaufsviertel★3 Betten',
'url': 'www.airbnb.de/rooms/40117962?adults=1&children=0&infants=0&check_in=2022-07-07&check_out=2022-07-14&previous_page_section_name=1000'},
{'name': 'ZIMMER - Bereich Central Kyoto Gion',
'url': 'www.airbnb.de/rooms/15215980?adults=1&children=0&infants=0&check_in=2022-06-14&check_out=2022-06-21&previous_page_section_name=1000'},
{'name': 'Raum, um das Kyoto zu genießen.',
'url': 'www.airbnb.de/rooms/9263813?adults=1&children=0&infants=0&check_in=2022-09-08&check_out=2022-09-15&previous_page_section_name=1000'},
{'name': 'Stilvolles modernes Kyo-Machiya 500 金閣寺 m vom Trockner entfernt',
'url': 'www.airbnb.de/rooms/20041502?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'},
{'name': 'Hotel Sou Kyoto Gion Queen Studio',
'url': 'www.airbnb.de/rooms/40236377?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
{'name': 'Workation GroLiving in KYOTO',
'url': 'www.airbnb.de/rooms/612511811801466646?adults=1&children=0&infants=0&check_in=2022-07-19&check_out=2022-07-26&previous_page_section_name=1000'},
{'name': '【home quarantin ok】shibainuatiniya/Kyoto Sta/Toji',
'url': 'www.airbnb.de/rooms/34028813?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'}]