使用 csv 文件中现有的 url 列表抓取酒店信息

Question

我从 TripAdvisor 抓取了 3 个酒店信息页面的 URL，并存储在一个 csv 文件中。导入 csv 文件后，我必须使用这 3 个 url 来抓取每个酒店名称，获取每个酒店的价格范围及其酒店 class。使用了Selenium的工具

Name	Link
The Upper House	https://en.tripadvisor.com.hk/Hotel_Review-g294217-d1513860-Reviews-The_Upper_House-Hong_Kong.html
Hotel ICON	https://en.tripadvisor.com.hk/Hotel_Review-g294217-d2031570-Reviews-Hotel_ICON-Hong_Kong.html
W Hong Kong	https://en.tripadvisor.com.hk/Hotel_Review-g294217-d1068719-Reviews-W_Hong_Kong-Hong_Kong.html

这是我的代码。使用单个酒店的 URL 时，我可以抓取酒店名称。但是，当涉及到很多酒店要刮的时候，就不行了。 “for”循环似乎有问题。

!pip install selenium

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import csv
from time import sleep
from time import time
from random import randint

browser = webdriver.Chrome(executable_path= 'C:\ProgramData\Anaconda3\Lib\site-packages\jupyterlab\chromedriver.exe')
result_list=[]

def start_request(q):
   r = browser.get(q)
   print("crlawling "+q)
   return r

def parse(text):
   container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
   mydict = {}

   for results in container1:
        try:
            mydict['name'] = results.find_element_by_xpath('//*[@id="HEADING"]')

         except Exception as e:
            print(e)
            print('not____________________________found')
            mydict['name'] = 'null'
            result_list.append(mydict)

with open('Best3HotelsLink.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
          req = row['Link']
          text = start_request(req)
          parse(text)
          sleep(randint(1,3))

import pandas as pd
df = pd.DataFrame(result_list)
df.to_csv('Detailed Hotelinfo.csv')
df

我也尝试过抓取酒店 class 和酒店的价格范围，但没有成功。 Hotel Class Price Range

我想就如何解决上述问题征求您的意见。非常感谢。

Answer 1

如果你有很多信息需要抓取我建议你每次都重新加载信息:

试试这个代码：

def parse(text):
   time.sleep(2)   # i suggzest you to add some time to wait to load the page
   container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
   nbrcontainer = len(container1)
   mydict = {}

   for i in range(0, nbrcontainer):
        container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
        results = container1[i]
        try:
            mydict['name'] = results.find_element_by_xpath('//*[@id="HEADING"]')

         except Exception as e:
            print(e)
            print('not____________________________found')
            mydict['name'] = 'null'
            result_list.append(mydict)

Answer 2

我不擅长使用 selenium，所以下面是如何使用 beautifulsoup 获取价格范围和酒店 class。两者都位于具有相同 ID (...) 的不同 div 中，因此很难抓取。我不认为 selenium 可以处理第一个选择器，但第二个应该可以工作

soup = BeautifulSoup(html_data, 'lxml')
price_range=soup.select_one('div:-soup-contains("PRICE RANGE") + div').text
hotel_class=soup.select_one('#ABOUT_TAB svg[title*="bubbles"]')['title']

他们有一个 API，如果您要在此站点上进行大量抓取，那么这可能是值得的。代码太糟糕了，我认为它已经值得了，但这只是我的意见

使用 csv 文件中现有的 url 列表抓取酒店信息

Scraping Hotel Info by using the existing list of urls in csv file

python

selenium

tripadvisor