是否可以在未安装浏览器的情况下使用 python 抓取网站?
Is it possible to scrape web sites with python without the browser installed?
我正在尝试使用 python 从网站上抓取数据。问题是:没有安装浏览器,无法安装(是纯DebianOS,没有GUI)。我在想也许可以在 selenium 中使用 chrome 驱动程序和 headless
模式,但它似乎不起作用。
这是我的测试代码:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
driver.get('https://www.kino-teatr.ru/')
search_bar = driver.find_element_by_id('search_input_top') # find search bar
search_bar.send_keys('Avengers') # enter the name of the movie
search_bar.send_keys(Keys.ENTER) # get the results
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', class_='list_item') # find the first item
print(div.find('a')['href']) # find a link to the page
它给了我以下错误
WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x5606b093c113 <unknown>
#1 0x5606b04046d8 <unknown>
#2 0x5606b04259c9 <unknown>
#3 0x5606b042319a <unknown>
#4 0x5606b045de0a <unknown>
#5 0x5606b0457f53 <unknown>
#6 0x5606b042dbda <unknown>
#7 0x5606b042eca5 <unknown>
#8 0x5606b096d8dd <unknown>
#9 0x5606b0986a9b <unknown>
#10 0x5606b096f6b5 <unknown>
#11 0x5606b0987725 <unknown>
#12 0x5606b096308f <unknown>
#13 0x5606b09a4188 <unknown>
#14 0x5606b09a4308 <unknown>
#15 0x5606b09bea6d <unknown>
#16 0x7f35ddc8bea7 <unknown>
我已经尝试按照 here and installing additional libraries as described here 所述安装驱动程序,但没有成功。
是否可以在没有安装浏览器的情况下使用 selenium?我应该怎么做才能实现?
在此先感谢您的帮助或建议!
您可以尝试 install requests lib 并执行以下操作以获得所需的 HTML 页面:
>>> import requests
>>> url = 'https://www.geeksforgeeks.org'
>>> response = requests.get(url).text
>>> '7 Alternative Career Paths For Software Engineers' in response
True
然后就可以使用LXML or BeautifulSoup解析页面了
更新
from lxml import html
response = requests.post('https://www.kino-teatr.ru/search/', data={'text':'мстители'.encode('cp1251')}).content
doc = html.fromstring(response)
entries = doc.xpath('//div[@class="list_item_name"]/h4')
first_movie = entries[0].text_content()
我正在尝试使用 python 从网站上抓取数据。问题是:没有安装浏览器,无法安装(是纯DebianOS,没有GUI)。我在想也许可以在 selenium 中使用 chrome 驱动程序和 headless
模式,但它似乎不起作用。
这是我的测试代码:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
driver.get('https://www.kino-teatr.ru/')
search_bar = driver.find_element_by_id('search_input_top') # find search bar
search_bar.send_keys('Avengers') # enter the name of the movie
search_bar.send_keys(Keys.ENTER) # get the results
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', class_='list_item') # find the first item
print(div.find('a')['href']) # find a link to the page
它给了我以下错误
WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x5606b093c113 <unknown>
#1 0x5606b04046d8 <unknown>
#2 0x5606b04259c9 <unknown>
#3 0x5606b042319a <unknown>
#4 0x5606b045de0a <unknown>
#5 0x5606b0457f53 <unknown>
#6 0x5606b042dbda <unknown>
#7 0x5606b042eca5 <unknown>
#8 0x5606b096d8dd <unknown>
#9 0x5606b0986a9b <unknown>
#10 0x5606b096f6b5 <unknown>
#11 0x5606b0987725 <unknown>
#12 0x5606b096308f <unknown>
#13 0x5606b09a4188 <unknown>
#14 0x5606b09a4308 <unknown>
#15 0x5606b09bea6d <unknown>
#16 0x7f35ddc8bea7 <unknown>
我已经尝试按照 here and installing additional libraries as described here 所述安装驱动程序,但没有成功。
是否可以在没有安装浏览器的情况下使用 selenium?我应该怎么做才能实现?
在此先感谢您的帮助或建议!
您可以尝试 install requests lib 并执行以下操作以获得所需的 HTML 页面:
>>> import requests
>>> url = 'https://www.geeksforgeeks.org'
>>> response = requests.get(url).text
>>> '7 Alternative Career Paths For Software Engineers' in response
True
然后就可以使用LXML or BeautifulSoup解析页面了
更新
from lxml import html
response = requests.post('https://www.kino-teatr.ru/search/', data={'text':'мстители'.encode('cp1251')}).content
doc = html.fromstring(response)
entries = doc.xpath('//div[@class="list_item_name"]/h4')
first_movie = entries[0].text_content()