使用 Scrapy 从动态网页中抓取 url
Scrape urls from dynamic webpage using Scrapy
我想在 Scrapy 中制作一个网络爬虫,从该网站提取 10000 个新闻链接 https://hamariweb.com/news/newscategory.aspx?cat=7
当我向下滚动更多链接时,此网页是动态的。
我用 selenium 试过了,但没用。
import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy import signals
from scrapy.http import HtmlResponse
class WebnewsSpider(scrapy.Spider):
name = 'webnews'
allowed_domains = ['www.hamariweb.com']
start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=7']
def __init__ (self):
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
# options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
self.driver = webdriver. Chrome("C://Users//hammad//Downloads//chrome
driver",chrome_options=options)
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
上述代码以隐身模式打开浏览器并继续向下滚动。我还想提取 10000 个新闻链接,并希望在达到限制时停止浏览器。
您可以通过收集 css hrefs 将收集 URL 的逻辑添加到您的 parse() 方法:
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
urls = []
while True:
if len(urls) <= 10000:
for href in response.css('a::attr(href)'):
urls.append(href) # Follow tutorial to learn how to use the href object as you need
else:
break # Exit your while True statement when 10,000 links have been collected
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
在 scrapy 教程中有很多关于如何处理链接的信息 following links section。您可以使用那里的信息来了解您还可以用 scrapy 中的链接做什么。
我还没有用无限滚动测试过这个,所以你可能需要做一些改变,但这应该会让你朝着正确的方向前进。
我想在 Scrapy 中制作一个网络爬虫,从该网站提取 10000 个新闻链接 https://hamariweb.com/news/newscategory.aspx?cat=7 当我向下滚动更多链接时,此网页是动态的。
我用 selenium 试过了,但没用。
import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy import signals
from scrapy.http import HtmlResponse
class WebnewsSpider(scrapy.Spider):
name = 'webnews'
allowed_domains = ['www.hamariweb.com']
start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=7']
def __init__ (self):
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
# options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
self.driver = webdriver. Chrome("C://Users//hammad//Downloads//chrome
driver",chrome_options=options)
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
上述代码以隐身模式打开浏览器并继续向下滚动。我还想提取 10000 个新闻链接,并希望在达到限制时停止浏览器。
您可以通过收集 css hrefs 将收集 URL 的逻辑添加到您的 parse() 方法:
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
urls = []
while True:
if len(urls) <= 10000:
for href in response.css('a::attr(href)'):
urls.append(href) # Follow tutorial to learn how to use the href object as you need
else:
break # Exit your while True statement when 10,000 links have been collected
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
在 scrapy 教程中有很多关于如何处理链接的信息 following links section。您可以使用那里的信息来了解您还可以用 scrapy 中的链接做什么。
我还没有用无限滚动测试过这个,所以你可能需要做一些改变,但这应该会让你朝着正确的方向前进。