如何正确抓取图片链接？我的刮板只制作空白文件夹

Question

我的代码只创建空文件夹，不下载图片。

所以，我想我需要修改它，以便可以清楚地下载图像。

我尝试自己修复它，但不知道该怎么做。

任何人都请帮助我。谢谢！

import requests
import parsel
import os
import time

for page in range(1, 310): # Total 309pages
    print(f'======= Scraping data from page {page} =======')
    url = f'https://www.bikeexif.com/page/{page}'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
    response = requests.get(url, headers=headers)
    html_data = response.text
    selector = parsel.Selector(html_data)

    containers = selector.xpath('//div[@class="container"]/div/article[@class="smallhalf"]')

    for v in containers:
        old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
        if old_title is not None:
            title = old_title.replace(':', ' -')
        title_url = v.xpath('.//div[2]/h2/a/@href').get()
        print(title, title_url)

        if not os.path.exists('img\' + title):
            os.mkdir('img\' + title)

        response_image = requests.get(url=title_url, headers=headers).text
        selector_image = parsel.Selector(response_image)
        # Full Size Images
        images_url = selector_image.xpath('//div[@class="image-context"]/a[@class="download"]/@href').getall()

        for title_url in images_url:
            image_data = requests.get(url=title_url, headers=headers).content
            file_name = title_url.split('/')[-1]

            time.sleep(1)

            with open(f'img\{title}\' + file_name, mode='wb') as f:
                f.write(image_data)
                print('Download complete!!:', file_name)

Answer 1

此页面使用 JavaScript 创建 link "download" 但 requests/urllib/beautifulsoup/lxml/ parsel/scrapy 不能运行 JavaScript - 这会产生问题。

但页面似乎使用相同的 urls 在页面上显示图像 - 所以您可以使用 //img/@src

但这会产生另一个问题，因为页面使用 JavaScript 作为 "lazy loading" 图像并且只有第一个 img 有 src。其他图像在 data-src 中有 url（当您滚动页面时通常 Javascript 将 data-src 复制到 src）所以您必须 data-src 到下载一些图片。

您需要这样的东西才能获得 @src（第一张图片）和 @data-src（其他图片）。

images_url = selector_image.xpath('//div[@id="content"]//img/@src').getall() + \
             selector_image.xpath('//div[@id="content"]//img/@data-src').getall()

完整的工作代码（有其他小改动）

因为我使用 Linux 所以字符串 img\{title} 创建了错误的路径
所以我使用 os.path.join('img', title, filename) 在 Windows、Linux、Mac.

上创建正确的路径

import requests
import parsel
import os
import time

# you can define it once 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

for page in range(1, 310): # Total 309pages
    
    print(f'======= Scraping data from page {page} =======')
    
    url = f'https://www.bikeexif.com/page/{page}'

    response = requests.get(url, headers=headers)
    selector = parsel.Selector(response.text)

    containers = selector.xpath('//div[@class="container"]/div/article[@class="smallhalf"]')

    for v in containers:

        old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
        if old_title is not None:
            title = old_title.replace(':', ' -')

        title_url = v.xpath('.//div[2]/h2/a/@href').get()
        print(title, title_url)

        os.makedirs( os.path.join('img', title), exist_ok=True )  # it create only if doesn't exists

        response_article = requests.get(url=title_url, headers=headers)
        selector_article = parsel.Selector(response_article.text)
        
        # Full Size Images
        images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \
                     selector_article.xpath('//div[@id="content"]//img/@data-src').getall()

        print('len(images_url):', len(images_url))

        for img_url in images_url:

            response_image = requests.get(url=img_url, headers=headers)
            
            filename = img_url.split('/')[-1]

            with open( os.path.join('img', title, filename), 'wb') as f:
                f.write(response_image.content)
                print('Download complete!!:', filename)

如何正确抓取图片链接？我的刮板只制作空白文件夹

How to grab image links correctly? My scraper only make blank folders

python

web-crawler