不太了解如何使用 python 请求执行谷歌服务器请求

Not quite understanding how to preform a request of googles servers with python requests

我刚才的问题是无法正确形成 googles 服务器的请求,我已经尝试放入我的浏览器 (Chrome) 使用的所有请求 headers,但是没有' 似乎工作。这样做的最终目标是能够在请求中指定搜索词、分辨率和 jpg 文件类型,并将图像下载到文件夹中。欢迎提出任何建议并提前致谢

到目前为止,这是我的代码:

def funRequestsDownload(searchTerm):
print("Getting image for track ", searchTerm)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36', 'content-length': bytes(searchTerm, 'utf-8')}
queryStringParameters = {'hl': "en", "tbm": "isch", "source": "hp", "biw":1109, "bih": 475, "q": "SEARCH TERMS", "oq":"meme", "gs_l":"img.3..35i39k1j0l9.21651.21983.0.22205.10.10.0.0.0.0.131.269.2j1.3.0....0...1.1.64.img..7.3.267.0.4mTf5BYtfj8"}
payload = {'value': searchTerm}
url = 'http://www.google.co.uk'
dataDump = requests.get(url, data=payload, headers=headers, "Query String Parameters"=queryStringParameters)
temp = dataDump.content
with open('C:/Users/Jordan/Desktop/Music Program/temp.html', 'w') as file:
    file.write(str(temp))
    file.close
return(temp)
print("Downloaded image for track ", searchTerm)

旁注,我知道我唯一保存的是页面的 html,这是因为它返回了错误的请求页面,我想查看所述错误。

Google doesn't like people using scraping to access search results。他们更喜欢您使用他们的 API。

他们提供的 API 称为 Google Custom Search。它支持搜索图像。要使用他们的 API,您需要一个 adsense 帐户。使用从中获得的 API 密钥进行 API 调用。

您要击中的URL是

searchUrl = "https://www.googleapis.com/customsearch/v1?q=" + \
             searchTerm + "&start=" + startIndex + "&key=" + key + "&cx=" + cx + \
             "&searchType=image"

通过请求将其传递给 JSON 文件返回您的结果。

进一步阅读。

首先,http://www.google.co.uk -> http://www.google.co.uk/search,可能是回复不好的原因

要从 Google 图片中抓取图片,您需要从位于 <scrpt> 中的页面源 (ctrl+u) 解析数据标签。以下是您需要执行的步骤(已简化但非常接近下面的实际代码):

  1. 查找所有 <script> 个标签:
soup.select('script')
  1. 通过来自 <script> 标签的 regex 匹配图像数据:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
  1. 通过 regex:
  2. 匹配所需的图像(全尺寸)
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    matched_images_data_json)
  1. 使用bytes()decode()提取并解码它们:
for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
  1. 要保存图像,您可以使用 urllib.request.urlretrieve,这可能是最简单的解决方案之一。

有时它不会下载任何东西,因为请求是通过脚本(bot)发送的,如果你想从Google图片或其他搜索引擎解析图片,你需要通过user-agent,然后再下载图片,否则会阻塞请求,报错

user-agent传给urllib.request并下载图片:

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(URL, 'your_folder/image_name.jpg')

使用 an example in the online IDE 在本地抓取和下载图像的代码:

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "cat",      # query
    "tbm": "isch",   # image results
    "hl": "en",      # language
    "ijn": "0",      # batch of 100 images. "1" is another 100 images and so on.
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps() it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        #  comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # 
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)


    print('\nFull Resolution Images:')  # in order
    for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
        #  comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)

        # ------------------------------------------------
        # Download original images

        # print(f'Downloading {index} image...')
        
      # opener=urllib.request.build_opener()
      # opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
      # urllib.request.install_opener(opener)

      # urllib.request.urlretrieve(original_size_img, f'Bs4_Images/original_size_img_{index}.jpg')

或者,您可以使用 SerpApi 中的 Google Images API 来实现相同的目的。这是付费 API 和免费计划。

不同之处在于,您不必处理正则表达式,绕过 Google 中的块,并且在出现崩溃时随时间维护代码( 将在 HTML)。相反,您只需要遍历结构化 JSON 并获取您想要的数据。

要集成的示例代码:

import os, json # json for pretty output
from serpapi import GoogleSearch

def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "pexels cat",
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))


get_google_images()

---------------
'''
[
... # other images 
  {
    "position": 100, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
    "source": "pexels.com",
    "title": "Close-up of Cat · Free Stock Photo",
    "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
    "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  }
]
'''

P.S - 我写了一个更深入的博客 post 关于 how to scrape Google Images, and how to reduce the chance of being blocked while web scraping search engines

Disclaimer, I work for SerpApi.