使用 google 自定义搜索 API 搜索多个关键字时出现问题

Problem with searching multiple keywords using google custom search API

我正在尝试搜索多个关键字(在 filteredList 的列表中)并获取每个搜索结果的列表。这是我在下面尝试过的代码:

from googleapiclient.discovery import build
import csv
import pprint

my_api_key = "xxx"
my_cse_id = "xxx"


def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res['items']


filteredList = ['Optimal Elektronika',
                'Evrascon',
                ]

words = [
    'vakansiya'
    ]

newDictList = []

# this is the htmlSnippets, link and also htmlTitle for filtering over the list of the dictionaries
keyValList = ['link', 'htmlTitle', 'htmlSnippet']

for word in filteredList:
    results = google_search(word, my_api_key, my_cse_id, num=5)
    # print(results)
    newDict = dict()

    for result in results:
        for (key, value) in result.items():
            if key in keyValList:
                if word in newDict['htmlSnippet']:
                    pass
                    newDict[key] = pprint.pprint(value)
        newDictList.append(newDict)
    print(newDictList)

运行答案脚本

我得到的错误代码(运行答案脚本):

Traceback (most recent call last):
  File "/Users/valizadavali/PycharmProjects/webScrape/GCS.py", line 39, in <module>
    items = google_search(word, API_KEY, CSE_ID, num=5)
  File "/Users/valizadavali/PycharmProjects/webScrape/GCS.py", line 11, in google_search
    return res['items']
KeyError: 'items'

我没有 API 键 运行 这段代码,但我看到了一些错误:

当你使用

for items in filteredList:

然后你从列表中得到单词,而不是它的索引,所以你不能将它与数字进行比较。

要获取号码,您可以使用

for items in range(len(filteredList)):

但是最好使用第一个版本而不是这个版本,然后在

中使用 items 而不是 filterd[items]
results = google_search(items, my_api_key, my_cse_id, num=5)

如果您选择带有 range(len(filteredList)): 的版本,则不要向项目添加 1 - 因为这样您得到的数字是 1..6 而不是 0..5,所以您会跳过第一个元素 filteredList[0] 并且它不搜索第一个词。稍后您尝试获取列表中不存在的 filteredList[6] 并收到错误消息。

for word in filteredList:

    results = google_search(word, my_api_key, my_cse_id, num=5)
    print(results)    

    newDict = dict()

    for result in results:
        for (key, value) in result.items():
            if key in keyValList:
                newDict[key] = value
        newDictList.append(newDict)

    print(newDictList)

顺便说一句:你必须在每个循环中创建 newDict = dict()


顺便说一句:标准 print()pprint.pprint() 仅用于在屏幕上发送文本并且总是 returns None 因此您不能将显示的文本分配给变量。如果您必须格式化文本,请为此使用字符串格式。


编辑: 带有 range(len(...)) 的版本在 Python.

中不是首选
for index in range(len(filteredList)):

    results = google_search(filteredList[index], my_api_key, my_cse_id, num=5)
    print(results)    

    newDict = dict()

    for result in results:
        for (key, value) in result.items():
            if key in keyValList:
                newDict[key] = value
        newDictList.append(newDict)

    print(newDictList)

编辑:

from googleapiclient.discovery import build
import requests

API_KEY = "AIzXXX"
CSE_ID = "013XXX"

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res['items']

words = [
    'Semkir sistem',
    'Evrascon',
    'Baku Electronics',
    'Optimal Elektroniks',
    'Avtostar',
    'Improtex',
#    'Wayback Machine'
]

filtered_results = list()

keys = ['cacheId', 'link', 'htmlTitle', 'htmlSnippet', ]

for word in words:
    items = google_search(word, API_KEY, CSE_ID, num=5)

    for item in items:
        #print(item.keys()) # to check if every item has the same keys. It seems some items don't have 'cacheId'

        row = dict() # row of data in final list with results 
        for key in keys:
             row[key] = item.get(key) # None if there is no `key` in `item`
             #row[key] = item[key] # ERROR if there is no `key` in `item`

        # generate link to cached page
        if row['cacheId']:
            row['link_cache'] = 'https://webcache.googleusercontent.com/search?q=cache:{}:{}'.format(row['cacheId'], row['link'])
            # TODO: read HTML from `link_cache` and get full text.
            # Maybe module `newpaper` can be useful for some pages.
            # For other pages module `urllib.request` or `requests` can be needed.
            row['html'] = requests.get(row['link_cache']).text
        else:
            row['link_cache'] = None
            row['html'] = ''

        # check word in title and snippet. Word may use upper and lower case chars so I convert to lower case to skip this problem.
        # It doesn't work if text use native chars - ie. cyrylica
        lower_word = word.lower()
        if (lower_word in row['htmlTitle'].lower()) or (lower_word in row['htmlSnippet'].lower()) or (lower_word in row['html'].lower()):
            filtered_results.append(row)
        else:
            print('SKIP:', word)
            print('    :', row['link'])
            print('    :', row['htmlTitle'])
            print('    :', row['htmlSnippet'])
            print('-----')


for item in filtered_results:
    print('htmlTitle:', item['htmlTitle'])
    print('link:', item['link'])
    print('cacheId:', item['cacheId'])
    print('link_cache:', item['link_cache'])
    print('part of html:', item['html'][:300])
    print('---')