抓取 Google 学者安全页面

Scrape Google Scholar Security Page

我有这样的字符串:

url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'

我希望将其转换为:

converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'

我试过这个:

converted_url = url.decode('utf-8')

但是,抛出了这个错误:

AttributeError: 'str' object has no attribute 'decode'

decode用于将bytes转换为string。而你的 url 是 string,而不是 bytes

您可以使用encode将此string转换为bytes,然后使用decode转换为正确的string

(我用前缀r来模拟有这个问题的文本-没有前缀url不用转换)

url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
print(url)

url = url.encode('utf-8').decode('unicode_escape')
print(url)

结果:

http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10

http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10

顺便说一句: 首先检查 print(url) 也许你有正确的 url 但你使用了错误的方法来显示它。 Python Shell 显示没有 print() 使用 print(repr()) 的所有结果,它显示一些字符作为代码以显示文本中使用的编码(utf-8,iso-8859-1, win-1250、latin-1 等)

你可以用requests to do decoding automatically给你。

注意:after_author URL 参数是下一页标记,因此当您向您提供的确切 URL 发出请求时,返回的 HTML 不会与您预期的相同,因为 after_author URL 参数在每次请求时都会发生变化,例如在我的情况下它是不同的 - uB8AAEFN__8J,而在您的 URL 中它是 rukAAOJ8__8J.

要使其正常工作,您需要解析第一页的下一页标记,该标记将指向第二页,依此类推,例如:

# from my other answer: 
# https://github.com/dimitryzub/Whosebug-answers-archive/blob/main/answers/scrape_all_scholar_profiles_bs4.py

params = {
    "view_op": "search_authors",
    "mauthors": "valve",
    "hl": "pl",
    "astart": 0
}

authors_is_present = True
while authors_is_present:
    
    # if next page is present -> update next page token and increment to the next page
    # if next page is not present -> exit the while loop
    if soup.select_one("button.gs_btnPR")["onclick"]:
        params["after_author"] = re.search(r"after_author\x3d(.*)\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1)  # -> XB0HAMS9__8J
        params["astart"] += 10
    else:
        authors_is_present = False

Code and example to extract profiles data in the online IDE:

from parsel import Selector
import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "label:security",
    "hl": "pl",
    "view_op": "search_authors"
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.pl/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

profiles = []

for profile in selector.css(".gs_ai_chpr"):
    profile_name = profile.css(".gs_ai_name a::text").get()
    profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    profile_email = profile.css(".gs_ai_eml::text").get()
    profile_interests = profile.css(".gs_ai_one_int::text").getall()

    profiles.append({
        "profile_name": profile_name,
        "profile_link": profile_link,
        "profile_email": profile_email,
        "profile_interests": profile_interests
    })

print(json.dumps(profiles, indent=2))

或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API 实现相同的目的。这是付费 API 和免费计划。

不同的是,你不需要弄清楚如何提取数据、绕过搜索引擎的阻止、增加请求的数量等等。

要集成的示例代码:

from serpapi import GoogleSearch
import os, json

params = {
    "api_key": os.getenv("API_KEY"),     # SerpApi API key
    "engine": "google_scholar_profiles", # SerpApi profiles parsing engine
    "hl": "pl",                          # language
    "mauthors": "label:security"         # search query
}

search = GoogleSearch(params)
results = search.get_dict()

for profile in results["profiles"]:
    print(json.dumps(profile, indent=2))

# part of the output:
'''
{
  "name": "Johnson Thomas",
  "link": "https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ",
  "serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=pl",
  "author_id": "eKLr0EgAAAAJ",
  "affiliations": "Professor of Computer Science, Oklahoma State University",
  "email": "Zweryfikowany adres z cs.okstate.edu",
  "cited_by": 159999,
  "interests": [
    {
      "title": "Security",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Asecurity",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:security"
    },
    {
      "title": "cloud computing",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Acloud_computing",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:cloud_computing"
    },
    {
      "title": "big data",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Abig_data",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:big_data"
    }
  ],
  "thumbnail": "https://scholar.google.com/citations/images/avatar_scholar_56.png"
}
'''

Disclaimer, I work for SerpApi.