无法从 python 漂亮的抓取中找到内容

Trouble finding content from python beautiful scraping

我正在尝试抓取 this page and to get the URL of the title of each article which is an 'h3' 'a' element e.g. the first result is a link with text "Functional annotation of a full-length mouse cDNA collection" that links to this page

我的搜索 returns 是“[]”

我的代码如下:

import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.lens.org/lens/scholar/search/results?q="edith%20cowan"')
soup = BeautifulSoup(req.content, "html5lib")
article_links = soup.select('h3 a')
print(article_links)

我哪里错了?

您遇到这个问题是因为您使用了错误的 link 来获取文章 link。所以我没有做太多改动,就想出了这段代码 (请注意,我删除了 bs4 模块,因为它不再需要了):

import requests

search = "edith cowan"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

json = {"scholarly_search":{"from":0,"size":"10","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}

req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()

links = []
for x in req["query_result"]["hits"]["hits"]:
    links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))

search 变量等于您要搜索的字词(在您的例子中是 "edith cowan")。 link 存储在 links 变量中。


编辑:我是怎么做到的

所以主要问题可能是我从哪里得到 link 以及我怎么知道要在 json 变量中包含什么。为此,我使用了一个简单的 HTML 拦截器 (在我的例子中 burp suite community edition

当您访问此 URL (the one that you used in your question to send the req to) your browser sends a post request to https://www.lens.org/lens/api/multi/search?request_cache=true which then retrieves all the info of the current page 时,此工具向我展示了这一点。 json 变量 burp suite 还向您显示发送了哪些数据包,因此我将它们复制粘贴到 json 变量中。

为了更好的可视化,这是它在 burp 套件内部的样子:


编辑:扫描所有页面

为了扫描所有页面,您可以使用以下脚本:

import requests

search = "edith cowan" #Change this to the term you are searching for
r_to_show = 100 #This is the number of articles per page (I strongly recommend leaving it at 100)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

json = {"scholarly_search":{"from":0,"size":f"{r_to_show}","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}

req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()

links = [] #links are stored here
count = 0

#link_before and link_after helps determine when to stop going to the next page 
link_before = 0
link_after = 0

while True:
    json["scholarly_search"]["from"] += r_to_show
    if count > 0:
        req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json() 
    for x in req["query_result"]["hits"]["hits"]:
        links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))
    count += 1
    link_after = len(links)
    if link_after == link_before:
        break
    link_before = len(links)
    print(f"page {count} done, links recorder {len(links)}") 

我在代码中添加了一些注释以使其更易于理解。