Post 在主页上用 scrapy 请求 ajax

Question

我正试图在网站上抓取各种药店的价格 https://www.medizinfuchs.de for a specific drug (e.g., https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html)。

该页面使用通过加载更多按钮调用的无限滚动。使用开发者工具的网络分析，我看到如果我点击这个按钮，页面会向 https://www.medizinfuchs.de/ajax_apotheken 发送 post 请求。
如果我将此 post 请求复制为 cURL，然后使用 curl2scrapy 将其转换，我会得到以下代码：

from scrapy import Request

url = 'https://www.medizinfuchs.de/ajax_apotheken"'

request = Request(
    url=url,
    method='POST',
    dont_filter=True,
)

fetch(request)

网络分析显示post请求的响应是HTML格式（类似于主页），但是列出了所有家药店那里有他们的价格（在我点击加载更多按钮之前，不仅仅是主页上的十家药店）。

我有点尴尬的问题 - 我仍然是一个绝对的初学者 - 现在我如何将这个 post 请求整合到我以前的 python 代码中，以便扫描所有药店并获得价格所有药房的信息。我之前的python代码是：

import scrapy

class MedizinfuchsSpider(scrapy.Spider):
    name = "medizinfuchs"
    start_urls = [
            'https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html'
        ]
        
    def parse(self, response):
        for apotheke in response.css('div.apotheke'):
            yield {
                'name': apotheke.css('a.name::text').getall(),
                'single': apotheke.css('div.single::text').getall(),
                'shipping': apotheke.css('div.shipping::text').getall(),
            }

非常感谢您的支持:-)。

基督教徒

Answer 1

如果您愿意接受仅使用请求和 beautifulsoup 的建议，您可以：

使用 requests.Session() 存储 cookie 并在 url s.get(url) 上执行第一次调用。这将获得等于产品 ID

product_history

使用 requests.post 调用您在 chrome 开发工具中发现的 API，并在数据 [=33= 表单中指定 id ]

以下示例迭代产品列表并执行上述流程：

import requests
from bs4 import BeautifulSoup
import pandas as pd

products = [
    "https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html",
    "https://www.medizinfuchs.de/preisvergleich/alcohol-pads-b.braun-100-st-b.-braun-melsungen-ag-pzn-629703.html"
]

results = []

for url in products:
    # get id
    s = requests.Session()
    r = s.get(url)
    id = s.cookies.get_dict()["product_history"]

    soup = BeautifulSoup(r.text, "html.parser")
    pzn = soup.find("li", {"class": "pzn"}).text[5:]
    print(f'pzn: {pzn}')

    # make the call
    r = requests.post("https://www.medizinfuchs.de/ajax_apotheken",
                      data={
                          "params[ppn]": id,
                          "params[entry_order]": "single_asc",
                          "params[filter][rating]": "",
                          "params[filter][country]": 7,
                          "params[filter][favorit]": 0,
                          "params[filter][products_from][de]": 0,
                          "params[filter][products_from][at]": 0,
                          "params[filter][send]": 1,
                          "params[limit]": 300,
                          "params[merkzettel_sel]": "",
                          "params[merkzettel_reload]":  "",
                          "params[apo_id]":  ""
                      })
    soup = BeautifulSoup(r.text, "html.parser")
    data = [
        {
            "name": t.find("a").text.strip(),
            "single": t.find("div", {"class": "single"}).text.strip(),
            "shipping": t.find("div", {"class": "shipping"}).text.strip().replace("\t", "").replace("\n", " "),
        }
        for t in soup.findAll("div", {"class": "apotheke"})
    ]
    for t in data:
        results.append({
            "pzn": pzn,
            **t
        })
df = pd.DataFrame(results)
df.to_csv('result.csv', index=False)
print(df)

repl.it: https://replit.com/@bertrandmartel/ScrapeMedicinFuchs

请注意，在上述解决方案中，我仅使用 requests.Session() 来获取 product_history cookie。后续调用不需要该会话。这样，我就可以直接获取产品 ID，而无需在 html/js 中使用正则表达式。但也许有更好的方法来获取产品 ID，我们无法从 url 中获取它，因为它只有部分产品 ID 4114918 而不是 1104114918 （如果你不想对 110 后缀部分进行硬编码）

Post 在主页上用 scrapy 请求 ajax

Post request with scrapy on homepage with ajax

python

ajax

scrapy

web-scraping

infinite-scroll