从网站抓取 table

Question

我想从以下网站抓取一个 table：https://www.katastar.hr

要按照我的要求进行操作，请打开检查，然后单击网络。现在，当您打开网站时，您可以看到有一个 URL： https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined

问题是每次打开网站时id和status都不一样。当每次有不同的 GET 查询时，我如何抓取上述请求的输出（这是一个 json，那是一个 table）？

我会给出可重现的例子，但没有什么特别的我可以尝试。我应该从主页开始，但我不知道如何进行：

headers <- c(
  "Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding' = "gzip, deflate, br",
  'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
  "Cache-Control" = "max-age=0",
  "Connection" = "keep-alive",
  "DNT" = "1",
  "Host" = "www.katastar.hr",
  "If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
  "Referer" = "https://www.google.com/",
  "sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
  "sec-ch-ua-mobile" = "?0",
  "Sec-Fetch-Dest" = "document",
  "Sec-Fetch-Mode" = "navigate",
  "Sec-Fetch-Site" = "same-origin",
  "Sec-Fetch-User" = "?1",
  "Upgrade-Insecure-Requests" = "1",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
  "https://www.katastar.hr/",
  add_headers(headers))
httr::cookies(p)

代码既可以是R也可以是python。

Answer 1

您只需要 http header Origin 即可运行：

python

import requests

r = requests.get(
    "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
    headers={
        "Origin": "https://www.katastar.hr"
    })

print(r.json())

repl.it: https://replit.com/@bertrandmartel/ScrapeKatastar

R

library(httr)

data <- content(GET(
  "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
  add_headers(origin = "https://www.katastar.hr")
  ), as = "parsed", type = "application/json")

print(data)

为了进一步了解网站如何生成 id 和 status，在 JS 中有以下代码：

e.prototype.getSurveyors = function(e) {
    var t = this.runbase(),
      n = this.create(t.toString(), null);
    return this.httpClient.get(s + "/position", {
      params: {
        id: t.toString(),
        status: n,
        x: String(e[0]),
        y: String(e[1])
      }
    })
}
e.prototype.runbase = function() {
    return Math.floor(1e7 * Math.random())
}
e.prototype.create = function(e, t) {
    for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
    return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
}

它取一个随机数id并使用特定算法对其进行编码，并将结果放入status字段。然后服务器检查 status 编码值是否与 id 值匹配。

似乎以前的 id 值仍然像上面的示例一样工作（如果没有数据发送），但您也可以像这样重现上面的 JS 函数（python 中的示例):

from random import randint
import ctypes
import requests

number = randint(1000000, 9999999)

def encode(rand, data):
    randStr = str(rand)
    n = 0
    for char in randStr:
        n = ctypes.c_int(n << 5).value - n + ord(char)
    n = ctypes.c_int(n & n).value
    if data is None:
        suffix = ctypes.c_int(rand << 1).value
    else:
        suffix = ctypes.c_int(data << 1).value
    return f"{str(abs(n))[:6]}{suffix}"

r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
                 params={
                     "id": number,
                     "status": encode(number, None)
                 },
                 headers={
                     "Origin": "https://www.katastar.hr"
                 })
print(r.json())

# GET parcel Id 13241901
parcelId = 13241901
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
                 params={
                     "id": number,
                     "status": encode(number, parcelId)
                 },
                 headers={
                     "Origin": "https://www.katastar.hr"
                 })
print(r.json())

repl.it: https://replit.com/@bertrandmartel/ScrapeKatastarDecode

从网站抓取 table

Web scrape table from site

python

r

web-scraping

python-requests

httr