从网站抓取 table
Web scrape table from site
我想从以下网站抓取一个 table:https://www.katastar.hr
要按照我的要求进行操作,请打开检查,然后单击网络。
现在,当您打开网站时,您可以看到有一个 URL:
https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined
问题是每次打开网站时id和status都不一样。
当每次有不同的 GET 查询时,我如何 抓取 上述请求的输出(这是一个 json,那是一个 table)?
我会给出可重现的例子,但没有什么特别的我可以尝试。我应该从主页开始,但我不知道如何进行:
headers <- c(
"Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding' = "gzip, deflate, br",
'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
"Cache-Control" = "max-age=0",
"Connection" = "keep-alive",
"DNT" = "1",
"Host" = "www.katastar.hr",
"If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
"Referer" = "https://www.google.com/",
"sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile" = "?0",
"Sec-Fetch-Dest" = "document",
"Sec-Fetch-Mode" = "navigate",
"Sec-Fetch-Site" = "same-origin",
"Sec-Fetch-User" = "?1",
"Upgrade-Insecure-Requests" = "1",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
"https://www.katastar.hr/",
add_headers(headers))
httr::cookies(p)
代码既可以是R也可以是python。
您只需要 http header Origin
即可运行:
- python
import requests
r = requests.get(
"https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
repl.it: https://replit.com/@bertrandmartel/ScrapeKatastar
- R
library(httr)
data <- content(GET(
"https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
add_headers(origin = "https://www.katastar.hr")
), as = "parsed", type = "application/json")
print(data)
为了进一步了解网站如何生成 id
和 status
,在 JS 中有以下代码:
e.prototype.getSurveyors = function(e) {
var t = this.runbase(),
n = this.create(t.toString(), null);
return this.httpClient.get(s + "/position", {
params: {
id: t.toString(),
status: n,
x: String(e[0]),
y: String(e[1])
}
})
}
e.prototype.runbase = function() {
return Math.floor(1e7 * Math.random())
}
e.prototype.create = function(e, t) {
for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
}
它取一个随机数id
并使用特定算法对其进行编码,并将结果放入status
字段。然后服务器检查 status
编码值是否与 id
值匹配。
似乎以前的 id
值仍然像上面的示例一样工作(如果没有数据发送),但您也可以像这样重现上面的 JS 函数(python 中的示例):
from random import randint
import ctypes
import requests
number = randint(1000000, 9999999)
def encode(rand, data):
randStr = str(rand)
n = 0
for char in randStr:
n = ctypes.c_int(n << 5).value - n + ord(char)
n = ctypes.c_int(n & n).value
if data is None:
suffix = ctypes.c_int(rand << 1).value
else:
suffix = ctypes.c_int(data << 1).value
return f"{str(abs(n))[:6]}{suffix}"
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
params={
"id": number,
"status": encode(number, None)
},
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
# GET parcel Id 13241901
parcelId = 13241901
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
params={
"id": number,
"status": encode(number, parcelId)
},
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
repl.it: https://replit.com/@bertrandmartel/ScrapeKatastarDecode
我想从以下网站抓取一个 table:https://www.katastar.hr
要按照我的要求进行操作,请打开检查,然后单击网络。 现在,当您打开网站时,您可以看到有一个 URL: https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined
问题是每次打开网站时id和status都不一样。 当每次有不同的 GET 查询时,我如何 抓取 上述请求的输出(这是一个 json,那是一个 table)?
我会给出可重现的例子,但没有什么特别的我可以尝试。我应该从主页开始,但我不知道如何进行:
headers <- c(
"Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding' = "gzip, deflate, br",
'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
"Cache-Control" = "max-age=0",
"Connection" = "keep-alive",
"DNT" = "1",
"Host" = "www.katastar.hr",
"If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
"Referer" = "https://www.google.com/",
"sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile" = "?0",
"Sec-Fetch-Dest" = "document",
"Sec-Fetch-Mode" = "navigate",
"Sec-Fetch-Site" = "same-origin",
"Sec-Fetch-User" = "?1",
"Upgrade-Insecure-Requests" = "1",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
"https://www.katastar.hr/",
add_headers(headers))
httr::cookies(p)
代码既可以是R也可以是python。
您只需要 http header Origin
即可运行:
- python
import requests
r = requests.get(
"https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
repl.it: https://replit.com/@bertrandmartel/ScrapeKatastar
- R
library(httr)
data <- content(GET(
"https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
add_headers(origin = "https://www.katastar.hr")
), as = "parsed", type = "application/json")
print(data)
为了进一步了解网站如何生成 id
和 status
,在 JS 中有以下代码:
e.prototype.getSurveyors = function(e) {
var t = this.runbase(),
n = this.create(t.toString(), null);
return this.httpClient.get(s + "/position", {
params: {
id: t.toString(),
status: n,
x: String(e[0]),
y: String(e[1])
}
})
}
e.prototype.runbase = function() {
return Math.floor(1e7 * Math.random())
}
e.prototype.create = function(e, t) {
for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
}
它取一个随机数id
并使用特定算法对其进行编码,并将结果放入status
字段。然后服务器检查 status
编码值是否与 id
值匹配。
似乎以前的 id
值仍然像上面的示例一样工作(如果没有数据发送),但您也可以像这样重现上面的 JS 函数(python 中的示例):
from random import randint
import ctypes
import requests
number = randint(1000000, 9999999)
def encode(rand, data):
randStr = str(rand)
n = 0
for char in randStr:
n = ctypes.c_int(n << 5).value - n + ord(char)
n = ctypes.c_int(n & n).value
if data is None:
suffix = ctypes.c_int(rand << 1).value
else:
suffix = ctypes.c_int(data << 1).value
return f"{str(abs(n))[:6]}{suffix}"
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
params={
"id": number,
"status": encode(number, None)
},
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
# GET parcel Id 13241901
parcelId = 13241901
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
params={
"id": number,
"status": encode(number, parcelId)
},
headers={
"Origin": "https://www.katastar.hr"
})
print(r.json())
repl.it: https://replit.com/@bertrandmartel/ScrapeKatastarDecode