在 R 中,使用 rvest 和 xml2 从网站上的 <script> 元素中提取 JSON 对象
In R, use rvest and xml2 to extract JSON object from a <script> element on website
之前 post 编辑了 about scraping a table on the leaderboard page of the PGA's website on this page。总结一下 post,排行榜 table 显然很难抓取,因为此页面使用 javascript 呈现页面的方式和 table.
我可以检查并在标签中看到有一个对象 global.leaderboardConfig
,其中包含有用的信息:
是否可以在 R 中将此对象作为列表获取?我可以使用 xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script')
获取页面上的所有 76 个脚本元素,但是我不确定如何识别所需的特定脚本标签,也不知道如何从中获取对象。
编辑: 在 devtools 的网络选项卡中,还有此请求为获取数据的 API 调用提供 link。与其从脚本标签中获取对象,不如获取所有网络请求并筛选这些请求更容易?
此站点从使用特定算法的 JS 函数生成 hmac
和 expire
url 参数值。该算法的参数取决于作为 url 参数传递给托管该函数 here 的 JS 文件的纪元时间。这样,hmac
值每次都不同,因为它是从 url 不断变化的文件中处理的。
这个算法由这样的按位与& xor组成(伪代码):
step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step
step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step
....
....
xorKey
个数字是根据纪元时间在 https://microservice.pgatour.com/js
上动态生成的。您只需要使用当前纪元时间作为 url 参数请求此 js 文件,并使用正则表达式提取上述算法所需的所有 stepValues
(以 -1
开头)。您还需要在 r
中重现上述算法
以下脚本生成 url 参数并进行 API 调用:
library(httr)
library(stringr)
library(bitops)
# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))
# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")
steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num
for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)
# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\/", "/", str_match(output, "\leaderboardUrl:\s*'(.*)'")[2])
url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")
print(data)
kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract
如何找到这个算法?
我在 Javascript 代码中进行了搜索,并将混淆的代码逆向解码为可理解的内容。这是相当长的路要走。一步一步来吧。
任务 n°1 - 搜索 leaderboardUrl
你已经给出了问题中的第一个提示,config
的位置有一个 leaderboardUrl
。
有一个名为 stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js
的 this JS file 在 config.leaderboardUrl
中出现了 leaderboardUrl
:
{
key: "getLeaderboardData",
value: function (t, r, n) {
var o = this,
e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()), <===================== HERE
a = [this.performFetch(e)].concat(
g(
"initial" === n && this.config.translationsUrl
? [y.default.load(this.config.translationsUrl)]
: []
)
),
..........
}
让我们看看performFetch
似乎发送请求的函数
{
key: "performFetch",
value: function (t) {
var r = this,
e =
1 < arguments.length && void 0 !== arguments[1]
? arguments[1]
: {};
return t
? ((0, a.isProtectedUrl)(t) &&
(t = this.getUrlWithAuth(t)), <===================== HERE
(0, o.default)(t, e)
.then(function (e) {
return r.checkFetchResponseStatus(e, t);
.................
我们发现了 getUrlWithAuth
函数:
{
key: "getUrlWithAuth",
value: function (e) {
var t = u.setTrackingUserId,
r = u.UserIdTracker,
n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
if (t && n) {
var o = r.getTrackingUserIdParam(), <===================== HERE
a = t(r.getUserId());
return u.setUrlParameter(e, o, a);
}
return e;
},
},
现在,我们有 getUserId
和 getTrackingUserIdParam
,它们看起来像是将授权参数添加到 url 的函数和变量。问题是我们必须找到这个函数的位置。
任务 n°2 - 反混淆挑战:替换
我发现 this file 命名为 main.c03ddfd249437fcce43410c35a21c6f8.js
其中出现了 getUserId
和 getTrackingUserIdParam
:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
},
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
.................
function(g, e) {
var t = A
, C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
, I = t(428) + t(423) + t(407)
, o = t(483) + "rTr" + t(446) + t(477) + "Id";
C[t(489) + t(463) + t(469) + "cker"] = {
........................
getTrackingUserIdParam: function() {
return o
},
getUserId: function() {
return I
},
......................
}
}(jQuery, window)
},
我在上面的代码片段中跳过了很多代码,所以它更清楚。
你可以看到这里有替换,使用 t
数组作为基础,它将使用 A
函数偏移字符串,并且有一个 init 函数更新了初始 t
数组,以便它解码为正确的字符串
您可以将这段代码粘贴到 nodejs 脚本中,稍微修改一下,然后您可以使用类似的东西:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
};
console.log(t);
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);
// prints e[pgatour] = e[pgatour] || {};
此处 e
是 window
,因此您“只需”替换所有 A(XXX)
以便更好地理解正在发生的事情。
你会发现这个:
onBeforeSendRequest: function(g, e) {
var A = t;
if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
var I = this["getUse" + A(463)]()
, o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
, n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
}
},
解码后给出如下内容:
onBeforeSendRequest: function(g, e) {
if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
var I = this["getUserId"]()
, o = window["pgatour"]["setTrackingUserId"](I)
, n = this["getTrackingUserIdParam"]();
e.url = C["setUrlParameter"](e["url"], n, o)
}
},
我们要找的函数是window["pgatour"]["setTrackingUserId"]
。但我们本可以从第一次任务开始就知道这一点。记得在第一个JS文件中:
var t = u.setTrackingUserId
和u
是window.pgatour
但是在这里,我们有 I
硬编码的输入参数:
var I = A(428) + A(423) + A(407);
相当于var I = "id8730931"
现在让我们看看window["pgatour"]["setTrackingUserId"]
函数
任务 n°3 - Crypto/reverse
打开网站上的chrome开发者控制台,粘贴window["pgatour"]["setTrackingUserId"]
你会得到这样的东西:
function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................
是 :( 又要处理更多混淆代码
通过查看应用程序脚本,您可能会发现它位于 this file。这是 JS 文件 url:
https://microservice.pgatour.com/js?_=1618868625306
有一个url参数指定纪元时间,代码会根据此参数发生变化
查看代码本身,我们在替换输入参数 String.fromCharCode
和 Math.abs
后得到类似的结果
((function($__$, _, $_$) {
var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
......
_$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
___ += _(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -33221,
___ += _(_$_),
.....
$__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
var $$ = _$__(_$_$);
var _$_, ___, __;
var __ = (__ = 101,
___ = String.fromCharCode(__),
_$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
___ += String.fromCharCode(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -1798324966,
___ += String.fromCharCode(_$_),
__ = _$_,
....
__ = _$_,
___);
return __
}
);
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));
我们可以制作一个nodejs脚本,通过提取步长值(在异或阶段)以更简单的方式重现该算法:
const axios = require("axios");
const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);
(async function () {
const response = await axios.get(
"https://microservice.pgatour.com/js?_=1618868625506"
);
data = response.data.match(/-17\d+/g).map((it) => parseInt(it));
for (t of data) {
var step = ((value * value - encodedId) & init) ^ t;
result += String.fromCharCode(step);
value = step;
}
console.log(result);
})();
输出:
exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd
如果你改变纪元时间,它会给出不同的结果
repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt
然后你只需要在 r 中转换这个 nodejs 脚本并使用 url 参数进行你的 http 调用
注意encodedId
来自使用此函数转换的输入idid8730931
(这些值似乎不随纪元时间变化):
var $$_ = 4294967295;
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
我的猜测是服务器正在检查 hmac 是否正确引用了初始 ID 字符串 id8730931
因此进行硬编码是安全的(因为它也在服务器中进行了硬编码)
之前 post 编辑了
我可以检查并在标签中看到有一个对象 global.leaderboardConfig
,其中包含有用的信息:
是否可以在 R 中将此对象作为列表获取?我可以使用 xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script')
获取页面上的所有 76 个脚本元素,但是我不确定如何识别所需的特定脚本标签,也不知道如何从中获取对象。
编辑: 在 devtools 的网络选项卡中,还有此请求为获取数据的 API 调用提供 link。与其从脚本标签中获取对象,不如获取所有网络请求并筛选这些请求更容易?
此站点从使用特定算法的 JS 函数生成 hmac
和 expire
url 参数值。该算法的参数取决于作为 url 参数传递给托管该函数 here 的 JS 文件的纪元时间。这样,hmac
值每次都不同,因为它是从 url 不断变化的文件中处理的。
这个算法由这样的按位与& xor组成(伪代码):
step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step
step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step
....
....
xorKey
个数字是根据纪元时间在 https://microservice.pgatour.com/js
上动态生成的。您只需要使用当前纪元时间作为 url 参数请求此 js 文件,并使用正则表达式提取上述算法所需的所有 stepValues
(以 -1
开头)。您还需要在 r
以下脚本生成 url 参数并进行 API 调用:
library(httr)
library(stringr)
library(bitops)
# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))
# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")
steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num
for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)
# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\/", "/", str_match(output, "\leaderboardUrl:\s*'(.*)'")[2])
url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")
print(data)
kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract
如何找到这个算法?
我在 Javascript 代码中进行了搜索,并将混淆的代码逆向解码为可理解的内容。这是相当长的路要走。一步一步来吧。
任务 n°1 - 搜索 leaderboardUrl
你已经给出了问题中的第一个提示,config
的位置有一个 leaderboardUrl
。
有一个名为 stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js
的 this JS file 在 config.leaderboardUrl
中出现了 leaderboardUrl
:
{
key: "getLeaderboardData",
value: function (t, r, n) {
var o = this,
e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()), <===================== HERE
a = [this.performFetch(e)].concat(
g(
"initial" === n && this.config.translationsUrl
? [y.default.load(this.config.translationsUrl)]
: []
)
),
..........
}
让我们看看performFetch
似乎发送请求的函数
{
key: "performFetch",
value: function (t) {
var r = this,
e =
1 < arguments.length && void 0 !== arguments[1]
? arguments[1]
: {};
return t
? ((0, a.isProtectedUrl)(t) &&
(t = this.getUrlWithAuth(t)), <===================== HERE
(0, o.default)(t, e)
.then(function (e) {
return r.checkFetchResponseStatus(e, t);
.................
我们发现了 getUrlWithAuth
函数:
{
key: "getUrlWithAuth",
value: function (e) {
var t = u.setTrackingUserId,
r = u.UserIdTracker,
n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
if (t && n) {
var o = r.getTrackingUserIdParam(), <===================== HERE
a = t(r.getUserId());
return u.setUrlParameter(e, o, a);
}
return e;
},
},
现在,我们有 getUserId
和 getTrackingUserIdParam
,它们看起来像是将授权参数添加到 url 的函数和变量。问题是我们必须找到这个函数的位置。
任务 n°2 - 反混淆挑战:替换
我发现 this file 命名为 main.c03ddfd249437fcce43410c35a21c6f8.js
其中出现了 getUserId
和 getTrackingUserIdParam
:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
},
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
.................
function(g, e) {
var t = A
, C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
, I = t(428) + t(423) + t(407)
, o = t(483) + "rTr" + t(446) + t(477) + "Id";
C[t(489) + t(463) + t(469) + "cker"] = {
........................
getTrackingUserIdParam: function() {
return o
},
getUserId: function() {
return I
},
......................
}
}(jQuery, window)
},
我在上面的代码片段中跳过了很多代码,所以它更清楚。
你可以看到这里有替换,使用 t
数组作为基础,它将使用 A
函数偏移字符串,并且有一个 init 函数更新了初始 t
数组,以便它解码为正确的字符串
您可以将这段代码粘贴到 nodejs 脚本中,稍微修改一下,然后您可以使用类似的东西:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
};
console.log(t);
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);
// prints e[pgatour] = e[pgatour] || {};
此处 e
是 window
,因此您“只需”替换所有 A(XXX)
以便更好地理解正在发生的事情。
你会发现这个:
onBeforeSendRequest: function(g, e) {
var A = t;
if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
var I = this["getUse" + A(463)]()
, o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
, n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
}
},
解码后给出如下内容:
onBeforeSendRequest: function(g, e) {
if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
var I = this["getUserId"]()
, o = window["pgatour"]["setTrackingUserId"](I)
, n = this["getTrackingUserIdParam"]();
e.url = C["setUrlParameter"](e["url"], n, o)
}
},
我们要找的函数是window["pgatour"]["setTrackingUserId"]
。但我们本可以从第一次任务开始就知道这一点。记得在第一个JS文件中:
var t = u.setTrackingUserId
和u
是window.pgatour
但是在这里,我们有 I
硬编码的输入参数:
var I = A(428) + A(423) + A(407);
相当于var I = "id8730931"
现在让我们看看window["pgatour"]["setTrackingUserId"]
函数
任务 n°3 - Crypto/reverse
打开网站上的chrome开发者控制台,粘贴window["pgatour"]["setTrackingUserId"]
你会得到这样的东西:
function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................
是 :( 又要处理更多混淆代码
通过查看应用程序脚本,您可能会发现它位于 this file。这是 JS 文件 url:
https://microservice.pgatour.com/js?_=1618868625306
有一个url参数指定纪元时间,代码会根据此参数发生变化
查看代码本身,我们在替换输入参数 String.fromCharCode
和 Math.abs
((function($__$, _, $_$) {
var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
......
_$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
___ += _(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -33221,
___ += _(_$_),
.....
$__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
var $$ = _$__(_$_$);
var _$_, ___, __;
var __ = (__ = 101,
___ = String.fromCharCode(__),
_$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
___ += String.fromCharCode(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -1798324966,
___ += String.fromCharCode(_$_),
__ = _$_,
....
__ = _$_,
___);
return __
}
);
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));
我们可以制作一个nodejs脚本,通过提取步长值(在异或阶段)以更简单的方式重现该算法:
const axios = require("axios");
const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);
(async function () {
const response = await axios.get(
"https://microservice.pgatour.com/js?_=1618868625506"
);
data = response.data.match(/-17\d+/g).map((it) => parseInt(it));
for (t of data) {
var step = ((value * value - encodedId) & init) ^ t;
result += String.fromCharCode(step);
value = step;
}
console.log(result);
})();
输出:
exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd
如果你改变纪元时间,它会给出不同的结果
repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt
然后你只需要在 r 中转换这个 nodejs 脚本并使用 url 参数进行你的 http 调用
注意encodedId
来自使用此函数转换的输入idid8730931
(这些值似乎不随纪元时间变化):
var $$_ = 4294967295;
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
我的猜测是服务器正在检查 hmac 是否正确引用了初始 ID 字符串 id8730931
因此进行硬编码是安全的(因为它也在服务器中进行了硬编码)