在 R 中,使用 rvest 和 xml2 从网站上的 <script> 元素中提取 JSON 对象

In R, use rvest and xml2 to extract JSON object from a <script> element on website

之前 post 编辑了 about scraping a table on the leaderboard page of the PGA's website on this page。总结一下 post,排行榜 table 显然很难抓取,因为此页面使用 javascript 呈现页面的方式和 table.

我可以检查并在标签中看到有一个对象 global.leaderboardConfig,其中包含有用的信息:

是否可以在 R 中将此对象作为列表获取?我可以使用 xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script') 获取页面上的所有 76 个脚本元素,但是我不确定如何识别所需的特定脚本标签,也不知道如何从中获取对象。

编辑: 在 devtools 的网络选项卡中,还有此请求为获取数据的 API 调用提供 link。与其从脚本标签中获取对象,不如获取所有网络请求并筛选这些请求更容易?

此站点从使用特定算法的 JS 函数生成 hmacexpire url 参数值。该算法的参数取决于作为 url 参数传递给托管该函数 here 的 JS 文件的纪元时间。这样,hmac 值每次都不同,因为它是从 url 不断变化的文件中处理的。

这个算法由这样的按位与& xor组成(伪代码):

step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step

step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step

....
....

xorKey 个数字是根据纪元时间在 https://microservice.pgatour.com/js 上动态生成的。您只需要使用当前纪元时间作为 url 参数请求此 js 文件,并使用正则表达式提取上述算法所需的所有 stepValues(以 -1 开头)。您还需要在

中重现上述算法

以下脚本生成 url 参数并进行 API 调用:

library(httr)
library(stringr)
library(bitops)

# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))

# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000

output <- content(GET("https://microservice.pgatour.com/js", query = list(
    "_" = format(time, digits=13)
  )), as = "text", encoding = "UTF-8")

steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num

for(t in stepsNum){
    step <- bitXor(bitAnd(value * value - encodedId, init), t)
    result <- paste0(result, rawToChar(as.raw(step)));
    value <- step;
}
print(result)

# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\/", "/", str_match(output, "\leaderboardUrl:\s*'(.*)'")[2])

url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")

print(data)

kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract

如何找到这个算法?

我在 Javascript 代码中进行了搜索,并将混淆的代码逆向解码为可理解的内容。这是相当长的路要走。一步一步来吧。

任务 n°1 - 搜索 leaderboardUrl

你已经给出了问题中的第一个提示,config 的位置有一个 leaderboardUrl

有一个名为 stroke-play-leaderboard-controller-56223356ffc8423f5d6e.jsthis JS fileconfig.leaderboardUrl 中出现了 leaderboardUrl:

{
    key: "getLeaderboardData",
    value: function (t, r, n) {
      var o = this,
        e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()),  <===================== HERE
        a = [this.performFetch(e)].concat(
          g(
            "initial" === n && this.config.translationsUrl
              ? [y.default.load(this.config.translationsUrl)]
              : []
          )
        ),
        ..........
}

让我们看看performFetch似乎发送请求的函数

{
    key: "performFetch",
    value: function (t) {
      var r = this,
        e =
          1 < arguments.length && void 0 !== arguments[1]
            ? arguments[1]
            : {};
      return t
        ? ((0, a.isProtectedUrl)(t) &&
            (t = this.getUrlWithAuth(t)), <===================== HERE
          (0, o.default)(t, e)
            .then(function (e) {
              return r.checkFetchResponseStatus(e, t);
    .................

我们发现了 getUrlWithAuth 函数:

  {
    key: "getUrlWithAuth",
    value: function (e) {
      var t = u.setTrackingUserId, 
        r = u.UserIdTracker, 
        n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
      if (t && n) {
        var o = r.getTrackingUserIdParam(), <===================== HERE
          a = t(r.getUserId());
        return u.setUrlParameter(e, o, a);
      }
      return e;
    },
  },

现在,我们有 getUserIdgetTrackingUserIdParam,它们看起来像是将授权参数添加到 url 的函数和变量。问题是我们必须找到这个函数的位置。

任务 n°2 - 反混淆挑战:替换

我发现 this file 命名为 main.c03ddfd249437fcce43410c35a21c6f8.js 其中出现了 getUserIdgetTrackingUserIdParam :

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
    return t[g -= 398]
},
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
.................
function(g, e) {
    var t = A
      , C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
      , I = t(428) + t(423) + t(407)
      , o = t(483) + "rTr" + t(446) + t(477) + "Id";
    C[t(489) + t(463) + t(469) + "cker"] = {
        ........................
        getTrackingUserIdParam: function() {
            return o
        },
        getUserId: function() {
            return I
        },
        ......................
    }
}(jQuery, window)
},

我在上面的代码片段中跳过了很多代码,所以它更清楚。

你可以看到这里有替换,使用 t 数组作为基础,它将使用 A 函数偏移字符串,并且有一个 init 函数更新了初始 t 数组,以便它解码为正确的字符串

您可以将这段代码粘贴到 nodejs 脚本中,稍微修改一下,然后您可以使用类似的东西:

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];

var A = function(g, e) {
    return t[g -= 398]
};
console.log(t);
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);

// prints e[pgatour] = e[pgatour] || {};

此处 ewindow,因此您“只需”替换所有 A(XXX) 以便更好地理解正在发生的事情。

你会发现这个:

onBeforeSendRequest: function(g, e) {
    var A = t;
    if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
        var I = this["getUse" + A(463)]()
            , o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
            , n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
        e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
    }
},

解码后给出如下内容:

onBeforeSendRequest: function(g, e) {
    if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
        var I = this["getUserId"]()
            , o = window["pgatour"]["setTrackingUserId"](I)
            , n = this["getTrackingUserIdParam"]();
        e.url = C["setUrlParameter"](e["url"], n, o)
    }
},

我们要找的函数是window["pgatour"]["setTrackingUserId"]。但我们本可以从第一次任务开始就知道这一点。记得在第一个JS文件中:

var t = u.setTrackingUserId

uwindow.pgatour

但是在这里,我们有 I 硬编码的输入参数:

var I = A(428) + A(423) + A(407);

相当于var I = "id8730931"

现在让我们看看window["pgatour"]["setTrackingUserId"]函数

任务 n°3 - Crypto/reverse

打开网站上的chrome开发者控制台,粘贴window["pgatour"]["setTrackingUserId"]你会得到这样的东西:

function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................

是 :( 又要处理更多混淆代码

通过查看应用程序脚本,您可能会发现它位于 this file。这是 JS 文件 url:

https://microservice.pgatour.com/js?_=1618868625306

有一个url参数指定纪元时间,代码会根据此参数发生变化

查看代码本身,我们在替换输入参数 String.fromCharCodeMath.abs

后得到类似的结果
((function($__$, _, $_$) { 
    var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
    function _$__($) {
        var $$__ = 42;
        for (var _ = 0; _ < $.length; _++) {
            $$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
        }
        return Math.abs($$__);
    }
    ......
    _$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
    ___ += _(_$_),
    __ = _$_,
    _$_ = (__ * __ - $$) & $$_ ^ -33221,
    ___ += _(_$_),
    .....
    $__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
        var $$ = _$__(_$_$);
        var _$_, ___, __;
        var __ = (__ = 101,
        ___ = String.fromCharCode(__),
        _$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
        ___ += String.fromCharCode(_$_),
        __ = _$_,
        _$_ = (__ * __ - $$) & $$_ ^ -1798324966,
        ___ += String.fromCharCode(_$_),
        __ = _$_,
        ....
        __ = _$_,
        ___);
        return __
    }
    );
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));

我们可以制作一个脚本,通过提取步长值(在异或阶段)以更简单的方式重现该算法:

const axios = require("axios");

const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);

(async function () {
  const response = await axios.get(
    "https://microservice.pgatour.com/js?_=1618868625506"
  );
  data = response.data.match(/-17\d+/g).map((it) => parseInt(it));

  for (t of data) {
    var step = ((value * value - encodedId) & init) ^ t;
    result += String.fromCharCode(step);
    value = step;
  }
  console.log(result);
})();

输出:

exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd

如果你改变纪元时间,它会给出不同的结果

repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt

然后你只需要在 中转换这个 脚本并使用 url 参数进行你的 http 调用

注意encodedId来自使用此函数转换的输入idid8730931(这些值似乎不随纪元时间变化):

var $$_ = 4294967295;
function _$__($) {
    var $$__ = 42;
    for (var _ = 0; _ < $.length; _++) {
        $$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
    }
    return Math.abs($$__);
}

我的猜测是服务器正在检查 hmac 是否正确引用了初始 ID 字符串 id8730931 因此进行硬编码是安全的(因为它也在服务器中进行了硬编码)