从网页中抓取源代码 <script> 标签

Scraping source code <script> tag from a web page

我正在寻找一种抓取一些源代码的方法。我需要的信息在类似于此的标签中。

<script>
.......
var playerIdMap = {};
playerIdMap['4'] = '614';
playerIdMap['5'] = '84';
playerIdMap['6'] = '65';
playerIdMap['7'] = '701';
getPlayerIdMap = function() { return playerIdMap; };   // global
}
enclosePlayerMap();
</script>

我正在尝试获取 playerIdMap 数字的内容,例如:4 和 614,或者整行的内容..

编辑-2

完成 PHP 代码,灵感来自

的代码
<?php
/**
 * Handles making a cURL request
 *
 * @param string $url         URL to call out to for information.
 * @param bool   $callDetails Optional condition to allow for extended
 *   information return including error and getinfo details.
 *
 * @return array $returnGroup cURL response and optional details.
 */
function makeRequest($url, $callDetails = false)
{
  // Set handle
  $ch = curl_init($url);

  // Set options
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  // Execute curl handle add results to data return array.
  $result = curl_exec($ch);
  $returnGroup = ['curlResult' => $result,];

  // If details of curl execution are asked for add them to return group.
  if ($callDetails) {
    $returnGroup['info'] = curl_getinfo($ch);
    $returnGroup['errno'] = curl_errno($ch);
    $returnGroup['error'] = curl_error($ch);
  }

  // Close cURL and return response.
  curl_close($ch);
  return $returnGroup;
}

$url = "http://www.bullshooterlive.com/my-stats/999/";
$response = makeRequest($url, true);

$re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';

preg_match_all($re, $response['curlResult'], $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

//var_dump($response);

编辑-1

抱歉没意识到你问了 PHP 问题。不知道为什么我在这里假定 scrapy。总之 php 下面的代码应该有帮助

$re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';
$str = '<script>
.......
var playerIdMap = {};
playerIdMap[\'4\'] = \'614\';
playerIdMap[\'5\'] = \'84\';
playerIdMap[\'6\'] = \'65\';
playerIdMap[\'7\'] = \'701\';
getPlayerIdMap = function() { return playerIdMap; };   // global
}
enclosePlayerMap();
</script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

上一个回答

您可以使用类似下面的内容

>>> data = """
... <script>
... .......
... var playerIdMap = {};
... playerIdMap['4'] = '614';
... playerIdMap['5'] = '84';
... playerIdMap['6'] = '65';
... playerIdMap['7'] = '701';
... getPlayerIdMap = function() { return playerIdMap; };   // global
... }
... enclosePlayerMap();
... </script>
... """
>>> import re
>>>
>>> regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
>>> re.findall(regex, data)
[('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]

您需要使用以下方法访问脚本标签

data = response.xpath("//script[contains(text(),'getPlayerIdMap')]").extract_first() 

import re
regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
print(re.findall(regex, data))
[('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]