YQL：不再支持 html table

Question

我使用 YQL 获取一些 html 页面以从中读取信息。从今天开始我收到 return 消息 "html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use"

控制台中的示例：https://developer.yahoo.com/yql/console/#h=select+*+from+html+where+url%3D%22http%3A%2F%2Fwww.google.de%22

雅虎是否停止了这项服务？有人知道雅虎的公告吗？我想知道这只是一个错误还是他们真的停止了这项服务...

所有文档仍然存在（html 抓取）： https://developer.yahoo.com/yql/guide/yql-select-xpath.html , https://developer.yahoo.com/yql/

前一段时间我在 Yahoo 的 YQL 论坛上发帖，现在这个已经不存在了（或者至少我找不到了）。您如何联系 Yahoo 以查明此服务是否真的停止了？

此致， hebr3

Answer 1

看起来雅虎确实在 2017 年 6 月 8 日结束了对 html 库的支持（根据我的错误日志）。好像还没有正式的公告。

幸运的是，有一个 YQL 社区库可以用来代替官方 html 库，只需对您的代码库进行少量更改。见 htmlstring table in the YQL Console.

将您的 YQL 查询更改为引用 htmltable 而不是 html 并将社区环境包含在您的 REST 查询中。例如：

/*/ Old code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from html where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json";

/*/ New code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from htmlstring where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json"
    + "&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";

Answer 2

非常感谢您的代码。

它帮助我创建了自己的脚本来阅读我需要的那些页面。我以前从未编写过 PHP，但是有了你的代码和互联网的智慧，我可以根据我的需要更改你的脚本。

PHP

<?
    header('Access-Control-Allow-Origin: *'); //all
    $url = $_GET['url'];
    if (substr($url,0,25) != "https://www.xxxx.yy") {
       echo "Only https://www.xxxx.yy allowed!";
       return;
    }
    $xpathQuery = $_GET['xpath'];

    //need more hard check for security, I made only basic
   function check($target_url){
       $check = curl_init();
       //curl_setopt( $check, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
        //curl_setopt($check, CURLOPT_INTERFACE, "xxx.xxx.xxx.xxx");
        curl_setopt($check, CURLOPT_COOKIEJAR, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_COOKIEFILE, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_TIMEOUT, 40000);
        curl_setopt($check, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($check, CURLOPT_URL, $target_url);
        curl_setopt($check, CURLOPT_USERAGENT,   $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($check, CURLOPT_FOLLOWLOCATION, false);
        $tmp = curl_exec ($check);
        curl_close ($check);
        return $tmp;
    } 

    // get html
    $html = check($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    // apply xpath filter
    $xpath = new DOMXPath($dom);
    $elements = $xpath->query($xpathQuery);
    $temp_dom = new DOMDocument();
    foreach($elements as $n)   $temp_dom->appendChild($temp_dom->importNode($n,true));
    $renderedHtml = $temp_dom->saveHTML();

    // return html in json response
    // json structure: 
    // {html: "xxxx"}
    $post_data = array(
      'html' => $renderedHtml
    );  
    echo json_encode($post_data); 

?>

Javascript

$.ajax({
    url: "url of service",
    dataType: "json", 
    data: { url: url,
            xpath: "//*"
          },
    type: 'GET',
    success: function() {
             },
    error: function(data) {
           }
});

Answer 3

尽管 YQL 不再支持 html table，但我开始意识到，与其进行一次网络调用并解析出结果，还可以进行多次调用。例如，我之前的电话是这样的：

select html from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

哪个应该给我下面这样的信息

现在我必须使用这两个：

select title from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

select description from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

..得到我想要的。我不知道为什么他们会在没有明确列出回退的情况下弃用这样的东西，但你应该能够通过这种方式获取数据。

Answer 4

我构建了一个名为 CloudQuery (source code) 的开源工具，它最近提供了与 yql 类似的功能。它能够通过一些点击将大多数网站变成 API。

YQL：不再支持 html table

YQL: html table is no longer supported

yql