cURL 多个 URL 和解析结果

Question

我正在开发 PHP 抓取工具来执行以下操作：

cURL几个（总是少于10个）URLs,
将每个 URL 中的 HTML 添加到 DOMDocument,
解析 DOM 文档的 <a> 个元素，link 为 PDF，
存储数组中匹配元素的href。

我有第 1 步和第 2 步（我的代码输出所有 URL 的组合 HTML），但是当我尝试遍历结果以找到`元素 link 转换为 PDF，我什么也没得到（一个空数组）。

我已经在单个 cURL 上尝试了我的解析器代码并且它有效（returns 一个数组，该页面上的每个 pdf 都有 URLs）。

这是我的 cURL 代码：

$urls = Array( 
 'http://www.example.com/about/1.htm', 
 'http://www.example.com/about/2.htm',
 'http://www.example.com/about/3.htm',
 'http://www.example.com/about/4.htm' 
); 

# Make DOMDoc
$dom = new DOMDocument();

foreach ($urls as $url) { 
    $ch = curl_init($url);  
    $html = curl_exec($ch);
    # Exec and close CURL, suppressing errors
    @$dom->createDocumentFragment($html);
    curl_close($ch);
}

解析器代码：

#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
    # Show the <a href>
    $linkh = $link->getAttribute('href');
    $filend = ".pdf";
    # @ at beginning supresses string length warning
    @$pdftester = substr_compare($linkh, $filend, -4, 4, true);
    if ($pdftester === 0) {
        array_push($pdf_array, $linkh);
    }
}

完整代码如下所示：

<?php 

$urls = Array( 
 'http://www.example.com/about/1.htm', 
 'http://www.example.com/about/2.htm',
 'http://www.example.com/about/3.htm',
 'http://www.example.com/about/4.htm' 
); 

# Make DOM parser
$dom = new DOMDocument();

foreach ($urls as $url) { 
    $ch = curl_init($url);  
    $html = curl_exec($ch);
    # Exec and close CURL, suppressing errors
    @$dom->createDocumentFragment($html);
    curl_close($ch);
} 

#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
    # Show the <a href>
    $linkh = $link->getAttribute('href');
    $filend = ".pdf";
    # @ at beginning supresses string length warning
    @$pdftester = substr_compare($linkh, $filend, -4, 4, true);
    if ($pdftester === 0) {
        array_push($pdf_array, $linkh);
    }
}

print_r($pdf_array);

?>

对我在 DOM 解析和 PDF 数组构建中做错了什么有什么建议吗？

Answer 1

1. 为了将 HTML 内容放入 $html 中，您需要 set the CURL option CURLOPT_RETURNTRANSFER 标记。否则它只会将内容打印到页面并在 $html.

中放入 1（成功）

CURLOPT_RETURNTRANSFER: TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);

2。 createDocumentFragment 方法并不像您想象的那样。

This function creates a new instance of class DOMDocumentFragment. This node will not show up in the document unless it is inserted with (e.g.) DOMNode::appendChild().

因此它不会将 HTML 读入 DOM 文档。它甚至不接受 $html 参数。

如果您想跳过 CURL 并一次性将文件直接加载到 DOM 对象中，您可能最好使用 loadHTML method, or loadHTMLFile。

@$dom->loadHTML($html);    // Like this
@$dom->loadHTMLFile($url); // or this (removing the CURL lines)

3。在将 HTML 加载到 DOM 对象后立即提取 PDF 链接是有意义的，而不是在提取之前尝试将所有页面合并为一个页面。您拥有的代码实际上运行良好:-)

cURL 多个 URL 和解析结果

cURL Multiple URLs & parse result

php

curl

domdocument

web-scraping