Foreach 循环中 preg_replace() 的问题
Trouble with preg_replace() in Foreach loop
长话短说,我的客户因为纠纷无法访问他们的服务器,他们需要他们所有的俱乐部照片,这样我才能为他们建立一个新网站。我必须通过 URL 下载它们,它们由 PHP 输出处理,输出不同大小以减少服务器负载。
它们有 3000 多个,我不会浪费时间一个一个地做这个。
因此,我决定编写一个快速且 [非常] 肮脏的 PHP 脚本,该脚本将使用 DOMDocument
查找图像的 link 来抓取页面,跨越每个相册,然后跨过相册子页面。
一切正常,除了在相册页面上查找脚本的这一特定部分:
(1) 一个link到一张图片,也就是
<div class='imagethumb'>
<a href="/gallery/index.php?album=blowout1&image=blahblah.jpg" title="Blahblah>
<img src="/gallery/index.php?album=blowout1&image=blahblah_thumb.jpg />
</a>
</div>
(2)一个link到下一页,也就是
<li>
<a href="/gallery/index.php?album=beginning&page=2" title="Page 2">2</a>
</li>
(3) link 到专辑 "Last Page" 或“...”
<li>
<a href="/gallery/index.php?album=recognition&page=9" title="Page 9">...</a>
</li>
这是脚本的相关部分:
//$url is an argument in the function wrapping this script
//look on albums for links
foreach ($album_links as $a_url) {
$album_html = file_get_contents($a_url['url']);
$album = new DOMDocument;
$album->loadHTML($album_html);
$i_links = $album->getElementsByTagName('a');
$album_title = $album->getElementsByTagName('title')->item(0)->textContent;
//to keep track of the number of sub-page links found, exclude page 1
$num_page_lnks = 1;
//search through all links on the page, look for:
foreach ($i_links as $link) {
//Links contained in div with class='imagethumb'
if ($link->parentNode->getAttribute('class') == 'imagethumb' ) {
array_push($image_links, ["album" => str_replace(" | ", "", $album_title), "title" => $link->getAttribute('title'), "url" => "http://" . parse_url($url, PHP_URL_HOST) . $link->getAttribute('href') . "&p=*full-image"]);
}
//links contained in li with no class, has a page number in the title, and is not a "..." link
elseif ($link->parentNode->getAttribute('class') == '' && preg_match('/Page0\d*/', $link->getAttribute('title')) && $link->textContent != "...") {
//add to the number of sub page links found
$num_page_lnks++;
array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . $link->getAttribute('href'));
}
//links containing the text "..." (link to last album page, if more than 7 pages)
elseif($link->textContent == "...") {
//Parse the url into parts
$url_parse=[];
parse_str($link->getAttribute('href'), $url_parse);
//Last Page links appear when greater than 7 pages, so start at 8 ($num_page_links + 1)
for ($count = ($num_page_lnks + 1); $count < ($url_parse['page'] + 1); $count++) {
array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . preg_replace("/[^\=]\d+$/", $count, $link->getAttribute('href')));
}
}
}
unset($album);
unset($album_html);
unset($i_links);
}
如果脚本找到子页面 link,它会添加到 $num_page_links
,这样当它找到 "..."
link 时,它将知道在创建中间页面时从哪里开始 links
这就是 returns:
{
"0": "http://club.website.com/gallery/index.php?album=beginning&page=2",
"1": "http://club.website.com/gallery/index.php?album=beginning&page=3",
"2": "http://club.website.com/gallery/index.php?album=history&page=2",
"3": "http://club.website.com/gallery/index.php?album=history&page=3",
"4": "http://club.website.com/gallery/index.php?album=history&page=4",
"5": "http://club.website.com/gallery/index.php?album=history&page=5",
"6": "http://club.website.com/gallery/index.php?album=history&page=6",
"7": "http://club.website.com/gallery/index.php?album=history&page=7",
"8": "http://club.website.com/gallery/index.php?album=memorial&page=2",
"9": "http://club.website.com/gallery/index.php?album=memorial&page=3",
"10": "http://club.website.com/gallery/index.php?album=memorial&page=4",
"11": "http://club.website.com/gallery/index.php?album=memorial&page=5",
"12": "http://club.website.com/gallery/index.php?album=memorial&page=6",
"13": "http://club.website.com/gallery/index.php?album=memorial&page=7",
"14": "http://club.website.com/gallery/index.php?album=memorial&page=9",
"15": "http://club.website.com/gallery/index.php?album=memorial&page=9",
"16": "http://club.website.com/gallery/index.php?album=members&page=2",
"17": "http://club.website.com/gallery/index.php?album=members&page=3",
"18": "http://club.website.com/gallery/index.php?album=members&page=4",
"19": "http://club.website.com/gallery/index.php?album=members&page=5",
"20": "http://club.website.com/gallery/index.php?album=members&page=6",
"21": "http://club.website.com/gallery/index.php?album=members&page=7",
"22": "http://club.website.com/gallery/index.php?album=members&page=8",
"23": "http://club.website.com/gallery/index.php?album=members&page=9",
"24": "http://club.website.com/gallery/index.php?album=members&page=10",
"25": "http://club.website.com/gallery/index.php?album=members&page=11",
"26": "http://club.website.com/gallery/index.php?album=toy_run&page=2",
"27": "http://club.website.com/gallery/index.php?album=toy_run&page=3",
"28": "http://club.website.com/gallery/index.php?album=toy_run&page=4",
"29": "http://club.website.com/gallery/index.php?album=toy_run&page=5",
"30": "http://club.website.com/gallery/index.php?album=toy_run&page=6",
"31": "http://club.website.com/gallery/index.php?album=toy_run&page=7",
"32": "http://club.website.com/gallery/index.php?album=toy_run&page=8",
"33": "http://club.website.com/gallery/index.php?album=recognition&page=2",
"34": "http://club.website.com/gallery/index.php?album=recognition&page=3",
"35": "http://club.website.com/gallery/index.php?album=recognition&page=4",
"36": "http://club.website.com/gallery/index.php?album=recognition&page=5",
"37": "http://club.website.com/gallery/index.php?album=recognition&page=6",
"38": "http://club.website.com/gallery/index.php?album=recognition&page=7",
"39": "http://club.website.com/gallery/index.php?album=recognition&page=9",
"40": "http://club.website.com/gallery/index.php?album=recognition&page=9",
"41": "http://club.website.com/gallery/index.php?album=blowout1&page=2",
"42": "http://club.website.com/gallery/index.php?album=blowout1&page=3",
"43": "http://club.website.com/gallery/index.php?album=blowout1&page=4",
"44": "http://club.website.com/gallery/index.php?album=blowout1&page=5",
"45": "http://club.website.com/gallery/index.php?album=blowout1&page=6",
"46": "http://club.website.com/gallery/index.php?album=blowout1&page=7",
"47": "http://club.website.com/gallery/index.php?album=blowout1&page=8",
"48": "http://club.website.com/gallery/index.php?album=blowout1&page=9",
"49": "http://club.website.com/gallery/index.php?album=blowout1&page=10"
}
该对象中的子页面数量恰到好处,但问题是:
- 当有 7 个或更少的专辑页面(6 个子页面)时,脚本效果很好
- 当有 8 个专辑页面(7 个子页面)时,脚本可以正常工作
- 当有 9 个相册页面时(8 个子页面 - [1] 当前页面,[2][3][4][5][6][7][...] 最后一页 (9) ), 脚本将第 9 页翻倍
- 当有 10 个或更多相册页面时,没问题。
我不知道我做错了什么。
编辑:
这是 $i_links
的 HTML 源代码:
<ul class="pagelist">
<li class="prev"><span class="disabledlink">« prev</span></li>
<li class="current"><a href="/gallery/index.php?album=recognition" title="Page 1 (Current Page)">1</a></li>
<li><a href="/gallery/index.php?album=recognition&page=2" title="Page 2">2</a></li>
<li><a href="/gallery/index.php?album=recognition&page=3" title="Page 3">3</a></li>
<li><a href="/gallery/index.php?album=recognition&page=4" title="Page 4">4</a></li>
<li><a href="/gallery/index.php?album=recognition&page=5" title="Page 5">5</a></li>
<li><a href="/gallery/index.php?album=recognition&page=6" title="Page 6">6</a></li>
<li><a href="/gallery/index.php?album=recognition&page=7" title="Page 7">7</a></li>
<li><a href="/gallery/index.php?album=recognition&page=9" title="Page 9">...</a></li>
<li class="next"><a href="/gallery/index.php?album=recognition&page=2" title="Next Page">next »</a></li>
</ul>
问题出在你的最后一个嵌套循环中:
//Last Page links appear when greater than 7 pages, so start at 8 ($num_page_links + 1)
for ($count = ($num_page_lnks + 1); $count < ($url_parse['page'] + 1); $count++) {
array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . preg_replace("/[^\=]\d+$/", $count, $link->getAttribute('href')));
}
当您到达第 7 个子链接(文本内容为“...”)时,$num_page_lnks
变量的值为 7
,$url_parse['page']
的值为 9
。所以会有两次迭代,其中 $count
变量将被分配 8
,然后 - 9
.
但是......那些链接保持不变:
"http://club.website.com/gallery/index.php?album=recognition&page=9"
"http://club.website.com/gallery/index.php?album=recognition&page=9"
因为您的正则表达式模式没有进行预期的替换。
var_dump(preg_replace("/[^\=]\d+$/",8,"/gallery/index.php?album=recognition&page=9"));
// will output:
string(47) "/gallery/index.php?album=recognition&page=9"
将您的正则表达式模式更改为此:/\d+$/
或考虑一些其他逻辑。
长话短说,我的客户因为纠纷无法访问他们的服务器,他们需要他们所有的俱乐部照片,这样我才能为他们建立一个新网站。我必须通过 URL 下载它们,它们由 PHP 输出处理,输出不同大小以减少服务器负载。
它们有 3000 多个,我不会浪费时间一个一个地做这个。
因此,我决定编写一个快速且 [非常] 肮脏的 PHP 脚本,该脚本将使用 DOMDocument
查找图像的 link 来抓取页面,跨越每个相册,然后跨过相册子页面。
一切正常,除了在相册页面上查找脚本的这一特定部分:
(1) 一个link到一张图片,也就是
<div class='imagethumb'>
<a href="/gallery/index.php?album=blowout1&image=blahblah.jpg" title="Blahblah>
<img src="/gallery/index.php?album=blowout1&image=blahblah_thumb.jpg />
</a>
</div>
(2)一个link到下一页,也就是
<li>
<a href="/gallery/index.php?album=beginning&page=2" title="Page 2">2</a>
</li>
(3) link 到专辑 "Last Page" 或“...”
<li>
<a href="/gallery/index.php?album=recognition&page=9" title="Page 9">...</a>
</li>
这是脚本的相关部分:
//$url is an argument in the function wrapping this script
//look on albums for links
foreach ($album_links as $a_url) {
$album_html = file_get_contents($a_url['url']);
$album = new DOMDocument;
$album->loadHTML($album_html);
$i_links = $album->getElementsByTagName('a');
$album_title = $album->getElementsByTagName('title')->item(0)->textContent;
//to keep track of the number of sub-page links found, exclude page 1
$num_page_lnks = 1;
//search through all links on the page, look for:
foreach ($i_links as $link) {
//Links contained in div with class='imagethumb'
if ($link->parentNode->getAttribute('class') == 'imagethumb' ) {
array_push($image_links, ["album" => str_replace(" | ", "", $album_title), "title" => $link->getAttribute('title'), "url" => "http://" . parse_url($url, PHP_URL_HOST) . $link->getAttribute('href') . "&p=*full-image"]);
}
//links contained in li with no class, has a page number in the title, and is not a "..." link
elseif ($link->parentNode->getAttribute('class') == '' && preg_match('/Page0\d*/', $link->getAttribute('title')) && $link->textContent != "...") {
//add to the number of sub page links found
$num_page_lnks++;
array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . $link->getAttribute('href'));
}
//links containing the text "..." (link to last album page, if more than 7 pages)
elseif($link->textContent == "...") {
//Parse the url into parts
$url_parse=[];
parse_str($link->getAttribute('href'), $url_parse);
//Last Page links appear when greater than 7 pages, so start at 8 ($num_page_links + 1)
for ($count = ($num_page_lnks + 1); $count < ($url_parse['page'] + 1); $count++) {
array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . preg_replace("/[^\=]\d+$/", $count, $link->getAttribute('href')));
}
}
}
unset($album);
unset($album_html);
unset($i_links);
}
如果脚本找到子页面 link,它会添加到 $num_page_links
,这样当它找到 "..."
link 时,它将知道在创建中间页面时从哪里开始 links
这就是 returns:
{
"0": "http://club.website.com/gallery/index.php?album=beginning&page=2",
"1": "http://club.website.com/gallery/index.php?album=beginning&page=3",
"2": "http://club.website.com/gallery/index.php?album=history&page=2",
"3": "http://club.website.com/gallery/index.php?album=history&page=3",
"4": "http://club.website.com/gallery/index.php?album=history&page=4",
"5": "http://club.website.com/gallery/index.php?album=history&page=5",
"6": "http://club.website.com/gallery/index.php?album=history&page=6",
"7": "http://club.website.com/gallery/index.php?album=history&page=7",
"8": "http://club.website.com/gallery/index.php?album=memorial&page=2",
"9": "http://club.website.com/gallery/index.php?album=memorial&page=3",
"10": "http://club.website.com/gallery/index.php?album=memorial&page=4",
"11": "http://club.website.com/gallery/index.php?album=memorial&page=5",
"12": "http://club.website.com/gallery/index.php?album=memorial&page=6",
"13": "http://club.website.com/gallery/index.php?album=memorial&page=7",
"14": "http://club.website.com/gallery/index.php?album=memorial&page=9",
"15": "http://club.website.com/gallery/index.php?album=memorial&page=9",
"16": "http://club.website.com/gallery/index.php?album=members&page=2",
"17": "http://club.website.com/gallery/index.php?album=members&page=3",
"18": "http://club.website.com/gallery/index.php?album=members&page=4",
"19": "http://club.website.com/gallery/index.php?album=members&page=5",
"20": "http://club.website.com/gallery/index.php?album=members&page=6",
"21": "http://club.website.com/gallery/index.php?album=members&page=7",
"22": "http://club.website.com/gallery/index.php?album=members&page=8",
"23": "http://club.website.com/gallery/index.php?album=members&page=9",
"24": "http://club.website.com/gallery/index.php?album=members&page=10",
"25": "http://club.website.com/gallery/index.php?album=members&page=11",
"26": "http://club.website.com/gallery/index.php?album=toy_run&page=2",
"27": "http://club.website.com/gallery/index.php?album=toy_run&page=3",
"28": "http://club.website.com/gallery/index.php?album=toy_run&page=4",
"29": "http://club.website.com/gallery/index.php?album=toy_run&page=5",
"30": "http://club.website.com/gallery/index.php?album=toy_run&page=6",
"31": "http://club.website.com/gallery/index.php?album=toy_run&page=7",
"32": "http://club.website.com/gallery/index.php?album=toy_run&page=8",
"33": "http://club.website.com/gallery/index.php?album=recognition&page=2",
"34": "http://club.website.com/gallery/index.php?album=recognition&page=3",
"35": "http://club.website.com/gallery/index.php?album=recognition&page=4",
"36": "http://club.website.com/gallery/index.php?album=recognition&page=5",
"37": "http://club.website.com/gallery/index.php?album=recognition&page=6",
"38": "http://club.website.com/gallery/index.php?album=recognition&page=7",
"39": "http://club.website.com/gallery/index.php?album=recognition&page=9",
"40": "http://club.website.com/gallery/index.php?album=recognition&page=9",
"41": "http://club.website.com/gallery/index.php?album=blowout1&page=2",
"42": "http://club.website.com/gallery/index.php?album=blowout1&page=3",
"43": "http://club.website.com/gallery/index.php?album=blowout1&page=4",
"44": "http://club.website.com/gallery/index.php?album=blowout1&page=5",
"45": "http://club.website.com/gallery/index.php?album=blowout1&page=6",
"46": "http://club.website.com/gallery/index.php?album=blowout1&page=7",
"47": "http://club.website.com/gallery/index.php?album=blowout1&page=8",
"48": "http://club.website.com/gallery/index.php?album=blowout1&page=9",
"49": "http://club.website.com/gallery/index.php?album=blowout1&page=10"
}
该对象中的子页面数量恰到好处,但问题是:
- 当有 7 个或更少的专辑页面(6 个子页面)时,脚本效果很好
- 当有 8 个专辑页面(7 个子页面)时,脚本可以正常工作
- 当有 9 个相册页面时(8 个子页面 - [1] 当前页面,[2][3][4][5][6][7][...] 最后一页 (9) ), 脚本将第 9 页翻倍
- 当有 10 个或更多相册页面时,没问题。
我不知道我做错了什么。
编辑:
这是 $i_links
的 HTML 源代码:
<ul class="pagelist">
<li class="prev"><span class="disabledlink">« prev</span></li>
<li class="current"><a href="/gallery/index.php?album=recognition" title="Page 1 (Current Page)">1</a></li>
<li><a href="/gallery/index.php?album=recognition&page=2" title="Page 2">2</a></li>
<li><a href="/gallery/index.php?album=recognition&page=3" title="Page 3">3</a></li>
<li><a href="/gallery/index.php?album=recognition&page=4" title="Page 4">4</a></li>
<li><a href="/gallery/index.php?album=recognition&page=5" title="Page 5">5</a></li>
<li><a href="/gallery/index.php?album=recognition&page=6" title="Page 6">6</a></li>
<li><a href="/gallery/index.php?album=recognition&page=7" title="Page 7">7</a></li>
<li><a href="/gallery/index.php?album=recognition&page=9" title="Page 9">...</a></li>
<li class="next"><a href="/gallery/index.php?album=recognition&page=2" title="Next Page">next »</a></li>
</ul>
问题出在你的最后一个嵌套循环中:
//Last Page links appear when greater than 7 pages, so start at 8 ($num_page_links + 1)
for ($count = ($num_page_lnks + 1); $count < ($url_parse['page'] + 1); $count++) {
array_push($image_page_links, "http://" . parse_url($url, PHP_URL_HOST) . preg_replace("/[^\=]\d+$/", $count, $link->getAttribute('href')));
}
当您到达第 7 个子链接(文本内容为“...”)时,$num_page_lnks
变量的值为 7
,$url_parse['page']
的值为 9
。所以会有两次迭代,其中 $count
变量将被分配 8
,然后 - 9
.
但是......那些链接保持不变:
"http://club.website.com/gallery/index.php?album=recognition&page=9"
"http://club.website.com/gallery/index.php?album=recognition&page=9"
因为您的正则表达式模式没有进行预期的替换。
var_dump(preg_replace("/[^\=]\d+$/",8,"/gallery/index.php?album=recognition&page=9"));
// will output:
string(47) "/gallery/index.php?album=recognition&page=9"
将您的正则表达式模式更改为此:/\d+$/
或考虑一些其他逻辑。