php 抓取并输出给定标签中的特定值或数字
php scraping and outputting a specific value or number in a given tag
所以我对 php 很陌生。但是在一些帮助下,我想出了如何抓取具有 h1 class=____
标签标识符的网站
更好的是,我已经想出了如何输出我想要的精确单词或值,只要它被空白分隔 space。因此,例如,如果给定标签名称 < INVENTORY > 的输出为“30 个球”,我可以指定为 echo[0],并且只会输出 30 个。太好了。
不过,我 运行 遇到了一个问题,我是否正在尝试提取一个没有用空格分隔的值 space。所以我的意思是,假设我想要“-34.89”作为输出(更准确地说,网站上该占位符中的任何数字,因为源网站上的数字可能会随时间变化)。
但是,我得到的输出是“-34.89dowjonessstockchange”。那里没有空白 space。
我该怎么做才能输出-34.89?或者,在给定的一天,无论数字是多少。必须有某种方式来表示上面的输出,只输出值 [0,1,2,3,4,5] for ex,就值的数量而言是 -34.89.
下面是一个网站上的测试例子,输出由“”空格决定的词和值space。这几乎是我所需要的,但缺少这种更精确的方法。
// this function is a scraping function for ethereumchange
function getEthereumchange(){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://coinmarketcap.com/');
$xpath = new DOMXPath($doc);
$query = "//tr[@id='id-ethereum']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
file_put_contents(globalVars::$_cache_dir . "ethereumchange",
$ret_[7]);
}
非常感谢。
如果您只关心那个变化百分比,试试这个并删除整个 foreach
部分:
$query = "//tr[@id='id-ethereum']/td[contains(@class, 'percent-24h')]";
$entries = $xpath->query($query);
echo $entries->item(0)->getAttribute('data-usd'); //-5.15
这是其余的列:
$xpath = new DOMXPath($doc);
$market_cap = $xpath->query("//tr[@id='id-ethereum']/td[contains(@class, 'market-cap')]");
echo $market_cap->item(0)->getAttribute('data-usd'); //30574084827.1
$price = $xpath->query("//tr[@id='id-ethereum']/td/a[contains(@class, 'price')]");
echo $price->item(0)->getAttribute('data-usd'); //329.567
$circulating_supply = $xpath->query("//tr[@id='id-ethereum']/td/a[@data-supply]");
echo $circulating_supply->item(0)->getAttribute('data-supply'); //92770467.9991
$volume = $xpath->query("//tr[@id='id-ethereum']/td/a[contains(@class, 'volume')]");
echo $volume->item(0)->getAttribute('data-usd'); //810454000.0
$percent_change = $xpath->query("//tr[@id='id-ethereum']/td[contains(@class, 'percent-24h')]");
echo $percent_change->item(0)->getAttribute('data-usd'); //-3.79
如果你想使用第三方库你可以使用https://github.com/rajanrx/php-scrape
<?php
use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;
require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');
// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/');
// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
[
new \Scraper\Structure\TextField(
[
'name' => 'Name',
'xpath' => './/td[2]/a',
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Market Cap',
'xpath' => './/td[3]',
]
),
new \Scraper\Structure\RegexField(
[
'name' => '% Change',
'xpath' => './/td[7]',
'regex' => '/(.*)%/'
]
),
]
);
// Extract data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);
将打印出以下内容:
Array
(
[0] => Array
(
[Name] => Bitcoin
[Market Cap] => ,495,710,233
[% Change] => -1.09
[hash] => 76faae07da1d2f8c1209d86301d198b3
)
[1] => Array
(
[Name] => Ethereum
[Market Cap] => ,063,517,955
[% Change] => -8.10
[hash] => 18ade4435c69b5116acf0909e174b497
)
[2] => Array
(
[Name] => Ripple
[Market Cap] => ,483,663,781
[% Change] => -2.73
[hash] => 5bf61e4bb969c04d00944536e02d1e70
)
[3] => Array
(
[Name] => Litecoin
[Market Cap] => ,263,545,508
[% Change] => -3.36
[hash] => ea205770c30ddc9cbf267aa5c003933e
)
and so on ...
希望对你有所帮助:)
Disclaimer: I am author of this library.
所以我对 php 很陌生。但是在一些帮助下,我想出了如何抓取具有 h1 class=____
标签标识符的网站更好的是,我已经想出了如何输出我想要的精确单词或值,只要它被空白分隔 space。因此,例如,如果给定标签名称 < INVENTORY > 的输出为“30 个球”,我可以指定为 echo[0],并且只会输出 30 个。太好了。
不过,我 运行 遇到了一个问题,我是否正在尝试提取一个没有用空格分隔的值 space。所以我的意思是,假设我想要“-34.89”作为输出(更准确地说,网站上该占位符中的任何数字,因为源网站上的数字可能会随时间变化)。
但是,我得到的输出是“-34.89dowjonessstockchange”。那里没有空白 space。
我该怎么做才能输出-34.89?或者,在给定的一天,无论数字是多少。必须有某种方式来表示上面的输出,只输出值 [0,1,2,3,4,5] for ex,就值的数量而言是 -34.89.
下面是一个网站上的测试例子,输出由“”空格决定的词和值space。这几乎是我所需要的,但缺少这种更精确的方法。
// this function is a scraping function for ethereumchange
function getEthereumchange(){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://coinmarketcap.com/');
$xpath = new DOMXPath($doc);
$query = "//tr[@id='id-ethereum']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
file_put_contents(globalVars::$_cache_dir . "ethereumchange",
$ret_[7]);
}
非常感谢。
如果您只关心那个变化百分比,试试这个并删除整个 foreach
部分:
$query = "//tr[@id='id-ethereum']/td[contains(@class, 'percent-24h')]";
$entries = $xpath->query($query);
echo $entries->item(0)->getAttribute('data-usd'); //-5.15
这是其余的列:
$xpath = new DOMXPath($doc);
$market_cap = $xpath->query("//tr[@id='id-ethereum']/td[contains(@class, 'market-cap')]");
echo $market_cap->item(0)->getAttribute('data-usd'); //30574084827.1
$price = $xpath->query("//tr[@id='id-ethereum']/td/a[contains(@class, 'price')]");
echo $price->item(0)->getAttribute('data-usd'); //329.567
$circulating_supply = $xpath->query("//tr[@id='id-ethereum']/td/a[@data-supply]");
echo $circulating_supply->item(0)->getAttribute('data-supply'); //92770467.9991
$volume = $xpath->query("//tr[@id='id-ethereum']/td/a[contains(@class, 'volume')]");
echo $volume->item(0)->getAttribute('data-usd'); //810454000.0
$percent_change = $xpath->query("//tr[@id='id-ethereum']/td[contains(@class, 'percent-24h')]");
echo $percent_change->item(0)->getAttribute('data-usd'); //-3.79
如果你想使用第三方库你可以使用https://github.com/rajanrx/php-scrape
<?php
use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;
require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');
// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/');
// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
[
new \Scraper\Structure\TextField(
[
'name' => 'Name',
'xpath' => './/td[2]/a',
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Market Cap',
'xpath' => './/td[3]',
]
),
new \Scraper\Structure\RegexField(
[
'name' => '% Change',
'xpath' => './/td[7]',
'regex' => '/(.*)%/'
]
),
]
);
// Extract data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);
将打印出以下内容:
Array
(
[0] => Array
(
[Name] => Bitcoin
[Market Cap] => ,495,710,233
[% Change] => -1.09
[hash] => 76faae07da1d2f8c1209d86301d198b3
)
[1] => Array
(
[Name] => Ethereum
[Market Cap] => ,063,517,955
[% Change] => -8.10
[hash] => 18ade4435c69b5116acf0909e174b497
)
[2] => Array
(
[Name] => Ripple
[Market Cap] => ,483,663,781
[% Change] => -2.73
[hash] => 5bf61e4bb969c04d00944536e02d1e70
)
[3] => Array
(
[Name] => Litecoin
[Market Cap] => ,263,545,508
[% Change] => -3.36
[hash] => ea205770c30ddc9cbf267aa5c003933e
)
and so on ...
希望对你有所帮助:)
Disclaimer: I am author of this library.