NodeJS - 阅读 HTML 头标签
NodeJS - Read HTML Head Tags
我想在我的 nodejs 应用程序中抓取一个 HTML 页面并形成一个 head 标签列表。例如:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<link rel="stylesheet" href="style.css">
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
<script src="script.src"></script>
</head>
<body>
...
</body>
</html>
期望的输出:
['<meta charset="UTF-8">','<meta name="viewport" content="width=device-width, initial-scale=1.0">','<title>Document</title>', ...etc]
但我有点卡住了,因为元标记没有 "close",所以它需要的不仅仅是简单的正则表达式和拆分。我想使用 DOMParser
但我在节点环境中。我尝试使用 xmldom
npm 包,但它只返回了一个换行符列表 (\r\n
)。
使用 request npm to request your page and then after you get response , use cheerio npm 解析并从原始数据中获取您想要的任何内容。
注意:cheerio 的语法类似于 jQuery
var request = require('request');
var cheerio = require('cheerio')
app.get('/scrape',(req,res)=>{
request('---your website url to scrape here ---', function (error, response, body) {
var $ = cheerio.load(body.toString())
let headContents=$('head').children().toString();
console.log('headContents',headContents)
});
});
一种选择是使用 Cheerio 解析 HTML 并从每个元素中提取您需要的信息:
const cheerio = require('cheerio');
const htmlStr = `<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<link rel="stylesheet" href="style.css">
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
<script src="script.src"></script>
</head>
<body>
...
</body>
</html>`;
const $ = cheerio.load(htmlStr);
const headTags = [];
$('head > *').each((_, elm) => {
headTags.push({ name: elm.name, attribs: elm.attribs, text: $(elm).text() });
});
console.log(headTags);
输出:
[ { name: 'meta', attribs: { charset: 'UTF-8' }, text: '' },
{ name: 'meta',
attribs:
{ name: 'viewport',
content: 'width=device-width, initial-scale=1.0' },
text: '' },
{ name: 'title', attribs: {}, text: 'Document' },
{ name: 'link',
attribs: { rel: 'stylesheet', href: 'style.css' },
text: '' },
{ name: 'link',
attribs:
{ rel: 'shortcut icon',
href: 'favicon.ico',
type: 'image/x-icon' },
text: '' },
{ name: 'script', attribs: { src: 'script.src' }, text: '' } ]
我想在我的 nodejs 应用程序中抓取一个 HTML 页面并形成一个 head 标签列表。例如:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<link rel="stylesheet" href="style.css">
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
<script src="script.src"></script>
</head>
<body>
...
</body>
</html>
期望的输出:
['<meta charset="UTF-8">','<meta name="viewport" content="width=device-width, initial-scale=1.0">','<title>Document</title>', ...etc]
但我有点卡住了,因为元标记没有 "close",所以它需要的不仅仅是简单的正则表达式和拆分。我想使用 DOMParser
但我在节点环境中。我尝试使用 xmldom
npm 包,但它只返回了一个换行符列表 (\r\n
)。
使用 request npm to request your page and then after you get response , use cheerio npm 解析并从原始数据中获取您想要的任何内容。
注意:cheerio 的语法类似于 jQuery
var request = require('request');
var cheerio = require('cheerio')
app.get('/scrape',(req,res)=>{
request('---your website url to scrape here ---', function (error, response, body) {
var $ = cheerio.load(body.toString())
let headContents=$('head').children().toString();
console.log('headContents',headContents)
});
});
一种选择是使用 Cheerio 解析 HTML 并从每个元素中提取您需要的信息:
const cheerio = require('cheerio');
const htmlStr = `<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<link rel="stylesheet" href="style.css">
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
<script src="script.src"></script>
</head>
<body>
...
</body>
</html>`;
const $ = cheerio.load(htmlStr);
const headTags = [];
$('head > *').each((_, elm) => {
headTags.push({ name: elm.name, attribs: elm.attribs, text: $(elm).text() });
});
console.log(headTags);
输出:
[ { name: 'meta', attribs: { charset: 'UTF-8' }, text: '' },
{ name: 'meta',
attribs:
{ name: 'viewport',
content: 'width=device-width, initial-scale=1.0' },
text: '' },
{ name: 'title', attribs: {}, text: 'Document' },
{ name: 'link',
attribs: { rel: 'stylesheet', href: 'style.css' },
text: '' },
{ name: 'link',
attribs:
{ rel: 'shortcut icon',
href: 'favicon.ico',
type: 'image/x-icon' },
text: '' },
{ name: 'script', attribs: { src: 'script.src' }, text: '' } ]