通过 x-ray/node 抓取黑客新闻
scrape hacker news via x-ray/node
如何通过 x-ray/nodejs 抓取黑客新闻 (https://news.ycombinator.com/)?
我想从中得到这样的东西:
[
{title1, comment1},
{title2, comment2},
...
{"‘Minimal’ cell raises stakes in race to harness synthetic life", 48}
...
{title 30, comment 30}
]
有个新闻table但是我不知道怎么抓取它...
网站上的每个故事都由三栏组成。它们没有它们独有的 parent。所以结构看起来像这样
<tbody>
<tr class="spacer"> //Markup 1
<tr class="athing"> //Headline 1 ('.deadmark+ a' contains title)
<tr class> //Meta Information 1 (.age+ a contains comments)
<tr class="spacer"> //Markup 2
<tr class="athing"> //Headline 2 ('.deadmark+ a' contains title)
<tr class> //Meta Information 2 (.age+ a contains comments)
...
<tr class="spacer"> //Markup 30
<tr class="athing"> //Headline 30 ('.deadmark+ a' contains title)
<tr class> //Meta Information 30 (.age+ a contains comments)
到目前为止我已经尝试过:
x("https://news.ycombinator.com/", "tr", [{
title: [".deadmark+ a"],
comments: ".age+ a"
}])
和
x("https://news.ycombinator.com/", {
title: [".deadmark+ a"],
comments: [".age+ a"]
})
第二种方法 returns 30 个名称和 29 个 comment-couts... 我看不出有任何可能将它们映射到一起,因为没有信息表明 30 个标题中的哪个缺少评论。 ..
任何帮助都适用
标记不容易用 X-ray
包抓取,因为有 no way to reference the current context in a CSS selector。这对于获取 tr.thing
行之后的下一个 tr
兄弟以获取评论很有用。
我们仍然可以使用 "next sibling" notation(+
)进入下一行,但是,我们不再针对可选注释 link,而是获取完整的行文本,然后使用正则表达式提取评论值。如果没有评论,则将值设置为 0
.
完整的工作代码:
var Xray = require('x-ray');
var x = Xray();
x("https://news.ycombinator.com/", {
title: ["tr.athing .deadmark+ a"],
comments: ["tr.athing + tr"]
})(function (err, obj) {
// extracting comments and mapping into an array of objects
var result = obj.comments.map(function (elm, index) {
var match = elm.match(/(\d+) comments?/);
return {
title: obj.title[index],
comments: match ? match[1]: "0"
};
});
console.log(result);
});
当前打印:
[ { title: 'Follow the money: what Apple vs. the FBI is really about',
comments: '85' },
{ title: 'Unable to open links in Safari, Mail or Messages on iOS 9.3',
comments: '12' },
{ title: 'Gogs – Go Git Service', comments: '13' },
{ title: 'Ubuntu Tablet now available for pre-order',
comments: '56' },
...
{ title: 'American Tech Giants Face Fight in Europe Over Encrypted Data',
comments: '7' },
{ title: 'Moving Beyond the OOP Obsession', comments: '34' } ]
如何通过 x-ray/nodejs 抓取黑客新闻 (https://news.ycombinator.com/)?
我想从中得到这样的东西:
[
{title1, comment1},
{title2, comment2},
...
{"‘Minimal’ cell raises stakes in race to harness synthetic life", 48}
...
{title 30, comment 30}
]
有个新闻table但是我不知道怎么抓取它... 网站上的每个故事都由三栏组成。它们没有它们独有的 parent。所以结构看起来像这样
<tbody>
<tr class="spacer"> //Markup 1
<tr class="athing"> //Headline 1 ('.deadmark+ a' contains title)
<tr class> //Meta Information 1 (.age+ a contains comments)
<tr class="spacer"> //Markup 2
<tr class="athing"> //Headline 2 ('.deadmark+ a' contains title)
<tr class> //Meta Information 2 (.age+ a contains comments)
...
<tr class="spacer"> //Markup 30
<tr class="athing"> //Headline 30 ('.deadmark+ a' contains title)
<tr class> //Meta Information 30 (.age+ a contains comments)
到目前为止我已经尝试过:
x("https://news.ycombinator.com/", "tr", [{
title: [".deadmark+ a"],
comments: ".age+ a"
}])
和
x("https://news.ycombinator.com/", {
title: [".deadmark+ a"],
comments: [".age+ a"]
})
第二种方法 returns 30 个名称和 29 个 comment-couts... 我看不出有任何可能将它们映射到一起,因为没有信息表明 30 个标题中的哪个缺少评论。 ..
任何帮助都适用
标记不容易用 X-ray
包抓取,因为有 no way to reference the current context in a CSS selector。这对于获取 tr.thing
行之后的下一个 tr
兄弟以获取评论很有用。
我们仍然可以使用 "next sibling" notation(+
)进入下一行,但是,我们不再针对可选注释 link,而是获取完整的行文本,然后使用正则表达式提取评论值。如果没有评论,则将值设置为 0
.
完整的工作代码:
var Xray = require('x-ray');
var x = Xray();
x("https://news.ycombinator.com/", {
title: ["tr.athing .deadmark+ a"],
comments: ["tr.athing + tr"]
})(function (err, obj) {
// extracting comments and mapping into an array of objects
var result = obj.comments.map(function (elm, index) {
var match = elm.match(/(\d+) comments?/);
return {
title: obj.title[index],
comments: match ? match[1]: "0"
};
});
console.log(result);
});
当前打印:
[ { title: 'Follow the money: what Apple vs. the FBI is really about',
comments: '85' },
{ title: 'Unable to open links in Safari, Mail or Messages on iOS 9.3',
comments: '12' },
{ title: 'Gogs – Go Git Service', comments: '13' },
{ title: 'Ubuntu Tablet now available for pre-order',
comments: '56' },
...
{ title: 'American Tech Giants Face Fight in Europe Over Encrypted Data',
comments: '7' },
{ title: 'Moving Beyond the OOP Obsession', comments: '34' } ]