JavaScript 迭代对象列表
JavaScript Iterating List of Objects
我正在为 Skyscanner 写一个爬虫只是为了好玩。我想要做的是遍历所有列表的列表,并为每个列表提取 URL.
到目前为止我所做的是获取清单 $("div[class^='FlightsResults_dayViewItems']") 其中 returns
但我不确定如何遍历返回的对象并获取 URL(/transport/flight/bos...)。我的伪代码是
for(listings in $("div[class^='FlightsResults_dayViewItems']")) {
go to class^='EcoTickerWrapper_itineraryContainer'
go to class^='FlightsTicket_container'
go to class^='FlightsTicket_link' and get the href and save in an array
}
我该怎么做?
旁注,我正在使用 cheerio 和 jquery.
更新:
我发现 CSS 选择器是
$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(at_index_i) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > a[class^='FlightsTicket_link']").href
现在,我想弄清楚如何遍历列表并为循环中的每个列表应用选择器。
此外,似乎不包括 div:nth-child(at_index_i) 是行不通的。有解决办法吗?
$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(3) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")
"/transport/flights/bos/cun/210301/210331/config/10081-2103010815--32733-0-10803-2103011250|10803-2103311225--31722-1-10081-2103312125?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27540602&inboundaltsenabled=false&infants=0&originentityid=27539525&outboundaltsenabled=false&preferdirects=false&preferflexible=false&ref=home&rtn=1"
$("div[class^='FlightsResults_dayViewItems'] > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")
undefined
这是迭代列表并为每个列表获取 URL 的函数。
async function scrapeListingUrl(listingURL) {
try {
const page = await browser.newPage();
await page.goto(listingURL, { waitUntil: "networkidle2" });
// await page.waitForNavigation({ waitUntil: "networkidle2" }); // Wait until page is finished loading before navigating
console.log("Finished loading page.");
const html = await page.evaluate(() => document.body.innerHTML);
fs.writeFileSync("./listing.html", html);
const $ = await cheerio.load(html); // Inject jQuery to easily get content of site more easily compared to using raw js
// Iterate through flight listings
// Note: Using regex to match class containing "FlightsResults_dayViewItems" to get listing since actual class name contains nonsense string appended to end.
const bookingURLs = $('a[class*="FlightsTicket_link"]')
.map((i, elem) => console.log(elem.href))
.get();
console.log(bookingURLs);
return bookingURLs;
} catch (error) {
console.log("Scrape flight url failed.");
console.log(error);
}
}
使用map()
const hrefs = $(selector).map((i, elem) => elem.href).get()
查看您未使用的代码 jQuery,所以上面的代码不起作用。所以你只需要使用一个基本的选择器,将 class 的一部分与 querySelectorAll 相匹配。而map是用来抓取hrefs的。
const links = [...document.querySelectorAll('a[class*="FlightsTicket_link"]')]
.map(l=>l.href)
我正在为 Skyscanner 写一个爬虫只是为了好玩。我想要做的是遍历所有列表的列表,并为每个列表提取 URL.
到目前为止我所做的是获取清单 $("div[class^='FlightsResults_dayViewItems']") 其中 returns
但我不确定如何遍历返回的对象并获取 URL(/transport/flight/bos...)。我的伪代码是
for(listings in $("div[class^='FlightsResults_dayViewItems']")) {
go to class^='EcoTickerWrapper_itineraryContainer'
go to class^='FlightsTicket_container'
go to class^='FlightsTicket_link' and get the href and save in an array
}
我该怎么做? 旁注,我正在使用 cheerio 和 jquery.
更新: 我发现 CSS 选择器是
$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(at_index_i) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > a[class^='FlightsTicket_link']").href
现在,我想弄清楚如何遍历列表并为循环中的每个列表应用选择器。
此外,似乎不包括 div:nth-child(at_index_i) 是行不通的。有解决办法吗?
$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(3) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")
"/transport/flights/bos/cun/210301/210331/config/10081-2103010815--32733-0-10803-2103011250|10803-2103311225--31722-1-10081-2103312125?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27540602&inboundaltsenabled=false&infants=0&originentityid=27539525&outboundaltsenabled=false&preferdirects=false&preferflexible=false&ref=home&rtn=1"
$("div[class^='FlightsResults_dayViewItems'] > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")
undefined
这是迭代列表并为每个列表获取 URL 的函数。
async function scrapeListingUrl(listingURL) {
try {
const page = await browser.newPage();
await page.goto(listingURL, { waitUntil: "networkidle2" });
// await page.waitForNavigation({ waitUntil: "networkidle2" }); // Wait until page is finished loading before navigating
console.log("Finished loading page.");
const html = await page.evaluate(() => document.body.innerHTML);
fs.writeFileSync("./listing.html", html);
const $ = await cheerio.load(html); // Inject jQuery to easily get content of site more easily compared to using raw js
// Iterate through flight listings
// Note: Using regex to match class containing "FlightsResults_dayViewItems" to get listing since actual class name contains nonsense string appended to end.
const bookingURLs = $('a[class*="FlightsTicket_link"]')
.map((i, elem) => console.log(elem.href))
.get();
console.log(bookingURLs);
return bookingURLs;
} catch (error) {
console.log("Scrape flight url failed.");
console.log(error);
}
}
使用map()
const hrefs = $(selector).map((i, elem) => elem.href).get()
查看您未使用的代码 jQuery,所以上面的代码不起作用。所以你只需要使用一个基本的选择器,将 class 的一部分与 querySelectorAll 相匹配。而map是用来抓取hrefs的。
const links = [...document.querySelectorAll('a[class*="FlightsTicket_link"]')]
.map(l=>l.href)