需要帮助从 craigslist 抓取图像

Need help scraping image from craigslist

我已经尝试了所有我能想到的。我能够获取 postUrl、日期、标题、价格和位置。如果您转到 https://sandiego.craigslist.org/search/sss?query=surfboards 并将下面的代码片段粘贴到控制台,它会 returns 所有图像。但是当我尝试访问我的代码时,它返回未定义。如有任何帮助,我们将不胜感激!

$('#search-results > li').each((index, element) => {
   console.log( $(element).children().find('img').attr('src') )
})
import axios from 'axios'
import request from 'request-promise'
import cheerio from 'cheerio'
import express from 'express'

import path from 'path'
const __dirname = path.resolve();

const PORT = process.env.PORT || 8000;

const app = express();

app.get('', (req, res) => {
  res.sendFile(__dirname + '/views/index.html')
});

const surfboards = [];

axios("https://sandiego.craigslist.org/search/sss?query=surfboards")
.then(res => {
  const htmlData = res.data;
  const $ = cheerio.load(htmlData);

  $('#search-results > li').each((index, element) => {
    const postUrl = $(element).children('a').attr('href');
    const date = $(element).children('.result-info').children('.result-date').text();
    const title = $(element).children('.result-info').children('.result-heading').text().trim();
    const price = $(element).children('.result-info').children('.result-meta').children('.result-price').text();
    const location = $(element).children('.result-info').children('.result-meta').children(".result-hood").text().trim();

  
    // Why is this not working?!?!?!?!?!
    const img = $(element).children().find('img').attr('src');
    
    surfboards.push({
      title,
      postUrl,
      date,
      price,
      location,
      img
    })
  })
  return surfboards
}).catch(err => console.error(err))

app.get('/api/surfboards', (req, res) => {

  const usedboards = surfboards
  
  return res.status(200).json({
    results: usedboards
  })
})
// Make App listen
app.listen(PORT, () => console.log(`Server is listening to port ${PORT}`))

看起来该页面设置了带有 JavaScript 的图像。因此 axios 得到 HTML 而没有实际的图像链接。

但这里似乎有一个解决方法。您可以通过连接来自父 a 标签的 https://images.craigslist.orgdata-ids 值来生成图像链接。

你可以这样得到data-ids

var data_ids = $(element).children('a').attr('data-ids')

然后用逗号分割成数组,删除前两个3:符号并像这样连接:

`${img_base_url}/${ids}_${resolution_and_extension}`

但是如果你只需要为第一张图片获取URL那么就不需要每次都创建新的数组。改用子字符串(注意有时 li 根本没有图像):

if (data_ids && data_ids.includes(',')) {
    data_ids.substring(data_ids.indexOf('3:') + 2, data_ids.indexOf(','))
} else if (data_ids) {
    data_ids.substring(data_ids.indexOf('3:') + 2, data_ids.length)
}