如果下载 HTML 时没有出现,如何抓取 URL? Javascript 可能是这里的问题

How to scrape URL if it doesn't appear when download HTML? Javascript might be a problem here

我正在尝试抓取此主页 (www.globo.com) 的一些网址。我可以获得标题和其他网址。但其中一些不在 HTML 上,无法使用 requests 和 lxml 抓取。我不想使用 selenium/bs4/beautifulsoap 因为代码将在 Heroku 服务器上 运行,所以这会让一切变得更加困难。

我要抓取的 URL 在 div 之后,这两个 classes:container 和 false。这是强制性的。 div 上没有 class “false” 的其他网址我可以轻松抓取。

有谁知道如何在存在这个问题的情况下抓取 URL?或者有人推荐其他库来完成这项任务(不是 bs4 或 selenium)?

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[@class="container false"]//a/@href')
print(urls)

这也行不通:

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[contains(@class, "container") and contains(@class, "false")]//a/@href')
print(urls)

谢谢

事实证明,“缺失的”URL 实际上在源代码中,但您需要进行一些挖掘。

基本上,这些是由 JS 从嵌入式 JSON 加载的。您可以定位 JSON 所在的 div 并提取给定列的所有数据。

操作方法如下:

import json

import requests
from lxml import html

source = html.fromstring(requests.get('https://www.globo.com/').content)
columns = ["esporte", "jornalismo", "entretenimento"]

for column in columns:
    column_data = (
        json.loads(
            source.xpath(f'//div[@id="column-{column}"]')[0].get(f"data-{column}")
        )
    )
    for item in column_data:
        try:
            print(item["content"]["url"])
            print(f'Item id: {item["id"]}')
            print("-" * 120)
        except KeyError:
            continue

这应该产生:

https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/junior-moraes-e-aprovado-em-exames-cardiologicos-antes-de-assinar-com-o-corinthians.ghtml
Item id: 527df4d0-2310-4c6c-bda7-7215e2c43ce2
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/sao-paulo/noticia/2022/03/11/passo-a-passo-entenda-a-polemica-entre-ceni-diego-costa-e-o-medico-do-sao-paulo-no-classico.ghtml
Item id: 6516b867-c2ca-412b-9a7b-52aca2a58b2d
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/pe/futebol/noticia/2022/03/11/joelinton-diz-que-nao-conhece-oasis-e-sugere-alceu-valenca-para-musica-da-torcida-do-newcastle.ghtml
Item id: 3d30a7a6-2e13-44a0-957f-85e9ccd4a389
------------------------------------------------------------------------------------------------------------------------
https://oglobo.globo.com/esportes/futebol/apresentado-no-botafogo-piazon-mostra-empolgacao-com-projeto-da-saf-expectativa-grande-25428998?utm_source=globo.com&utm_medium=oglobo
Item id: f33b3e35-a9b9-4f0d-bb9e-95f145d54046
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/joao-victor-cita-intensidade-maior-nos-treinos-do-corinthians-e-cre-em-evolucao-mais-adaptados.ghtml
Item id: c1306207-e4af-41da-ac65-b4bdc5bc6489
------------------------------------------------------------------------------------------------------------------------
https://extra.globo.com/famosos/jogador-douglas-luiz-da-selecao-brasileira-namora-companheira-de-clube-na-inglaterra-casal-posta-cliques-romanticos-25427740.html
Item id: 44de3874-5143-48a9-89ad-3da8b4c5e0d7
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/am/futebol/times/amazonas-fc/noticia/2022/03/11/atacante-walter-ex-santa-cruz-e-anunciado-pelo-amazonas-fc.ghtml
Item id: d98971c4-b220-4c69-bc92-8c877a389951
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/programas/verao-espetacular/noticia/2022/03/11/tecnologia-ajuda-surfistas-na-busca-por-ondulacoes-historicas-em-nazare.ghtml
Item id: 2737af2e-ee76-41c2-852c-cc0f0d00e01a
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/popo-tatua-cena-de-luta-com-whindersson-na-pele-e-no-coracao.html
Item id: 1505eee1-18df-4fa8-aa53-77feb10a5129
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/combate/noticia/2022/03/11/ufc-marreta-e-ankalaev-batem-peso-rapido-para-luta-no-sabado.ghtml
Item id: c87ceed9-e9c1-47e5-a269-406b0c4a7636
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ce/ceara/noticia/2022/03/11/tres-dias-de-viagem-e-minha-irma-so-chorando-chamando-o-nome-da-minha-mae-diz-garoto-que-viajou-sem-responsavel-de-sao-paulo-ao-ceara.ghtml
Item id: 4379555e-9892-4e43-998a-567f8f4f1eb5
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/es/espirito-santo/noticia/2022/03/11/brasileiro-e-preso-na-tailandia-com-cocaina-diluida-em-produtos-de-beleza.ghtml
Item id: e914e223-cb1a-43bb-89f5-eaafc6b475fa
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Saude/noticia/2022/03/apos-vencer-covid-19-e-um-quadro-de-pneumonia-menina-de-3-anos-sai-do-hospital-e-corre-para-abracar-prima.html
Item id: f60fdd89-01da-44c2-b131-ae132e1345c2
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/pr/norte-noroeste/noticia/2022/03/11/justica-nega-posse-de-professora-sem-vacina-contra-covid-para-dar-aulas-na-rede-municipal-de-londrina.ghtml
Item id: 461a0f34-51fa-419b-8c72-80234ce05302
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/to/tocantins/noticia/2022/03/11/mauro-carlesse-se-pronuncia-nas-redes-sociais-apos-renuncia-cheguei-no-limite.ghtml
Item id: d1f0e0fb-7ac5-4975-a7ba-ef7746099073
------------------------------------------------------------------------------------------------------------------------
https://revistagalileu.globo.com/Um-So-Planeta/noticia/2022/03/novas-observacoes-mostram-que-gelo-do-artico-afinou-nos-ultimos-3-anos.html
Item id: 5cd0e336-970f-48b6-a2cd-db59cab98964
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/sp/ribeirao-preto-franca/noticia/2022/03/11/homem-fotografa-partes-intimas-de-mulher-de-saia-em-loja-de-sertaozinho-sp-video.ghtml
Item id: 6521e28a-fd0a-4666-864d-658d119ff31f
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ba/bahia/noticia/2022/03/11/passageiros-relatam-problema-nas-duas-linhas-do-metro-de-salvador.ghtml
Item id: 29428049-e314-41d0-a693-f2b71b259c79
------------------------------------------------------------------------------------------------------------------------
https://autoesporte.globo.com/curiosidades/noticia/2022/03/maior-carro-do-mundo-tem-26-rodas-heliponto-e-pode-levar-ate-75-pessoas.ghtml
Item id: 79fbc28c-4629-4403-bfc8-7e4511c33d8b
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Gravidez/noticia/2022/03/coercao-reprodutiva-em-documentario-mulheres-contam-que-parceiros-esconderam-suas-pilulas-anticoncepcionais-e-furaram-preservativos.html
Item id: 471eb31e-b5ae-4d9d-ae12-e9a6e74d1cd3
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/fantastico/noticia/2022/03/11/uma-tarde-com-jade-apos-deixar-o-bbb-22-influencer-curtiu-praia-no-rio-e-atendeu-fas.ghtml
Item id: 5d4f5867-5001-498e-a1fe-d189a7adaed2
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/felipe-roque-curte-praia-com-atriz-sofia-starling-ex-de-andre-marques.html
Item id: fa1ead5a-0f95-43d0-bca8-ebd754421104
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-Inspira/noticia/2022/03/entenda-frontoplastia-procedimento-para-diminuir-testa-feito-pela-ex-bbb-thais-braz.html
Item id: 0a4cebdf-0093-46b0-a192-b34c56e31e44
------------------------------------------------------------------------------------------------------------------------
https://vogue.globo.com/celebridade/noticia/2022/03/gabi-martins-confirma-que-ficou-felipe-neto-mas-descarta-relacionamento-estamos-solteiros.html
Item id: 60b7edde-0ef1-4280-a804-dcccb70c8197
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/jamie-lee-curtis-mostra-corpo-real-em-novo-papel-chupava-barriga-desde-os-11-anos.html
Item id: 07b8250e-4098-44a9-8bac-f70913867aa8
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/tudo-mais/tv-e-famosos/noticia/pergunta-de-susana-vieira-no-encontro-bomba-na-web-posso-falar-mal.ghtml
Item id: 08aacc1c-abd3-477a-a8b9-ffd5d1ab174f
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/Entrevista/noticia/2022/03/titi-muller-sobre-relacao-com-o-ex-marido-gente-quer-ver-o-outro-feliz.html
Item id: 05c58354-7b1b-4bd7-b8b6-48459cbfeec0
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/novelas/um-lugar-ao-sol/vem-por-ai/noticia/um-lugar-ao-sol-christianrenato-fica-entre-os-ciumes-de-barbara-e-as-exigencias-de-stephany.ghtml
Item id: f39cce20-143b-4ba7-a728-86bacedde3e0
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/deborah-secco-exibe-marquinha-de-biquini-na-praia-e-ganha-elogio-do-marido.ghtml
Item id: f3d60eb1-3329-48f3-8eb1-7647ca353558
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/kim-kardashian-compartilha-primeira-foto-no-instagram-ao-lado-de-pete-davidson.ghtml
Item id: 284d4a38-12cd-45ce-850a-ad436512444a
------------------------------------------------------------------------------------------------------------------------

注意:有些项目有 ID 但没有 URL,这些通常是小部件。因此,try-except.