如何从网页中提取 page_source

How to extract page_source from a webpage

我正在尝试从政府网页获取数据,但是,当我获取页面源时,它不包含浏览器中显示的数据。

from selenium import webdriver
from selenium.webdriver.support.ui import Select


page = 'http://web.cvm.gov.br/app/esforcosrestritos/#/consultarOferta'

driver = webdriver.Chrome()
driver.get(page)

## Click on "Encerrada"
driver.find_element_by_xpath('//*[@id="content"]/div[4]/div[2]/div/div /div[4]/div[2]/label[3]/input').click()


## Select year
year = Select(driver.find_element_by_xpath('//*[@id="content"]/div[4]/div[2]/div/div/div[4]/div[1]/div/select'))
year.select_by_visible_text('2017')


## Click on "Pesquisar"
driver.find_element_by_xpath('//*[@id="content"]/div[4]/div[3]/div/a[1]/span').click()


## Click on "DEBENTURES SIMPLES" inside "Ofertas Encerradas"
driver.find_element_by_css_selector('#content > div.container.ng-scope > div:nth-child(4) > div:nth-child(2) > div > table > tbody > tr:nth-child(15) > td.col-lg-2.text-left.ng-binding').click()

## Click on 1st result
driver.find_element_by_css_selector('#content > div.container.ng-scope > div:nth-child(4) > div > div > table > tbody > tr.text-center > td.text-left.ng-binding').click()

##Page Source
html = driver.page_source

在这个例子中,第一个字段 "CNPJ",而不是得到值 '04.031.960/00001-70',我得到这个:

<input type="text" class="form-control ng-pristine ng-untouched ng-valid ng-valid-maxlength" data-ng-cnpj="" data-ng-model="$responsavel.ofertante.cnpj" data-ng-change="getNomeResponsavelPorCnpj($responsavel.ofertante)" data-ng-disabled="mesmosDadosEmissor || $responsavel.disabled" maxlength="18" disabled="disabled">

此外,如果我将鼠标悬停在浏览器中的值上,则无法 select 它。

有没有办法从这种类型的页面中获取数据?

一旦 click() 第一个结果 上,您需要为 Heading[ 引入 WebDriverWait =26=] **** 可见,然后你可以提取 page_source 如下:

  • 代码块:

    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='ng-binding ng-scope'][contains(.,'RIO DE ENCERRAMENTO DE OFERTA P')]")))
    ##Page Source
    print(driver.page_source)
    
  • 控制台输出:

    <!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="pt_br" data-ng-app="app" class="ng-scope"><head><style type="text/css">@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide:not(.ng-hide-animate){display:none !important;}ng\:form{display:block;}</style>  
        <meta charset="UTF-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
        <meta http-equiv="CACHE-CONTROL" content="NO-CACHE" />
        <meta http-equiv="EXPIRES" content="Mon, 22 Jul 2002 11:12:01 GMT" />
    
        <title>Sistema Ofertas com Esforços Restritos</title>
    
        <link rel="shortcut icon" href="resources/img/favicon.ico" />
    
        <link rel="stylesheet" href="resources/css/open-sans.css" />
        <link rel="stylesheet" href="resources/css/bootstrap/css/bootstrap.min.css" />
        <link rel="stylesheet" href="resources/css/bootstrap/css/bootstrap-theme.min.css" />
        <link rel="stylesheet" href="resources/js/bootstrap-datepicker/datepicker.css" />
        <link rel="stylesheet" href="resources/js/ngTable/ng-table.min.css" />
        <link rel="stylesheet" href="resources/css/cvm.css" />
    </head>
    
    <body class="modal-open" style="padding-right: 17px;">
        <div id="fullContent">
    
            <div id="content" data-ng-controller="AutenticarUsuarioController" class="ng-scope">
            <!-- INICIO MENU BRASIL -->
                <div class="nav-brasil">
                    <div class="navbar navbar-default">
                        <div class="container">
                            <div class="navbar-header">
                              <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#brasil">
                              <img src="resources/img/brazil-flag_05.png" />
                              </button>
                            </div>
    
                            <!-- Collect the nav links, forms, and other content for toggling -->
                            <div class="collapse navbar-collapse" id="brasil">
                              <ul class="nav navbar-nav">
                            <li><a class="icon-brasil" href="http://www.brasil.gov.br/" target="_blank">BRASIL</a></li>
                            <li><a href="http://www.acessoainformacao.gov.br/sistema/" target="_blank">Acesso à informação</a></li>
                              </ul>
    
                              <ul class="nav navbar-nav navbar-right">
                            <li class="first-li"><a href="http://brasil.gov.br/barra#participe" target="_blank">Participe</a></li>
                            <li><a href="http://www.servicos.gov.br/" target="_blank">Serviços</a></li>
                            <li><a href="http://www.planalto.gov.br/legislacao" target="_blank">Legislação</a></li>
                            <li><a href="http://brasil.gov.br/barra#orgaos-atuacao-canais" target="_blank">Canais</a></li>
                              </ul>
    
                            </div><!-- /.navbar-collapse -->
                        </div>
                    </div>
                </div>
                <!-- FIM MENU BRASIL -->
    
                <!-- INICIO CABEÇALHO -->
                <div id="header">
                    <div class="container">
                        <div class="row">
                            <div class="col-lg-4">
                                <h5>CVM - Comissão de Valores Mobiliários</h5>
                            </div>
                            <div class="text-right" data-ng-init="initContraste()">
                                <a class="h6" href="javascript:void(0)" data-ng-click="altoContraste()">ALTO CONTRASTE</a>
                            </div>
                        </div>
    
                        <a class="h2" href="javascript:void(0)" data-ng-click="abrirPaginaPrincipal()">Sistemas de Ofertas Públicas com Esforços Restritos</a>
    
                        <div class="row">
                            <div class="col-lg-3">
                                <h5>GOVERNO FEDERAL</h5>
                            </div>
                            <!-- ngIf: temUsuario() -->
                        </div>
                    </div>
                </div>
                <!-- FIM CABEÇALHO -->
    
                <!-- INICIO MENU PRINCIPAL -->
                    <!-- INICIO MENU PRINCIPAL -->
                <div class="nav-principal">
                    <div class="navbar navbar-default">
                        <!-- ngIf: temUsuario() -->
                    </div>
                </div>
                <!-- FIM MENU PRINCIPAL -->
    
                <!-- INICIO CONTEÚDO -->
                <!-- ngView:  --><div data-ng-view="" class="container ng-scope">   
    <div data-ng-init="init()" class="ng-scope">
        <div class="row row-title">
            <div class="right-title">
                <!-- ngIf: acao.isAcaoVisualizar() && permissaoAlteracao -->
    
                <!-- ngIf: acao.isAcaoVisualizar() && permissaoAlteracao -->
    
                <a class="btn btn-link" href="ajuda/Envio_Formulario_Encerramento.pdf" target="_blank">
                    <img src="resources/img/ajuda.png" />
                    <span class="ng-binding">Ajuda</span>
                </a>
            </div>
    
            <!-- ngIf: acao.isAcaoIncluir() -->
            <!-- ngIf: acao.isAcaoAlterar() -->
            <!-- ngIf: acao.isAcaoVisualizar() --><div data-ng-if="acao.isAcaoVisualizar()" class="ng-binding ng-scope">VISUALIZAR FORMULÁRIO DE ENCERRAMENTO DE OFERTA PÚBLICA COM ESFORÇOS RESTRITOS</div><!-- end ngIf: acao.isAcaoVisualizar() -->
    
        </div>
    
        <div style="min-height: 1200px">
            <div class="row row-required ng-binding">* Campos Obrigatórios</div>
            <!-- ngIf: acao.isAcaoAlterar() && !usuarioGestor -->       
    
            <div data-ng-responsavel="$responsavel"></div>
            <div data-ng-oferta="$oferta"></div>
            <div data-ng-intermediario="$intermediario"></div>
            <div data-ng-colocacao="$colocacao"></div>
    
        </div>
    
        <div class="row row-center">
    
            <div class="col-center">
                <a class="btn btn-default" role="button" href="javascript:void(0)" data-ng-click="voltar()">
                    <img src="resources/img/arrow-left.png" />
                    <span class="ng-binding">Voltar</span>
                </a>
    
                <!-- ngIf: acao.isAcaoIncluir() -->
    
                <!-- ngIf: acao.isAcaoAlterar() -->
    
            </div>
    
        </div>
    </div></div>    
                <!-- FIM CONTEÚDO -->   
    
            </div>
    
            <!-- INICIO RODAPÉ -->
            <div id="footer">
                <div class="container footer-container">
                    <div class="row">
                        <div class="col-lg-8">
                            <a href="http://www.acessoainformacao.gov.br/sistema/" target="_blank">
                                <img src="resources/img/logo-acesso_25.png" />
                            </a>
                        </div>
                        <div class="col-lg-2 text-right cvm-footer-description">
                            <h6>CVM - Comissão de</h6><h6>Valores Mobiliários</h6>
                        </div>
                        <a href="http://www.brasil.gov.br/"><span class="logo-brasil-footer"></span></a>
                    </div>
                </div>
                <div class="version-sistem">
                    <div class="container">
                    </div>
                </div>
            </div>
            <!-- FIM RODAPÉ -->
    
        </div>
    
        <!-- DEPENDÊNCIAS JAVA SCRIPT -->
        <script type="text/javascript" src="resources/js/jquery/jquery-2.1.3.min.js"></script>
        <script type="text/javascript" src="resources/js/base64/jquery.base64.min.js"></script>
        <script type="text/javascript" src="resources/js/jquery/jquery.maskedinput.min.js"></script>
        <script type="text/javascript" src="resources/js/jquery/jquery.maskmoney.min.js"></script>
        <script type="text/javascript" src="resources/js/jquery/jquery.cookie.js"></script>
        <script type="text/javascript" src="resources/css/bootstrap/js/bootstrap.min.js"></script>
    
        <script type="text/javascript" src="resources/js/bootstrap-datepicker/bootstrap-datepicker.js"></script>
        <script type="text/javascript" src="resources/js/bootstrap-datepicker/bootstrap-datepicker.pt-BR.js"></script>
    
        <script type="text/javascript" src="resources/js/angular/angular.min.js"></script>
        <script type="text/javascript" src="resources/js/angular/angular-route.min.js"></script>
        <script type="text/javascript" src="resources/js/angular/angular-locale_pt-br.js"></script>
    
        <script type="text/javascript" src="resources/js/ngTable/ng-table.min.js"></script> 
    
        <script type="text/javascript" src="application/directives/directives.js"></script>
    
        <script type="text/javascript" src="application/message/message.js"></script>
        <script type="text/javascript" src="application/message/i18n.js"></script>
        <script type="text/javascript" src="application/security/security.js"></script>
    
        <script type="text/javascript" src="application/app.js"></script>
    
        <script type="text/javascript" src="application/controllers/AutenticarUsuarioController.js"></script>
        <script type="text/javascript" src="application/controllers/ConfigurarValoresMobiliariosController.js"></script>
        <script type="text/javascript" src="application/controllers/EnviarFormularioInicialController.js"></script>
        <script type="text/javascript" src="application/controllers/EnviarFormularioParcialController.js"></script>
        <script type="text/javascript" src="application/controllers/EnviarFormularioEncerramentoController.js"></script>
        <script type="text/javascript" src="application/controllers/EnviarComunicadoDispensaMicroEmpresaController.js"></script>
        <script type="text/javascript" src="application/controllers/EnviarFormularioDispensaLoteUnicoController.js"></script>
        <script type="text/javascript" src="application/controllers/GerenciarEnvioFormulariosController.js"></script>
        <script type="text/javascript" src="application/controllers/ConsultarOfertaController.js"></script>
    
    
    <div class="message" ng-messages=""></div><div class="loader modal in" aria-hidden="false" style="display: block; padding-right: 17px;"><div class="modal-backdrop  in" style="height: 672px;"></div><div class="modal-dialog"> <div class="modal-content"><div class="modal-header" style="text-align: center"><h5 class="modal-title">Aguarde</h5></div><div class="modal-body"><div class="row row-mg-1 row-center"><img src="resources/img/ajax-loader.gif" /></div></div></div></div></div></body></html>
    

我终于解决了这个问题,从浏览器日志中获取信息。数据没有直接出现在 html 源中,但它在过程中使用的 POST 中。

这是最终的工作代码:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 
import json
import pandas as pd


page = 'http://web.cvm.gov.br/app/esforcosrestritos/#/consultarOferta'

d = DesiredCapabilities.CHROME
d['loggingPrefs'] = { 'performance':'ALL' }
driver = webdriver.Chrome(desired_capabilities=d)
driver.get(page)

## Click on "Encerrada"
driver.find_element_by_xpath('//*[@id="content"]/div[4]/div[2]/div/div /div[4]/div[2]/label[3]/input').click()


## Select year
year = Select(driver.find_element_by_xpath('//*[@id="content"]/div[4]/div[2]/div/div/div[4]/div[1]/div/select'))
year.select_by_visible_text('2017')


## Click on "Pesquisar"
driver.find_element_by_xpath('//*[@id="content"]/div[4]/div[3]/div/a[1]/span').click()


## Click on "DEBENTURES SIMPLES" inside "Ofertas Encerradas"
driver.find_element_by_css_selector('#content > div.container.ng-scope > div:nth-child(4) > div:nth-child(2) > div > table > tbody > tr:nth-child(15) > td.col-lg-2.text-left.ng-binding').click()

## Click on 1st result
driver.find_element_by_css_selector('#content > div.container.ng-scope > div:nth-child(4) > div > div > table > tbody > tr.text-center > td.text-left.ng-binding').click()


## Selenium browser log
performance_log = driver.get_log('performance')

## Find log with allocation information
for j in range(len(performance_log)):
    if performance_log[j]['message'].find('Clubes de Investimento') != -1:
        break

allocation = performance_log[j]['message']

## Filter allocation data
allocation = allocation.replace('\', '')
allocation = allocation[allocation.find('{"colocacoes":['):]


## Put data into a Pandas DataFrame
allocation_table = pd.DataFrame(columns = ['tipoInvestidor', 'numeroInvestidores', 'quantidadeValorMobiliario'])
slice_allocation = '{"tipoInvestidor":{"id":' 
slice_alternative= '{"numeroInvestidores":' 

for i in range(1,11):

    beginning = allocation.find(slice_allocation+str(i)) if allocation.find(slice_allocation+str(i))!=-1 else allocation.find(slice_alternative) 
    end = allocation.find(slice_allocation+str(i+1)) if allocation.find(slice_allocation+str(i+1))!=-1 else allocation.find(slice_alternative) 

    allocation_investor = allocation[beginning:end-1]
    allocation = allocation[end:]

    allocation_investor = json.loads(allocation_investor)
    allocation_investor['tipoInvestidor'] = allocation_investor['tipoInvestidor']['descricao']

    allocation_table = allocation_table.append(allocation_investor, ignore_index = True)

allocation_table.fillna(0, inplace = True)