将 .xls 文件转换为 Python 中的最新版本

Question

我在一个网站上做了一个 web_scraping，根据 Python 脚本，通过它直接下载一个 .xls 文件，该文件替换了目标文件夹中的旧文件, 下面:

我的脚本：

import time 
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support.select import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import shutil
import os
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import matplotlib



driver = webdriver.Chrome(ChromeDriverManager().install())

driver = webdriver.Chrome()

driver.get('http://estatisticas.cetip.com.br/astec/series_v05/paginas/lum_web_v05_series_introducao.asp?str_Modulo=Ativo&int_Idioma=1&int_Titulo=6&int_NivelBD=2/')
driver.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div/dl/dd[2]/a').click()
time.sleep(3)
driver.switch_to.frame(driver.find_element(By.XPATH, '//iframe[@name="dados_corpo"]'))
driver.switch_to.frame(driver.find_element(By.XPATH, '//frame[@name="ativo"]'))
find_dp1 = driver.find_element(By.XPATH, '//select[@name="ativo"]')
select_find_dp1 = Select(find_dp1)
select_find_dp1.select_by_visible_text("CBIO - Crédito de descarbonização")
time.sleep(3)

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[1])

time.sleep(1)
informacoes = Select(driver.find_element(By.NAME, 'selectopcoes'))
informacoes.select_by_visible_text('Estoque')
    
driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[2])

time.sleep(1)
# Data Inicial 
driver.find_element(By.NAME, 'DT_DIA_DE').send_keys('16')
driver.find_element(By.NAME, 'DT_MES_DE').send_keys('10')
driver.find_element(By.NAME, 'DT_ANO_DE').send_keys('2020')

# Data Final
driver.find_element(By.NAME, 'DT_DIA_ATE').send_keys('10')
driver.find_element(By.NAME, 'DT_MES_ATE').send_keys('02')
driver.find_element(By.NAME, 'DT_ANO_ATE').send_keys('2022')

driver.find_elements(By.CLASS_NAME, 'button')[1].click()

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'iframe'))
time.sleep(1)
driver.find_element(By.CLASS_NAME, 'primary-text').find_element(By.TAG_NAME,'a').click()

time.sleep(4)

origem = 'C:\Users\prmatteo\Downloads\'
destino = os.path.join(origem, 'C:\Users\prmatteo\xxx\Área de Trabalho\Arquivos Python\renovabio2.xls')
extensao = '.xls'

for file in os.listdir(origem):
    if file.endswith(extensao):
        shutil.move(os.path.join(origem,file), destino)

它总是以旧 excel 格式下载 .xls 文件。我想当我在目的地，以便它不会以兼容模式格式打开个人电脑

Answer 1

如果您安装了 xlrd 和 pandas，请使用 pandas.read_excel 将数据读入 DataFrame。然后使用 pandas.to_excel.

将文件输出到 xlsx

Answer 2

我试图从那个站点下载一个文件，不幸的是，它根本不生成 Excel 个文件。许多网站通过生成 CSV 文件或带有 table 和伪造的 xls 扩展名的 HTML 文件来伪造 Excel 导出。 Excel 认识到这一点并尝试将文件作为文本导入或 HTML 并向您显示警告。

不幸的是，在这种情况下，文件甚至不是 CSV。这是保存为文本的结果页面，包括 headers。它甚至不使用 UTF8，所以 non-US 个字符被破坏了。 table 数据只是一些以制表符作为分隔符的行：

B3
ADA - Alongamento da Dívida Agrícola - Estoque

De: 20/03/2022 -->  Até: 18/04/2022

Valores Financeiros em Reais.
Estoque Valorizado.
Metologia de cálculo: Preço Unitário da Curva x Quantidade Depositada na data.

Data    Volume
21/03/2022  0
22/03/2022  0
23/03/2022  0
24/03/2022  0
25/03/2022  0
28/03/2022  0
29/03/2022  0
30/03/2022  0a
31/03/2022  0
01/04/2022  0
04/04/2022  0
05/04/2022  0
06/04/2022  0
07/04/2022  0
08/04/2022  0
11/04/2022  0
12/04/2022  0
13/04/2022  0
14/04/2022  0
18/04/2022  0

您必须自己解析此文本文件并创建一个 Excel 文件。您可以执行此操作的一种方法是使用 Pandas 通过 read_csv skipping the first 9 rows, and then save it as Excel with to_excel:

将文件读取为 CSV

import pandas as pd
filename="fake_excel.xls"
df=pd.read_csv(filename, sep='\t',skiprows=9)
# Display it, to see what we got
df

df.to_excel("real.xlsx")

read_csv 方法允许您指定不同的分隔符、跳过 header 和页脚行、更改编码等。

一个可能的问题是日期格式。除非您另外指定，否则 read_csv 会将日期导入为文本。您可以告诉它解析看似日期的单元格，甚至尝试使用适当的参数推断日期格式。

您可以通过多种方式检查加载的数据。如果您键入 df，您将看到 DataFrame 的第一行和最后几行。您可以使用 df.info() 获取列数及其类型等。

将 .xls 文件转换为 Python 中的最新版本

Convert .xls file to latest version in Python

python

excel

xls