如何更改文件扩展名?
How to change file extension?
我正在尝试从 Tax Foundation 网站抓取“.xlsx”文件。遗憾的是,我不断收到一条错误消息:Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file
。我做了一些研究,它说解决这个问题的方法是将文件扩展名更改为“.xls”而不是“.xlsx”。有人可以帮忙吗?
from bs4 import BeautifulSoup
import urllib.request
import os
url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")
soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))
FHFA = os.chdir('C:/US_Census/Directory')
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.xlsx']):
continue
file = href.split('/')[-1]
filename = file.rsplit('.', 1)[0]
if filename not in seen: # only retrieve file if it has not been seen before
seen.add(filename) # add the file to the set
url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
print(filename)
print(' ')
print("All files successfully downloaded.")
P.S。我知道您可以下载该文件,但我正在网络抓取它以自动执行特定过程。
您的问题出在 url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
行上。如果您访问该网站并将鼠标悬停在 Excel 下载按钮上,您会看到更长的 link、https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx
(注意 2017....238
了吗?)。所以你从来没有正确下载 Excel 文件。这是这样做的正确行:
url = urllib.request.urlretrieve(href, file)
其他一切正常。
我正在尝试从 Tax Foundation 网站抓取“.xlsx”文件。遗憾的是,我不断收到一条错误消息:Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file
。我做了一些研究,它说解决这个问题的方法是将文件扩展名更改为“.xls”而不是“.xlsx”。有人可以帮忙吗?
from bs4 import BeautifulSoup
import urllib.request
import os
url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")
soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))
FHFA = os.chdir('C:/US_Census/Directory')
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.xlsx']):
continue
file = href.split('/')[-1]
filename = file.rsplit('.', 1)[0]
if filename not in seen: # only retrieve file if it has not been seen before
seen.add(filename) # add the file to the set
url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
print(filename)
print(' ')
print("All files successfully downloaded.")
P.S。我知道您可以下载该文件,但我正在网络抓取它以自动执行特定过程。
您的问题出在 url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
行上。如果您访问该网站并将鼠标悬停在 Excel 下载按钮上,您会看到更长的 link、https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx
(注意 2017....238
了吗?)。所以你从来没有正确下载 Excel 文件。这是这样做的正确行:
url = urllib.request.urlretrieve(href, file)
其他一切正常。