如何将压缩文件夹中的 CSV 文件从 URL 加载到 Pandas DataFrame

How to Load a CSV File from zipped folder from URL into Pandas DataFrame

我想从 URL 的压缩文件夹中将 CSV 文件加载到 Pandas DataFrame 中。我参考了 并使用了如下相同的解决方案:

from urllib import request
import zipfile

# link to the zip file
link = 'https://cricsheet.org/downloads/'
# the zip file is named as ipl_csv2.zip
request.urlretrieve(link, 'ipl_csv2.zip')
compressed_file = zipfile.ZipFile('ipl_csv2.zip')

# I need the csv file named all_matches.csv from ipl_csv2.zip
csv_file = compressed_file.open('all_matches.csv')
data = pd.read_csv(csv_file)
data.head()

但是在 运行 代码之后,我得到一个错误:

BadZipFile                                Traceback (most recent call last)
<ipython-input-3-7b7a01259813> in <module>
      1 link = 'https://cricsheet.org/downloads/'
      2 request.urlretrieve(link, 'ipl_csv2.zip')
----> 3 compressed_file = zipfile.ZipFile('ipl_csv2.zip')
      4 csv_file = compressed_file.open('all_matches.csv')
      5 data = pd.read_csv(csv_file)

~\Anaconda3\lib\zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1267         try:
   1268             if mode == 'r':
-> 1269                 self._RealGetContents()
   1270             elif mode in ('w', 'x'):
   1271                 # set the modified flag so central directory gets written

~\Anaconda3\lib\zipfile.py in _RealGetContents(self)
   1334             raise BadZipFile("File is not a zip file")
   1335         if not endrec:
-> 1336             raise BadZipFile("File is not a zip file")
   1337         if self.debug > 1:
   1338             print(endrec)

BadZipFile: File is not a zip file

我不太习惯 Python 中的 zip 文件处理。那么请帮我看看我的代码需要做哪些更正?

如果我在网络浏览器中打开 URL https://cricsheet.org/downloads/ipl_csv2.zip,zip 文件会自动下载到我的系统中。由于数据每天都添加到这个 zip 文件中,我想访问 URL 并通过 Python 直接获取 CSV 文件以节省存储空间。

Edit1:如果你们有任何其他代码解决方案,请分享...

试试这个:

link = "https://cricsheet.org/downloads/ipl_csv2.zip"

如果文件已下载,不用担心,如果您不需要该文件,请取消下载。 您将始终从 link.

获取更新数据

这是我在下面与@nobleknight 讨论后所做的:

# importing libraries
import zipfile
from urllib.request import urlopen
import shutil
import os

url = 'https://cricsheet.org/downloads/ipl_csv2.zip'
file_name = 'ipl_csv2.zip'

# extracting zipfile from URL
with urlopen(url) as response, open(file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

    # extracting required file from zipfile
    with zipfile.ZipFile(file_name) as zf:
        zf.extract('all_matches.csv')

# deleting the zipfile from the directory
os.remove('ipl_csv2.zip')

# loading data from the file
data = pd.read_csv('all_matches.csv')

此解决方案可防止我在网上找到的每个解决方案都遇到的 ContentTooShortErrorHTTPForbiddenError 错误。感谢@nobleknight 参考 this.

为我提供了部分解决方案

欢迎提出任何其他想法。