如何将压缩文件夹中的 CSV 文件从 URL 加载到 Pandas DataFrame
How to Load a CSV File from zipped folder from URL into Pandas DataFrame
我想从 URL 的压缩文件夹中将 CSV 文件加载到 Pandas DataFrame 中。我参考了 并使用了如下相同的解决方案:
from urllib import request
import zipfile
# link to the zip file
link = 'https://cricsheet.org/downloads/'
# the zip file is named as ipl_csv2.zip
request.urlretrieve(link, 'ipl_csv2.zip')
compressed_file = zipfile.ZipFile('ipl_csv2.zip')
# I need the csv file named all_matches.csv from ipl_csv2.zip
csv_file = compressed_file.open('all_matches.csv')
data = pd.read_csv(csv_file)
data.head()
但是在 运行 代码之后,我得到一个错误:
BadZipFile Traceback (most recent call last)
<ipython-input-3-7b7a01259813> in <module>
1 link = 'https://cricsheet.org/downloads/'
2 request.urlretrieve(link, 'ipl_csv2.zip')
----> 3 compressed_file = zipfile.ZipFile('ipl_csv2.zip')
4 csv_file = compressed_file.open('all_matches.csv')
5 data = pd.read_csv(csv_file)
~\Anaconda3\lib\zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
1267 try:
1268 if mode == 'r':
-> 1269 self._RealGetContents()
1270 elif mode in ('w', 'x'):
1271 # set the modified flag so central directory gets written
~\Anaconda3\lib\zipfile.py in _RealGetContents(self)
1334 raise BadZipFile("File is not a zip file")
1335 if not endrec:
-> 1336 raise BadZipFile("File is not a zip file")
1337 if self.debug > 1:
1338 print(endrec)
BadZipFile: File is not a zip file
我不太习惯 Python 中的 zip 文件处理。那么请帮我看看我的代码需要做哪些更正?
如果我在网络浏览器中打开 URL https://cricsheet.org/downloads/ipl_csv2.zip
,zip 文件会自动下载到我的系统中。由于数据每天都添加到这个 zip 文件中,我想访问 URL 并通过 Python 直接获取 CSV 文件以节省存储空间。
Edit1:如果你们有任何其他代码解决方案,请分享...
试试这个:
link = "https://cricsheet.org/downloads/ipl_csv2.zip"
如果文件已下载,不用担心,如果您不需要该文件,请取消下载。
您将始终从 link
.
获取更新数据
这是我在下面与@nobleknight 讨论后所做的:
# importing libraries
import zipfile
from urllib.request import urlopen
import shutil
import os
url = 'https://cricsheet.org/downloads/ipl_csv2.zip'
file_name = 'ipl_csv2.zip'
# extracting zipfile from URL
with urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# extracting required file from zipfile
with zipfile.ZipFile(file_name) as zf:
zf.extract('all_matches.csv')
# deleting the zipfile from the directory
os.remove('ipl_csv2.zip')
# loading data from the file
data = pd.read_csv('all_matches.csv')
此解决方案可防止我在网上找到的每个解决方案都遇到的 ContentTooShortError
和 HTTPForbiddenError
错误。感谢@nobleknight 参考 this.
为我提供了部分解决方案
欢迎提出任何其他想法。
我想从 URL 的压缩文件夹中将 CSV 文件加载到 Pandas DataFrame 中。我参考了
from urllib import request
import zipfile
# link to the zip file
link = 'https://cricsheet.org/downloads/'
# the zip file is named as ipl_csv2.zip
request.urlretrieve(link, 'ipl_csv2.zip')
compressed_file = zipfile.ZipFile('ipl_csv2.zip')
# I need the csv file named all_matches.csv from ipl_csv2.zip
csv_file = compressed_file.open('all_matches.csv')
data = pd.read_csv(csv_file)
data.head()
但是在 运行 代码之后,我得到一个错误:
BadZipFile Traceback (most recent call last)
<ipython-input-3-7b7a01259813> in <module>
1 link = 'https://cricsheet.org/downloads/'
2 request.urlretrieve(link, 'ipl_csv2.zip')
----> 3 compressed_file = zipfile.ZipFile('ipl_csv2.zip')
4 csv_file = compressed_file.open('all_matches.csv')
5 data = pd.read_csv(csv_file)
~\Anaconda3\lib\zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
1267 try:
1268 if mode == 'r':
-> 1269 self._RealGetContents()
1270 elif mode in ('w', 'x'):
1271 # set the modified flag so central directory gets written
~\Anaconda3\lib\zipfile.py in _RealGetContents(self)
1334 raise BadZipFile("File is not a zip file")
1335 if not endrec:
-> 1336 raise BadZipFile("File is not a zip file")
1337 if self.debug > 1:
1338 print(endrec)
BadZipFile: File is not a zip file
我不太习惯 Python 中的 zip 文件处理。那么请帮我看看我的代码需要做哪些更正?
如果我在网络浏览器中打开 URL https://cricsheet.org/downloads/ipl_csv2.zip
,zip 文件会自动下载到我的系统中。由于数据每天都添加到这个 zip 文件中,我想访问 URL 并通过 Python 直接获取 CSV 文件以节省存储空间。
Edit1:如果你们有任何其他代码解决方案,请分享...
试试这个:
link = "https://cricsheet.org/downloads/ipl_csv2.zip"
如果文件已下载,不用担心,如果您不需要该文件,请取消下载。
您将始终从 link
.
这是我在下面与@nobleknight 讨论后所做的:
# importing libraries
import zipfile
from urllib.request import urlopen
import shutil
import os
url = 'https://cricsheet.org/downloads/ipl_csv2.zip'
file_name = 'ipl_csv2.zip'
# extracting zipfile from URL
with urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# extracting required file from zipfile
with zipfile.ZipFile(file_name) as zf:
zf.extract('all_matches.csv')
# deleting the zipfile from the directory
os.remove('ipl_csv2.zip')
# loading data from the file
data = pd.read_csv('all_matches.csv')
此解决方案可防止我在网上找到的每个解决方案都遇到的 ContentTooShortError
和 HTTPForbiddenError
错误。感谢@nobleknight 参考 this.
欢迎提出任何其他想法。