Memory Error:While reading a large .txt file from BLOB in python
Memory Error:While reading a large .txt file from BLOB in python
我正在尝试从 python 中的 Azure blob 读取一个大的 (~1.5 GB) .txt 文件,该文件出现内存错误。有没有一种方法可以有效地读取这个文件?
下面是我正在尝试的代码 运行:
from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time
STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"
CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'
blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
end = time.time()
print("Time taken = ",end-start)
下面是错误的最后几行:
---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
17
18 #df = pd.read_csv(StringIO(blobstring))
~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
2378 if_none_match,
2379 timeout)
-> 2380 blob.content = blob.content.decode(encoding)
2381 return blob
2382
MemoryError:
如何从 Blob 容器中读取 Python 中大小约为 1.5 GB 的文件?另外,我想为我的代码设置最佳 运行时间。
假设你的机器有足够的内存,根据下面的pandas.read_csv
API参考,你可以直接通过csv将csv blob内容读入pandas dataframe带有 sas 令牌的 blob url。
这是我的示例代码,供您参考。
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
import pandas as pd
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'
url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"
service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)
我认为在读取文本内容并将其转换为file-like
对象时会避免多次复制内存buffer
。
希望对您有所帮助。
我正在尝试从 python 中的 Azure blob 读取一个大的 (~1.5 GB) .txt 文件,该文件出现内存错误。有没有一种方法可以有效地读取这个文件?
下面是我正在尝试的代码 运行:
from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time
STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"
CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'
blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
end = time.time()
print("Time taken = ",end-start)
下面是错误的最后几行:
---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
17
18 #df = pd.read_csv(StringIO(blobstring))
~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
2378 if_none_match,
2379 timeout)
-> 2380 blob.content = blob.content.decode(encoding)
2381 return blob
2382
MemoryError:
如何从 Blob 容器中读取 Python 中大小约为 1.5 GB 的文件?另外,我想为我的代码设置最佳 运行时间。
假设你的机器有足够的内存,根据下面的pandas.read_csv
API参考,你可以直接通过csv将csv blob内容读入pandas dataframe带有 sas 令牌的 blob url。
这是我的示例代码,供您参考。
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
import pandas as pd
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'
url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"
service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)
我认为在读取文本内容并将其转换为file-like
对象时会避免多次复制内存buffer
。
希望对您有所帮助。