文件(.tar.gz)下载和处理使用urllib和请求包-python

File (.tar.gz) download and processing using urlib and requests package-python

范围: 使用哪个库? urllib 与请求 我试图下载 url 上可用的日志文件。 URL 托管在 aws 并包含文件名。访问 url 后,它会提供一个 .tar.gz 文件供下载。我需要将这个文件下载到我选择的解压目录中并解压缩以到达其中的 json 文件,最后解析 json 文件。在互联网上搜索时,我发现零星的信息遍布整个地方。在这个问题中,我尝试将其合并到一个地方。

使用请求库: 一个 PyPi 包,在处理高 http 请求时被认为是优越的。 参考文献:

  1. https://docs.python.org/3/library/urllib.request.html#module-urllib.request
  2. What are the differences between the urllib, urllib2, urllib3 and requests module?

代码:

import requests
import urllib.request
import tempfile
import shutil
import tarfile
import json
import os
import re

with requests.get(respurl,stream = True) as File:
    # stream = true is required by the iter_content below
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        with open(tmp_file.name,'wb') as fd:
            for chunk in File.iter_content(chunk_size=128):
                fd.write(chunk)

with tarfile.open(tmp_file.name,"r:gz") as tf:
    # To save the extracted file in directory of choice with same name as downloaded file.
    tf.extractall(path)
    # for loop for parsing json inside tar.gz file.
    for tarinfo_member in tf:
        print("tarfilename", tarinfo_member.name, "is", tarinfo_member.size, "bytes in size and is", end="")
        if tarinfo_member.isreg():
            print(" a regular file.")
        elif tarinfo_member.isdir():
            print(" a directory.")
        else:
            print(" something else.")
        if os.path.splitext(tarinfo_member.name)[1] == ".json":
            print("json file name:",os.path.splitext(tarinfo_member.name)[0])
            json_file = tf.extractfile(tarinfo_member)
            # capturing json file to read its contents and further processing.
            content = json_file.read()
            json_file_data = json.loads(content)
            print("Status Code",json_file_data[0]['status_code'])
            print("Response Body",json_file_data[0]['response'])
            # Had to decode content again as it was double encoded.
            print("Errors:",json.loads(json_file_data[0]['response'])['errors'])


To save the extracted file in directory of choice with same name as downloaded file. variable 'path' is formed as follows.

Where url sample is containing file name '44301621eb-response.tar.gz'

https://yoursite.com/44301621eb-response.tar.gz?AccessKeyId=your_id&Expires=1575526260&Signature=you_signature

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
PROJECT_NAME = 'your_project_name'
PROJECT_ROOT = os.path.join(BASE_DIR, PROJECT_NAME)
LOG_ROOT = os.path.join(PROJECT_ROOT, 'log')
filename = re.split("([^?]+)(?:.+/)([^#?]+)(\?.*)?", respurl)
# respurl is the url from the where the file will be downloaded 
path = os.path.join(LOG_ROOT,filename[2])

regex match output from regex101.com

与urllib的比较

为了了解细微差别,我也使用 urllib 实现了相同的代码。

Notice the usage of tempfile library is slightly different which worked for me. I had to use shutil library with urllib where requests didn't work with shutil library copyfileobj method due to difference response object that we get using urllib and requests.

with urllib.request.urlopen(respurl) as File:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(File, tmp_file)

with tarfile.open(tmp_file.name,"r:gz") as tf:
    print("Temp tf File:", tf.name)
    tf.extractall(path)
    for tarinfo in tf:
        print("tarfilename", tarinfo.name, "is", tarinfo.size, "bytes in size and is", end="")
        if tarinfo.isreg():
            print(" a regular file.")
        elif tarinfo.isdir():
            print(" a directory.")
        else:
            print(" something else.")
        if os.path.splitext(tarinfo_member.name)[1] == ".json":
            print("json file name:",os.path.splitext(tarinfo_member.name)[0])
            json_file = tf.extractfile(tarinfo_member)
            # capturing json file to read its contents and further processing.
            content = json_file.read()
            json_file_data = json.loads(content)
            print("Status Code",json_file_data[0]['status_code'])
            print("Response Body",json_file_data[0]['response'])
            # Had to decode content again as it was double encoded.
            print("Errors:",json.loads(json_file_data[0]['response'])['errors'])