Python 尝试下载大型压缩 csv 文件时代码中断,在较小的文件上工作正常

Python code breaks when attemting to download larger zipped csv file, works fine on smaller file

在处理包含 25MB CSV 文件的小型 zip 文件(大约 8MB)时,以下代码可以正常工作。一旦我尝试下载更大的文件(包含 180MB csv 的 45MB zip 文件),代码就会中断并且我收到以下错误消息:

(venv) ufulu@ufulu awr % python get_awr_ranking_data.py
https://api.awrcloud.com/v2/get.php?action=get_topsites&token=REDACTED&project=REDACTED Client+%5Bw%5D&fileName=2017-01-04-2019-10-09
Traceback (most recent call last):
  File "get_awr_ranking_data.py", line 101, in <module>
    getRankingData(project['name'])
  File "get_awr_ranking_data.py", line 67, in getRankingData
    processRankingdata(rankDateData['details'])
  File "get_awr_ranking_data.py", line 79, in processRankingdata
    domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
AttributeError: 'float' object has no attribute 'split'

我的目标是下载170个项目的数据并将数据保存到sqlite DB。

请多多包涵,因为我是编程领域的新手 python。我将非常感谢任何帮助修复下面的代码以及任何其他建议和改进以使代码更健壮和 pythonic。

提前致谢

from dotenv import dotenv_values
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from sqlalchemy import create_engine

# SQL Alchemy setup

engine = create_engine('sqlite:///rankingdata.sqlite', echo=False)



# Excerpt from the initial API Call

data = {'projects': [{'name': 'Client1',
   'id': '168',
   'frequency': 'daily',
   'depth': '5',
   'kwcount': '80',
   'last_updated': '2019-10-01',
   'keywordstamp': 1569941983},
                     {
                         "depth": "5",
                         "frequency": "ondemand",
                         "id": "194",
                         "kwcount": "10",
                         "last_updated": "2019-09-30",
                         "name": "Client2",
                         "timestamp": 1570610327
                     },

                     {
                         "depth": "5",
                         "frequency": "ondemand",
                         "id": "196",
                         "kwcount": "100",
                         "last_updated": "2019-09-30",
                         "name": "Client3",
                         "timestamp": 1570610331
                     }  
                     ]}

#setup
api_url = 'https://api.awrcloud.com/v2/get.php?action='
urls = [] # processed URLs
urlbacklog = [] # URLs that didn't return a downloadable File

# API Call to recieve URL containing downloadable zip and csv
def getRankingData(project):
    action = 'get_dates'
    response = requests.get(''.join([api_url, action]),
                            params=dict(token=dotenv_values()['AWR_API'],
                                        project=project))
    response = response.json()
    action2 = 'topsites_export'
    rankDateData = requests.get(''.join([api_url, action2]),
                            params=dict(token=dotenv_values()['AWR_API'],
                                        project=project, startDate=response['details']['dates'][0]['date'], stopDate=response['details']['dates'][-1]['date'] ))

    rankDateData = rankDateData.json()
    print(rankDateData['details'])
    urls.append(rankDateData['details'])
    processRankingdata(rankDateData['details'])

# API Call to download and unzip csv data and process it in pandas
def processRankingdata(url):
    content = requests.get(url)
    # {"response_code":25,"message":"Export in progress. Please come back later"}
    if "response_code" not in content:
        f = ZipFile(BytesIO(content.content))
        #print(f.namelist()) to get all filenames in Zip
        with f.open(f.namelist()[0], 'r') as g: rankingdatadf = pd.read_csv(g)
        rankingdatadf = rankingdatadf[rankingdatadf['Search Engine'].str.contains("Google")]
        domain = []
        for row in rankingdatadf['URL']:
            domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
        rankingdatadf['Domain'] = domain
        rankingdatadf['Domain'] = rankingdatadf['Domain'].str.replace('www.', '')
        rankingdatadf = rankingdatadf.drop(columns=['Title', 'Meta description', 'Snippet', 'Page'])
        print(rankingdatadf['Search Engine'][0])
        writeData(rankingdatadf)
    else:
        urlbacklog.append(url)
        pass
# Finally write the data to database
def writeData(rankingdatadf):
    table_name_from_file = project['name']
    check = engine.has_table(table_name_from_file)
    print(check)  # boolean
    if check == False:
        rankingdatadf.to_sql(table_name_from_file, con=engine)
        print(project['name'] + ' ...Done')
    else:
        print(project['name'] + ' ... already in DB')


for project in data['projects']:
    getRankingData(project['name'])

问题似乎是浮动的拆分调用,而不一定是下载。尝试更改第 79 行

来自

domain.append(row.split("//")[-1].split("/")[0].split('?')[0])

domain.append(str(str(str(row).split("//")[-1]).split("/")[0]).split('?')[0])

您似乎正在尝试在此处解析 URL 的网络位置部分,您也可以使用 urllib.parse 来简化此操作,而不是链接所有拆分:

from urllib.parse import urlparse
...
for row in rankingdatadf['URL']:
   domain.append(urlparse(row).netloc)

我认为格式错误的 URL 导致了您的问题,请尝试(诊断问题):

尝试:

for row in rankingdatadf['URL']:
   try:
       domain.append(urlparse(row).netloc)
   catch Exception:
       exit(row)

看起来您在上面已经弄明白了,您有一个数据库条目,其中 URL 字段的值为 NULL。不确定您对此数据集的保真度要求是什么,但可能希望对 URL 字段执行数据库规则,或使用 pandas 删除 URL 为 NaN 的行。

rankingdatadf = rankingdatadf.dropna(subset=['URL'])