Python 网络抓取,仅收集 80% 到 90% 的预期数据行。我的循环有问题吗?

Python web scraping, only collects 80 to 90% of intended data rows. Is there something wrong with my loop?

我试图从出现在给定 Showbuzzdaily.com 网页 (example) 底部的文本中收集 150 行数据,但我的脚本只收集了 132 行。

我是 Python 的新手。是否需要将某些内容添加到我的循环中以确保按预期收集所有记录?

为了排除故障,我创建了一个列表 (program_count) 以验证在生成 CSV 之前代码中是否发生了这种情况,结果显示列表中只有 132 个项目,而不是 150 个。有趣的是,由于某种原因,最后一行 (#132) 最终在 CSV 的末尾被复制。

我在抓取 Google Trends(使用 pytrends)时遇到了类似的问题,我尝试抓取的数据中只有大约 80% 以 CSV 格式结束。所以我怀疑我的代码有问题,或者我的请求太多了。

在此代码中将 time.sleep(0.1) 添加到 for while 循环并没有产生不同的结果。

import time
import requests
import datetime
from bs4 import BeautifulSoup
import pandas as pd # import pandas module

from datetime import date, timedelta

# creates empty 'records' list
records = []

start_date = date(2021, 4, 12)
orig_start_date = start_date # Used for naming the CSV
end_date = date(2021, 4, 12)
delta = timedelta(days=1) # Defines delta as +1 day

print(str(start_date) + ' to ' + str(end_date)) # Visual reassurance

# begins while loop that will continue for each daily viewership report until end_date is reached
while start_date <= end_date: 
    start_weekday = start_date.strftime("%A") # define weekday name

    start_month_num = int(start_date.strftime("%m")) # define month number
    start_month_num = str(start_month_num) # convert to string so it is ready to be put into address

    start_month_day_num = int(start_date.strftime("%d")) # define day of the month
    start_month_day_num = str(start_month_day_num) # convert to string so it is ready to be put into address
    
    start_year = int(start_date.strftime("%Y")) # define year
    start_year = str(start_year) # convert to string so it is ready to be put into address

    #define address (URL)
    address = 'http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-'+start_weekday.lower()+'-cable-originals-network-finals-'+start_month_num+'-'+start_month_day_num+'-'+start_year+'.html'
    print(address) # print for visual reassurance

    # read the web page at the defined address (URL)
    r = requests.get(address)

    soup = BeautifulSoup(r.text, 'html.parser')

    # we're going to deal with results that appear within <td> tags
    results = soup.find_all('td')

    # reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
    date_line = results[0].text.split(": ",1)[1] # reads the text after the colon and space (': '), which is where the date information is located
    weekday_name = date_line.split(' ')[0] # stores the weekday name
    month_name = date_line.split(' ',2)[1] # stores the month name
    day_month_num = date_line.split(' ',1)[1].split(' ')[1].split(',')[0] # stores the day of the month
    year = date_line.split(', ',1)[1] # stores the year

    # concatenates and stores the full date value
    mmmmm_d_yyyy = month_name+' '+day_month_num+', '+year

    del results[:10] # deletes the first 10 results, which contained the date information and column headers

    program_count = [] # empty list for program counting

    # (within the while loop) begins a for loop that appends data for each program in a daily viewership report
    for result in results:
        rank = results[0].text # stores P18-49 rank
        program = results[1].text # stores program name
        network = results[2].text # stores network name
        start_time = results[3].text # stores program's start time
        mins = results[4].text # stores program's duration in minutes
        p18_49 = results[5].text # stores program's P18-49 rating
        p2 = results[6].text # stores program's P2+ viewer count (in thousands)
        records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list

        program_count.append(program) # adds each program name to the list.

        del results[:7] # deletes the first 7 results remaining, which contained the data for 1 row (1 program) which was just stored in 'records'
   
    print(len(program_count)) # Toubleshooting: prints to screen the number of programs counted. Should be 150.

    records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
    print(str(start_date)+' collected...') # Visual reassurance one page/day is finished being collected
    start_date += delta # at the end of while loop, advance one day


df = pd.DataFrame(records, columns=['Date','Weekday','P18-49 Rank','Program','Network','Start time','Mins','P18-49','P2+']) # Creates DataFrame using the columns listed
df.to_csv('showbuzz '+ str(orig_start_date) + ' to '+ str(end_date) + '.csv', index=False, encoding='utf-8') # generates the CSV file, using start and end dates in filename

您这样单独提取所有 table 数据 (<td>) 似乎让自己的调试变得更加困难。在单步执行代码并进行一些更改后,我最好的猜测是错误来自于您在迭代 results 时删除条目,这变得混乱。作为旁注,您也永远不会在循环中使用 result ,这会使声明毫无意义。这样的事情最终会变得更干净一些,并为您提供 150 个结果:

results = soup.find_all('tr')

# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].select_one('td').text.split(": ", 1)[1] # Selects first td it finds under the first tr
weekday_name = date_line.split(' ')[0]
month_name = date_line.split(' ', 2)[1]
day_month_num = date_line.split(' ', 1)[1].split(' ')[1].split(',')[0]
year = date_line.split(', ', 1)[1]

mmmmm_d_yyyy = month_name + ' ' + day_month_num + ', ' + year

program_count = []  # empty list for program counting

for result in results[2:]:
    children = result.find_all('td')
    rank = children[0].text  # stores P18-49 rank
    program = children[1].text  # stores program name
    network = children[2].text  # stores network name
    start_time = children[3].text  # stores program's start time
    mins = children[4].text  # stores program's duration in minutes
    p18_49 = children[5].text  # stores program's P18-49 rating
    p2 = children[6].text  # stores program's P2+ viewer count (in thousands)
    records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2))

    program_count.append(program)  # adds each program name to the list.

您也不需要使用第二个列表来获取已检索的节目数量(将节目附加到 program_count)。无论如何,它在两个列表中的最终数量相同,因为您要从每个结果中附加一个程序名称。因此,您可以使用 print(len(records)) 而不是 print(len(program_count))。我假设它只是为了调试目的。