使用 BeautifulSoup 从已抓取的数据创建数据帧时出现数组长度错误

Array length Error while creating Dataframe from Scraped Data using BeautifulSoup

我正在创建一个 IMDB 评级和评论数据集。
Link
我想抓取此页面上的所有评分和评论。有些评论没有评分,因此我的评论和评分数量不同。
我尝试了各种方法来处理空值,但未能成功实施。

我的代码:

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import string

url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"user_id": [], "rating":[], "title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    for user in (
        [tag.attrs['href'] for tag in soup.find_all('a', attrs={'class': None})
                if tag.attrs['href'].startswith('/user') & tag.attrs['href'].endswith('/')]
    ):
        data["user_id"].append(user[6:-1])

    for rate in (
        [tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
    ):
      if (rate.__eq__(None)):
        data["rating"].append(None)
      else:
        data["rating"].append(rate)
    
    ## Update the 'key' variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())
  

df = pd.DataFrame(data)
print(df)

len(data['rating'])
>>>2107

len(data['review'])
>>>2150

错误:

ValueError                                Traceback (most recent call last)
<ipython-input-28-0064f972ba2a> in <module>()
     41 
     42 
---> 43 df = pd.DataFrame(data)
     44 print(df)

3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
    395             lengths = list(set(raw_lengths))
    396             if len(lengths) > 1:
--> 397                 raise ValueError("arrays must all be same length")
    398 
    399             if have_dicts:

ValueError: arrays must all be same length

我想为数据框中不可用的评级设置空白值。

不幸的是,并不总是有评级,所以这里的逻辑失败了:

for rate in (
        [tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
    ):
      if (rate.__eq__(None)):
        data["rating"].append(None)
      else:
        data["rating"].append(rate)

无论您追加什么,最终循环的项目数都少于预期。


一种可能的解决方案:

您需要修改以确保循环的项目数与其他列表相同,例如

for rate in (
    [tag.select_one('.point-scale').previous_element if tag.select_one('.point-scale') is not None else None 
     for tag in soup.select('.lister-item-content')] 
):
    data["rating"].append(rate)

旁注:

您可以通过添加以下内容进行调试:

if not pagination_key:
    break

以下:

if len(soup.select('.lister-item-content, .point-scale')) % 2:
    print(url.format(key))
    break

然后访问浏览器中打印的url,在元素选项卡浏览器查找框中输入.lister-item-content, .point-scale,然后点击return;如果匹配次数不均匀,则表示评分缺失,您可以循环查看评论以查看位置。