使用 BeautifulSoup 从已抓取的数据创建数据帧时出现数组长度错误

Question

我正在创建一个 IMDB 评级和评论数据集。
Link
我想抓取此页面上的所有评分和评论。有些评论没有评分，因此我的评论和评分数量不同。
我尝试了各种方法来处理空值，但未能成功实施。

我的代码：

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import string

url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"user_id": [], "rating":[], "title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    for user in (
        [tag.attrs['href'] for tag in soup.find_all('a', attrs={'class': None})
                if tag.attrs['href'].startswith('/user') & tag.attrs['href'].endswith('/')]
    ):
        data["user_id"].append(user[6:-1])

    for rate in (
        [tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
    ):
      if (rate.__eq__(None)):
        data["rating"].append(None)
      else:
        data["rating"].append(rate)
    
    ## Update the 'key' variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())
  

df = pd.DataFrame(data)
print(df)

len(data['rating'])
>>>2107

len(data['review'])
>>>2150

错误：

ValueError                                Traceback (most recent call last)
<ipython-input-28-0064f972ba2a> in <module>()
     41 
     42 
---> 43 df = pd.DataFrame(data)
     44 print(df)

3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
    395             lengths = list(set(raw_lengths))
    396             if len(lengths) > 1:
--> 397                 raise ValueError("arrays must all be same length")
    398 
    399             if have_dicts:

ValueError: arrays must all be same length

我想为数据框中不可用的评级设置空白值。

Answer 1

不幸的是，并不总是有评级，所以这里的逻辑失败了：

for rate in (
        [tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
    ):
      if (rate.__eq__(None)):
        data["rating"].append(None)
      else:
        data["rating"].append(rate)

无论您追加什么，最终循环的项目数都少于预期。

一种可能的解决方案：

您需要修改以确保循环的项目数与其他列表相同，例如

for rate in (
    [tag.select_one('.point-scale').previous_element if tag.select_one('.point-scale') is not None else None 
     for tag in soup.select('.lister-item-content')] 
):
    data["rating"].append(rate)

旁注：

您可以通过添加以下内容进行调试：

if not pagination_key:
    break

以下：

if len(soup.select('.lister-item-content, .point-scale')) % 2:
    print(url.format(key))
    break

然后访问浏览器中打印的url，在元素选项卡浏览器查找框中输入.lister-item-content, .point-scale，然后点击return；如果匹配次数不均匀，则表示评分缺失，您可以循环查看评论以查看位置。

使用 BeautifulSoup 从已抓取的数据创建数据帧时出现数组长度错误

Array length Error while creating Dataframe from Scraped Data using BeautifulSoup

python

dataframe

beautifulsoup

nonetype