使用 BeautifulSoup 从已抓取的数据创建数据帧时出现数组长度错误
Array length Error while creating Dataframe from Scraped Data using BeautifulSoup
我正在创建一个 IMDB 评级和评论数据集。
Link
我想抓取此页面上的所有评分和评论。有些评论没有评分,因此我的评论和评分数量不同。
我尝试了各种方法来处理空值,但未能成功实施。
我的代码:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import string
url = (
"https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"user_id": [], "rating":[], "title": [], "review": []}
while True:
response = requests.get(url.format(key))
soup = BeautifulSoup(response.content, "html.parser")
# Find the pagination key
pagination_key = soup.find("div", class_="load-more-data")
if not pagination_key:
break
for user in (
[tag.attrs['href'] for tag in soup.find_all('a', attrs={'class': None})
if tag.attrs['href'].startswith('/user') & tag.attrs['href'].endswith('/')]
):
data["user_id"].append(user[6:-1])
for rate in (
[tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
):
if (rate.__eq__(None)):
data["rating"].append(None)
else:
data["rating"].append(rate)
## Update the 'key' variable in-order to scrape more reviews
key = pagination_key["data-key"]
for title, review in zip(
soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
):
data["title"].append(title.get_text(strip=True))
data["review"].append(review.get_text())
df = pd.DataFrame(data)
print(df)
len(data['rating'])
>>>2107
len(data['review'])
>>>2150
错误:
ValueError Traceback (most recent call last)
<ipython-input-28-0064f972ba2a> in <module>()
41
42
---> 43 df = pd.DataFrame(data)
44 print(df)
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
395 lengths = list(set(raw_lengths))
396 if len(lengths) > 1:
--> 397 raise ValueError("arrays must all be same length")
398
399 if have_dicts:
ValueError: arrays must all be same length
我想为数据框中不可用的评级设置空白值。
不幸的是,并不总是有评级,所以这里的逻辑失败了:
for rate in (
[tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
):
if (rate.__eq__(None)):
data["rating"].append(None)
else:
data["rating"].append(rate)
无论您追加什么,最终循环的项目数都少于预期。
一种可能的解决方案:
您需要修改以确保循环的项目数与其他列表相同,例如
for rate in (
[tag.select_one('.point-scale').previous_element if tag.select_one('.point-scale') is not None else None
for tag in soup.select('.lister-item-content')]
):
data["rating"].append(rate)
旁注:
您可以通过添加以下内容进行调试:
if not pagination_key:
break
以下:
if len(soup.select('.lister-item-content, .point-scale')) % 2:
print(url.format(key))
break
然后访问浏览器中打印的url,在元素选项卡浏览器查找框中输入.lister-item-content, .point-scale
,然后点击return;如果匹配次数不均匀,则表示评分缺失,您可以循环查看评论以查看位置。
我正在创建一个 IMDB 评级和评论数据集。
Link
我想抓取此页面上的所有评分和评论。有些评论没有评分,因此我的评论和评分数量不同。
我尝试了各种方法来处理空值,但未能成功实施。
我的代码:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import string
url = (
"https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"user_id": [], "rating":[], "title": [], "review": []}
while True:
response = requests.get(url.format(key))
soup = BeautifulSoup(response.content, "html.parser")
# Find the pagination key
pagination_key = soup.find("div", class_="load-more-data")
if not pagination_key:
break
for user in (
[tag.attrs['href'] for tag in soup.find_all('a', attrs={'class': None})
if tag.attrs['href'].startswith('/user') & tag.attrs['href'].endswith('/')]
):
data["user_id"].append(user[6:-1])
for rate in (
[tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
):
if (rate.__eq__(None)):
data["rating"].append(None)
else:
data["rating"].append(rate)
## Update the 'key' variable in-order to scrape more reviews
key = pagination_key["data-key"]
for title, review in zip(
soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
):
data["title"].append(title.get_text(strip=True))
data["review"].append(review.get_text())
df = pd.DataFrame(data)
print(df)
len(data['rating'])
>>>2107
len(data['review'])
>>>2150
错误:
ValueError Traceback (most recent call last)
<ipython-input-28-0064f972ba2a> in <module>()
41
42
---> 43 df = pd.DataFrame(data)
44 print(df)
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
395 lengths = list(set(raw_lengths))
396 if len(lengths) > 1:
--> 397 raise ValueError("arrays must all be same length")
398
399 if have_dicts:
ValueError: arrays must all be same length
我想为数据框中不可用的评级设置空白值。
不幸的是,并不总是有评级,所以这里的逻辑失败了:
for rate in (
[tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
):
if (rate.__eq__(None)):
data["rating"].append(None)
else:
data["rating"].append(rate)
无论您追加什么,最终循环的项目数都少于预期。
一种可能的解决方案:
您需要修改以确保循环的项目数与其他列表相同,例如
for rate in (
[tag.select_one('.point-scale').previous_element if tag.select_one('.point-scale') is not None else None
for tag in soup.select('.lister-item-content')]
):
data["rating"].append(rate)
旁注:
您可以通过添加以下内容进行调试:
if not pagination_key:
break
以下:
if len(soup.select('.lister-item-content, .point-scale')) % 2:
print(url.format(key))
break
然后访问浏览器中打印的url,在元素选项卡浏览器查找框中输入.lister-item-content, .point-scale
,然后点击return;如果匹配次数不均匀,则表示评分缺失,您可以循环查看评论以查看位置。