Python 网页抓取 - 两个不同的 Parent Class 名称，不同的结构但相同的 Child Class 名称

Question

我正在尝试从 prnewswire.com 抓取新闻文章。每篇文章都存储在一个名为“行”的 div 中。

我的问题是有些文章预览在标题和描述旁边有一张图片。因此，在“行”-classes 下，它是 class 名称“卡片”（带图像）或“col-sm-12 卡片”（无图像）：

我当前的代码如下：

import requests
from bs4 import BeautifulSoup
import pandas

headers = {
    'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
        'Version/14.0.1 Safari/605.1.15'
}

articlelist = []


def getarticles(page):
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

prnewswire_articles = soup.find_all('div', {'class': 'col-sm-12 card'})

for item in prnewswire_articles:
    prnewswire_article = {
        'page': page,
        'article_title': item.find('h3').text,
        'article_link': 'https://www.prnewswire.com/' +
                        item.find('a')['href'],
        'article_description': item.find('p').text,
    }
    articlelist.append(prnewswire_article)
return

for x in range(1, 3):
    getarticles(x)

df = pandas.DataFrame(articlelist)
print(df.head())
print(len(df))
df.to_excel('PRNewsWire.xlsx', index=False)
print('Finished.')

我发现了以下内容：在我声明“prnewswire_articles”并查找具有特定 class 名称的 div 的行中，我得到了我想要的结果class “col-sm-12 卡”。但是“卡”或“行”不起作用。

我注意到“card”classes 的 html 结构与“col-sm-12 card”classes 不同，但它们都包含一个“h3”元素（文章标题），一个“a href”和一个“p”元素

这是我在使用“行”或“卡片”作为 class 名称时收到的错误消息：

Traceback (most recent call last):

File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 33, in <module>
    getarticles(x)
  File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 24, in getarticles
    'article_title': item.find('h3').text,
AttributeError: 'NoneType' object has no attribute 'text'

我找了一整天，什么也没找到。最近才开始学习 Python，如果这是一个愚蠢的错误，我很抱歉，但我正在寻找答案。真的很感激帮助很多！ :)

Answer 1

您可以 select class .card-list 下的所有 .row（使用 CSS selector）。我还更改了提取文章标题的方式（只需获取 <small> 元素之后的文本）：

import pandas
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
        'Version/14.0.1 Safari/605.1.15'
}

articlelist = []


def getarticles(page):
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    prnewswire_articles = soup.select('.card-list > .row')    # <-- select all rows that are under class "card-list"

    for item in prnewswire_articles:
        prnewswire_article = {
            'page': page,
            'article_title': item.select_one('h3 small').find_next_sibling(text=True).strip(),   # <--- select text that is after <small> element
            'article_link': 'https://www.prnewswire.com/' +
                            item.find('a')['href'],
            'article_description': item.find('p').get_text(strip=True, separator='\n'),
        }
        articlelist.append(prnewswire_article)

for x in range(1, 3):
    getarticles(x)

df = pandas.DataFrame(articlelist)
print(df)

打印：

...

186     2  Wugen Announces Exclusive Partnership Agreemen...  https://www.prnewswire.com//news-releases/wuge...  Wugen Inc., a clinical-stage biotechnology com...
187     2  Upstryve Initiates Mentor Network for Trade St...  https://www.prnewswire.com//news-releases/upst...  Upstryve Inc (Upstryve) www.upstryve.com. Upst...
188     2  Inkling Simplifies Integration to Learning and...  https://www.prnewswire.com//news-releases/inkl...  Inkling, a global leader in digital learning p...
189     2  CommerceHub to Participate in Bank of America'...  https://www.prnewswire.com//news-releases/comm...  CommerceHub, a leading provider of ecommerce s...
190     2        Instrument Promotes Kara Place to President  https://www.prnewswire.com//news-releases/inst...  Instrument, a digitally focused, creative agen...
191     2   Regent Properties Announces Executive Promotions  https://www.prnewswire.com//news-releases/rege...  Regent Properties ("Regent"), a real estate in...
192     2  PowerPay Hits  Billion in Home Renovations L...  https://www.prnewswire.com//news-releases/powe...  PowerPay, the nation's fastest-growing home im...
193     2  GoldenTree Announces Closing of 8 Million C...  https://www.prnewswire.com//news-releases/gold...  GoldenTree Loan Management II ("GLM II") and i...
194     2  The Real-Time Moving Show on the Screen "Showt...  https://www.prnewswire.com//news-releases/the-...  EnableWow (www.showtap.com) launched a new pre...
195     2  Black Knight: Lock Activity Suggests Q1 2021 R...  https://www.prnewswire.com//news-releases/blac...  Today, the Data & Analytics division of Black ...
196     2  LG Innotek Joins Hands with Microsoft to Proli...  https://www.prnewswire.com//news-releases/lg-i...  LG Innotek (CEO Cheoldong Jeong) announced on ...
197     2  MicroWorkers Integrates Ontology's ONTO Wallet...  https://www.prnewswire.com//news-releases/micr...  To bridge micro workers globally, Ontology and...
198     2     MemVerge Introduces M3 Channel Partner Program  https://www.prnewswire.com//news-releases/memv...  MemVerge™, the pioneers of Big Memory software...
199     2  Innovative Deals Spur the Growth of New Sports...  https://www.prnewswire.com//news-releases/inno...  Last year has shaped up to be crucial for the ...

Python 网页抓取 - 两个不同的 Parent Class 名称，不同的结构但相同的 Child Class 名称

Python Web Scraping - Two Different Parent Class Names, Different Structures but same Child Class Names

python

beautifulsoup

parent-child

web-scraping

python-requests