需要一些帮助来识别 HTML 标签，这将使我能够提取所有相关的标题、链接和 img URL。我的代码当前显示 1

Question

我使用请求库访问该网站并使用 BeautifulSoup 解析 html。我希望我的抓取工具能够抓取至少 4 个带有链接和图像的标题 URL 来自网站。我知道它是 HTML 标签，但我找不到哪个标签。我已经上传了我到目前为止所做的。该代码显示第一个标题，URL 的标题链接。

from bs4 import BeautifulSoup
import requests

#user agent to facilitates end-user interaction with web content**

headers = [''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101'
]

#identifying website to be scraped*

source = request.get('https://www.jse.co.za/').text

#print(source) - verifying if HTLM for the page

soup = BeautifulSoup(source ,'lxml')# html parser
#print(soup.prettify)- to check if HTML has been parsed.

for item in soup.find_all('div',{'class':'view-content row row-flex'})[0:4]:# indexing
   text = item.find('h3'  {'class':'card__title'}).text .strip()
   img =  item.find('img' {'class': 'media__image })
   link= item.find('a')
   article_link = link.attrs('href')
   print('Article Headline')
   print(text)
   print('IMAGE URL')
   print(img['data-src']
   print('LINK TO ARTICLE')
   print(article_link)
   print()

输出

# looking at output of 4 headlines
ARTICLE HEADLINE
South Africa offers investment opportunities to Asia Pacific investors
# looking at output of at least 4 Image URL's 
IMAGE URL
/sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner_0.jpg?h=4ae650de&itok=hdGEy5jA
# I was hoping to scrape at least 4 links
LINK TO ARTICLE
/news/market-news/south-africa-offers-investment-opportunities-asia-pacific-investors







```

Answer 1

看看那个 JSE 网站，他们使用 article 标签列出每条新闻，还有 card class，所以我建议使用 for article in soup.find_all('article') 将其拆分，然后在其中获取每个内部项目。

更新：完整的示例。

from bs4 import BeautifulSoup
import requests

base_url = 'https://www.jse.co.za'

source = requests.get(base_url).text
print("Got source")

soup = BeautifulSoup(source, 'html.parser')
print("Parsed source")

articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")

for article in articles:
    print("----------------------------------------------------")

    headline = article.h3.text.strip()
    link = base_url + article.a['href']
    text = article.find("div", class_="field--type-text-with-summary").text.strip()
    img_url = base_url + article.picture.img['data-src']

    print(headline)
    print(link)
    print(text)
    print("Image: "+ img_url)

可运行here

需要一些帮助来识别 HTML 标签，这将使我能够提取所有相关的标题、链接和 img URL。我的代码当前显示 1

Need some help identifying the HTML tag that will allow me to pull all the relevant headlines, links and img URL's. My code is currently displaying 1

html

python

beautifulsoup

request

web-scraping