如何 select 多值 html 标签中的第一个元素?

How to select first element in multi-valued html tags?

我正在开发网络抓取以从 AllMusic 收集一些信息。但是,当标签内有多个选项(例如 href)时,我很难正确 return 信息。

问题:我需要return每个艺术家的第一个音乐流派。在每位艺术家一个值的情况下,我的代码有效。但是,在有多种音乐流派的情况下,我无法 select 只选择第一种。 这是创建的代码:

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}


performer = []
links = []
genre = []

for artist in artists:
  url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
  soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
  div = soup.select("div.name")[0]
  link = div.find_all('a')[0]['href']
  links.append(link)
  for l in links:
    soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
    divGenre= soup.select("div.genre")[0] 
    genres = divGenre.find('a')
    performer.append(artist)
    genre.append(genres.text)

df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df

希望正确理解您的问题 - 主要问题是您在 for-loop 中重复 links 并导致重复。

可能会改变你的策略,尝试在一次迭代中获取所有信息并以更结构化的方式存储它们。

例子

import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

data = []

for artist in artists:
    url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
    soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
    link = soup.select_one("div.name a").get('href')
    soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
    data.append({
        'artist':artist,
        'genre':soup.select_one("div.genre a").text,
        'link':link
    })

print(pd.DataFrame(data).to_markdown(index=False))
输出