如何 select 多值 html 标签中的第一个元素?
How to select first element in multi-valued html tags?
我正在开发网络抓取以从 AllMusic 收集一些信息。但是,当标签内有多个选项(例如 href)时,我很难正确 return 信息。
问题:我需要return每个艺术家的第一个音乐流派。在每位艺术家一个值的情况下,我的代码有效。但是,在有多种音乐流派的情况下,我无法 select 只选择第一种。
这是创建的代码:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
performer = []
links = []
genre = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
div = soup.select("div.name")[0]
link = div.find_all('a')[0]['href']
links.append(link)
for l in links:
soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
divGenre= soup.select("div.genre")[0]
genres = divGenre.find('a')
performer.append(artist)
genre.append(genres.text)
df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df
希望正确理解您的问题 - 主要问题是您在 for-loop
中重复 links
并导致重复。
可能会改变你的策略,尝试在一次迭代中获取所有信息并以更结构化的方式存储它们。
例子
import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
data = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
link = soup.select_one("div.name a").get('href')
soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
data.append({
'artist':artist,
'genre':soup.select_one("div.genre a").text,
'link':link
})
print(pd.DataFrame(data).to_markdown(index=False))
输出
artist
genre
link
Alexander 23
Pop/Rock
https://www.allmusic.com/artist/alexander-23-mn0003823464
Alex & Sierra
Folk
https://www.allmusic.com/artist/alex-sierra-mn0003280540
Tion Wayne
Rap
https://www.allmusic.com/artist/tion-wayne-mn0003666177
Tom Cochrane
Pop/Rock
https://www.allmusic.com/artist/tom-cochrane-mn0000931015
The Waked
Electronic
https://www.allmusic.com/artist/the-waked-mn0004025091
我正在开发网络抓取以从 AllMusic 收集一些信息。但是,当标签内有多个选项(例如 href)时,我很难正确 return 信息。
问题:我需要return每个艺术家的第一个音乐流派。在每位艺术家一个值的情况下,我的代码有效。但是,在有多种音乐流派的情况下,我无法 select 只选择第一种。 这是创建的代码:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
performer = []
links = []
genre = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
div = soup.select("div.name")[0]
link = div.find_all('a')[0]['href']
links.append(link)
for l in links:
soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
divGenre= soup.select("div.genre")[0]
genres = divGenre.find('a')
performer.append(artist)
genre.append(genres.text)
df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df
希望正确理解您的问题 - 主要问题是您在 for-loop
中重复 links
并导致重复。
可能会改变你的策略,尝试在一次迭代中获取所有信息并以更结构化的方式存储它们。
例子
import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
data = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
link = soup.select_one("div.name a").get('href')
soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
data.append({
'artist':artist,
'genre':soup.select_one("div.genre a").text,
'link':link
})
print(pd.DataFrame(data).to_markdown(index=False))
输出
artist | genre | link |
---|---|---|
Alexander 23 | Pop/Rock | https://www.allmusic.com/artist/alexander-23-mn0003823464 |
Alex & Sierra | Folk | https://www.allmusic.com/artist/alex-sierra-mn0003280540 |
Tion Wayne | Rap | https://www.allmusic.com/artist/tion-wayne-mn0003666177 |
Tom Cochrane | Pop/Rock | https://www.allmusic.com/artist/tom-cochrane-mn0000931015 |
The Waked | Electronic | https://www.allmusic.com/artist/the-waked-mn0004025091 |