从 html 代码中获取特定信息

Get specific informaion from html code

我们的想法是收集所有发布曲目的 soundcloud 用户的 ID(不是名字),首字母是例如"f" 在我们 "past year" 的时期。

我在 soundcloud 上使用了过滤器并在下一个 URL 中得到了结果:https://soundcloud.com/search/sounds?q=f&filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap

我在 html 代码的后续行中找到了第一个用户的 ID ("wavey-hefner"): <a class="sound__coverArt" href="/wavey-hefner/foreign" draggable="true">

我想从整个 html 中获取每个用户的 ID。

我的代码是:

import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://soundcloud.com/search/sounds?q=f& filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap")
soup = BeautifulSoup(html.text, 'html.parser')
for id in soup.findAll("a", {"class" : "sound_coverArt"}):
    print (id.get('href'))

它returns没什么:(

页面在 JavaScript 中呈现。可以用Selenium来渲染,先安装Selenium:

pip3 install selenium

然后得到 driver 例如https://sites.google.com/a/chromium.org/chromedriver/downloads(如果你在 Windows 或 Mac 上,你可以获得 Chrome 的无头版本 - 如果你愿意,可以使用 Canary)将 driver 放在你的路径中。

from bs4 import BeautifulSoup
from selenium import webdriver
import time

browser = webdriver.Chrome()
url = ('https://soundcloud.com/search/sounds?q=f& filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap')
browser.get(url)
time.sleep(5)
# To make it load more scroll to the bottom of the page (repeat if you want to)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html_source = browser.page_source
browser.quit()

soup =   BeautifulSoup(html_source, 'html.parser')
for id in soup.findAll("a", {"class" : "sound__coverArt"}):
    print (id.get('href'))

输出:

/tee-grizzley/from-the-d-to-the-a-feat-lil-yachty
/empire/fat-joe-remy-ma-all-the-way-up-ft-french-montana
/tee-grizzley/first-day-out
/21savage/feel-it
/pluggedsoundz/famous-dex-geek-1
/rodshootinbirds/fairytale-x-rod-da-god
/chancetherapper/finish-line-drown-feat-t-pain-kirk-franklin-eryn-allen-kane-noname
/alkermith/future-low-life-ft-the-weeknd-evol
/javon-woodbridge/fabolous-slim-thick
/hamburgerhelper/feed-the-streets-prod-dequexatron-1000
/rob-neal-139819089/french-montana-lockjaw-remix-ft-gucci-mane-kodak-black
/pluggedsoundz/famous-dex-energy
/ovosoundradiohits/future-ft-drake-used-to-this
/pluggedsoundz/famous
/a-boogie-wit-da-hoodie/fucking-kissing-feat-chris-brown
/wavey-hefner/foreign
/jalensantoy/foreplay
/yvng_swag/fall-in-luv
/rich-the-kid/intro-prod-by-lab-cook
/empire/fat-joe-remy-ma-money-showers-feat-ty-dolla-ign