如何使用 Selenium 和 Python 从 table 中抓取所有艺术家的名字？

Question

我正在尝试抓取前 1000 名艺术家的网站并将它们附加到列表中，以便通过搜索艺术家的名字来执行歌词分析。我正在使用的网站可以选择一次显示所有 1000 位艺术家，因此我使用 selenium 来 select 该选项。从那里，我找到艺术家姓名并将它们放在 WebElements 列表中。我遍历列表以获取文本元素并将其附加到我的列表中。该程序在获得一定数量的艺术家后不断抛出 StaleElementReferenceException，如下所示。

我尝试了一些建议的选项，例如使用 wait until 语句或 try and catch 语句，但最终导致程序崩溃。我看到的大多数解决方案都是在单击 Web 元素或与 Web 元素交互时发生的，但是在我 select 我的选项之后我没有更改页面上的任何内容，所以我不确定我哪里出错了。我对硒还很陌生，所以我不确定这是否是获取艺术家姓名的最佳方式。任何帮助将不胜感激。

我的代码：

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://chartmasters.org/most-streamed-artists-ever-on-spotify/')

try:
    # get the select tag
    select = Select(driver.find_element(By.TAG_NAME,'#table_1_length > label > div > select'))
    # select by value (select All option to get all 1000 artists)
    select.select_by_value('-1')

    all_artists = []
    all_artists_references = driver.find_elements(By.CLASS_NAME, 'bolded.column-artist-name')

    for element in all_artists_references:
        print(element.text)
        all_artists.append(element.text)

    print(all_artists)

finally:
    driver.quit()

Answer 1

要提取并打印您需要归纳的所有 1000 个艺术家姓名 for visibility_of_all_elements_located() using you can use either of the following :

使用CSS_SELECTOR:

print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#table_1 tbody tr[role='row'] td:nth-of-type(2)")))])

使用 XPATH:

print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table_1']//tbody//tr[@role='row']//following::td[2]")))])

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Answer 2

获取准确信息的表单查询相当冗长table，但直接从源获取数据效率更高。

import requests
import pandas as pd

url = 'https://chartmasters.org/wp-admin/admin-ajax.php'
params = {
    'action': 'get_wdtable',
    'table_id': '1'}
data = {
'draw': '1',
'columns[0][data]': '0',
'columns[0][name]': 'rank',
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'false',
'columns[0][search][value]': '',
'columns[0][search][regex]': 'false',
'columns[1][data]': '1',
'columns[1][name]': 'Artist Name',
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'false',
'columns[1][search][value]': '',
'columns[1][search][regex]': 'false',
'columns[2][data]': '2',
'columns[2][name]': 'Lead Streams',
'columns[2][searchable]': 'true',
'columns[2][orderable]': 'true',
'columns[2][search][value]': '',
'columns[2][search][regex]': 'false',
'columns[3][data]': '3',
'columns[3][name]': 'Featured Streams',
'columns[3][searchable]': 'true',
'columns[3][orderable]': 'true',
'columns[3][search][value]': '',
'columns[3][search][regex]': 'false',
'columns[4][data]': '4',
'columns[4][name]': 'Tracks',
'columns[4][searchable]': 'true',
'columns[4][orderable]': 'true',
'columns[4][search][value]': '',
'columns[4][search][regex]': 'false',
'columns[5][data]': '5',
'columns[5][name]': '1b+',
'columns[5][searchable]': 'true',
'columns[5][orderable]': 'true',
'columns[5][search][value]': '',
'columns[5][search][regex]': 'false',
'columns[6][data]': '6',
'columns[6][name]': '100m+',
'columns[6][searchable]': 'true',
'columns[6][orderable]': 'true',
'columns[6][search][value]': '',
'columns[6][search][regex]': 'false',
'columns[7][data]': '7',
'columns[7][name]': '10m+',
'columns[7][searchable]': 'true',
'columns[7][orderable]': 'true',
'columns[7][search][value]': '',
'columns[7][search][regex]': 'false',
'columns[8][data]': '8',
'columns[8][name]': '1m+',
'columns[8][searchable]': 'true',
'columns[8][orderable]': 'true',
'columns[8][search][value]': '',
'columns[8][search][regex]': 'false',
'columns[9][data]': '9',
'columns[9][name]': 'Last Update',
'columns[9][searchable]': 'true',
'columns[9][orderable]': 'true',
'columns[9][search][value]': '',
'columns[9][search][regex]': 'false',
'order[0][column]': '2',
'order[0][dir]': 'desc',
'start': '0',
'length': '9999',
'search[value]': '',
'search[regex]': 'false',
'wdtNonce': '64ac23afe1'}


cols = []
for k, v in data.items():
    if 'name' in k:
        cols.append(v)

jsonData = requests.post(url, params=params, data=data).json()
df = pd.DataFrame(jsonData['data'], columns=cols)

输出：

print(df)
     rank    Artist Name    Lead Streams  ... 10m+  1m+ Last Update
0       1          Drake  45,625,377,884  ...  241  244    29.03.22
1       2     Ed Sheeran  34,724,649,138  ...  165  199    29.03.22
2       3      Bad Bunny  33,419,082,838  ...  134  140    29.03.22
3       4     The Weeknd  30,455,269,996  ...  143  161    29.03.22
4       5  Ariana Grande  30,021,891,319  ...  126  175    29.03.22
..    ...            ...             ...  ...  ...  ...         ...
995   996          HONNE   1,229,848,408  ...   29   85    18.12.21
996   997  Darius Rucker   1,229,826,891  ...   14   77    28.03.22
997   998       King Von   1,224,925,368  ...   34   68    14.03.22
998   999        JP Saxe   1,224,510,818  ...   13   30    24.03.22
999  1000        Showtek   1,223,338,892  ...   19   69    26.02.21

[1000 rows x 10 columns]

如何使用 Selenium 和 Python 从 table 中抓取所有艺术家的名字？

How to scrape the names of all the artists from the table using Selenium and Python?

python

selenium

list-comprehension

web-scraping

webdriverwait