从 wiki 中抓取表格。 Python 与 bs4

Scrape tables from wiki. Python with bs4

我正试图从所有网站上抓取一个 table 但我遇到了这个问题:我只抓取了一个单元格并且不知道我的问题在哪里。我需要从 table that how i need 中的所有行中抓取前两个单元格。我如何手动将此代码修改为其他 tables?

我的代码:

from bs4 import BeautifulSoup
import requests

URL_TO = 'https://en.wikipedia.org/wiki/Rammstein_discography'
response = requests.get(URL_TO)
soup = BeautifulSoup(response.text,'html.parser')
soup.prettify()
table = soup.find("table", { "class" : "wikitable plainrowheaders" })
for row in table.findAll("tr"):
    cells = row.findAll("td")
    bells = row.findAll("th")
print(cells, bells)

我的输出:

[<td>
<ul><li>Released: 17 May 2019</li>
<li>Label: Universal</li>
<li>Format: CD, LP, DL</li></ul>
</td>, <td>1</td>, <td>5</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>1</td>, <td>2</td>, <td>1</td>, <td>3</td>, <td>9
</td>, <td>
<ul><li>FRA: 50,000 <sup class="reference" id="cite_ref-chartsinfrance_45-0"><a href="#cite_note-chartsinfrance-45">[45]</a></sup></li>
<li>GER: 260,000<sup class="reference" id="cite_ref-chartsinfrance_45-1"><a href="#cite_note-chartsinfrance-45">[45]</a></sup></li>
<li>US: 25,000<sup class="reference" id="cite_ref-46"><a href="#cite_note-46">[46]</a></sup></li>
<li>WW: 900,000<sup class="reference" id="cite_ref-47"><a href="#cite_note-47">[47]</a></sup></li></ul>
</td>, <td>
<ul><li>BVMI: 5× Gold<sup class="reference" id="cite_ref-musikindustrie_23-4"><a href="#cite_note-musikindustrie-23">[23]</a></sup></li>
<li>BEL: Gold<sup class="reference" id="cite_ref-48"><a href="#cite_note-48">[48]</a></sup></li>
<li>SNEP: Gold<sup class="reference" id="cite_ref-snep_44-1"><a href="#cite_note-snep-44">[44]</a></sup></li>
<li>IFPI AUT: 2× Platinum<sup class="reference" id="cite_ref-IFPIAUT_30-4"><a href="#cite_note-IFPIAUT-30">[30]</a></sup></li></ul>
</td>] [<th scope="row"><a href="/wiki/Untitled_Rammstein_album" title="Untitled Rammstein album">Untitled</a>
</th>]

我需要:

[Herzeleid]
[Released: 24 September 1995
Label: Motor, Slash
Format: CD, CS, LP, DL]

您可以使用 pandas 进行 table 抓取

import pandas as pd

URL_TO = 'https://en.wikipedia.org/wiki/Rammstein_discography'
df = pd.read_html(URL_TO)
df[1].loc[0, ['Title', 'Album details']].iloc[1]

上面的0为第一条记录,Herzeleid

Out[26]: 'Released: 24 September 1995 Label: Motor, Slash Format: CD, CS, LP, DL'

您可以将 table 保存为

df[1].loc[:, ['Title', 'Album details']].to_csv('text_file.csv', index=False)