我如何使用 python 从维基百科 table 中仅抓取一列的 link？

Question

我是初学者，这是我在论坛上的第一个问题。如标题中所述，我的目标是仅从该维基页面的 table 的一列中抓取链接：https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain

我已经看过该论坛上的一些贡献（尤其是这个 How do I extract text data in first column from Wikipedia table?），但其中 none 似乎回答了我的问题（据我所知，使用 Dataframe 是不是解决方案，因为它是 table 的一种 copy/paste 而我想获得链接）。

到目前为止，这是我的代码

import requests
res=requests.get("https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain")

from bs4 import BeautifulSoup as bs
soup=bs(res.text,"html.parser")
table=soup.find('table','wikitable')
links=table.findAll('a')
communes={}
for link in links:
    url=link.get("href","")
    communes[link.text.strip()]=url
print(communes)

提前感谢您的回答！

Answer 1

要抓取特定列，您可以使用 nth-of-type(n) CSS Selector. In order to use a CSS Selector, use the select() method instead of find_all()。

例如，仅抓取第六列，select 第六列 <td> 使用 soup.select("td:nth-of-type(6)")

下面是如何仅从第五列打印所有链接的示例：

import requests
from bs4 import BeautifulSoup


BASE_URL = "https://fr.wikipedia.org"
URL = "https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# The following will find all `a` tags under the fifth `td` of it's type, which is the fifth column
for tag in soup.select("td:nth-of-type(5) a"):
    print(BASE_URL + tag["href"])

输出：

https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-1
https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-2
https://fr.wikipedia.org/wiki/Canton_d%27Amb%C3%A9rieu-en-Bugey
https://fr.wikipedia.org/wiki/Canton_de_Villars-les-Dombes
https://fr.wikipedia.org/wiki/Canton_de_Belley
...

Answer 2

如果您想要包含公社的第一列，您也可以使用它在属性 = 值选择器中左对齐的事实

commune_links = ['https://fr.wikipedia.org' + i['href'] for i in soup.select('[style="text-align:left;"] a')]

我如何使用 python 从维基百科 table 中仅抓取一列的 link？

How do I scrape link of only one column from a Wikipedia table with python?

python

wikipedia

beautifulsoup

web-scraping