如何编辑存储在列表中的 link

Question

import requests
import re


def getHTMLdocument(url):
    response = requests.get(url)
    return response.text


url_to_scrape = 'https://www.parliament.gov.sg/about-us/structure/the-cabinet'
links = []

while True:

    html_document = getHTMLdocument(url_to_scrape)
    soup = BeautifulSoup(html_document, 'lxml')

    if soup.find_all('a', attrs={'href': re.compile("/details/")}) == []:
        break

    for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
        if link.get('href') not in links:
            links.append(link.get('href'))
            print(links)

目前，这是我的代码，它为我提供了一个输出列表

['/mps/current-list-of-mps/mp/details/lee-hsien-loong', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/heng-swee-keat', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/teo-chee-hean', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tharman-shanmugaratnam', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ng-eng-hen', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/vivian-balakrishnan', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/k-shanmugam', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/gan-kim-yong'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/s-iswaran'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/grace-fu-hai-yien'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/chan-chun-sing'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/lawrence-wong'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/masagos-zulkifli-bin-masagos-mohamad'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ong-ye-kung'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/desmond-lee'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/josephine-teo', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/indranee-rajah', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/mohamad-maliki-bin-osman', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/edwin-tong-chun-fai', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tan-see-leng']

在我的代码的下一部分中，我试图从每个 link 中抓取数据，但是，列表中的第一个 link 并没有作为有效 url，我无法从中获取信息。

我如何编辑它，使其与列表中的其他 url 相同？

非常感谢

Answer 1

在将字符串添加到列表之前，您可以使用以下代码检查他的格式是否正确，并在需要时进行更正：

def correct_url(url):

    if not url.startswith('https://www.parliament.gov.sg'):
        url = f'https://www.parliament.gov.sg{url}'
    return URL

新函数采用的for循环：

for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
        if link.get('href') not in links:
            links.append(correct_url(link.get('href')))
            print(links)

如何编辑存储在列表中的 link

how to edit a link that is stored in a list

python

url

list

beautifulsoup

web-scraping