如何编辑存储在列表中的 link
how to edit a link that is stored in a list
import requests
import re
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = 'https://www.parliament.gov.sg/about-us/structure/the-cabinet'
links = []
while True:
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
if soup.find_all('a', attrs={'href': re.compile("/details/")}) == []:
break
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
if link.get('href') not in links:
links.append(link.get('href'))
print(links)
目前,这是我的代码,它为我提供了一个输出列表
['/mps/current-list-of-mps/mp/details/lee-hsien-loong', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/heng-swee-keat', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/teo-chee-hean', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tharman-shanmugaratnam', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ng-eng-hen', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/vivian-balakrishnan', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/k-shanmugam', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/gan-kim-yong'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/s-iswaran'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/grace-fu-hai-yien'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/chan-chun-sing'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/lawrence-wong'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/masagos-zulkifli-bin-masagos-mohamad'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ong-ye-kung'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/desmond-lee'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/josephine-teo', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/indranee-rajah', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/mohamad-maliki-bin-osman', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/edwin-tong-chun-fai', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tan-see-leng']
在我的代码的下一部分中,我试图从每个 link 中抓取数据,但是,列表中的第一个 link 并没有作为有效 url,我无法从中获取信息。
我如何编辑它,使其与列表中的其他 url 相同?
非常感谢
在将字符串添加到列表之前,您可以使用以下代码检查他的格式是否正确,并在需要时进行更正:
def correct_url(url):
if not url.startswith('https://www.parliament.gov.sg'):
url = f'https://www.parliament.gov.sg{url}'
return URL
新函数采用的for循环:
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
if link.get('href') not in links:
links.append(correct_url(link.get('href')))
print(links)
import requests
import re
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = 'https://www.parliament.gov.sg/about-us/structure/the-cabinet'
links = []
while True:
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
if soup.find_all('a', attrs={'href': re.compile("/details/")}) == []:
break
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
if link.get('href') not in links:
links.append(link.get('href'))
print(links)
目前,这是我的代码,它为我提供了一个输出列表
['/mps/current-list-of-mps/mp/details/lee-hsien-loong', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/heng-swee-keat', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/teo-chee-hean', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tharman-shanmugaratnam', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ng-eng-hen', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/vivian-balakrishnan', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/k-shanmugam', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/gan-kim-yong'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/s-iswaran'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/grace-fu-hai-yien'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/chan-chun-sing'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/lawrence-wong'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/masagos-zulkifli-bin-masagos-mohamad'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ong-ye-kung'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/desmond-lee'、'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/josephine-teo', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/indranee-rajah', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/mohamad-maliki-bin-osman', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/edwin-tong-chun-fai', 'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tan-see-leng']
在我的代码的下一部分中,我试图从每个 link 中抓取数据,但是,列表中的第一个 link 并没有作为有效 url,我无法从中获取信息。
我如何编辑它,使其与列表中的其他 url 相同?
非常感谢
在将字符串添加到列表之前,您可以使用以下代码检查他的格式是否正确,并在需要时进行更正:
def correct_url(url):
if not url.startswith('https://www.parliament.gov.sg'):
url = f'https://www.parliament.gov.sg{url}'
return URL
新函数采用的for循环:
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
if link.get('href') not in links:
links.append(correct_url(link.get('href')))
print(links)