beautifulsoup:如何抓取多个不同结尾的网址

beautifulsoup: how to scrape multiple urls that end differently

我想刮这个dictionary for it's different verbs. the verbs appear in this 'https://www.spanishdict.com/conjugate/' plus the verb . so,e.g : for verb 'hacer' we will have: https://www.spanishdict.com/conjugate/hacer

我想抓取包含每个动词变位的所有可能链接,并将它们 return 作为字符串列表。所以我做了以下事情:

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    reqs = requests.get(url + str())
    soup = BeautifulSoup(reqs.text, 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        urls.append(link.get('href'))

    print(urls)

但是当我打印 url 时,我只得到几个空列表。

预期输出样本:

['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]

当您循环遍历“url”时,您正在遍历一个字符串。看这段代码:

url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    print(i)

这会生成 URL 的每个字母:

h
t
t
p
s
:
/
/
w
w
w
<truncated>

你这里也做错了:

reqs = requests.get(url + str())

我不确定您要做什么,但 'url + str()' 只是 URL 加上一个空字符串,即 URL.

如果你删除 for 循环和不必要的空字符串,你会得到我认为你想要得到的:

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    urls.append(link.get('href'))

print(urls)

这会产生:

['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation%20hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign%3Dadhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source%3Dsd-footer']

这个链接列表是您想要的吗?

编辑

希望明白你的意思 - 如果是这样,问题应该得到改进。要从 javascript 获取信息,您可以使用正则表达式解析响应:

import requests
import json
import re

r = requests.get('https://www.spanishdict.com/conjugation')
m = re.search(r'window.SD_COMPONENT_DATA = ({.*})', r.text)
['https://www.spanishdict.com/conjugate/'+w for x in json.loads(m.group(1))['searchQuickLinkSections'] for w in x['words']]

输出

['https://www.spanishdict.com/conjugate/tener',
 'https://www.spanishdict.com/conjugate/hacer',
 'https://www.spanishdict.com/conjugate/ser',
 'https://www.spanishdict.com/conjugate/estar',
 'https://www.spanishdict.com/conjugate/haber',
 'https://www.spanishdict.com/conjugate/ir',
 'https://www.spanishdict.com/conjugate/poder',
 'https://www.spanishdict.com/conjugate/decir',
 'https://www.spanishdict.com/conjugate/cerrar',
 'https://www.spanishdict.com/conjugate/mentir',
 'https://www.spanishdict.com/conjugate/dormir',
 'https://www.spanishdict.com/conjugate/recordar',
 'https://www.spanishdict.com/conjugate/seguir',
 'https://www.spanishdict.com/conjugate/medir',
 'https://www.spanishdict.com/conjugate/adquirir',
 'https://www.spanishdict.com/conjugate/jugar',
 'https://www.spanishdict.com/conjugate/vestirse',
 'https://www.spanishdict.com/conjugate/divertirse',
 'https://www.spanishdict.com/conjugate/acostarse',
 'https://www.spanishdict.com/conjugate/ponerse',
 'https://www.spanishdict.com/conjugate/despertarse',
 'https://www.spanishdict.com/conjugate/sentirse',
 'https://www.spanishdict.com/conjugate/levantarse',
 'https://www.spanishdict.com/conjugate/sentarse',
 'https://www.spanishdict.com/conjugate/gustar',
 'https://www.spanishdict.com/conjugate/alegrar',
 'https://www.spanishdict.com/conjugate/quedar',
 'https://www.spanishdict.com/conjugate/encantar',
 'https://www.spanishdict.com/conjugate/parecer',
 'https://www.spanishdict.com/conjugate/faltar',
 'https://www.spanishdict.com/conjugate/doler',
 'https://www.spanishdict.com/conjugate/interesar']

获得预期的输出后,您应该有一个动词列表。虽然您的问题中没有提供来源,但这是生成此类信息的良好开端,我使用了列表 verbs-top-500 和列表理解。

对于在其 href 中包含 translate 的所有 <a>,它连接您的 url 和直接子 <div> 中的文本动词<a>:

['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate"]')]

例子

import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')

urls = ['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate/"]')]

输出

['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]