beautifulsoup:如何抓取多个不同结尾的网址
beautifulsoup: how to scrape multiple urls that end differently
我想刮这个dictionary for it's different verbs. the verbs appear in this 'https://www.spanishdict.com/conjugate/' plus the verb . so,e.g : for verb 'hacer' we will have: https://www.spanishdict.com/conjugate/hacer
我想抓取包含每个动词变位的所有可能链接,并将它们 return 作为字符串列表。所以我做了以下事情:
import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/'
for i in url:
reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
urls.append(link.get('href'))
print(urls)
但是当我打印 url 时,我只得到几个空列表。
预期输出样本:
['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]
当您循环遍历“url”时,您正在遍历一个字符串。看这段代码:
url = 'https://www.spanishdict.com/conjugate/'
for i in url:
print(i)
这会生成 URL 的每个字母:
h
t
t
p
s
:
/
/
w
w
w
<truncated>
你这里也做错了:
reqs = requests.get(url + str())
我不确定您要做什么,但 'url + str()' 只是 URL 加上一个空字符串,即 URL.
如果你删除 for 循环和不必要的空字符串,你会得到我认为你想要得到的:
import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/'
reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
urls.append(link.get('href'))
print(urls)
这会产生:
['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation%20hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign%3Dadhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source%3Dsd-footer']
这个链接列表是您想要的吗?
编辑
希望明白你的意思 - 如果是这样,问题应该得到改进。要从 javascript 获取信息,您可以使用正则表达式解析响应:
import requests
import json
import re
r = requests.get('https://www.spanishdict.com/conjugation')
m = re.search(r'window.SD_COMPONENT_DATA = ({.*})', r.text)
['https://www.spanishdict.com/conjugate/'+w for x in json.loads(m.group(1))['searchQuickLinkSections'] for w in x['words']]
输出
['https://www.spanishdict.com/conjugate/tener',
'https://www.spanishdict.com/conjugate/hacer',
'https://www.spanishdict.com/conjugate/ser',
'https://www.spanishdict.com/conjugate/estar',
'https://www.spanishdict.com/conjugate/haber',
'https://www.spanishdict.com/conjugate/ir',
'https://www.spanishdict.com/conjugate/poder',
'https://www.spanishdict.com/conjugate/decir',
'https://www.spanishdict.com/conjugate/cerrar',
'https://www.spanishdict.com/conjugate/mentir',
'https://www.spanishdict.com/conjugate/dormir',
'https://www.spanishdict.com/conjugate/recordar',
'https://www.spanishdict.com/conjugate/seguir',
'https://www.spanishdict.com/conjugate/medir',
'https://www.spanishdict.com/conjugate/adquirir',
'https://www.spanishdict.com/conjugate/jugar',
'https://www.spanishdict.com/conjugate/vestirse',
'https://www.spanishdict.com/conjugate/divertirse',
'https://www.spanishdict.com/conjugate/acostarse',
'https://www.spanishdict.com/conjugate/ponerse',
'https://www.spanishdict.com/conjugate/despertarse',
'https://www.spanishdict.com/conjugate/sentirse',
'https://www.spanishdict.com/conjugate/levantarse',
'https://www.spanishdict.com/conjugate/sentarse',
'https://www.spanishdict.com/conjugate/gustar',
'https://www.spanishdict.com/conjugate/alegrar',
'https://www.spanishdict.com/conjugate/quedar',
'https://www.spanishdict.com/conjugate/encantar',
'https://www.spanishdict.com/conjugate/parecer',
'https://www.spanishdict.com/conjugate/faltar',
'https://www.spanishdict.com/conjugate/doler',
'https://www.spanishdict.com/conjugate/interesar']
获得预期的输出后,您应该有一个动词列表。虽然您的问题中没有提供来源,但这是生成此类信息的良好开端,我使用了列表 verbs-top-500
和列表理解。
对于在其 href
中包含 translate
的所有 <a>
,它连接您的 url 和直接子 <div>
中的文本动词<a>
:
['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate"]')]
例子
import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')
urls = ['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate/"]')]
输出
['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]
我想刮这个dictionary for it's different verbs. the verbs appear in this 'https://www.spanishdict.com/conjugate/' plus the verb . so,e.g : for verb 'hacer' we will have: https://www.spanishdict.com/conjugate/hacer
我想抓取包含每个动词变位的所有可能链接,并将它们 return 作为字符串列表。所以我做了以下事情:
import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/'
for i in url:
reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
urls.append(link.get('href'))
print(urls)
但是当我打印 url 时,我只得到几个空列表。
预期输出样本:
['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]
当您循环遍历“url”时,您正在遍历一个字符串。看这段代码:
url = 'https://www.spanishdict.com/conjugate/'
for i in url:
print(i)
这会生成 URL 的每个字母:
h
t
t
p
s
:
/
/
w
w
w
<truncated>
你这里也做错了:
reqs = requests.get(url + str())
我不确定您要做什么,但 'url + str()' 只是 URL 加上一个空字符串,即 URL.
如果你删除 for 循环和不必要的空字符串,你会得到我认为你想要得到的:
import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/'
reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
urls.append(link.get('href'))
print(urls)
这会产生:
['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation%20hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign%3Dadhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source%3Dsd-footer']
这个链接列表是您想要的吗?
编辑
希望明白你的意思 - 如果是这样,问题应该得到改进。要从 javascript 获取信息,您可以使用正则表达式解析响应:
import requests
import json
import re
r = requests.get('https://www.spanishdict.com/conjugation')
m = re.search(r'window.SD_COMPONENT_DATA = ({.*})', r.text)
['https://www.spanishdict.com/conjugate/'+w for x in json.loads(m.group(1))['searchQuickLinkSections'] for w in x['words']]
输出
['https://www.spanishdict.com/conjugate/tener',
'https://www.spanishdict.com/conjugate/hacer',
'https://www.spanishdict.com/conjugate/ser',
'https://www.spanishdict.com/conjugate/estar',
'https://www.spanishdict.com/conjugate/haber',
'https://www.spanishdict.com/conjugate/ir',
'https://www.spanishdict.com/conjugate/poder',
'https://www.spanishdict.com/conjugate/decir',
'https://www.spanishdict.com/conjugate/cerrar',
'https://www.spanishdict.com/conjugate/mentir',
'https://www.spanishdict.com/conjugate/dormir',
'https://www.spanishdict.com/conjugate/recordar',
'https://www.spanishdict.com/conjugate/seguir',
'https://www.spanishdict.com/conjugate/medir',
'https://www.spanishdict.com/conjugate/adquirir',
'https://www.spanishdict.com/conjugate/jugar',
'https://www.spanishdict.com/conjugate/vestirse',
'https://www.spanishdict.com/conjugate/divertirse',
'https://www.spanishdict.com/conjugate/acostarse',
'https://www.spanishdict.com/conjugate/ponerse',
'https://www.spanishdict.com/conjugate/despertarse',
'https://www.spanishdict.com/conjugate/sentirse',
'https://www.spanishdict.com/conjugate/levantarse',
'https://www.spanishdict.com/conjugate/sentarse',
'https://www.spanishdict.com/conjugate/gustar',
'https://www.spanishdict.com/conjugate/alegrar',
'https://www.spanishdict.com/conjugate/quedar',
'https://www.spanishdict.com/conjugate/encantar',
'https://www.spanishdict.com/conjugate/parecer',
'https://www.spanishdict.com/conjugate/faltar',
'https://www.spanishdict.com/conjugate/doler',
'https://www.spanishdict.com/conjugate/interesar']
获得预期的输出后,您应该有一个动词列表。虽然您的问题中没有提供来源,但这是生成此类信息的良好开端,我使用了列表 verbs-top-500
和列表理解。
对于在其 href
中包含 translate
的所有 <a>
,它连接您的 url 和直接子 <div>
中的文本动词<a>
:
['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate"]')]
例子
import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')
urls = ['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate/"]')]
输出
['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]