抓取在搜索结果中找到的链接列表
scrape a list of links found in a search result
我正在尝试从图书馆页面抓取搜索结果。但由于我想要的不仅仅是书名,我希望脚本能够打开每个搜索结果并抓取详细站点以获取更多信息。
到目前为止我有以下内容:
import bs4 as bs
import urllib.request, urllib.error, urllib.parse
from http.cookiejar import CookieJar
from bs4 import Comment
cj = CookieJar()
basisurl = 'http://mz-villigst.cidoli.de/index.asp?stichwort=hans'
#just took any example page similar to the one i have in mind
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
p = opener.open(basisurl)
for mednrs in soup.find_all(string=lambdatext:isinstance(text,Comment)):
#and now when i do [0:] it gives me the medianumbers and i can create the links like this:
links = 'http://mz-villigst.cidoli.de/index.asp?MEDIENNR=' + mednrs[10:17]
我现在的主要问题是:如何让它给我一个列表(像这样:["1", "2"]...)然后我可以通过它?
创建一个列表并在循环内附加到它:
links = []
for mednrs in soup.find_all(string=lambda text: isinstance(text, Comment)):
link = 'http://mz-villigst.cidoli.de/index.asp?MEDIENNR=' + mednrs[10:17]
links.append(link)
或者使用列表理解:
links = ['http://mz-villigst.cidoli.de/index.asp?MEDIENNR=' + mednrs[10:17]
for mednrs in soup.find_all(string=lambda text: isinstance(text, Comment))]
我正在尝试从图书馆页面抓取搜索结果。但由于我想要的不仅仅是书名,我希望脚本能够打开每个搜索结果并抓取详细站点以获取更多信息。
到目前为止我有以下内容:
import bs4 as bs
import urllib.request, urllib.error, urllib.parse
from http.cookiejar import CookieJar
from bs4 import Comment
cj = CookieJar()
basisurl = 'http://mz-villigst.cidoli.de/index.asp?stichwort=hans'
#just took any example page similar to the one i have in mind
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
p = opener.open(basisurl)
for mednrs in soup.find_all(string=lambdatext:isinstance(text,Comment)):
#and now when i do [0:] it gives me the medianumbers and i can create the links like this:
links = 'http://mz-villigst.cidoli.de/index.asp?MEDIENNR=' + mednrs[10:17]
我现在的主要问题是:如何让它给我一个列表(像这样:["1", "2"]...)然后我可以通过它?
创建一个列表并在循环内附加到它:
links = []
for mednrs in soup.find_all(string=lambda text: isinstance(text, Comment)):
link = 'http://mz-villigst.cidoli.de/index.asp?MEDIENNR=' + mednrs[10:17]
links.append(link)
或者使用列表理解:
links = ['http://mz-villigst.cidoli.de/index.asp?MEDIENNR=' + mednrs[10:17]
for mednrs in soup.find_all(string=lambda text: isinstance(text, Comment))]