你如何使用 beautifulsoup 获得某些词的链接
how do you get links with certain words using beautifulsoup
此代码用于从 html 网页获取链接,但我想让它只给我包含某些词的链接。
例如,只有在 url 中包含这个词的链接:"www.mywebsite.com/word"
我的代码:
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
print link['href']`
您可以使用 in 进行简单的字符串搜索。下面的示例仅打印 href 中包含“/website-builder”的链接。
if '/website-builder' in link['href']:
print link['href']
完整代码:
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if '/website-builder' in link['href']:
print link['href']
示例输出:
/website-builder?linkOrigin=website-builder&linkId=hd.mainnav.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.mywebsite.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.hosting.mywebsite
/website-builder?linkOrigin=website-builder&linkId=ct.btn.stickynavigation.easy-to-use#easy-to-use
这是我想出的:
links = [link for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')) if link.find("word") != -1]
print links
当然,您应该将 "word" 替换为您希望作为过滤依据的任何词。
此代码用于从 html 网页获取链接,但我想让它只给我包含某些词的链接。
例如,只有在 url 中包含这个词的链接:"www.mywebsite.com/word"
我的代码:
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
print link['href']`
您可以使用 in 进行简单的字符串搜索。下面的示例仅打印 href 中包含“/website-builder”的链接。
if '/website-builder' in link['href']:
print link['href']
完整代码:
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if '/website-builder' in link['href']:
print link['href']
示例输出:
/website-builder?linkOrigin=website-builder&linkId=hd.mainnav.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.mywebsite.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.hosting.mywebsite
/website-builder?linkOrigin=website-builder&linkId=ct.btn.stickynavigation.easy-to-use#easy-to-use
这是我想出的:
links = [link for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')) if link.find("word") != -1]
print links
当然,您应该将 "word" 替换为您希望作为过滤依据的任何词。