如何使用 python 从网站中提取具有匹配词的 html 链接
How to extract html links with a matching word from a website using python
我有一个网址,比如 http://www.bbc.com/news/world/asia/
。就在这个页面中,我想提取所有包含 India 或 INDIA 或 india 的链接(应该不区分大小写)。
如果我点击任何输出链接,它应该会带我到相应的页面,例如,这些是印度的几行 India shock over Dhoni retirement 和 印度大雾继续造成混乱。如果我点击这些链接,我应该分别被重定向到 http://www.bbc.com/news/world-asia-india-30640436
和 http://www.bbc.com/news/world-asia-india-30630274
。
import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)
我在 python 3.4.2 中编写了非常基本的最小代码。
您需要在显示的文本中搜索单词india
。为此,您需要一个自定义函数:
from bs4 import BeautifulSoup
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
'india' in tag.get_text().lower())
results = soup.find_all(india_links)
india_links
lambda 查找所有 <a>
link 具有 href
属性并包含 india
(不区分大小写)的标签显示的文本。
注意我使用了requests
响应object.content
属性;将解码留给 BeautifulSoup!
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
<a href="/news/world/asia/india/">India</a>,
<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
注意这里的http://www.bbc.co.uk/news/world-radio-and-tv-15386555
link;我们必须使用 lambda
搜索,因为使用 text
正则表达式的搜索不会找到该元素;包含的文本(特别报道:India Direct)不是标签中的唯一元素,因此不会被找到。
类似的问题适用于 /news/world-asia-india-30632852
link;嵌套的 <span>
元素使得 Court boost to India BJP chief 标题文本不是 link 标签的直接 child 元素。
您可以提取 just links:
from urllib.parse import urljoin
result_links = [urljoin(url, tag['href']) for tag in results]
其中所有相关 URL 都是相对于原始 URL 解析的:
>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30647504',
'http://www.bbc.com/news/world-asia-india-30640444',
'http://www.bbc.com/news/world-asia-india-30640436',
'http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30630274',
'http://www.bbc.com/news/world-asia-india-30632852',
'http://www.bbc.com/sport/0/cricket/30632182',
'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']
我有一个网址,比如 http://www.bbc.com/news/world/asia/
。就在这个页面中,我想提取所有包含 India 或 INDIA 或 india 的链接(应该不区分大小写)。
如果我点击任何输出链接,它应该会带我到相应的页面,例如,这些是印度的几行 India shock over Dhoni retirement 和 印度大雾继续造成混乱。如果我点击这些链接,我应该分别被重定向到 http://www.bbc.com/news/world-asia-india-30640436
和 http://www.bbc.com/news/world-asia-india-30630274
。
import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)
我在 python 3.4.2 中编写了非常基本的最小代码。
您需要在显示的文本中搜索单词india
。为此,您需要一个自定义函数:
from bs4 import BeautifulSoup
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
'india' in tag.get_text().lower())
results = soup.find_all(india_links)
india_links
lambda 查找所有 <a>
link 具有 href
属性并包含 india
(不区分大小写)的标签显示的文本。
注意我使用了requests
响应object.content
属性;将解码留给 BeautifulSoup!
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
<a href="/news/world/asia/india/">India</a>,
<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
注意这里的http://www.bbc.co.uk/news/world-radio-and-tv-15386555
link;我们必须使用 lambda
搜索,因为使用 text
正则表达式的搜索不会找到该元素;包含的文本(特别报道:India Direct)不是标签中的唯一元素,因此不会被找到。
类似的问题适用于 /news/world-asia-india-30632852
link;嵌套的 <span>
元素使得 Court boost to India BJP chief 标题文本不是 link 标签的直接 child 元素。
您可以提取 just links:
from urllib.parse import urljoin
result_links = [urljoin(url, tag['href']) for tag in results]
其中所有相关 URL 都是相对于原始 URL 解析的:
>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30647504',
'http://www.bbc.com/news/world-asia-india-30640444',
'http://www.bbc.com/news/world-asia-india-30640436',
'http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30630274',
'http://www.bbc.com/news/world-asia-india-30632852',
'http://www.bbc.com/sport/0/cricket/30632182',
'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']