使用 beautiful soup 从刮掉的 html 标签中提取文本时遇到问题
Having trouble extracting text from inside scraped html tags using beautiful soup
我用来抓取内容的代码
class Scraper(object):
# contains methods to scrape data from curse
def scrape(url):
req = request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
return request.urlopen(req).read()
def lookup(page, tag, class_name):
parsed = BeautifulSoup(page, "html.parser")
return parsed.find_all(tag, class_=class_name)
此 returns 包含与此类似条目的列表
<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>
我正在尝试提取 href 标签之间的文本,在本例中
World Quest Tracker
我怎样才能做到这一点?
html_doc = '<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>'
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find('a').text
这将打印
u'World Quest Tracker'
试试这个。
from bs4 import BeautifulSoup
html='''
<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>
'''
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".title"):
print(item.text)
结果:
World Quest Tracker
I'm attempting to extract the text inbetween the href tags
如果你真的想要href
属性中的文本,而不是<a></a>
锚点包裹的文本内容(你的措辞有点不清楚),使用get('href')
:
from bs4 import BeautifulSoup
html = '<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>'
soup = BeautifulSoup(html, 'lxml')
soup.find('a').get('href')
'/addons/wow/world-quest-tracker'
我用来抓取内容的代码
class Scraper(object):
# contains methods to scrape data from curse
def scrape(url):
req = request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
return request.urlopen(req).read()
def lookup(page, tag, class_name):
parsed = BeautifulSoup(page, "html.parser")
return parsed.find_all(tag, class_=class_name)
此 returns 包含与此类似条目的列表
<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>
我正在尝试提取 href 标签之间的文本,在本例中
World Quest Tracker
我怎样才能做到这一点?
html_doc = '<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>'
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find('a').text
这将打印
u'World Quest Tracker'
试试这个。
from bs4 import BeautifulSoup
html='''
<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>
'''
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".title"):
print(item.text)
结果:
World Quest Tracker
I'm attempting to extract the text inbetween the href tags
如果你真的想要href
属性中的文本,而不是<a></a>
锚点包裹的文本内容(你的措辞有点不清楚),使用get('href')
:
from bs4 import BeautifulSoup
html = '<li class="title"><h4><a href="/addons/wow/world-quest-tracker">World Quest Tracker</a></h4></li>'
soup = BeautifulSoup(html, 'lxml')
soup.find('a').get('href')
'/addons/wow/world-quest-tracker'