我如何使用 bs4 在 python 2.7 中提取此网页的标题？

Question

目前我正在使用下面的代码：

import urllib
from bs4 import BeautifulSoup  
import codecs 

file_obj = open("E:\Sport_Cricket.txt", "r+")    
links_str = file_obj.readlines()    

c=1    
for j in links_str:
 url=j.rstrip('\n')  
 if(url.endswith("ece")):
    htmltext = urllib.urlopen(url).read()
    soup = BeautifulSoup(htmltext,"lxml")

    #title
    webpage_title = soup.find_all("h1", attrs = {"class": "title"}) 
    webpage_title = webpage_title[0].get_text(strip=True) 
    with codecs.open("E:\Corpus\Sport\Cricket\text"+str(c)+".txt", "w+", encoding="utf-8") as f:
     f.writelines(webpage_title+"\r\n")

    c=c+1

Sport_Cricket.txt 包含：

http://www.thehindu.com/sport/cricket/unadkat-does-the-trick-for-pune/article18401543.ece
http://www.thehindu.com/sport/cricket/live-updates-delhi-daredevils-versus-mumbai-indians/article18400821.ece
http://www.thehindu.com/sport/cricket/old-guard-wants-pull-out-coa-warns-of-consequences/article18400811.ece
http://www.thehindu.com/sport/cricket/the-rise-of-sandeep-sharma/article18400700.ece
http://www.thehindu.com/sport/cricket/axar-has-found-his-mojo/article18400258.ece

我收到以下错误：

Traceback (most recent call last):
 File "C:\Users\PJM\working_extractor_sorted_complete.py", line 31, in <module>
webpage_title = webpage_title[0].get_text(strip=True)
IndexError: list index out of range

除了webpage_title = webpage_title[0].text(strip=True)还有其他选择吗？？？

Answer 1

潘卡吉，

使用此代替使用 BeautifulSoup 获取标题。

webpage_title = soup.title.string

这将获取 html 文档中任何位置的第一个标题元素。

我如何使用 bs4 在 python 2.7 中提取此网页的标题？

How do i extract title of this webpage in python 2.7 using bs4?

beautifulsoup

python-2.7

index-error