Beautifulsoup 找不到文字

Question

我正在尝试使用 urllib 和 beautiful soup 在 python 中编写一个抓取工具。我有一个新闻故事 URL 的 csv，对于大约 80% 的页面，爬虫工作，但是当故事顶部有图片时，脚本不再提取时间或 body 文本。我很困惑，因为 soup.find 和 soup.find_all 似乎没有产生不同的结果。我已经尝试了各种不同的标签来捕获文本以及 'lxml' 和 'html.parser.'

代码如下：

testcount = 0
titles1 = []
bodies1 = []
times1 = []

data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")

    titlemess = soup.find(id="title").get_text() #getting the title
    titlestring = str(titlemess) #make it a string
    title = titlestring.replace("\n", "").replace("\r","")
    titles1.append(title)

    bodymess = soup.find(class_="article").get_text() #get the body with markup
    bodystring = str(bodymess) #make body a string
    body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
    bodies1.append(body) #add to list for export

    timemess = soup.find('span',{"class":"time"}).get_text()
    timestring = str(timemess)
    time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
    times1.append(time)

    testcount = testcount +1 #counter
    print(testcount)
except Exception as e:
    print(testcount, e)

这是我得到的一些结果（标记为'nonetype'的是成功拉取标题但body/time为空的结果）

1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm

2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object 没有属性 'get_text'

如有任何帮助，我们将不胜感激！谢谢

编辑：我没有“10 个信誉点”，所以我不能 post 更多链接来测试，但如果您需要更多页面示例，我会与他们一起评论。

Answer 1

问题是网站上没有class="article"和"class":"time"一样的图片。因此，您似乎必须检测网站上是否有图片，然后如果有图片，请搜索日期和文本，如下所示：

对于日期，尝试：

timemess = soup.find(id="pubtime").get_text()

对于body文字，这篇文章似乎只是图片的标题。因此，您可以尝试以下操作：

bodymess = soup.find('img').findNext().get_text()

简而言之，soup.find('img') 找到图像，findNext() 转到下一个块，巧合的是，它包含文本。

因此，在您的代码中，我将执行以下操作：

try:
    bodymess = soup.find(class_="article").get_text()

except AttributeError:
    bodymess = soup.find('img').findNext().get_text()

try:
    timemess = soup.find('span',{"class":"time"}).get_text()

except AttributeError:
    timemess = soup.find(id="pubtime").get_text()

作为网络抓取的一般流程，我通常使用浏览器访问网站本身，并首先在浏览器中找到网站后台的元素。

Beautifulsoup 找不到文字

Beautifulsoup can't find text

python

urllib

beautifulsoup

python-3.x