使用 goose retrieving nothing 阅读文章内容

Question

我正在尝试从 .html 文件中读取（为方便起见，此处指定 url 示例）。但有时它不显示任何文字。请帮我解决这个问题。

使用的 Goose 版本：https://github.com/agolo/python-goose/ 当前版本有一些错误。

from goose import Goose
from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text

Answer 1

Goose 确实使用了几个预定义元素，这些元素可能是查找顶级节点的良好起点。如果没有找到 "known" 元素，它会开始寻找 top_node，这通常是一个包含很多 p 标签的元素。您可以阅读 extractors/content.py 了解更多详情。

给定的文章没有很多普通文章的特征，通常包裹在文章标签中，或者带有 class 和 id 的 div 标签，例如 'post-content' , 'story-body', 'article', 等。它是一个 div 标签，带有 id = 'docText' 并且没有段落，因此鹅无法预测它的好事。

我可以建议您在 extractors/content.py 中的 KNOWN_ARTICLE_CONTENT_TAGS 常量的开头添加这一行 :

KNOWN_ARTICLE_CONTENT_TAGS = [ {'attr': 'id', 'value': 'docText'}, ... other paths go here ]

这里是提取的正文：

Chennai, Dec. 19 -- The Tamil Nadu Government on Monday appointed a one-man judicial commission of inquiry to look into the reasons for Sunday's stampede in state capital Chennai, which claimed 42 lives and left another 37 injured.\n\nThe announcement of the formation of the commission came even as family members of those killed in a stampede agonised and agitated over the unexpected tragedy.\n\nThe 42 homeless people were trampled to death during the distribution of flood relief supplies at a shelter in the Tamil Nadu capital.\n\nOfficials said over 5,000 people rushed in as the gates of the shelter opened, causing the stampede.\n\nChitra, family member of a victim, said it was mismanagement that led to the tragedy. \u2026

使用 goose retrieving nothing 阅读文章内容

Read article content using goose retrieving nothing

python

web-crawler

goose