python-可读性的使用

Usage of python-readability

(https://github.com/buriy/python-readability)

我在使用这个库时遇到困难,找不到它的任何文档。 (有吗?)

调用help(Document)有一些可用的片段,但还是有问题。

到目前为止我的代码:

from readability.readability import Document
import requests

url = 'http://www.somepage.com'

html = requests.get(url, verify=False).content
readable_article = Document(html,   negative_keywords='test_keyword').summary()

with open('test.html', 'w', encoding='utf-8') as test_file:
    test_file.write(readable_article)

根据 help(Document) 输出,应该可以使用列表作为 negative_keywords.

的输入
readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()

给我一堆我不明白的错误:

Traceback (most recent call last): File "/usr/lib/python3.4/site-packages/readability/readability.py", line 163, in summary candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line 300, in score_paragraphs candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 360, in score_node content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 348, in class_weight if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object has no attribute 'search' Traceback (most recent call last): File "/usr/lib/python3.4/site-packages/readability/readability.py", line 163, in summary candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line 300, in score_paragraphs candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 360, in score_node content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 348, in class_weight if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object has no attribute 'search'

有人可以给我提示错误或如何处理吗?

库代码有错误。如果你看 compile_pattern:

def compile_pattern(elements):
    if not elements:
        return None
    elif isinstance(elements, (list, tuple)):
        return list(elements)
    elif isinstance(elements, regexp_type):
        return elements
    else:
        # assume string or string like object
        elements = elements.split(',')
        return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)

如果 elements 不是 None,则它只是 returns 正则表达式,不是列表或元组,也不是正则表达式。

不过,稍后它假定 self.negative_keywords 是一个正则表达式。因此,我建议您以 "test_keyword1,test_keyword2" 的形式将列表作为字符串输入。这将确保 compile_pattern returns 一个应该修复错误的正则表达式。