在评论中测试我的分类器

Question

好的，我已经能够使用 NaiveBayes 算法训练我的电影评论分类器。任务是：

Test your classifier against a negative review of the walking dead. http://metro.co.uk/2017/02/27/the-walking-dead-season-7-episode-11-hostiles-and-calamities-wasnt-as-exciting-as-it-sounds-6473911/#mv-a

现在我的书给出了一个文档分类的例子，它使用了 classifier.classify(df)...现在我明白这是文档特征并且必须被标记化等

My question: Is there some way to test my classifier against the review just using the url? Or do i have to highlight all the words of the review, store as a string or document then tokenize etc?

Answer 1

您的程序可以像这样读取 URL 的内容：

with urllib.urlopen("http://example.com/review.html") as rec:
    data = rec.read()

但是，您建议的 URL 指向 HTML 文档，因此您需要 "scrape" 内容（即提取评论正文并将其转换"plain text" 通过删除粗体等），然后再继续。为此，您可以使用 BeautifulSoup 或类似的东西。（NLTK 曾经有一个抓取功能，但放弃了它以支持 BeautifulSoup。）除非你已经学会了如何做，否则通过从中复制粘贴一些测试文档确实会更简单您的浏览器连接到记事本等纯文本编辑器，这将删除所有标记。

在评论中测试我的分类器

Testing my classifier on a review

python

nlp

classification

nltk

document-classification