维基百科文章中的词频

Question

如何在不存储整篇文章的情况下获取维基百科文章中指定单词的频率，然后对其进行处理？例如，"India" 这个词在这篇文章中出现的次数 https://simple.wikipedia.org/wiki/India

Answer 1

这是一个简单的示例，逐行读取网页。但是不能保证 HTML 被分成行。（在这种情况下，超过 1300 个。）

import re
import urllib.request
from collections import Counter

URL = 'https://simple.wikipedia.org/wiki/India'

counter = Counter()

with urllib.request.urlopen(URL) as source:
    for line in source:
        words = re.split(r"[^A-Z]+", line.decode('utf-8'), flags=re.I)
        counter.update(words)

for word in ['India', 'Indian', 'Indians']:
    print('{}: {}'.format(word, counter[word]))

输出

> python3 test.py
India: 547
Indian: 75
Indians: 11
>

如果词条出现在页面的 HTML 结构中，而不仅仅是内容，这也算在内。

如果您想专注于内容，请考虑使用首选 MediaWiki API 来提取内容的 Pywikibot python library，尽管它看起来是基于您所使用的 "complete page at a time" 模型注意你试图避免。无论如何，该模块的文档指向您可能想要查看的类似但更高级的软件包列表。

维基百科文章中的词频

Word Frequency in a WikiPedia Article

information-retrieval

web-crawler

information-extraction

python-3.x

mediawiki-api