如何使用 python 将维基百科页面拆分为段落？

Question

我使用 Python 维基百科库来提取维基百科页面的内容。我想处理此内容的每个段落（例如计算每个段落的字数）。将维基百科内容分成段落的最佳方法是什么？

import wikipedia as wikipedia

def getPage(title):

    content = wikipedia.page(title).content
    #for each paragraph in content do: 
        #...

Answer 1

方法不对

wikipedia 图书馆不提供此类信息。

在这个例子中可以看到返回的页面内容不包含大部分布局元素：

import wikipedia
print(wikipedia.page("New York City").content)

"[...] and sports. Home to the headquarters of the United Nations, New York is an important center for international diplomacy.Situated on one of the world's largest natural harbors, [...]"

当然有一些暗示，但解析起来很乏味：

像上面这样的某些段落拆分在上一段的最后一个句号之后没有空格；
标题使用类似 == MyTitle ==\n;
打印换行符。

定义维基百科部分

如果您正在寻找定义的部分，请尝试 wikipediaapi 库，它更活跃、更完整。

有了它，您可以轻松获取版块：

import wikipediaapi
page_py = wikipediaapi.Wikipedia('en').page('New_York_City')
print(page_py.sections[0].text)

"In 1664, the city was named in honor of the Duke of York, [...] seized it from the Dutch."

这种方法可以提供非常干净的文本，但无法识别节内的段落。

Html 段

但是，如果您要查找定义为 <p>...</p> 的段落，则需要解析 html 并进行一些清理。

这是实现该目标的一种方法（使用 BeautifulSoup4）：

import bs4
import requests
import unicodedata
import re

def get_paragraphs(page_name):

    r = requests.get('https://en.wikipedia.org/api/rest_v1/page/html/{0}'.format(page_name))
    soup = bs4.BeautifulSoup(r.content)
    html_paragraphs = soup.find_all('p')

    for p in html_paragraphs:
        cleaned_text = re.sub('(\[[0-9]+\])', '', unicodedata.normalize('NFKD', p.text)).strip()
        if cleaned_text:
            yield cleaned_text

print(list(get_paragraphs('New_York_City'))[0])

"New York City (NYC), often called simply New York, is the most populous city in the United States. [...] Home to the headquarters of the United Nations, New York is an important center for international diplomacy."

尽管清理并不完美，但这种方法可能是最好的。

如何使用 python 将维基百科页面拆分为段落？

How to split a wikipedia page into paragraphs using python?

python

text

split

wikipedia

paragraph

方法不对

定义维基百科部分

Html 段