如何以与原始网页相同的顺序显示标题 header 和段落？ - Python

Question

我正在解析维基百科网页。我想搜索一个关键字，例如 "The first abstraction" 并显示标题、header 和匹配的段落，我该怎么做？

Web :https://en.wikipedia.org/wiki/Mathematics
Search = "The first abstraction"
Output :
       tittle: Mathematics
       header: History
       paragraph : The history of mathematics can be seen as an ever-increasing series of   
                   abstractions. **The first abstraction**, which is shared by many animals,[14] was 
                   probably that of numbers: the realization that a collection of two apples and a            
                   collection of two oranges (for example) have something in common, namely quantity 
                   of their members. 
import bs4
import requests

response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')

title = html.select("#firstHeading")[0].text
print(title)
paragraphs = html.select("p")
for para in paragraphs:
    print (para.text)

# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (para.text)

这段代码很好地显示了标题，但是 header 和段落没有排序，所以我无法匹配它。谢谢

Answer 1

首先，当您循环浏览

标签时，您需要搜索 "The first abstraction"，因为您只需要包含 "The first abstraction".

的段落

所以在你的 'para' 上添加一个 find() 方法来检查是否存在预期的文本 -

paragraphs = html.select("p")

Search = "The first abstraction" # expected text

for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)

这将为您提供预期的段落 -

The history of mathematics can be seen as an ever-increasing series of abstractions. The first abstraction, which is shared by many animals,[14] was probably that of numbers: the realization that a collection of two apples and a collection of two oranges (for example) have something in common, namely quantity of their members.

现在段落和标题完成了。您需要提取 header。关注您要解析的页面的 html 文件结构（这总是有帮助的）。

在下图中，h2 是 p 标签（您的文本所在的位置）的同级标签。阅读有关兄弟姐妹的信息 here。

所以要遍历到前一个兄弟，你应该在 p 标签上调用 'previous_sibling' 两次。

由于 h2 是 p 之前的同级标签 2，您可以访问 h2 （其中有 'History' header）作为 -

paragraphs = html.select("p")
for para in paragraphs:
    px = para.text
    if px.find(Search)>-1:
        print (para.text)
        print(para.previous_sibling.previous_sibling.previous_sibling.previous_sibling.text)

这将打印 -

History

如何以与原始网页相同的顺序显示标题 header 和段落？ - Python

How can I show title header and paragraph in the same order as the original web page? - Python

python

wikipedia

beautifulsoup