抓取 BS4 维基百科文本（将每个标题与相关段落配对）- 并将其输出到 CSV-format

Question

我想将网络抓取的段落与维基百科中最近抓取的标题配对：例如，我选择了以下 wikipedia-article：https://en.wikipedia.org/wiki/England。我目前正在抓取维基百科页面以查找每个段落，但是，我也在抓取所有标题以便将两者放在一起。我试图将每个标题与关联的段落配对，但是，我想将其写入 csv 文件。注意：我们有以下 h2 标题（段落的）：

Toponymy
History
Governance
Geography
Economy
Healthcare

我可以将另一个维基百科文本（例如意大利、西班牙、法国等）的抄袭文本添加到 csv - 因为我使用相同的标题。顺便说一句：我不确定我需要的东西是否清楚，所以请随时提问。

方法：我正在使用的代码：使用以下 code.find_all('h2') 标记，然后使用 find_next_siblings('p') 在 h2 之后获取 p 标签，直到找到下一个 h2。

from bs4 import BeautifulSoup
import requests

page_link = 'https://en.wikipedia.org/wiki/England'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h2')[1:]:
    texth2=tag.text.strip()
    textContent.append(texth2)
    for item in tag.find_next_siblings('p'):
        if texth2 in item.find_previous_siblings('h2')[0].text.strip():
            textContent.append(item.text.strip())
            
print(textContent)

请参阅下面的 输出：

顺便说一句：我们有以下 h2 标题（段落的）

Toponymy
History
Governance
Geography
Economy
Healthcare
Demography

等等等等

我想写的所有标题专栏

def write_csv_file(content_list): 打开 , 作为 csv_file： writer = csv.writer(csv_file, 分隔符=',') writer.writerows(content_list)

控制台输出:

 ['Toponymy', 'The name "England" is derived from the Old English name Englaland, which means "land of the Angles".[15] The Angles were one of the Germanic tribes that settled in Great Britain during the Early Middle Ages. The Angles came from the Anglia peninsula in the Bay of Kiel area (present-day German state of Schleswig–Holstein) of the Baltic Sea.[16] The earliest recorded use of the term, as "Engla londe", is in the late-ninth-century translation into Old English of Bede\'s Ecclesiastical History of the English People. The term was then used in a different sense to the modern one, meaning "the land inhabited by the English", and it included English people in what is now south-east Scotland but was then part of the English kingdom of Northumbria. The Anglo-Saxon Chronicle recorded that the Domesday Book of 1086 covered the whole of England, meaning the English kingdom, but a few years later the Chronicle stated that King Malcolm III went "out of Scotlande into Lothian in Englaland", thus using it in the more ancient sense.[17]', 'The earliest attested reference to the Angles occurs in the 1st-century work by Tacitus, Germania, in which the Latin word Anglii is used.[18] The etymology of the tribal name itself is disputed by scholars; it has been suggested that it derives from the shape of the Angeln peninsula, an angular shape.[19] How and why a term derived from the name of a tribe that was less significant than others, such as the Saxons, came to be used for the entire country and its people is not known, but it seems this is related to the custom of calling the Germanic people in Britain Angli Saxones or English Saxons to distinguish them from continental Saxons (Eald-Seaxe) of Old Saxony between the Weser and Eider rivers in Northern Germany.[20] In Scottish Gaelic, another language which developed on the island of Great Britain, the Saxon tribe gave their name to the word for England (Sasunn);[21] similarly, the Welsh name for the English language is "Saesneg". A romantic name for England is Loegria, related to the Welsh word for England, Lloegr, and made popular by its use in Arthurian legend. Albion is also applied to England in a more poetic capacity,[22] though its original meaning is the island of Britain as a whole.', 'History', 'The earliest known evidence of human presence in the area now known as England was that of Homo antecessor, dating to approximately 780,000 years ago. The oldest proto-human bones discovered in England date from 500,000\xa0years ago.[23] Modern humans are known to have inhabited the area during the Upper Paleolithic period, though permanent settlements were only established within the last 6,000 years.[24][25]\nAfter the last ice age only large mammals such as mammoths, bison and woolly rhinoceros remained. Roughly 11,000\xa0years ago, when the ice sheets began to recede, humans repopulated the area; genetic research suggests they came from the northern part of the Iberian Peninsula.[26] The sea level was lower than now and Britain was connected by land bridge to Ireland and Eurasia.[27]\nAs the seas rose, it was separated from Ireland 10,000\xa0years ago and from Eurasia two millennia later.', 
    ....so on]

我想将所有输出写入 csv-file - 换句话说 - 在抓取维基百科文本后（将每个标题与相关段落配对） - 并将其输出到 CSV-format

顺便说一句：我可以将另一个维基百科文本（例如意大利、西班牙、法国等）的抄袭文本添加到 csv - 因为我使用相同的标题：

Toponymy
History
Governance
Geography
Economy
Healthcare
Demography

更新： 嗨@Andrej Kesely：这太棒了，太棒了——非常感谢；顺便说一句：我们也可以将其应用于类似的任务吗：请参阅此处显示的数字集线器的 collection 数据：https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool

以hub-cards作为数据如下：
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view

我已经将抓取器应用于此 - 它有效 - 但如何实现 csv-output 对 url 进行迭代的抓取器：我们能否将输出也放入 csv - 同时应用相同的技术!?

非常感谢。

Answer 1

此示例将从不同的 URL 中获取所有 <h2> 作为标题并跟随 <p> 作为文本并将其保存为 CSV 文件：

import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import accumulate, groupby


urls = [
    "https://en.wikipedia.org/wiki/England",
    "https://en.wikipedia.org/wiki/Portugal",
]

out = []
for url in urls:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    tags = soup.select_one(".mw-parser-output").find_all(recursive=False)
    a = accumulate(t.name == "h2" for t in tags)

    d = {"Country": url.split("/")[-1]}
    for _, g in groupby(zip(a, tags), lambda k: k[0]):
        g = list(t for _, t in g if t.name in {"h2", "p"})
        if g[0].name != "h2" or len(g) == 1:
            continue

        title = g[0].get_text(strip=True).replace("[edit]", "")
        text = "\n".join(
            [t.get_text(strip=True, separator=" ") for t in g[1:]]
        ).strip()

        if not text:
            continue

        d[title] = text

    out.append(d)

df = pd.DataFrame(out)
df.to_csv("data.csv", index=False)

保存 data.csv（来自 Libre Office 的屏幕截图）：

抓取 BS4 维基百科文本（将每个标题与相关段落配对）- 并将其输出到 CSV-format

scrape with BS4 Wikipedia text (pair each heading with paragraphs associated) - and output it to CSV-format

html

python

csv

beautifulsoup

web-scraping