从维基百科页面中提取数据

Question

这个问题可能非常具体。我正在尝试从 https://en.wikipedia.org/wiki/3M 等公司的维基百科页面中提取员工人数。

我尝试使用维基百科 python API 和一些正则表达式查询。但是，我找不到任何可靠的东西可以概括为任何公司（不考虑例外情况）。

此外，因为 table 行没有 ID 或 class 我无法直接访问该值。以下是来源：

<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
</th>
<td style="line-height:1.35em;">89,800 (2015)<sup id="cite_ref-FY_1-5" class="reference"><a href="#cite_note-FY-1">[1]</a></sup></td>
</tr>

所以，即使我有 table - infobox vcard 的 ID，所以我无法找到使用 beautifulSoup.[=14= 抓取此信息的方法]

有没有办法提取这些信息？它出现在页面开头右侧的摘要 table 中。

Answer 1

使用lxml.etree代替BeautifulSoup，你可以用XPath表达式得到你想要的：

>>> from lxml import etree
>>> import requests
>>> r = requests.get('https://en.wikipedia.org/wiki/3M')
>>> doc = etree.fromstring(r.text)
>>> e = doc.xpath('//table[@class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td')
>>> e[0].text
'89,800 (2015)'

让我们仔细看看这个表达式：

//table[@class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td

也就是说：

Find all table elements that have attribute class set to infobox vcard, and inside those elements look for tr elements that have a child th element that has a child div element that contains the text "Number of employees", and inside that tr element, get the first td element.

Answer 2

为什么要重新发明轮子？

数据库百科

在 RDF 三元组中有此信息。

参见例如 http://dbpedia.org/page/3M

从维基百科页面中提取数据

Extracting data from a wikipedia page

python

regex

wikipedia

web-scraping

数据库百科