我如何修复我的维基百科 table web-scraper - returns 没有单元格值

Question

我的维基百科有点问题 table 网络抓取工具：问题是它无法读取单元格中的文本。我已经定义了 table - 没有问题，我已经定义了行，没有问题。我的代码如下所示：

import requests
from bs4 import BeautifulSoup
import re
import dateutil

result = requests.get('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population')
assert result.status_code==200
print(result.status_code)

src = result.content
document = BeautifulSoup(src, 'lxml')

table = document.find('table')
table

assert table.find('th').get_text() == "Rank"

rows = table.find_all('tr')
rows

for row in rows[1:-1]:
    cells = row.find_all(['th'], ['td'])

    cells_text = [cell.get_text() for cell in cells]

    print(cells_text)

这为我提供了以下输出：

200
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

Process finished with exit code 0

我一直在关注这个视频教程 https://www.youtube.com/watch?v=rzYeuMAo4Dw&t=641s。据我所知，这个家伙做了和我完全一样的事情 - 但他的刮板显然在我的不工作的地方工作。

我不知道问题到底是什么以及如何解决。

Answer 1

把th,td放在一起列在.find_all:

import requests
from bs4 import BeautifulSoup

result = requests.get(
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)
assert result.status_code == 200

src = result.content
document = BeautifulSoup(src, "lxml")

table = document.find("table")
assert table.find("th").get_text() == "Rank"
rows = table.find_all("tr")

for row in rows[1:-1]:
    cells = row.find_all(["th", "td"])       # <--- put th, td in list
    cells_text = [cell.get_text(strip=True) for cell in cells]
    print(cells_text)

打印：

['–', 'World', '7,892,391,000', '100%', '31 Aug 2021', 'UN projection[2]', '']
['1', 'China(more)', 'Asia', '1,411,778,724', '17.9%', '1 Nov 2020', '2020 census result[3]', 'The census figure refers tomainland China, excluding itsspecial administrative regionsofHong KongandMacau, the former of which returned to Chinese sovereignty on 1\xa0July 1997 and the latter on 20\xa0December 1999.']
['2', 'India(more)', 'Asia', '1,381,310,652', '17.5%', '31 Aug 2021', 'National population clock[4]', 'The figure includes the population of India-administered Kashmir but not of China- or Pakistan-administered Kashmir.']
['3', 'United States(more)', 'Americas', '332,282,961', '4.21%', '31 Aug 2021', 'National population clock[5]', 'Includes the50 statesand theDistrict of Columbia, but excludes theU.S. territories.']
['4', 'Indonesia(more)', 'Asia', '271,350,000', '3.44%', '31 Dec 2020', 'National annual estimate[6]', '']

...

我如何修复我的维基百科 table web-scraper - returns 没有单元格值

How do i fix my wikipedia table web-scraper - returns no cell values

python

wikipedia

web-scraping