Python: 从维基百科获取属于某个类别的所有文章的列表

Python: getting the list of all articles belonging to a certain category from Wikipedia

我在维基百科上有一个类别(例如 https://en.wikipedia.org/wiki/Category:Member_states_of_the_United_Nations ),任务是获取属于该类别的所有文章的列表并打印终端的以下结果(在 Python),像这样:

答:11

B: 17

...

Y: 1

Z: 2

我只有一个想法(但我不确定它是否正确)——我正在使用导入维基百科。除了使用 import wikipedia 之外,还有其他方法可以解决此任务吗?

解决方案使用 Wikipedia-API

代码

from itertools import groupby
import wikipediaapi

def get_categorymembers(categorymembers):
    '''
        Generator all categories a page belongs too
    '''
    for c in categorymembers.values():
        yield c.title
    
# Get Wikipedia api for english
wiki_wiki = wikipediaapi.Wikipedia('en')

# Get Wikipedia page for UN (based upon title)
page_un = wiki_wiki.page("Category:Member states of the United Nations")

# Generator for categories for page
categories = get_categorymembers(page_un.categorymembers)

# Drop first (since its just the starting page which is UN page)
next(categories )

# Sort alphabetically
categories = sorted(categories)

# Group by first letter of name
groups = groupby(categories, lambda k: k[0])

# Show First letter and count for group
for k, v in groups:
    print(k, len(list(v)))

输出

注:B只有16,因为巴哈马被称为“The Bahamas”所以放在T字母组。

A 11
B 16
C 15
D 5
E 9
F 4
G 10
H 3
I 8
J 3
K 6
L 9
M 18
N 10
O 1
P 9
Q 1
R 5
S 26
T 12
U 8
V 3
Y 1
Z 2

包含字母 A 的国家

代码

from itertools import groupby
import wikipediaapi

def get_categorymembers(categorymembers):
    '''
        Prints all categories a page belongs too
    '''
    for c in categorymembers.values():
        yield c.title
    
def chunk_list(seq, size):
    ' Break list in to chunks '
    return (seq[i:i+size] for i in range(0, len(seq), size))

wiki_wiki = wikipediaapi.Wikipedia('en')

page_un = wiki_wiki.page("Category:Member states of the United Nations")
# Get Wikipedia pa
countries = get_categorymembers(page_un.categorymembers)
next(countries)  # Drop first since just overalll category

# Countries with a in name
contains_a = [c for c in countries if 'a' in c.lower()]

# Show list (4 per line)
for sublist in chunk_list(contains_a, 4):
    print(*sublist)

Afghanistan Albania Algeria Andorra
Angola Antigua and Barbuda Argentina Armenia
Australia Austria Azerbaijan The Bahamas
Bahrain Bangladesh Barbados Belarus
Bhutan Bolivia Bosnia and Herzegovina Botswana
Brazil Bulgaria Burkina Faso Cambodia
Cameroon Canada Cape Verde Central African Republic
Chad China Colombia Democratic Republic of the Congo
Costa Rica Croatia Cuba Denmark
Dominica Dominican Republic East Timor Ecuador
El Salvador Equatorial Guinea Eritrea Estonia
Eswatini Ethiopia Finland France
User talk:FuzionEXA Gabon The Gambia Georgia (country)
Germany Ghana Grenada Guatemala
Guinea Guinea-Bissau Guyana Haiti
Honduras Hungary Iceland India
Indonesia Iran Iraq Republic of Ireland
Israel Italy Ivory Coast Jamaica
Japan Jordan Kazakhstan Kenya
Kiribati Kuwait Kyrgyzstan Laos
Latvia Lebanon Liberia Libya
Lithuania Madagascar Malawi Malaysia
Maldives Mali Malta Marshall Islands
Mauritania Mauritius Federated States of Micronesia Moldova
Monaco Mongolia Mozambique Myanmar
Namibia Nauru Nepal Kingdom of the Netherlands
New Zealand Nicaragua Nigeria North Korea
North Macedonia Norway Oman Pakistan
Palau Panama Papua New Guinea Paraguay
Poland Portugal Qatar Romania
Russia Rwanda Saint Kitts and Nevis Saint Lucia
Saint Vincent and the Grenadines Samoa San Marino São Tomé and Príncipe
Saudi Arabia Senegal Serbia Sierra Leone
Singapore Slovakia Slovenia Solomon Islands
Somalia South Africa South Korea South Sudan
Spain Sri Lanka Sudan Suriname
Switzerland Syria Tajikistan Tanzania
Thailand Tonga Trinidad and Tobago Tunisia
Turkmenistan Tuvalu Uganda Ukraine
United Arab Emirates United States Uruguay Uzbekistan
Vanuatu Venezuela Vietnam Zambia
Zimbabwe