使用 for 循环遍历多个字符串的列表

Iterate over list of multiple strings using for loop

我对 Python 中的编码还很陌生。对于个人项目,我正在寻找不同的方法来从维基百科页面列表中检索生日和死亡日期。我正在使用 wikipedia 包。

我尝试实现这一点的一种方法是遍历维基百科摘要并返回我连续计算四位数字时的索引。

import wikipedia as wp

names = ('Zaha Hadid', 'Rem Koolhaas')
wiki_summary = wp.summary(names)
b_counter = 0
i_b_year = []
d_counter = 0
i_d_year = []

for i,x in enumerate(wiki_summary):
    if x.isdigit() == True:
        b_counter += 1
        if b_counter == 4:
           i_b_year = i
           break
        else:
            continue        
    else:
        b_counter = 0

到目前为止,这适用于我列表中的第一个人,但我想遍历我的 names 列表中的所有名字。有没有办法使用 for 循环查找索引并使用 for 循环遍历 names?

我知道还有其他方法,例如通过解析找到 bday 标签,但我想尝试几种不同的解决方案。

我不熟悉 Wikipedia 包,但您似乎可以迭代名称元组:

import Wikipedia as wp

names = ('Zaha Hadid', 'Rem Koolhaas')

i_b_year = []
for name in names: #This line is new
    wiki_summary = wp.summary(name) #Just changed names for name
    b_counter = 0
    d_counter = 0
    i_d_year = []

    for i,x in enumerate(wiki_summary):
        if x.isdigit() == True:
            b_counter += 1
            if b_counter == 4:
               i_b_year.append(i) #I am guessing you want this list to increase with each name in names. Thus, 'append'.
               break
            else:
                continue        
        else:
            b_counter = 0

您正在尝试:

  1. 声明两个空列表来存储每个人的出生年份和死亡年份。
  2. 从元组中获取每个人的维基百科摘要。
  3. 从摘要中解析前两个 4 位数字,并将它们附加到出生年份和死亡年份列表中。

问题是人物摘要可能不包括出生年份和死亡年份作为前两个 4 位数字。例如 Rem_Koolhaas 的维基百科摘要包括他的出生年份作为前 4 位数字,但第二个 4 位数字在这一行中:In 2005, he co-founded Volume Magazine together with Mark Wigley and Ole Bouman.

我们可以看到,birth_yeardeath_year 列表可能包含不准确的信息。

这是您想要实现的代码:

import wikipedia as wp

names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
i_d_year = []

for person_name in names:
    wiki_summary = wp.summary(person_name)
    birth_year_found = False
    death_year_found = False
    digits = ""    

    for c in wiki_summary:
        if c.isdigit() == True:
            if birth_year_found == False:                
                digits += c
                if len(digits) == 4:
                    birth_year_found = True
                    i_b_year.append(int(digits))
                    digits = ""
            elif death_year_found == False:
                digits += c
                if len(digits) == 4:
                    death_year_found = True
                    i_d_year.append(int(digits))
                    break
        else:
            digits = ""
    if birth_year_found == False:
        i_b_year.append(0)
    if death_year_found == False:
        i_d_year.append(0)

for i in range(len(names)):
    print(names[i], i_b_year[i], i_d_year[i])

输出:

Zaha Hadid 1950 2016
Rem Koolhaas 1944 2005

免责声明:在上面的代码中,如果在任何人的摘要中找不到两个 4 位数字,我会附加 0。正如我已经提到的,没有人断言维基百科摘要会将一个人的出生年份和死亡年份列为前两位 4 位数字,这些列表可能包含错误信息。

首先,由于以下几个原因,您的代码将无法运行:

  1. 导入维基百科仅适用于第一个小写字母 import wikipedia
  2. summary 方法接受字符串(在你的例子中是名字),所以你必须为集合中的每个名字调用它

抛开所有这些,让我们尝试实现您想要做的事情:

import wikipedia as wp
import re

# First thing we see (at least for pages provided) is that dates all share the same format:
# For those who are no longer with us 31 October 1950 – 31 March 2016
# For those who are still alive 17 November 1944
# So we have to build regex patterns to find those
# First is the months pattern, since it's quite a big one
MONTHS_PATTERN = r"January|February|March|April|May|June|July|August|September|October|November|December"
# Next we build our date pattern, double curly braces are used for literal text
DATE_PATTERN = re.compile(fr"\d{{1,2}}\s({MONTHS_PATTERN})\s\d{{,4}}")
# Declare our set of names, great choice of architects BTW :)
names = ('Zaha Hadid', 'Rem Koolhaas')
# Since we're trying to get birthdays and dates of death, we will create a dictionary for storing values
lifespans = {}
# Iterate over them in a loop
for name in names:
    lifespan = {'birthday': None, 'deathday': None}
    try:
        summary = wp.summary(name)
        # First we find the first date in summary, since it's most likely to be the birthday
        first_date = DATE_PATTERN.search(summary)
        if first_date:
            # If we've found a date – suppose it's birthday
            bday = first_date.group()
            lifespan['birthday'] = bday
            # Let's check whether the person is no longer with us
            LIFESPAN_PATTERN = re.compile(fr"{bday}\s–\s{DATE_PATTERN.pattern}")
            lifespan_found = LIFESPAN_PATTERN.search(summary)
            if lifespan_found:
                lifespan['deathday'] = lifespan_found.group().replace(f"{bday} – ", '')
            lifespans[name] = lifespan
        else:
            print(f'No dates were found for {name}')
    except wp.exceptions.PageError:
        # Handle not found page, so that code won't break
        print(f'{name} was not found on Wikipedia')
        pass

# Print result
print(lifespans)

提供的名称的输出:

{'Zaha Hadid': {'birthday': '31 October 1950', 'deathday': '31 March 2016'}, 'Rem Koolhaas': {'birthday': '17 November 1944', 'deathday': None}}

这种方法效率低下并且有很多缺陷,比如如果我们得到一个日期符合正则表达式但不是生日和忌日的页面。它非常难看(尽管我已经尽力了:))而且你最好不要解析标签。

如果您对维基百科的日期格式不满意,我建议您查看 datetime。另外,考虑到那些正则表达式 适合那两个特定页面,我没有对维基百科中日期的表示方式进行任何研究。所以,如果有任何不一致,我建议你坚持使用解析标签。