从列表中标签的属性生成 DF

Generate DF from attributes of tags in list

我有一个维基百科文章的修订列表,我是这样查询的:

import urllib
import re

def getRevisions(wikititle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle 
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request

    while True:
        response = urllib.request.urlopen(url + next).read()     #web request

        response = str(response)

        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request
    return revisions    

这会生成一个列表,其中每个元素都是一个 rev 标签作为字符串:

['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]

我怎样才能从这个列表中生成一个 DF

使用 json 的输出格式,然后您可以轻松地从 Json

创建数据框

Example URL for JSON output

不使用正则表达式的“简单”方法是拆分字符串然后解析:

for rev_string in revisions:
    rev_dict = {}

    # Skipping the first and last as it's the tag.
    attributes = rev_string.split(' ')[1:-1]

    #Split on = and take each value as key and value and convert value to string to get rid of excess ""
    for attribute in attributes:
        key, value = attribute.split("=")            
        rev_dict[key] = str(value) 
    
    df = pd.DataFrame.from_dict(rev_dict)

此示例将为每个修订创建一个数据框。如果您想在一个字典中收集多个 reivsion,那么您可以处理独特的属性(我不知道这些属性是否会根据 wiki 文档而改变),然后在收集字典中的所有属性后将其转换为 DataFrame。