从列表中标签的属性生成 DF
Generate DF from attributes of tags in list
我有一个维基百科文章的修订列表,我是这样查询的:
import urllib
import re
def getRevisions(wikititle):
url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle
revisions = [] #list of all accumulated revisions
next = '' #information for the next request
while True:
response = urllib.request.urlopen(url + next).read() #web request
response = str(response)
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return revisions
这会生成一个列表,其中每个元素都是一个 rev
标签作为字符串:
['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]
我怎样才能从这个列表中生成一个 DF
使用 json 的输出格式,然后您可以轻松地从 Json
创建数据框
不使用正则表达式的“简单”方法是拆分字符串然后解析:
for rev_string in revisions:
rev_dict = {}
# Skipping the first and last as it's the tag.
attributes = rev_string.split(' ')[1:-1]
#Split on = and take each value as key and value and convert value to string to get rid of excess ""
for attribute in attributes:
key, value = attribute.split("=")
rev_dict[key] = str(value)
df = pd.DataFrame.from_dict(rev_dict)
此示例将为每个修订创建一个数据框。如果您想在一个字典中收集多个 reivsion,那么您可以处理独特的属性(我不知道这些属性是否会根据 wiki 文档而改变),然后在收集字典中的所有属性后将其转换为 DataFrame。
我有一个维基百科文章的修订列表,我是这样查询的:
import urllib
import re
def getRevisions(wikititle):
url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle
revisions = [] #list of all accumulated revisions
next = '' #information for the next request
while True:
response = urllib.request.urlopen(url + next).read() #web request
response = str(response)
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return revisions
这会生成一个列表,其中每个元素都是一个 rev
标签作为字符串:
['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]
我怎样才能从这个列表中生成一个 DF
使用 json 的输出格式,然后您可以轻松地从 Json
创建数据框不使用正则表达式的“简单”方法是拆分字符串然后解析:
for rev_string in revisions:
rev_dict = {}
# Skipping the first and last as it's the tag.
attributes = rev_string.split(' ')[1:-1]
#Split on = and take each value as key and value and convert value to string to get rid of excess ""
for attribute in attributes:
key, value = attribute.split("=")
rev_dict[key] = str(value)
df = pd.DataFrame.from_dict(rev_dict)
此示例将为每个修订创建一个数据框。如果您想在一个字典中收集多个 reivsion,那么您可以处理独特的属性(我不知道这些属性是否会根据 wiki 文档而改变),然后在收集字典中的所有属性后将其转换为 DataFrame。