在 Beautiful Soup 中寻找和存储根的子代
Finding and storing children of roots in Beautiful Soup
我正在尝试从父项 <assignee>
中查找并存储子项 <orgname>
。到目前为止,我的代码运行在 XML 文档中,已经选择了某些其他标签 - 我已经将其设置为:
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
with open('./output.csv', 'ab') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append
for inv_name, pat_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), assign.find("orgname"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):
writer.writerow([inv_name.text, pat_num.text, org_name.text, date_num.text, country.text, city.text, state.text])
我已经有了这个顺序,所以每个发明名称和专利都需要配对,并且需要组织受让人名称。问题是还有其他标签与律师和此类组织相关联,如下所示:
<agent sequence="01" rep-type="attorney">
<addressbook>
<orgname>Sawyer Law Group LLP</orgname>
<address>
<country>unknown</country>
</address>
</addressbook>
</agent>
</agents>
</parties>
<assignees>
<assignee>
<addressbook>
<orgname>International Business Machines Corporation</orgname>
<role>02</role>
<address>
<city>Armonk</city>
<state>NY</state>
<country>US</country>
</address>
</addressbook>
</assignee>
</assignees>
我只想要 <assignee>
标签下的组织名称。我试过:
赋值 = soup.findAll("assignee")
org_name = assign.findAll("orgname")
但无济于事。它只是射出:
"ResultSet object has no attribute '%s'. You're probably treating a
list of items like a single item. Did you call find_all() when you
meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
如何添加这些标签并找到受让人标签下的所有组织名称?
看似简单,但我做不到。
提前致谢。
assign = soup.findAll("assignee")
returns a list ,所以这就是调用 org_name = assign.findAll("orgname")
失败的原因,你必须遍历 assign
并称它为 .findAll("orgname")
,但似乎每个 <assignee>
中只有一个 <orgname>
,因此没有必要使用 .findAll
而不是 .find
.尝试使用列表理解对 assign
的每个元素使用 .find
:
orgnames = [item.find("orgname") for item in assign]
或者,要直接获取他们的文本,请先检查 <orgname>
是否存在于 <assignee>
中:
orgnames = [item.find("orgname").text for item in assign if item.find("orgname")]
我正在尝试从父项 <assignee>
中查找并存储子项 <orgname>
。到目前为止,我的代码运行在 XML 文档中,已经选择了某些其他标签 - 我已经将其设置为:
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
with open('./output.csv', 'ab') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append
for inv_name, pat_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), assign.find("orgname"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):
writer.writerow([inv_name.text, pat_num.text, org_name.text, date_num.text, country.text, city.text, state.text])
我已经有了这个顺序,所以每个发明名称和专利都需要配对,并且需要组织受让人名称。问题是还有其他标签与律师和此类组织相关联,如下所示:
<agent sequence="01" rep-type="attorney">
<addressbook>
<orgname>Sawyer Law Group LLP</orgname>
<address>
<country>unknown</country>
</address>
</addressbook>
</agent>
</agents>
</parties>
<assignees>
<assignee>
<addressbook>
<orgname>International Business Machines Corporation</orgname>
<role>02</role>
<address>
<city>Armonk</city>
<state>NY</state>
<country>US</country>
</address>
</addressbook>
</assignee>
</assignees>
我只想要 <assignee>
标签下的组织名称。我试过:
赋值 = soup.findAll("assignee") org_name = assign.findAll("orgname")
但无济于事。它只是射出:
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
如何添加这些标签并找到受让人标签下的所有组织名称? 看似简单,但我做不到。
提前致谢。
assign = soup.findAll("assignee")
returns a list ,所以这就是调用 org_name = assign.findAll("orgname")
失败的原因,你必须遍历 assign
并称它为 .findAll("orgname")
,但似乎每个 <assignee>
中只有一个 <orgname>
,因此没有必要使用 .findAll
而不是 .find
.尝试使用列表理解对 assign
的每个元素使用 .find
:
orgnames = [item.find("orgname") for item in assign]
或者,要直接获取他们的文本,请先检查 <orgname>
是否存在于 <assignee>
中:
orgnames = [item.find("orgname").text for item in assign if item.find("orgname")]