Python 打印提取的 HTML 标签的排列输出
Python Printing Arranged Output of extracted HTML tags
在以下 HTML 代码中,尝试提取并组织提取的输出:
html_doc = """
<html>
<body>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Birds Toys</div>
<div class="category-description">Toys belonging to the Bird Category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Eagle</span>
<span class="item-price">.00</span>
</div>
<p class="description">Eagle is the national bird of the US.</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Parrot</span>
<span class="item-price">.00</span>
</div>
<p class="description">Parrot is found in tropical and subtropical region.</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Owls</span>
<span class="item-price">.00</span>
</div>
<p class="description">Owls are nocturnal.</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Kingfisher</span>
<span class="item-price">.00</span>
</div>
<p class="description">Kigfisher hunts in the water</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Quail</span>
<span class="item-price">.00</span>
</div>
<p class="description"></p>
</li>
</ul>
</li>
</ul>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Reptiles Toys</div>
<div class="category-description">Toys belonging to Reptiles Category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Snake</span>
<span class="item-price">.00</span>
</div>
<p class="description">Snakes can be poisonous.</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Lizard</span>
<span class="item-price">.00</span>
</div>
<p class="description">Lizards are found both at homes and in jungle</p>
</li>
</ul>
</li>
</ul>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Germs Toys</div>
<div class="category-description">Toys that belong to germs category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Bacteria</span>
<span class="item-price">.95</span>
</div>
<p class="description">Bacteria can cause tuberclausis</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Protozoa</span>
<span class="item-price">.95</span>
</div>
<p class="description"></p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Virus</span>
<span class="item-price">.95</span>
</div>
<p class="description">Viruses are known to cause Corona, Aids, etc.</p>
</li>
</ul>
</li>
</ul>
</body>
</html>
"""
我能够使用以下代码成功提取 div-class、span-class、p-class 组合:
soup = BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
# ITEM CLASS find a list of all div elements
divitemscatg = soup.find_all('div', {'class' : 'h4 category-name section-title'})
linesdivitemscatg = [span.get_text() for span in divitemscatg]
print(linesdivitemscatg)
# ITEM TITLE find a list of all span elements
spansitemtitle = soup.find_all('span', {'class' : 'item-title'})
linesitemtitle = [span.get_text() for span in spansitemtitle]
print(linesitemtitle)
# ITEM PRICE find a list of all span elements
spansitemprice = soup.find_all('span', {'class' : 'item-price'})
linesitemprice = [span.get_text() for span in spansitemprice]
print(linesitemprice)
# DESC find a list of all span elements
spansitemdesc = soup.find_all('p', {'class' : 'description'})
linesitemdesc = [span.get_text() for span in spansitemdesc]
print(linesitemdesc)
我得到的输出是:
['Birds Toys', 'Reptiles Toys', 'Germs Toys']
['Eagle', 'Parrot', 'Owls', 'Kingfisher', 'Quail', 'Snake', 'Lizard', 'Bacteria', 'Protozoa', 'Virus']
['.00', '.00', '.00', '.00', '.00', '.00', '.00', '.95', '.95', '.95']
['Eagle is the national bird of the US.', 'Parrot is found in tropical and subtropical region.', 'Owls are nocturnal.', 'Kigfisher hunts in the water', '', 'Snakes can be poisonous.', 'Lizards are found both at homes and in jungle', 'Bacteria can cause tuberclausis', '', 'Viruses are known to cause Corona, Aids, etc.']
但我需要如下不同组织的输出:
Birds Toys|Eagle|.00|Eagle is the national bird of the US.
Birds Toys|Parrot|.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|.00|Owls are nocturnal.
Birds Toys|Kingfisher|.00|Kigfisher hunts in the water
Birds Toys|Quail|.00|
Reptiles Toys|Snake|.00|Snakes can be poisonous.
Reptiles Toys|Lizard|.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|.95|
Germs Toys|Virus|.95|Viruses are known to cause Corona, Aids, etc.
为了实现后者,上面的代码需要做哪些改动。我无法以所需的格式正确排列它。
提前致谢。
您可以通过这种方式实现目标 - Select 每个 menu-item,找到其之前的类别并将其添加到您的内容中:
soup=BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
for l in soup.select('.menu-items'):
data = [
l.find_previous('div',{'class':'h4'}).text,
l.select_one('.item-title').text,
l.select_one('.item-price').text,
l.select_one('.description').text
]
output.write('|'.join(data)+'\n')
输出
Birds Toys|Eagle|.00|Eagle is the national bird of the US.
Birds Toys|Parrot|.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|.00|Owls are nocturnal.
Birds Toys|Kingfisher|.00|Kigfisher hunts in the water
Birds Toys|Quail|.00|
Reptiles Toys|Snake|.00|Snakes can be poisonous.
Reptiles Toys|Lizard|.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|.95|
Germs Toys|Virus|.95|Viruses are known to cause Corona, Aids, etc.
在以下 HTML 代码中,尝试提取并组织提取的输出:
html_doc = """
<html>
<body>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Birds Toys</div>
<div class="category-description">Toys belonging to the Bird Category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Eagle</span>
<span class="item-price">.00</span>
</div>
<p class="description">Eagle is the national bird of the US.</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Parrot</span>
<span class="item-price">.00</span>
</div>
<p class="description">Parrot is found in tropical and subtropical region.</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Owls</span>
<span class="item-price">.00</span>
</div>
<p class="description">Owls are nocturnal.</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Kingfisher</span>
<span class="item-price">.00</span>
</div>
<p class="description">Kigfisher hunts in the water</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Quail</span>
<span class="item-price">.00</span>
</div>
<p class="description"></p>
</li>
</ul>
</li>
</ul>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Reptiles Toys</div>
<div class="category-description">Toys belonging to Reptiles Category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Snake</span>
<span class="item-price">.00</span>
</div>
<p class="description">Snakes can be poisonous.</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Lizard</span>
<span class="item-price">.00</span>
</div>
<p class="description">Lizards are found both at homes and in jungle</p>
</li>
</ul>
</li>
</ul>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Germs Toys</div>
<div class="category-description">Toys that belong to germs category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Bacteria</span>
<span class="item-price">.95</span>
</div>
<p class="description">Bacteria can cause tuberclausis</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Protozoa</span>
<span class="item-price">.95</span>
</div>
<p class="description"></p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Virus</span>
<span class="item-price">.95</span>
</div>
<p class="description">Viruses are known to cause Corona, Aids, etc.</p>
</li>
</ul>
</li>
</ul>
</body>
</html>
"""
我能够使用以下代码成功提取 div-class、span-class、p-class 组合:
soup = BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
# ITEM CLASS find a list of all div elements
divitemscatg = soup.find_all('div', {'class' : 'h4 category-name section-title'})
linesdivitemscatg = [span.get_text() for span in divitemscatg]
print(linesdivitemscatg)
# ITEM TITLE find a list of all span elements
spansitemtitle = soup.find_all('span', {'class' : 'item-title'})
linesitemtitle = [span.get_text() for span in spansitemtitle]
print(linesitemtitle)
# ITEM PRICE find a list of all span elements
spansitemprice = soup.find_all('span', {'class' : 'item-price'})
linesitemprice = [span.get_text() for span in spansitemprice]
print(linesitemprice)
# DESC find a list of all span elements
spansitemdesc = soup.find_all('p', {'class' : 'description'})
linesitemdesc = [span.get_text() for span in spansitemdesc]
print(linesitemdesc)
我得到的输出是:
['Birds Toys', 'Reptiles Toys', 'Germs Toys']
['Eagle', 'Parrot', 'Owls', 'Kingfisher', 'Quail', 'Snake', 'Lizard', 'Bacteria', 'Protozoa', 'Virus']
['.00', '.00', '.00', '.00', '.00', '.00', '.00', '.95', '.95', '.95']
['Eagle is the national bird of the US.', 'Parrot is found in tropical and subtropical region.', 'Owls are nocturnal.', 'Kigfisher hunts in the water', '', 'Snakes can be poisonous.', 'Lizards are found both at homes and in jungle', 'Bacteria can cause tuberclausis', '', 'Viruses are known to cause Corona, Aids, etc.']
但我需要如下不同组织的输出:
Birds Toys|Eagle|.00|Eagle is the national bird of the US.
Birds Toys|Parrot|.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|.00|Owls are nocturnal.
Birds Toys|Kingfisher|.00|Kigfisher hunts in the water
Birds Toys|Quail|.00|
Reptiles Toys|Snake|.00|Snakes can be poisonous.
Reptiles Toys|Lizard|.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|.95|
Germs Toys|Virus|.95|Viruses are known to cause Corona, Aids, etc.
为了实现后者,上面的代码需要做哪些改动。我无法以所需的格式正确排列它。
提前致谢。
您可以通过这种方式实现目标 - Select 每个 menu-item,找到其之前的类别并将其添加到您的内容中:
soup=BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
for l in soup.select('.menu-items'):
data = [
l.find_previous('div',{'class':'h4'}).text,
l.select_one('.item-title').text,
l.select_one('.item-price').text,
l.select_one('.description').text
]
output.write('|'.join(data)+'\n')
输出
Birds Toys|Eagle|.00|Eagle is the national bird of the US.
Birds Toys|Parrot|.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|.00|Owls are nocturnal.
Birds Toys|Kingfisher|.00|Kigfisher hunts in the water
Birds Toys|Quail|.00|
Reptiles Toys|Snake|.00|Snakes can be poisonous.
Reptiles Toys|Lizard|.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|.95|
Germs Toys|Virus|.95|Viruses are known to cause Corona, Aids, etc.