正在解析 HTML 中的标签

Question

我知道有人问过这个问题，但我认为不是在这种特定情况下。如果是这样，请随时向我展示。

我有一个 HTML 文件分层（你可以查看原始 here）这样：

<h5 id="foo1">Title 1</h5>
               <table class="foo2">
                  <tbody>
                     <tr>
                        <td>
                           <h3 class="foo3">SomeName1</h3>
                           <img src="Somesource" alt="SomeName2" title="SomeTitle"><br>
                              <p class="textcode">
                                    Some precious text here
                              </p>
                        </td>
                        ...
               </table>

我想在每个 h5 中分别提取 <p> 每个 table 数据中包含的名称、图像和文本，这意味着我想将这些项目中的每一个保存在一个以其中的 h5 命名的单独文件夹。

我试过了：

# coding: utf-8
import os
import re
from bs4 import BeautifulSoup as bs

os.chdir("WorkingDirectory")
# Sélection du HTML et remplissage de son contenu dans la variable éponyme
with open("TheGoodPath.htm","r") as html:
    html = bs(html,'html.parser')
    # Sélection des hearders, restriction des résultats aux six premiers et création des dossiers
    h5 = html.find_all("h5",limit=6)
    for h in h5:
        # Création des fichiers avec le nom des headers
        chemin = u"../Résulat/"
        nom = str(h.contents[0].string)
        os.makedirs(chemin + nom,exist_ok=True)
        # Sélection de la table soeur située juste après le header
        table = h.find_next_sibling(name = 'table')
        for t in table:
            # Sélection des headers contenant les titres des documents
            h3 = t.find_all("h3")
            for k in h3:
                titre = str(k.string)
                # Création des répertoires avec les noms des figures
                os.makedirs(chemin + nom + titre,exist_ok=True)
                os.fdopen(titre.tex)
                # Récupération de l'image située dans la balise soeur située juste après le header précédent
                img = k.find_next_sibling("img")
                chimg = img.img['src']
                os.fdopen(img.img['title'])
                # Récupération du code TikZ située dans la balise soeur située juste après le header précédent
                tikz = k.find_next_sibling('p')
                # Extraction du code TikZ contenu dans la balise précédemment récupérée
                code = tikz.get_text()
                # Définition puis écriture du préambule et du code nécessaire à la production de l'image précédemment enregistrée
                preambule = r"%PREAMBULE \n  \usepackage{pgfplots} \n  \usepackage{tikz} \n  \usepackage[european resistor, european voltage, european current]{circuitikz} \n  \usetikzlibrary{arrows,shapes,positioning} \n  \usetikzlibrary{decorations.markings,decorations.pathmorphing, decorations.pathreplacing} \n  \usetikzlibrary{calc,patterns,shapes.geometric} \n  %FIN PREAMBULE"
                with open(chemin + nom + titre,'w') as result:
                    result.write(preambule + code)

但是它为 h3 = t.find_all("h3"), line 21

打印 AttributeError: 'NavigableString' object has no attribute 'find_next_element'

Answer 1

看起来（根据 for t in table 循环判断）您打算查找多个 "table" 元素。使用 find_next_siblings() 而不是 find_next_sibling():

table = h.find_next_siblings(name='table') 
for t in table:

Answer 2

这似乎是你想要的，每个 h5 之间似乎只有一个 table 所以不要重复它只需使用 find_next 并使用 table 返回：

from bs4 import BeautifulSoup

import requests

cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text

soup = BeautifulSoup(cont)

h5s = soup.find_all("h5",limit=6)
for h5 in h5s:
    # find first table after
    table = h5.find_next("table")
    # find all h3 elements in that table
    for h3 in table.select("h3"):
        print(h3.text)
        img = h3.find_next("img")
        print(img["src"])
        print(img["title"])
        print(img.find_next("p").text)
    print()

这给你这样的输出：

repere-plan.svg

\begin{tikzpicture}[scale=1]
\draw (0,0) --++ (1,1) --++ (3,0) --++ (-1,-1) --++ (-3,0);
\draw [thick] [->] (2,0.5) --++(0,2) node [right] {z};
%thick : gras ; very thick : trÃ¨s gras ; ultra thick : hyper gras
\draw (2,0.5) node [left] {O};
\draw [thick] [->] (2,0.5) --++(-1,-1) node [left] {x};
\draw [thick] [->] (2,0.5) --++(2,0) node [below] {y};
\end{tikzpicture}

Lignes de champ et Ã©quipotentielles
images/cours-licence/em3/ligne-champ-equipot.svg

ligne-champ-equipot.svg

\begin{tikzpicture}[scale=0.8]
\draw[->] (-2,0) -- (2,0);
\draw[->] (0,-2) -- (0,2);
\draw node [red] at (-2,1.25) {\scriptsize{Lignes de champ}};
\draw node [blue] at (2,-1.25) {\scriptsize{Equipotentielles}};
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sin(\x r)*3*sin(\x r)*5});
%r = angle en radian
%domain permet de dÃ©finir le domaine dans lequel la fonction sera tracÃ©e
%samples=200 permet d'augmenter le nombre de points pour le tracÃ©
%smooth amÃ©liore Ã©galement la qualitÃ© de la trace
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sin(\x r)*2*sin(\x r)*5});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sqrt(abs(cos(\x r)))*15});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sqrt(abs(cos(\x r)))*15});
\end{tikzpicture}

Fonction arctangente
images/schemas/math/arctan.svg

arctan.svg

\begin{tikzpicture}[scale=0.8]
\draw[very thin,color=gray] (-pi,pi) grid (-pi,pi);
\draw[->] (-pi,0) -- (pi,0) node[right] {$x$};
\draw[->] (0,-2) -- (0,2);
\draw[color=red,domain=-pi:pi,samples=150] plot ({\x},{rad(atan(\x))} )node[right,red] {$\arctan(x)$};
\draw[color=blue,domain=-pi:pi] plot ({\x},{rad(-atan(\x))} )node[right,blue] {$-\arctan(x)$};
%Le rad() est une autre faÃ§on de dire que l'argument est en radian
\end{tikzpicture}

将所有 .svg 写入磁盘：

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os import path

cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text

soup = BeautifulSoup(cont)
base_url = "http://www.physagreg.fr/"

h5s = soup.find_all("h5", limit=6)
for h5 in h5s:
    # find first table after
    table = h5.find_next("table")
    # find all h3 elements in that table
    for h3 in table.select("h3"):
        print(h3.text)
        img = h3.find_next("img")
        src, title = img["src"], img["title"]
        # join base url and image url
        img_url = urljoin(base_url, src)
        # open file using title as file name
        with open(title, "w") as f:
           # requests the img url and write content
            f.write(requests.get(img_url).content)

这会给你 arctan.svg courbe-Epeff.svg 和页面上的所有其他内容等。

正在解析 HTML 中的标签

Parsing a tag in HTML

html

python

beautifulsoup

find

findall