提取两个 lxml 标记之间的所有内容 Python

Question

考虑以下 html 片段

<html>
  .
  .
  .
  <div>
    <p> Hello </p>
    <div>
      <b>
        Text1
      </b>
      <p>
        This is a huge paragraph text
      </p>
       .
       .
       .
     </div>
  </div>
  .
  .
  .
  <div>
    <i>
      Text2
    </i>
  </div>

假设我需要提取从 Text1 到 Text2 的所有内容，包括标签。使用一些方法，我已经能够提取出这两个的标签，即它们的唯一 ID。

基本上我有2个Element.etree元素，对应我需要的两个标签。

如何提取两个标签之间的所有内容？

（我能想到的一种可能的解决方案是找到两个标记的共同祖先，然后执行 iterwalk() 并在 Element1 处开始提取，并在 2 处停止。但是，我不确定如何这将是）任何解决方案将不胜感激。

请注意，我已经找到了我需要的两个标签，我并不是在寻找解决方案来找到这些标签（例如使用 xpath）

编辑：我想要的输出是

      <b>
        Text1
      </b>
      <p>
        This is a huge paragraph text
      </p>
       .
       .
       .
     </div>
  </div>
  .
  .
  .
  <div>
    <i>
      Text2
    </i>

请注意，我不介意前 2 个 <div> 标签，但不想要 Hello。结束的结束标签也是如此。我最感兴趣的是中间内容。

编辑 2：我使用复杂的 xpath 条件提取了 Etree 元素，这对于 bs4 等其他替代方案是不可行的，因此任何使用 lxml 元素的解决方案都将不胜感激:)

Answer 1

经过审核和质疑：

from essentials.tokening import CreateToken # This was imported just to generate a random string - pip install mknxgn_essentials
import bs4

HTML = """<html>
    <div>
        <div>
            <div id="start">
                Hello, My name is mark
            </div>
        </div>
    </div>

    <div>
        This is in the middle
    </div>

    <div>
        <div id="end">
            This is the end
        </div>
    </div>

    <div>
        Do not include this.
    </div>

</html>"""

RandomString = CreateToken(30, HTML) #Generate a random string that could never occur on it's own in the file, if it did occur, use something else 
soup = bs4.BeautifulSoup(HTML, features="lxml") # Convert the text into soup
start_div = soup.find("div", attrs={"id": "start"}) #assuming you can find this element
start_div.insert_before(RandomString) # insert the random string before this element
end_div = soup.find("div", attrs={"id": "end"})     #again, i was assuming you can also find this element
end_div.insert_after(RandomString) # insert the random string after this element

print(str(soup).split(RandomString)[1]) # Get between both random strings

这个returns的输出：

>>>             <div id="start">
>>>                 Hello, My name is mark
>>>             </div>
>>>     </div>
>>> </div>
>>>     <div>
>>>         This is in the middle
>>>     </div>
>>> <div>
>>>     <div id="end">
>>>         This is the end
>>>     </div>

提取两个 lxml 标记之间的所有内容 Python

Extracting everything between two lxml tags Python

html

python

tags

lxml