从嵌套的无序 html 列表创建 Pandas 数据框

Question

我有一个无序列表网页，我想将它们变成 pandas 数据框作为 NLP 工作流程的第一步。

import pandas as pd
from bs4 import BeautifulSoup
html = '''<html>
        <body>
          <ul>
              <li>
              Name
                    <ul>
                        <li>Many</li>
                        <li>Stories</li>
                    </ul>
                </li> 
          </ul>
          <ul>
              <li>
              More
              </li>
         </ul>
         <ul>
             <li>Stuff 
                     <ul>
                         <li>About</li>
                    </ul>
            </li>
        </ul>
        </body>
        </html>'''

 soup = BeautifulSoup(html, 'lxml')

目标是将每个顶级列表变成一个数据框，看起来像这样的输出：

   0    1     2
0 Name  Many  Stories
1 More  null  null
2 Stuff About null

我尝试使用以下代码获取所有列表项（包括子列表）

target = soup.find_all('ul')

但是它returns双输出：

[<li>
                   Name
                         <ul>
 <li>Many</li>
 <li>Stories</li>
 </ul>
 </li>, <li>Many</li>, <li>Stories</li>, <li>
                   More
                   </li>, <li>Stuff 
                          <ul>
 <li>About</li>
 </ul>
 </li>, <li>About</li>]

真的迷路了。谢谢。

Answer 1

分解在评论中，欣赏！

from lxml import etree
import re

#Convert html string to correct format for parsing with XPATHs
root = etree.XML(html)
tree = etree.ElementTree(root)

#Your XPATH Selector 
xpathselector = 'body/ul'

#List of lxml items that need to be decoded
hold = tree.xpath(xpathselector)

'''
1. Get strings of each item in hold
2. Decode to string
3. Remove all tags and \n in each list
4. Split on spaces to create list of lists
'''
df = pd.DataFrame([re.sub('(\n)|(\<.{0,3}\>)','',etree.tostring(i).decode('utf-8')).split() for i in hold])
df
       0      1        2
0   Name   Many  Stories
1   More   None     None
2  Stuff  About     None

从嵌套的无序 html 列表创建 Pandas 数据框

Creating Pandas dataframe from nested unordered html lists

python

xpath

nlp

elementtree

pandas