从嵌套的无序 html 列表创建 Pandas 数据框
Creating Pandas dataframe from nested unordered html lists
我有一个无序列表网页,我想将它们变成 pandas 数据框作为 NLP 工作流程的第一步。
import pandas as pd
from bs4 import BeautifulSoup
html = '''<html>
<body>
<ul>
<li>
Name
<ul>
<li>Many</li>
<li>Stories</li>
</ul>
</li>
</ul>
<ul>
<li>
More
</li>
</ul>
<ul>
<li>Stuff
<ul>
<li>About</li>
</ul>
</li>
</ul>
</body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
目标是将每个顶级列表变成一个数据框,看起来像这样的输出:
0 1 2
0 Name Many Stories
1 More null null
2 Stuff About null
我尝试使用以下代码获取所有列表项(包括子列表)
target = soup.find_all('ul')
但是它returns双输出:
[<li>
Name
<ul>
<li>Many</li>
<li>Stories</li>
</ul>
</li>, <li>Many</li>, <li>Stories</li>, <li>
More
</li>, <li>Stuff
<ul>
<li>About</li>
</ul>
</li>, <li>About</li>]
真的迷路了。谢谢。
分解在评论中,欣赏!
from lxml import etree
import re
#Convert html string to correct format for parsing with XPATHs
root = etree.XML(html)
tree = etree.ElementTree(root)
#Your XPATH Selector
xpathselector = 'body/ul'
#List of lxml items that need to be decoded
hold = tree.xpath(xpathselector)
'''
1. Get strings of each item in hold
2. Decode to string
3. Remove all tags and \n in each list
4. Split on spaces to create list of lists
'''
df = pd.DataFrame([re.sub('(\n)|(\<.{0,3}\>)','',etree.tostring(i).decode('utf-8')).split() for i in hold])
df
0 1 2
0 Name Many Stories
1 More None None
2 Stuff About None
我有一个无序列表网页,我想将它们变成 pandas 数据框作为 NLP 工作流程的第一步。
import pandas as pd
from bs4 import BeautifulSoup
html = '''<html>
<body>
<ul>
<li>
Name
<ul>
<li>Many</li>
<li>Stories</li>
</ul>
</li>
</ul>
<ul>
<li>
More
</li>
</ul>
<ul>
<li>Stuff
<ul>
<li>About</li>
</ul>
</li>
</ul>
</body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
目标是将每个顶级列表变成一个数据框,看起来像这样的输出:
0 1 2
0 Name Many Stories
1 More null null
2 Stuff About null
我尝试使用以下代码获取所有列表项(包括子列表)
target = soup.find_all('ul')
但是它returns双输出:
[<li>
Name
<ul>
<li>Many</li>
<li>Stories</li>
</ul>
</li>, <li>Many</li>, <li>Stories</li>, <li>
More
</li>, <li>Stuff
<ul>
<li>About</li>
</ul>
</li>, <li>About</li>]
真的迷路了。谢谢。
分解在评论中,欣赏!
from lxml import etree
import re
#Convert html string to correct format for parsing with XPATHs
root = etree.XML(html)
tree = etree.ElementTree(root)
#Your XPATH Selector
xpathselector = 'body/ul'
#List of lxml items that need to be decoded
hold = tree.xpath(xpathselector)
'''
1. Get strings of each item in hold
2. Decode to string
3. Remove all tags and \n in each list
4. Split on spaces to create list of lists
'''
df = pd.DataFrame([re.sub('(\n)|(\<.{0,3}\>)','',etree.tostring(i).decode('utf-8')).split() for i in hold])
df
0 1 2
0 Name Many Stories
1 More None None
2 Stuff About None