从 html 文本中提取 table
Extract table from html text
我正在尝试从页面中提取文本并将其保存为数据框。该页面未格式化为标签,因此 pandas 无法直接读取它。我尝试使用 bs4 但无法提取确切的 URL.
from bs4 import BeautifulSoup
html = requests.get('https://s3.amazonaws.com/todel162/veryimp/claps-0001.html')
soup = BeautifulSoup(html.text, "lxml")
links = soup.find_all('li')
import pandas as pd
df = pd.DataFrame(links)
我希望数据框有 4 列,类似这样...
vote title date url
1 How a TV Sitcom Triggered the Downfall of Western Civilization 2016-03-23 12:23 https://medium.com/p/how-a-tv-sitcom-triggered-the-downfall-of-western-civilization-336e8ccf7dd0
soup.find_all('li')
只是 returns 页面中的所有 li
标签。您需要做的是从每个 li
标签中获取相关信息,例如投票、标题、日期和 Url,然后将其保存到列表列表中。然后您可以将其转换为数据框。通过使用 'a'
标签的 'href'
属性,您可以使用 BeautifulSoup 获得 url。
from bs4 import BeautifulSoup
import requests
import pandas as pd
html = requests.get('https://s3.amazonaws.com/todel162/veryimp/claps-0001.html')
soup = BeautifulSoup(html.text, "lxml")
links = soup.find_all('li')
final_list=[]
for li in links:
votes=li.contents[0].split(' ')[0]
title=li.find('a').text
date=li.find('time').text
url=li.find('a')['href']
final_list.append([votes,title,date,url])
df = pd.DataFrame(final_list,columns=['Votes', 'title', 'Date','Url'])
print(df)
#just df if in Jupyter notebook
Jupyter notebook 的示例输出
您需要解析 html 。 find_all('li')
只需在 html 中找到您需要更具体的所有 li。
这是全部代码:
html = requests.get('https://s3.amazonaws.com/todel162/veryimp/claps-0001.html')
soup = BeautifulSoup(html.text, "html.parser")
import pandas as pd
links = soup.find_all('li')
all = []
for elements in links:
find_a = elements.find('a')
vote = find_a.previous_sibling
vote =int(re.search(r'\d+', vote).group()) #remove the + and - Just get int
title = find_a.text
date = elements.find('time',attrs={'class' : 'dt-published'}).text
url = find_a['href']
all.append([vote,title,date,url])
dataf = pd.DataFrame(all,columns=['vote', 'title', 'date','url'])
print(dataf)
我正在尝试从页面中提取文本并将其保存为数据框。该页面未格式化为标签,因此 pandas 无法直接读取它。我尝试使用 bs4 但无法提取确切的 URL.
from bs4 import BeautifulSoup
html = requests.get('https://s3.amazonaws.com/todel162/veryimp/claps-0001.html')
soup = BeautifulSoup(html.text, "lxml")
links = soup.find_all('li')
import pandas as pd
df = pd.DataFrame(links)
我希望数据框有 4 列,类似这样...
vote title date url
1 How a TV Sitcom Triggered the Downfall of Western Civilization 2016-03-23 12:23 https://medium.com/p/how-a-tv-sitcom-triggered-the-downfall-of-western-civilization-336e8ccf7dd0
soup.find_all('li')
只是 returns 页面中的所有 li
标签。您需要做的是从每个 li
标签中获取相关信息,例如投票、标题、日期和 Url,然后将其保存到列表列表中。然后您可以将其转换为数据框。通过使用 'a'
标签的 'href'
属性,您可以使用 BeautifulSoup 获得 url。
from bs4 import BeautifulSoup
import requests
import pandas as pd
html = requests.get('https://s3.amazonaws.com/todel162/veryimp/claps-0001.html')
soup = BeautifulSoup(html.text, "lxml")
links = soup.find_all('li')
final_list=[]
for li in links:
votes=li.contents[0].split(' ')[0]
title=li.find('a').text
date=li.find('time').text
url=li.find('a')['href']
final_list.append([votes,title,date,url])
df = pd.DataFrame(final_list,columns=['Votes', 'title', 'Date','Url'])
print(df)
#just df if in Jupyter notebook
Jupyter notebook 的示例输出
您需要解析 html 。 find_all('li')
只需在 html 中找到您需要更具体的所有 li。
这是全部代码:
html = requests.get('https://s3.amazonaws.com/todel162/veryimp/claps-0001.html')
soup = BeautifulSoup(html.text, "html.parser")
import pandas as pd
links = soup.find_all('li')
all = []
for elements in links:
find_a = elements.find('a')
vote = find_a.previous_sibling
vote =int(re.search(r'\d+', vote).group()) #remove the + and - Just get int
title = find_a.text
date = elements.find('time',attrs={'class' : 'dt-published'}).text
url = find_a['href']
all.append([vote,title,date,url])
dataf = pd.DataFrame(all,columns=['vote', 'title', 'date','url'])
print(dataf)